Partha Pratim Saha — AI Researcher & Mechanistic Interpretability

01 · About

About Me

I am an AI researcher and lecturer with over 12 years of combined experience in industry (Infosys, BirlaSoft/J&J, Wipro, IIT Kanpur) and academia (BITS Pilani M.Tech — 9.08 GPA, top 5%), now focused full-time on foundational research in mechanistic interpretability, AI alignment, and geometric deep learning.

My research programme develops information-geometric frameworks for tracing how LLMs encode, propagate, and transform beliefs across layers — and how alignment, fine-tuning, cultural model merging, and distillation alter that internal geometry. Three active research threads: (1) AI Safety & Mechanistic Interpretability via Torsional Belief Vector Fields; (2) Cross-cultural & Multilingual AI via Semantic Helix / nDNA; and (3) Privacy-Preserving Federated Learning for healthcare (published SpringerNature ICOMP’24).

My long-term vision: AI systems aligned not only behaviourally but geometrically — with transparent, auditable internal representations that can be verified, not just tested. I am particularly motivated by applications where responsible, interpretable AI can have measurable global impact: global health, diagnostic AI, equitable NLP, and human-centred AI systems for underserved communities.

I currently serve as Lecturer in Computer Science at Nalhati Government Polytechnic College (West Bengal, India), where I teach ML, Deep Learning, IoT, and Python, and supervise 50+ students on AI/NLP research projects. I have been selected as a NeurIPS 2025 Reviewer, AWS AI & ML Scholar, and received full scholarship to multiple premier summer schools (Duke ML, Cohere, Armenian LLM, University of Chicago DSI).

I am deeply concerned about the risks from advanced AI systems — particularly power-seeking behaviour, concentration of control, and the failure modes that emerge when capable AI systems pursue proxy objectives misaligned with human values. These are not abstract worries: they shape every architectural choice I make in my research. My goal is not to merely describe alignment failures but to build the mathematical tools that let us detect them inside the model — before they manifest as harmful outputs.

“AI systems must be aligned not only behaviourally but geometrically — with transparent, auditable internal representations that can be verified, not just tested.”

Education

M.Tech — Data Science & Engineering

BITS Pilani, India

GPA 9.08/10 · Distinction · Top 5% (2019–2021)

B.Tech — Computer Science & Engineering

WBUT, Kolkata, India

GPA 8.49/10 · Top 5% (2006–2010)

Research Interests

Mechanistic Interpretability · Geometric ML · AI Safety & Alignment · AI Deception Detection · Misalignment Monitoring · Cross-cultural LLMs · Privacy-Preserving FL · LLM Reasoning

Current Position

Lecturer in CS

Nalhati Govt. Polytechnic College

West Bengal, India · Dec 2021–Present

Links

LinkedIn: partha121

GitHub: pps121

technical.partha@gmail.com

02 · Skills & Expertise

Technical Expertise

NLP & Foundation Models

LLM Alignment & SafetyMechanistic Interpretability AI Deception DetectionMisalignment Monitoring Belief & Knowledge EditingAI Agents XAI / Explainable AIFine-tuning (HuggingFace) Hybrid & Multi-hop RAGConversational AI

Deep Learning & ML

PyTorchTensorFlow / Keras Transformers (GPT, Llama, Mistral, Qwen, Gemma, DeepSeek) CNN, RNN, Autoregressive Models Regression / Classification / Clustering Federated Learning

Geometric & Mathematical ML

Riemannian GeometryFisher-Rao Metric Cartan TorsionPersistent Homology Information GeometryDTW Analysis Spectral MethodsFrenet-Serret Framework

Frameworks & Tools

PythonLangChainLlamaIndex Scikit-learnNumpy / Pandas / NLTK / SpaCy Docker / KubernetesAzure / AWS / IBM Cloud PostgreSQL / MySQL

Domain Knowledge

AI Safety & Responsible AIGlobal Health AI Cancer Genomics & BioinformaticsHealthcare NLP Education TechnologyFinance & Banking Cross-cultural & Multilingual AI

Research & Communication

NeurIPS / ICML / SpringerNature Publication Academic Mentorship (50+ students) Workshop & Seminar Facilitation LaTeX & Technical Writing Cross-functional Team Leadership

Technical AI Safety

BlueDot Impact — AGI Strategy BlueDot Impact — Technical AI Safety Power-seeking & Goal Misgeneralisation Mechanistic Interpretability (Circuits) Geometric Alignment Probing RLHF / DPO Safety Analysis Catastrophic Risk Evaluation Representation Engineering

03 · Research

Research Projects

AI Safety Research Programme — Path to Impact

Core Research Focus

My core concern is straightforward but urgent: as AI systems grow more capable, misalignment between their internal objectives and human values becomes catastrophically consequential — not merely inconvenient. I am particularly worried about power-seeking behaviour in strategically capable agents, concentration of control in military and governance domains, and the fundamental challenge that behavioural safety does not imply geometric safety. A model can pass all safety evaluations while encoding deeply misaligned belief structures internally.

        Geometric Torsion Framework — measuring alignment not as behaviour but as geometry:

        Torsion norm T1_ℓ = ‖S_ℓ‖F
         ·  H1 amplification 1500× for normative concepts
         ·  Thermodynamic gap 10×
        (normative vs factual)  ·  Entropy–torsion bridge
        ρ = −0.387 (Mistral, p=5.43×10−30)
      

How I plan to create impact: (1) Develop post-hoc geometric probes that are compute-efficient enough for deployment-time monitoring; (2) Collaborate with world-leading AI safety labs (Anthropic, Redwood, ARC Evals, Mech Interp groups at Oxford/Cambridge/MIT/CMU) to validate geometric torsion findings against circuit-level mechanistic analysis; (3) Pursue a fully-funded PhD at a programme with strong AI safety infrastructure to work on scalable interpretability tools deployable beyond 7B-parameter models. I am actively applying for research fellowships, internships, and PhD positions starting 2026.

Geometric torsion metrics (8-scale)

3×2

IT/PA model pairs tested

20,439

LITMUS benchmark prompts

BlueDot Impact — AGI Strategy BlueDot Impact — Technical AI Safety Mechanistic InterpretabilityGeometric Alignment Power-seeking AnalysisCatastrophic Risk

Torsional Belief Vector Fields (TBVF) — AI Safety & Mechanistic Interpretability

Models transformer hidden-state trajectories as discrete curves on a Riemannian belief manifold (ℳ, g_F) with the Fisher-Rao metric. The torsion tensor S_ℓ = (M_ℓ − M_ℓ^T) / 2 captures rotational mismatch between consecutive belief updates — a non-commutative geometric signature invisible to attention patterns or activation magnitudes. Key discovery: DPO alignment creates geometrically localised “brake layers” that systematically suppress torsion, probed across OLMo-7B, Mistral-7B, and Zephyr-7B using 500 unsafe prompts (Litmus) and 17 geometric metrics.

        Peak result — Layer 27, Mistral-7B (Bonferroni-corrected, n=500):

        DPO torsion suppression: 44.4%  | 
        Cohen’s d = 0.741  | 
        p = 7.7 × 10¹³

        DTW–Torsion theorem: DC(w) ≥ 0.875 · |Σ||SIT||F − Σ||SPA||F|
      

Geometric metrics developed

3×2

SFT/DPO model pairs

500

Unsafe prompts (Litmus)

Concepts (DTW analysis)

Layer-level t-tests

16pp

Full paper + appendix

Torsion TensorFisher-Rao MetricFrenet-Serret Holonomy DefectSpectral TorsionDTW DPOOLMo-7BMistral-7BZephyr-7B

-->

Semantic Helix of LLMs (nDNA) — Cross-cultural & Multilingual AI

Active

Unifies fine-tuning, alignment, distillation, and merging as measurable deformations of the same depth-wise semantic flow via spectral curvature κ_ℓ and thermodynamic length ℒ_ℓ (epistemic effort across layers). Investigates epistemic inheritance in merged LLMs using Fisher-Rao geometry, producing neural offspring; emergent cultural nDNA measured via spectral curvature deviation Δκ_ℓ and thermodynamic length divergence Δℒ_ℓ. Cultures studied: African, Latin American, South Asian, East Asian, Arabic, Indigenous, European, Pacific Islander.

Spectral CurvatureThermodynamic Length Model MergingCultural AI Llama3-instructDeepSeek-R1Qwen

Privacy-Preserving Federated Learning for Healthcare (CFL)

Published · SpringerNature ICOMP’24

Collaborative Federated Learning (CFL) cloud-based system separates datasets into public and private sets based on the removal of PHI/PII, enabling personalised GPT-like systems without centralising sensitive data. Enables AI for healthcare at scale while preserving patient privacy — a critical requirement for responsible, equitable AI deployment globally.

Federated LearningPrivacy-Preserving AI Healthcare AIPHI/PII SeparationSpringerNature

Neural Robustness Learning in Dense Transformers (Preliminary)

In Progress

Investigates whether formal robustness certificates for LLMs can be derived from data contamination, label noise, and adversarial attacks rather than input-perturbation bounds. DPO-aligned models show compressed Fisher-norm spectra versus SFT counterparts, suggesting alignment induces geometric contraction doubling as a robustness mechanism. Framework: Lipschitz robustness bounds guaranteeing up to 40% data contamination holds almost natural robustness. Models: ViT, MedSigLIP, CLIP.

Robustness CertificatesFisher-norm Spectra ViTMedSigLIPCLIP

04 · Publications

Papers & Publications

2025

★ NeurIPS 2025 Workshop

Prompting Away Stereotypes? Evaluating Bias in Text-to-Image Models for Occupations

Shaina Raza, Maximus Powers, Partha Pratim Saha, Elham Dolatabadi, Usman Naseem

NeurIPS 2025 Workshop on Algorithmic Fairness · Empirical bias audit of DALL·E, Midjourney, Stable Diffusion across occupational stereotypes

arXiv DOI

2024

SpringerNature ICOMP'24

Collaborative Federated Learning Cloud Based System for Privacy-Preserving Healthcare AI

Partha Pratim Saha

First-author · Privacy-preserving federated learning system separating PHI/PII for GPT-style personalised healthcare without centralising sensitive data

SpringerNature

Under Review & In Preparation

2026

Preprint · Under Review

GRAFT: Geometric Representations of Alignment's Fingerprint in Transformer Belief Trajectories

Partha Pratim Saha

Preprint · Under review · T2 torsion is 8× more concept-discriminative than CKA (AUC 0.89); three pre-registered hypotheses confirmed on LITMUS (20,439 prompts)

Code

2026

Under Review · NeurIPS 2026

MENTIS: What Belief Changes Under Alignment? Multi-Scale Latent Torsion in Language Models

Partha Pratim Saha, Samarth Raina, Mayur Parvatikar, Amit Dhanda, Vinija Jain, Aman Chadha, Amitava Das

Under review (NeurIPS 2026) · 8 new torsion metrics, full LITMUS benchmark study, DPO suppression 44.4% (Cohen’s d=0.741, p=7.7×10¹³), entropy–torsion bridge ρ=−0.387

GitHub

2025

Preprint

Scaling-law and Preference Integration in Neural Alignment Layers (SPINAL)

Arion Das, Partha Pratim Saha, Aman Chadha, Vinija Jain, Amitava Das

Contribution: model experiments, paper writing

GitHub

2025

Journal Paper

Enhancing Human Empathy in Conversations Using Transformer-Based Models

Cherishma Kumar Subhasa, Endriyas Zenagebriel, Partha Pratim Saha, Zarah Rezaei, Joseph Akinyemi

Sciencematch · Impact Scholar Program 2025 · Top contributor; provided all technical ideas and process-pipeline

DOI: 10.5281/ZENODO.15126395

2024

SpringerNature

Collaborative Federated Learning Cloud Based System

Partha Pratim Saha, Naresh K. Sehgal, Miad Faezipour

International Conference on Internet Computing & IoT (ICOMP’24) · Computer Engineering & Applied Computing (CSCE), USA

SpringerNature

05 · Experience

Work Experience

Teaching Assistant (M.Tech Programme)

BITS Pilani, India

2021 – 2023

Teaching Assistant for three graduate courses: NLP Applications [Winter 2023], Deep Learning [Fall 2021], and Deep Reinforcement Learning [Spring 2021]. Conducted tutorials, graded assignments, and mentored students. Honorarium: USD $2,513.11 across all three courses.

Lecturer in Computer Science

WBSCTED — Nalhati Government Polytechnic College, West Bengal, India

Dec 2021 – Present

Teaching ML, Deep Learning, IoT, Python, and Java. Project supervisor for 50+ final-year students in AI, NLP, Agentic AI, and Empathetic Chatbot development. Head of Department (CS) responsibilities. Four active research projects on AI Safety, nDNA/Semantic Helix, Cultural LLMs, and Neural Robustness.

Lead Data Scientist — Conversational Dialog System

Wipro Limited, Bangalore, India

Sept 2021 – Nov 2021

Developed a conversational chatbot system removing query ambiguities. Led a team of 5; implemented 50+ custom intents and dialog flows with IBM Watson. Impact: 0.3 million users worldwide.

Senior Data Scientist — Medical Search Engine (J&J R&D)

BirlaSoft · Johnson & Johnson R&D, New Delhi, India

Dec 2017 – Aug 2019

Built medical search engine using SciBERT and SpaCy NLP pipeline. Impact: over 0.1 million J&J product users. Tools: Python, Word2Vec, SciSpacy, Fuzzy Search, Flask.

Project Engineer — Threat Intelligence System

Indian Institute of Technology (IIT) Kanpur, India

Nov 2016 – Jul 2017

Developed secure threat management system for academic institutions. Researched cyber-security measures against integrity, confidentiality, and non-repudiation attacks. Tools: Python, Drupal, Django.

Senior Systems Engineer — Alignment & Cancer Genomics in AI

Infosys Technologies Limited, Chennai, India

Jan 2011 – Jul 2015

Applied Edit Distance and Needleman-Wunsch algorithms on DNA sequences (FASTQ) to identify minimum insertions/deletions. Identified top 10 genes driving Multiple Myeloma blood cancer; implemented 3 research papers. Impact: biological hierarchy determination for new species, drug design, life expectancy improvements. Tools: Python, Word2Vec, Numpy, Pandas, Dynamic Programming.

06 · Recognition

Awards & Achievements

🛡️

BlueDot Impact Scholar — Selected for both AGI Strategy and Technical AI Safety courses (2025–2026). Rigorous training in catastrophic risk, power-seeking, and technical safety evaluation.

🔬

LASR Labs — Progressed through initial selection rounds of the LASR (Learning from AI Safety Research) Labs programme for mechanistic interpretability research.

🖥️

5x Google Colab Pro A100/H100 GPU (300 units each) from Neuromatch Academy for AI Safety research

📋

NeurIPS 2025 Reviewer — Selected to serve as reviewer for MTI-LLM Workshop at NeurIPS 2025

☁️

AWS AI & ML Scholar by Udacity, 2025

🎓

90% Scholarship — Armenian LLM Summer School 2025

🌐

Duke Machine Learning Summer School 2025 & Cohere Summer School 2025 attendee

🏆

SPAR Demo Day 2025 — Accepted for AI Safety & Alignment research demonstration (Neuromatch / AI Safety cohort)

🏙️

University of Chicago DSI Summer School 2024 — AI-Science Research Program; Eric & Wendy Schmidt Postdoctoral Fellowship (Schmidt Futures)

🎓

MLx Generative AI Fellowship — Oxford ML Summer School 2024 & 2025 — competitive scholarship award for generative AI research

🌍

Athens NLP Summer School 2024 — competitive international selection for NLP & large language models

🗼

diiP Summer School 2024, Paris — Deep Learning & Interpretability in Practice; competitive selection

🧠

Neuromatch Academy — Deep Learning — competitive global selection for the intensive summer school

🗽

NYU AI Summer School 2022 — New York University; competitive selection in AI & ML

🤖

AI4 IMPACT Scholar 2021 — AI Singapore; selected practitioner programme for applied AI impact

💡

Google Developer's Program 2019 — Google Developer Expert community; competitive global selection

🎓

Udacity Bertelsmann Technology Scholarship — Google-sponsored; competitive global selection in AI & ML