We discovered that language models leave a natural "signature" on their API outputs that's extremely hard to fake. Here's how it works π
π arxiv.org/abs/2510.14086 1/
@nsubramani23.bsky.social
PhD student @CMU LTI - working on model #interpretability, student researcher @google; prev predoc @ai2; intern @MSFT nishantsubramani.github.io
We discovered that language models leave a natural "signature" on their API outputs that's extremely hard to fake. Here's how it works π
π arxiv.org/abs/2510.14086 1/
At @colmweb.org all week π₯―π! Presenting 3 mechinterp + actionable interp papers at @interplay-workshop.bsky.social
1. BERTology in the Modern World w/ @bearseascape.bsky.social
2. MICE for CATs
3. LLM Microscope w/ Jiarui Liu, Jivitesh Jain, @monadiab77.bsky.social
Reach out to chat! #COLM2025
Excited to be attending NEMI in Boston today to present π MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools and co-moderate the model steering and control roundtable! Come find me to connect and chat about steering and actionable interp
22.08.2025 12:28 β π 2 π 0 π¬ 0 π 0At #ACL2025 in Vienna π¦πΉ till next Saturday! Love to chat about anything #interpretability π, understanding model internals π¬, and finding yummy vegan food π₯¬
25.07.2025 21:53 β π 5 π 0 π¬ 0 π 0At #ICML2025 π¨π¦ till Sunday! Love to chat about #interpretability, understanding model internals, and finding yummy vegan food in Vancouver π₯¬π
14.07.2025 17:33 β π 5 π 0 π¬ 0 π 0Congrats π₯³π₯³π₯³π₯³
13.06.2025 19:08 β π 1 π 0 π¬ 0 π 0π¨New #interpretability paper with @nsubramani23.bsky.social: π΅οΈ Model Internal Sleuthing: Finding Lexical Identity and Inflectional Morphology in Modern Language Models
04.06.2025 17:19 β π 1 π 1 π¬ 1 π 1π¨ Check out our new #interpretability paper: π΅π½ Model Internal Sleuthing led by the amazing @bearseascape.bsky.social who is an undergrad at @scsatcmu.bsky.social @ltiatcmu.bsky.social
04.06.2025 17:41 β π 4 π 1 π¬ 0 π 0Excited to announce that I started at @googleresearch.bsky.social on the cloud team as a student researcher last month working with Hamid Palangi on actionable #interpretability π to build better tool using #agents βοΈπ€
02.06.2025 16:35 β π 4 π 0 π¬ 0 π 0Presenting this today at the poster session at #NAACL2025!
Come chat about interpretability, trustworthiness, and tool-using agents!
ποΈ - Thursday May 1st (today)
π - Hall 3
π - 200-330pm
At #NAACL2025 π΅till Sunday! Love to chat about interpretability, understanding model internals, and finding vegan food π₯¬
30.04.2025 15:03 β π 3 π 0 π¬ 0 π 0Come to our poster in Albuquerque on Thursday 2-330pm in the interpretability & analysis section!
Paper: aclanthology.org/2025.naacl-l...
Code (coming soon): github.com/microsoft/mi...
π§΅/π§΅
MICE π:
π― - significantly beats baselines on expected tool-calling utility, especially in high risk scenarios
β
- matches expected calibration error of baselines
β
- is sample efficient
β
- generalizes zeroshot to unseen tools
5/π§΅
Calibration is not sufficient: both an oracle and a model that just predicts the base rate are perfectly calibratedπ€¦π½ββοΈ
We develop a new metric expected tool-calling utility π οΈto measure the utility of deciding whether or not to execute a tool call via a confidence score!
4/π§΅
We propose π MICE to better assess confidence when calling tools:
1οΈβ£ decode from each intermediate layer of an LM
2οΈβ£ compute similarity scores between each layerβs generation and the final output.
3οΈβ£ train a probabilistic classifier on these features
3/π§΅
1οΈβ£ Tool-using agents need to be useful and safe as they take actions in the world
2οΈβ£ Language models are poorly calibrated
π€ Can we use model internals to better calibrate language models to make tool-using agents safer and more useful?
2/π§΅
π Excited to share a new interp+agents paper: ππ± MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools appearing at #NAACL2025
This was work done @msftresearch.bsky.social last summer with Jason Eisner, Justin Svegliato, Ben Van Durme, Yu Su, and Sam Thomson
1/π§΅
Congrats!!
24.04.2025 04:30 β π 1 π 0 π¬ 0 π 0Congrats! π₯³
27.03.2025 03:10 β π 1 π 0 π¬ 0 π 0Have these people met β¦ society? Read a book? Listened to music? Regurgitating esoteric facts isnβt intelligence.
This is more like humanityβs last stand at jeopardy
www.nytimes.com/2025/01/23/t...
ππ½ looks good to me!
14.12.2024 01:27 β π 1 π 0 π¬ 0 π 0 ππ½ Intro
πΌ PhD student @ltiatcmu.bsky.social
π My research is in model interpretability π, understanding the internals of LLMs to build more controllable and trustworthy systems
π«΅π½ If you are interested in better understanding of language technology or model interpretability, let's connect!
ππ½
21.11.2024 14:25 β π 2 π 0 π¬ 0 π 0ππ½
19.11.2024 14:45 β π 1 π 0 π¬ 0 π 01) I'm working on using intermediate model generations with LLMs to better calibrate tool using agents βοΈπ€ than the probabilities themselves! Turns out you can π₯³
2) There's gotta be a nice geometric understanding of what's going on within LLMs when we tune them π€
Love to be added too!
17.11.2024 17:44 β π 1 π 0 π¬ 0 π 0Utah is hiring tenure-track/tenured faculty & a priority area is NLP!Β
Please reach out over email if you have questions about the school and Salt Lake City, happy to share my experience so far.Β
utah.peopleadmin.com/postings/154...