1/6: Can we use an LLM’s hidden activations to predict and prevent wrong predictions? When it comes to arithmetic, yes!
I’m presenting new work w/
@alestolfo.bsky.social
“Probing for Arithmetic Errors in LMs” @ #ICML2025 Act Interp WS
🧵 below
18.07.2025 17:22 — 👍 1 🔁 1 💬 5 📌 0
Logo for MIB: A Mechanistic Interpretability Benchmark
Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?
We propose 😎 𝗠𝗜𝗕: a 𝗠echanistic 𝗜nterpretability 𝗕enchmark!
23.04.2025 18:15 — 👍 49 🔁 15 💬 1 📌 6
@vidhishab.bsky.social Safoora Yousefi @erichorvitz.bsky.social @besmiranushi.bsky.social
15.04.2025 16:36 — 👍 0 🔁 0 💬 0 📌 0
Our paper "Improving Instruction-Following in Language Models through Activation Steering” has been accepted to #ICLR2025!
We're also excited to share that our public GitHub repo is now live.
Code: github.com/microsoft/ll...
Camera-ready: arxiv.org/abs/2410.12877
15.04.2025 16:35 — 👍 7 🔁 2 💬 1 📌 2
Currently in ETH Zurich. Working on mechanistic interpretability.
PhD student in Computer Science and Natural Language Processing at ETH Zürich
NLP Researcher | CS PhD Candidate @ Technion
Research in NLP (mostly LM interpretability & explainability).
Assistant prof at UMD CS + CLIP.
Previously @ai2.bsky.social @uwnlp.bsky.social
Views my own.
sarahwie.github.io
PhD Student at the ILLC / UvA doing work at the intersection of (mechanistic) interpretability and cognitive science. Current Anthropic Fellow.
hannamw.github.io
Assistant professor of computer science at Technion; visiting scholar at @KempnerInst 2025-2026
https://belinkov.com/
Postdoc @ TakeLab, UniZG | previously: Technion; TU Darmstadt | PhD @ TakeLab, UniZG
Faithful explainability, controllability & safety of LLMs.
🔎 On the academic job market 🔎
https://mttk.github.io/
The largest workshop on analysing and interpreting neural networks for NLP.
BlackboxNLP will be held at EMNLP 2025 in Suzhou, China
blackboxnlp.github.io
Assistant Professor at Bar-Ilan University
https://yanaiela.github.io/
AI Evaluation and Interpretability @MicrosoftResearch, Prev PhD @CMU.
Chief Scientific Officer of Microsoft.
AI/ML, Responsible AI @Nvidia
Master student at ENS Paris-Saclay / aspiring AI safety researcher / improviser
Prev research intern @ EPFL w/ wendlerc.bsky.social and Robert West
MATS Winter 7.0 Scholar w/ neelnanda.bsky.social
https://butanium.github.io
Postdoc at Northeastern and incoming Asst. Prof. at Boston U. Working on NLP, interpretability, causality. Previously: JHU, Meta, AWS
Interpretable Deep Networks. http://baulab.info/ @davidbau
https://mega002.github.io
AI Safety Research // Software Engineering
PhD Student at @gronlp.bsky.social 🐮, core dev @inseq.org. Interpretability ∩ HCI ∩ #NLProc.
gsarti.com