5/6: Finally, we use this information as a weak oracle to trigger self-correction. Re-prompting the LM based on the probeโs prediction leads to a correction of up to 11% of the mistakes made by the model.
18.07.2025 17:25 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
4/6: Can this be useful in a more realistic setting? We apply the probes trained on โpure arithmeticโ queries to structured CoT traces obtained on GSM8K. The probes transfer well in a robust and consistent manner.
18.07.2025 17:25 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
3/6: Given the previous results, it should be possible to predict the correctness of the model output. We designed lightweight probes that achieve high accuracy.
18.07.2025 17:24 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
2/6: We feed an LM arithmetic queries and we train lightweight probes (e.g., circular) on its residual stream. Interestingly, they can accurately predict the ground-truth result, regardless of the LM's correctness.
18.07.2025 17:23 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
1/6: Can we use an LLMโs hidden activations to predict and prevent wrong predictions? When it comes to arithmetic, yes!
Iโm presenting new work w/
@alestolfo.bsky.social
โProbing for Arithmetic Errors in LMsโ @ #ICML2025 Act Interp WS
๐งต below
18.07.2025 17:22 โ ๐ 1 ๐ 1 ๐ฌ 5 ๐ 0
Do you plan to work on AI safety/ alignment in the future?
11.01.2025 14:07 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0
Assistant Professor at ETH Zurich; interested in Natural language processing, Machine learning and Edtech
PhD @ ETHZ - LLM Interpretability
alestolfo.github.io
Senior Lecturer #USydCompSci at the University of Sydney. Postdocs IBM Research and Stanford; PhD at Columbia. Converts โ into puns: sometimes theorems. He/him.
Master student at ENS Paris-Saclay / aspiring AI safety researcher / improviser
Prev research intern @ EPFL w/ wendlerc.bsky.social and Robert West
MATS Winter 7.0 Scholar w/ neelnanda.bsky.social
https://butanium.github.io
Postdoc at Northeastern and incoming Asst. Prof. at Boston U. Working on NLP, interpretability, causality. Previously: JHU, Meta, AWS
Interpretable Deep Networks. http://baulab.info/ @davidbau
https://mega002.github.io
Gemini Post-Training โซ๏ธ Research Scientist at Google DeepMind โซ๏ธ PhD from ETH Zurich
AI Safety Research // Software Engineering
PhD Student at @gronlp.bsky.social ๐ฎ, core dev @inseq.org. Interpretability โฉ HCI โฉ #NLProc.
gsarti.com
Waiting on a robot body. All opinions are universal and held by both employers and family.
Literally a professor. Recruiting students to start my lab.
ML/NLP/they/she.
Machine learning haruspex
NLP PhD student at Imperial College London and Apple AI/ML Scholar.
Machine learning PhD student @ Blei Lab in Columbia University
Working in mechanistic interpretability, nlp, causal inference, and probabilistic modeling!
Previously at Meta for ~3 years on the Bayesian Modeling & Generative AI teams.
๐ www.sweta.dev
Machine Learning PhD Student
@ Blei Lab & Columbia University.
Working on probabilistic ML | uncertainty quantification | LLM interpretability.
Excited about everything ML, AI and engineering!
PhD student at Vector Institute / University of Toronto. Building tools to study neural nets and find out what they know. He/him.
www.danieldjohnson.com
Mechanistic interpretability
Creator of https://github.com/amakelov/mandala
prev. Harvard/MIT
machine learning, theoretical computer science, competition math.
Post-doc @ Harvard. PhD UMich. Spent time at FAIR and MSR. ML/NLP/Interpretability