Yucheng Sun @yuchengsun - Bluesky Profile

Latest posts by yuchengsun.bsky.social on Bluesky

Probing for Arithmetic Errors in Language Models We investigate whether internal activations in language models can be used to detect arithmetic errors. Starting with a controlled setting of 3-digit addition, we show that simple probes can accuratel...

6/6: Thanks for the supervision
@alestolfo.bsky.social @mrinmaya.bsky.social

Check out our paper: arxiv.org/abs/2507.12379

18.07.2025 17:27 — 👍 0 🔁 0 💬 0 📌 0

5/6: Finally, we use this information as a weak oracle to trigger self-correction. Re-prompting the LM based on the probe’s prediction leads to a correction of up to 11% of the mistakes made by the model.

18.07.2025 17:25 — 👍 1 🔁 0 💬 0 📌 0

4/6: Can this be useful in a more realistic setting? We apply the probes trained on “pure arithmetic” queries to structured CoT traces obtained on GSM8K. The probes transfer well in a robust and consistent manner.

18.07.2025 17:25 — 👍 0 🔁 0 💬 0 📌 0

3/6: Given the previous results, it should be possible to predict the correctness of the model output. We designed lightweight probes that achieve high accuracy.

18.07.2025 17:24 — 👍 0 🔁 0 💬 0 📌 0

2/6: We feed an LM arithmetic queries and we train lightweight probes (e.g., circular) on its residual stream. Interestingly, they can accurately predict the ground-truth result, regardless of the LM's correctness.

18.07.2025 17:23 — 👍 0 🔁 0 💬 0 📌 0

1/6: Can we use an LLM’s hidden activations to predict and prevent wrong predictions? When it comes to arithmetic, yes!
I’m presenting new work w/
@alestolfo.bsky.social
“Probing for Arithmetic Errors in LMs” @ #ICML2025 Act Interp WS
🧵 below

18.07.2025 17:22 — 👍 1 🔁 1 💬 5 📌 0

Do you plan to work on AI safety/ alignment in the future?

11.01.2025 14:07 — 👍 2 🔁 0 💬 1 📌 0

@yuchengsun is following 19 prominent accounts

@royalrumble

Mrinmaya Sachan
@mrinmaya

Assistant Professor at ETH Zurich; interested in Natural language processing, Machine learning and Edtech

Alessandro Stolfo
@alestolfo

PhD @ ETHZ - LLM Interpretability alestolfo.github.io

Clément Canonne
@ccanonne.github.io

Senior Lecturer #USydCompSci at the University of Sydney. Postdocs IBM Research and Stanford; PhD at Columbia. Converts ☕ into puns: sometimes theorems. He/him.

Clément Dumas
@butanium

Master student at ENS Paris-Saclay / aspiring AI safety researcher / improviser Prev research intern @ EPFL w/ wendlerc.bsky.social and Robert West MATS Winter 7.0 Scholar w/ neelnanda.bsky.social https://butanium.github.io

Aaron Mueller
@amuuueller

Postdoc at Northeastern and incoming Asst. Prof. at Boston U. Working on NLP, interpretability, causality. Previously: JHU, Meta, AWS

David Bau
@davidbau

Interpretable Deep Networks. http://baulab.info/ @davidbau

Mor Geva
@megamor2

https://mega002.github.io

Niklas Stoehr
@niklasstoehr

Gemini Post-Training ⚫️ Research Scientist at Google DeepMind ⚫️ PhD from ETH Zurich

Nina Rimsky
@ninarimsky

AI Safety Research // Software Engineering

Gabriele Sarti
@gsarti.com

PhD Student at @gronlp.bsky.social 🐮, core dev @inseq.org. Interpretability ∩ HCI ∩ #NLProc. gsarti.com

Naomi Saphra
@nsaphra

Waiting on a robot body. All opinions are universal and held by both employers and family. Literally a professor. Recruiting students to start my lab. ML/NLP/they/she.

Dashiell
@dashiells

Machine learning haruspex

Joe Stacey
@joestacey

NLP PhD student at Imperial College London and Apple AI/ML Scholar.

Sweta Karlekar
@swetakar

Machine learning PhD student @ Blei Lab in Columbia University Working in mechanistic interpretability, nlp, causal inference, and probabilistic modeling! Previously at Meta for ~3 years on the Bayesian Modeling & Generative AI teams. 🔗 www.sweta.dev

Nicolas Beltran-Velez
@velezbeltran

Machine Learning PhD Student @ Blei Lab & Columbia University. Working on probabilistic ML | uncertainty quantification | LLM interpretability. Excited about everything ML, AI and engineering!

Daniel Johnson
@ddjohnson

PhD student at Vector Institute / University of Toronto. Building tools to study neural nets and find out what they know. He/him. www.danieldjohnson.com

Alex Makelov
@amakelov

Mechanistic interpretability Creator of https://github.com/amakelov/mandala prev. Harvard/MIT machine learning, theoretical computer science, competition math.

Andrew Lee
@ajyl

Post-doc @ Harvard. PhD UMich. Spent time at FAIR and MSR. ML/NLP/Interpretability