Yucheng Sun's Avatar

Yucheng Sun

@yuchengsun.bsky.social

Currently in ETH Zurich. Working on mechanistic interpretability.

5 Followers  |  39 Following  |  7 Posts  |  Joined: 20.11.2024  |  1.4166

Latest posts by yuchengsun.bsky.social on Bluesky

Preview
Probing for Arithmetic Errors in Language Models We investigate whether internal activations in language models can be used to detect arithmetic errors. Starting with a controlled setting of 3-digit addition, we show that simple probes can accuratel...

6/6: Thanks for the supervision
@alestolfo.bsky.social @mrinmaya.bsky.social

Check out our paper: arxiv.org/abs/2507.12379

18.07.2025 17:27 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

5/6: Finally, we use this information as a weak oracle to trigger self-correction. Re-prompting the LM based on the probeโ€™s prediction leads to a correction of up to 11% of the mistakes made by the model.

18.07.2025 17:25 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

4/6: Can this be useful in a more realistic setting? We apply the probes trained on โ€œpure arithmeticโ€ queries to structured CoT traces obtained on GSM8K. The probes transfer well in a robust and consistent manner.

18.07.2025 17:25 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

3/6: Given the previous results, it should be possible to predict the correctness of the model output. We designed lightweight probes that achieve high accuracy.

18.07.2025 17:24 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

2/6: We feed an LM arithmetic queries and we train lightweight probes (e.g., circular) on its residual stream. Interestingly, they can accurately predict the ground-truth result, regardless of the LM's correctness.

18.07.2025 17:23 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

1/6: Can we use an LLMโ€™s hidden activations to predict and prevent wrong predictions? When it comes to arithmetic, yes!
Iโ€™m presenting new work w/
@alestolfo.bsky.social
โ€œProbing for Arithmetic Errors in LMsโ€ @ #ICML2025 Act Interp WS
๐Ÿงต below

18.07.2025 17:22 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 5    ๐Ÿ“Œ 0

Do you plan to work on AI safety/ alignment in the future?

11.01.2025 14:07 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

@yuchengsun is following 19 prominent accounts