1/6: Can we use an LLMโs hidden activations to predict and prevent wrong predictions? When it comes to arithmetic, yes!
Iโm presenting new work w/
@alestolfo.bsky.social
โProbing for Arithmetic Errors in LMsโ @ #ICML2025 Act Interp WS
๐งต below
18.07.2025 17:22 โ
๐ 1
๐ 1
๐ฌ 5
๐ 0
Logo for MIB: A Mechanistic Interpretability Benchmark
Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?
We propose ๐ ๐ ๐๐: a ๐ echanistic ๐nterpretability ๐enchmark!
23.04.2025 18:15 โ
๐ 51
๐ 15
๐ฌ 1
๐ 6
@vidhishab.bsky.social Safoora Yousefi @erichorvitz.bsky.social @besmiranushi.bsky.social
15.04.2025 16:36 โ
๐ 0
๐ 0
๐ฌ 0
๐ 0
Our paper "Improving Instruction-Following in Language Models through Activation Steeringโ has been accepted to #ICLR2025!
We're also excited to share that our public GitHub repo is now live.
Code: github.com/microsoft/ll...
Camera-ready: arxiv.org/abs/2410.12877
15.04.2025 16:35 โ
๐ 8
๐ 2
๐ฌ 1
๐ 2