Alessandro Stolfo's Avatar

Alessandro Stolfo

@alestolfo.bsky.social

PhD @ ETHZ - LLM Interpretability alestolfo.github.io

361 Followers  |  65 Following  |  2 Posts  |  Joined: 17.11.2024
Posts Following

Posts by Alessandro Stolfo (@alestolfo.bsky.social)

Post image

1/6: Can we use an LLMโ€™s hidden activations to predict and prevent wrong predictions? When it comes to arithmetic, yes!
Iโ€™m presenting new work w/
@alestolfo.bsky.social
โ€œProbing for Arithmetic Errors in LMsโ€ @ #ICML2025 Act Interp WS
๐Ÿงต below

18.07.2025 17:22 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 5    ๐Ÿ“Œ 0
Logo for MIB: A Mechanistic Interpretability Benchmark

Logo for MIB: A Mechanistic Interpretability Benchmark

Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?

We propose ๐Ÿ˜Ž ๐— ๐—œ๐—•: a ๐— echanistic ๐—œnterpretability ๐—•enchmark!

23.04.2025 18:15 โ€” ๐Ÿ‘ 51    ๐Ÿ” 15    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 6

@vidhishab.bsky.social Safoora Yousefi @erichorvitz.bsky.social @besmiranushi.bsky.social

15.04.2025 16:36 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Our paper "Improving Instruction-Following in Language Models through Activation Steeringโ€ has been accepted to #ICLR2025!

We're also excited to share that our public GitHub repo is now live.
Code: github.com/microsoft/ll...
Camera-ready: arxiv.org/abs/2410.12877

15.04.2025 16:35 โ€” ๐Ÿ‘ 8    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 2