Arthur Conmy arthurconmy - Bluesky Statics

Preventing Language Models From Hiding Their Reasoning Large language models (LLMs) often benefit from intermediate steps of reasoning to generate answers to complex problems. When these intermediate steps of reasoning are used to monitor the activity of ...

Good question. I'm not sure. arxiv.org/abs/2310.18512 has a bunch of discussion of mitigations

04.01.2025 15:33 — 👍 2 🔁 0 💬 0 📌 0

This also isn't just about edge cases either; 1) happens with nice models like Claude, and 2) is even true for dumb models like Gemma-2 2B

02.01.2025 20:50 — 👍 2 🔁 0 💬 1 📌 0

2) Verification is considerably harder than generation. Even when there are a few 100 of tokens, often it takes me several minutes to understand whether reasoning is OK or not

02.01.2025 20:50 — 👍 2 🔁 0 💬 1 📌 0

Been really enjoying unfaithful CoT research with collaborators recently. Two observations:

1) Quickly it's clear that models are sneaking in reasoning without verbalising where it comes from (e.g. making an equation that gets the correct answer, but defined out of thin air)

02.01.2025 20:48 — 👍 4 🔁 0 💬 1 📌 0

We scaled training data attribution (TDA) methods ~1000x to find influential pretraining examples for thousands of queries in an 8B-parameter LLM over the entire 160B-token C4 corpus!
medium.com/people-ai-re...

13.12.2024 18:57 — 👍 35 🔁 7 💬 2 📌 5

Sparse Autoencoders (SAEs) are popular, with 10+ new approaches proposed in the last year. How do we know if we are making progress? The field has relied on imperfect proxy metrics.

We are releasing SAE Bench, a suite of 8 SAE evaluations!

Project co-led with Adam Karvonen.

11.12.2024 06:07 — 👍 4 🔁 2 💬 1 📌 0

Still, there is no ground truth for interpretability, so progress is tough

10.12.2024 19:45 — 👍 2 🔁 0 💬 0 📌 0

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders - Dec 2024

Currently, most of our SAE evaluations don't fully capture what we want, and are very expensive

This work provides a battery of automated metrics that should help researchers understand their SAEs' strengths and weaknesses better: www.neuronpedia.org/sae-bench/info

10.12.2024 19:44 — 👍 7 🔁 0 💬 1 📌 0

Please report @deep-mind.bsky.social: it is not the real DeepMind, which would obviously use the google.com or deepmind.google domains rather than one they bought off GoDaddy 3 days ago: uk.godaddy.com/whois/result...

26.11.2024 00:43 — 👍 20 🔁 4 💬 0 📌 0

Awesome research. The caveat is that humans were working for 8 hours, but they were explicitly encouraged to get results after 2 hours so I buy the claim.

22.11.2024 22:22 — 👍 1 🔁 0 💬 0 📌 0

I'm very bullish on automated research engineering soon, but even I was surprised that AI agents are twice as good as humans with 5+ years of experience or from a top AGI or safety lab at doing tasks in 2 hours. Paper: metr.org/AI_R_D_Evalu...

22.11.2024 22:21 — 👍 8 🔁 1 💬 1 📌 0

Posts by Arthur Conmy (@arthurconmy.bsky.social)