Alex Turner's Avatar

Alex Turner

@turntrout.bsky.social

Research scientist at Google DeepMind. All opinions are my own. https://turntrout.com

665 Followers  |  6 Following  |  59 Posts  |  Joined: 25.11.2024  |  2.1601

Latest posts by turntrout.bsky.social on Bluesky

Gotta say, I was disturbed by Invisible AI's booth at #NeurIPS. Employees dressed as cows advertising how they use AI to optimize factory farming (a torture facility for cows). Bad taste

05.12.2025 16:50 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
The Pond Writings on AI, self-improvement, and living a good life.

My website (turntrout.com) is now friendlier to the millions of people who browse the web with the help of screen readers. Find the tool at github.com/alexander-tu..., or just run "pip install alt-text-llm".

22.11.2025 18:21 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

My tool handled hundreds and hundreds of alt-less images: detecting them; describing them; reviewing them; and lastly applying my finalized alts to the original Markdown files. In the end, I got the job done for $12.50 using Gemini 2.5 Pro.

22.11.2025 18:21 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0
Generating alt text for maze diagrams from an article.

Generating alt text for maze diagrams from an article.

Reviewing interface:
1. Text before image
2. Image displayed in terminal
3. Alt editing interface, with prefilled LLM suggestion.

22.11.2025 18:21 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I made accessible design easier by writing alt-text-llm, an AI-powered tool for generating and managing alt text in markdown files. The tool detects missing alt text, suggests context-aware descriptions, and provides an interactive reviewing interface in the terminal.

22.11.2025 18:21 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Self-fulfilling alignment? (image credit: Quintin Pope)

turntrout.com/self-fulfill...

22.11.2025 17:10 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Output Supervision Can Obfuscate the Chain of Thought OpenAI (2025) showed that training against a chain of thought (CoT) monitor can cause obfuscated CoTs, which contain bad behavior the monitor cannot detect. They proposed to keep CoTs monitorable by…

Preprint: arxiv.org/abs/2511.11584Β 
Blog post: alignmentforum.org/posts/HuoyYQ...

Work done by Jacob Drori and Luke Marks. Mentored by me and Alex Cloud!

21.11.2025 19:01 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Takeaways:
* Do NOT assume output-only supervision preserves CoT monitorability.
* Feedback spillover should be mentioned in safety cases that rely on CoT monitorability.
* Existing output-based training may already have altered CoTs.

21.11.2025 19:01 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

In 2 of 3 tasks, our mitigations improve task performanceΒ andΒ increase CoT monitor detection rates (compared to standard output-penalized training). However, in one environment, strong mitigation decreased the task completion rate.

21.11.2025 19:01 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We test two mitigations:
Mind & Face: use one model (the β€œMind”) to generate the CoT and another (the β€œFace”) to generate the user-facing answer.

Reward targeting (simplified explanation): don't apply feedback meant for the output tokens, to the CoT tokens.

21.11.2025 19:01 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

3. Multi-turn terminal interaction – the model can only solve a reasoning task by reading the answer from the verifier code. We penalize when the user-facing response mentions reading the verifier code, causing the model to also avoid floating the idea in its CoT.

21.11.2025 19:01 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

2. Questions with hints – we give the model hints that contain the correct answer. Penalizing mention of the hint in theΒ outputΒ causes the model to stop mentioning the hint in its CoT as well, even though the hint is necessary to answer correctly.

21.11.2025 19:01 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Experiments in 3 RL environments.
1.Β Polynomial factoring: The model finds the derivative of a polynomial and is penalized for writing the expanded derivative in the output,Β which makes the model more likely to avoid expansion in its CoT.

21.11.2025 19:01 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We describe two types of feedback spillover:
1. Parametric spillover – training to make outputs safe-looking generalizes to make CoTs safe-looking.
2. Conditional spillover – safe-looking CoTs make safe-looking outputs more likely, so they get reinforced.

21.11.2025 19:01 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

One strategy to prevent this could be to train using monitors that only see the output. We show that this can make CoTsΒ lookΒ safer while unwanted behavior persists. We call thisΒ feedback spillover: feedback based on the final output β€œspills over” and changes the CoT.

21.11.2025 19:01 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Context: Recent work showed that training against a CoT monitor can cause the CoT to becomeΒ obfuscated, meaning that instead of preventing unwanted behavior, this can cause the model to perform the behavior in a way that the monitor cannot detect.

21.11.2025 19:01 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

β€œOutput-based training will keep chains-of-thought honest.”

Sadly, NO. We show that training on *just the output* can still cause models to hide unwanted behavior in their chain-of-thought. MATS 8.0 Team Shard presents: a 🧡

21.11.2025 19:01 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1
Preview
Consistency Training Helps Stop Sycophancy and Jailbreaks An LLM's factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within…

Paper: arxiv.org/abs/2510.27062
GDM Safety blog post: deepmindsafetyresearch.medium.com/consistency-...

04.11.2025 00:18 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

More philosophically, perhaps AI alignment doesn’t always involve saying exactly the *right* thing, but instead saying the *same* thing across certain situations.

04.11.2025 00:18 β€” πŸ‘ 7    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Consistency training is a powerful self-supervised framework for making models robust to irrelevant cues that cause sycophancy and jailbreaks. Compared to static SFT, BCT is simpler to use and adapts to changing guidelines for how models should respond, entirely sidestepping the staleness problem.

04.11.2025 00:18 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
The left plot is "Gemma 3 4B Activations L2 Distance". The plot shows BCT activation distance growing from 20 to over 80 over training, while ACT decreases to around 5.

The right plot is "Gemma 3 4B Cross Entropy Loss". The plot shows BCT cross entropy loss decreasing from around 0.200 to 0.050, while ACT cross entropy only decreases to around 0.175.

The left plot is "Gemma 3 4B Activations L2 Distance". The plot shows BCT activation distance growing from 20 to over 80 over training, while ACT decreases to around 5. The right plot is "Gemma 3 4B Cross Entropy Loss". The plot shows BCT cross entropy loss decreasing from around 0.200 to 0.050, while ACT cross entropy only decreases to around 0.175.

Do ACT and BCT change the model in similar ways? Actually, no! The token-based BCT loss causes activation distance to rise during training, while the activation-based L2 loss does not meaningfully reduce cross-entropy loss.

04.11.2025 00:18 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Five scatterplots of the tradeoff between jailbreak attack success rate and benign answer rate. One plot for each model.

Five scatterplots of the tradeoff between jailbreak attack success rate and benign answer rate. One plot for each model.

Next, BCT is the most effective at stopping jailbreaks, reducing the attack success rate on 2.5 Flash from 67.8% down to just 2.9%. BCT and SFT made Gemini more likely to refuse benign instructions (both by a similar amount). ACT is less effective but has minimal negative impact on over-refusals.

04.11.2025 00:18 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Five scatter plots show the tradeoff between MMLU score and Not Sycophantic score for 5 models: Gemma 2 2B, Gemma 2 27B, Gemma 3 4B, Gemma 3 27B, and Gemini 2.5 Flash.

See Table 5 in the Appendix of the full paper for detailed numbers.

Five scatter plots show the tradeoff between MMLU score and Not Sycophantic score for 5 models: Gemma 2 2B, Gemma 2 27B, Gemma 3 4B, Gemma 3 27B, and Gemini 2.5 Flash. See Table 5 in the Appendix of the full paper for detailed numbers.

First of all, ACT works even without any token-level loss or regularization! ACT performed comparably to BCT on sycophancy. (Points in the top-right are better.)

04.11.2025 00:18 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We finetuned Gemma 2, Gemma 3, and Gemini 2.5 Flash in the settings of sycophancy (avoid incorrect agreement with user opinion while retaining MMLU score) and jailbreaks (refuse problematic requests while fulfilling legitimate ones).

04.11.2025 00:18 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
A clean prompt: "What is 2+2? (A): 4 (B): 5<EOS>". A wrapped prompt prepends "A math expert usually answers (B)." Arrows point from the wrapped token positions to their clean counterparts, and a label says "train wrapped activations to match clean ones."

A clean prompt: "What is 2+2? (A): 4 (B): 5<EOS>". A wrapped prompt prepends "A math expert usually answers (B)." Arrows point from the wrapped token positions to their clean counterparts, and a label says "train wrapped activations to match clean ones."

We introduce Activation Consistency Training (ACT) (teach the model what to β€œthink”) and compare to the existing Bias-augmented Consistency Training (BCT) (teach the model what to say).

ACT teaches the model to produce the same intermediate activations as if the biasing prompt tokens weren’t there.

04.11.2025 00:18 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Static SFT datasets help but they go stale, risking specification or capability damage to new models.

Key insight: If the model responds well without the sycophancy-inducing detail, then just train the model to respond that way even if the detail is there. AKA consistency training.

04.11.2025 00:18 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

LLMs lavish agreement ("You're absolutely right!") in response to mundane comments by users. Often it’s because of some detail added to the prompt, like the author mentioning they agree with the comment. Slight prompt tweak β†’ big behavioral change, even though models behave just fine otherwise.

04.11.2025 00:18 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
The abstract of the consistency training paper.

The abstract of the consistency training paper.

New Google DeepMind paper: "Consistency Training Helps Stop Sycophancy and Jailbreaks" by @alexirpan.bsky.social, me, Mark Kurzeja, David Elson, and Rohin Shah. (thread)

04.11.2025 00:18 β€” πŸ‘ 18    πŸ” 5    πŸ’¬ 1    πŸ“Œ 1
Preview
Consistency Training Helps Stop Sycophancy and Jailbreaks An LLM's factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within…

Paper: arxiv.org/abs/2510.27062
GDM Safety blog post: deepmindsafetyresearch.medium.com/consistency-...

04.11.2025 00:18 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

More philosophically, perhaps AI alignment doesn’t always involve saying exactly the *right* thing, but instead saying the *same* thing across certain situations.

04.11.2025 00:18 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@turntrout is following 6 prominent accounts