Fazl Barez's Avatar

Fazl Barez

@fbarez.bsky.social

Let's build AI's we can trust!

208 Followers  |  139 Following  |  57 Posts  |  Joined: 17.11.2024  |  2.2259

Latest posts by fbarez.bsky.social on Bluesky

It is so easy to confuse chain of thought and explainability and in fact in a lot of the media it is presented as if with current LLMs we are allowed to view their actual thought processes. It is not that!

02.07.2025 12:41 β€” πŸ‘ 6    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0

Link to the paper www.alphaxiv.org/abs/2025.02

02.07.2025 07:33 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

@alasdair-p.bsky.social‬, @adelbibi.bsky.social ‬, Robert Trager, Damiano Fornasiere, @john-yan.bsky.social ‬, @yanai.bsky.social@yoshuabengio.bsky.social

01.07.2025 15:41 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

Work done with my wonderful collaborators @tonywu1105.bsky.social IvΓ‘n Arcuschin, @bearraptor, Vincent Wang, @noahysiegel.bsky.social , N. Collignon, C. Neo, @wordscompute.bsky.social pute.bsky.social‬

01.07.2025 15:41 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0
Post image

Bottom line: CoT can be useful but should never be mistaken for genuine interpretability. Ensuring trustworthy explanations requires rigorous validation and deeper insight into model internals, which is especially critical as AI scales up in high-stakes domains. (9/9) πŸ“–βœ¨

01.07.2025 15:41 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Inspired by cognitive science, we suggest strategies like error monitoring, self-correcting narratives, and dual-process reasoning (intuitive + reflective steps). Enhanced human oversight tools are also critical to interpret and verify model reasoning. (8/9)

01.07.2025 15:41 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We suggest treating CoT as complementary rather than sufficient for interpretability, developing rigorous methods to verify CoT faithfulness, and applying causal validation techniques like activation patching, counterfactual checks, and verifier models. (7/9)

01.07.2025 15:41 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Why does this disconnect occur? One possibility is that models process information via distributed, parallel computations. Yet CoT presents reasoning as a sequential narrative. This fundamental mismatch leads to inherently unfaithful explanations. (6/9)

01.07.2025 15:41 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1

Another red flag: models often silently correct errors within their reasoning steps. They may produce the correct final answer by reasoning steps that are not verbalised, while the steps they do verbalise remain flawed, creating an illusion of transparency. (5/9)

01.07.2025 15:41 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Alarmingly, explicit prompt biases can easily sway model answers without ever being mentioned in their explanations. Models rationalize biased answers convincingly, yet fail to disclose these hidden influences. Trusting such rationales can be dangerous. (4/9)

01.07.2025 15:41 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Our analysis shows high-stakes domains often rely on CoT explanations: ~38% of medical AI, 25% of AI for law, and 63% of autonomous vehicle papers using CoT misclaim it as interpretability. Misplaced trust here risks serious real-world consequences. (3/9)

01.07.2025 15:41 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Language models can be prompted or trained to verbalize reasoning steps in their Chain of Thought (CoT). Despite prior work showing such reasoning can be unfaithful, we find that around 25% of recent CoT-centric papers still mistakenly claim CoT as an interpretability technique. (2/9)

01.07.2025 15:41 β€” πŸ‘ 7    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1
Post image

Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their steps (CoT) aren't necessarily revealing their true reasoning. Spoiler: the transparency can be an illusion. (1/9) 🧡

01.07.2025 15:41 β€” πŸ‘ 83    πŸ” 31    πŸ’¬ 2    πŸ“Œ 5
Kander (@kander.bsky.social) Malware malbec and gummie bears!

Co-authored with: Isaac Friend, Keir Reid, Igor Krawczuk, Vincent Wang, @jakobmokander.bsky.social kander.bsky.social , @philiptorr.bsky.social , Julia C Morse and Robert Trager

27.06.2025 08:07 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

In our new paper, Toward Resisting AI-Enabled Authoritarianism, we propose some technical safeguards to push back:
πŸ”’ Scalable privacy
πŸ” Verifiable interpretability
πŸ›‘οΈ Adversarial user tools

27.06.2025 08:07 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Technology = power. AI is reshaping power β€” fast.

Today’s AI doesn’t just assist decisions; it makes them. Governments use it for surveillance, prediction, and control β€” often with no oversight.

Technical safeguards aren’t enough on their own β€” but they’re essential for AI to serve society.

27.06.2025 08:07 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

And Anna Yelizarov, @fbarez.bsky.social, @scasper.bsky.social, Beatrice Erkers, among others.

We'll draw from political theory, cooperative AI, economics, mechanism design, history, and hierarchical agency.

18.06.2025 18:12 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

This is a step toward targeted, interpretable, and robust knowledge removal β€” at the parameter level.

Joint work with Clara Suslik, Yihuai Hong, and @fbarez.bsky.social, advised by @megamor2.bsky.social
πŸ”— Paper: arxiv.org/abs/2505.22586
πŸ”— Code: github.com/yoavgur/PISCES

29.05.2025 16:22 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
BERI-AIGI (University of Oxford) AI Safety Research programme 2025 Technical AI Governance Initiative (AIGI) at the University of Oxford is partnering with BERI to offer Research Programme focused on both the technical and the governance aspects of AI safety. BERI ar...

Application link: shorturl.at/Se8my

20.05.2025 17:13 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Come work with me at Oxford this summer! Paid research opportunity to:

White-box LLMs & model security
Safe RL & reward hacking
Interpretability & governance tools

Remote or Oxford.

Apply by 30 May 23:59 UTC. DM with questions.

20.05.2025 17:13 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Job Details

my.corehr.com/pls/uoxrecru...

15.05.2025 11:12 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Come work with me at Oxford!

We’re hiring a Postdoc in Causal Systems Modelling to:

- Build causal & white-box models that make frontier AI safer and more transparent
- Turn technical insights into safety cases, policy briefs, and governance tools
]

DM if you have any questions.

15.05.2025 11:12 β€” πŸ‘ 4    πŸ” 4    πŸ’¬ 1    πŸ“Œ 0

How do you handle wildly different reviewer opinions? Any tips for meta-reviews that actually help authors improve their work?

08.04.2025 11:59 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

First-time Area Chair seeking advice! What helped you most when evaluating papers beyond just averaging scores?

After suffering through unhelpful reviews as an author, I want to do right by papers in my track.

08.04.2025 11:59 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸŽ‰ Our Actionable Interpretability workshop has been accepted to #ICML2025! πŸŽ‰
> Follow @actinterp.bsky.social
> Website actionable-interpretability.github.io

@talhaklay.bsky.social @anja.re @mariusmosbach.bsky.social @sarah-nlp.bsky.social @iftenney.bsky.social

Paper submission deadline: May 9th!

31.03.2025 16:59 β€” πŸ‘ 43    πŸ” 16    πŸ’¬ 3    πŸ“Œ 3

ahahah so true!

01.04.2025 15:28 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Organizers: Ben Bucknall, @lisasoder.bsky.social, @ankareuel.bsky.social @fbarez.bsky.social, @carlosmougan.bsky.social
Weiwei Pan, Siddharth Swaroop, @ankareuel.bsky.social , Robert Trager @maosbot.bsky.social

01.04.2025 14:58 β€” πŸ‘ 5    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0

Technical AI Governance (TAIG) at #ICML2025 this July in Vancouver!

Credit to
Ben and Lisa for all the work!

We have a new centre at Oxford working on technical AI governance with Robert Trager and @maosbot.bsky.social many other great minds. We are hiring - please reach out!
Quote

01.04.2025 15:10 β€” πŸ‘ 6    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

Naomi is great, you should consider joining her group!

27.03.2025 05:39 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
CDS building which looks like a jenga tower

CDS building which looks like a jenga tower

Life update: I'm starting as faculty at Boston University
@bucds.bsky.social in 2026! BU has SCHEMES for LM interpretability & analysis, I couldn't be more pumped to join a burgeoning supergroup w/ @najoung.bsky.social @amuuueller.bsky.social. Looking for my first students, so apply and reach out!

27.03.2025 02:24 β€” πŸ‘ 246    πŸ” 13    πŸ’¬ 35    πŸ“Œ 6

@fbarez is following 20 prominent accounts