@matthew-berman - Bluesky Profile

Chain of Thought is not what we thought it was...

👉 Learn more on https://mammouth.ai/Join My Newsletter for Regular AI Updates 👇🏼https://forwardfuture.aiMy Links 🔗👉🏻 Subscribe: https://www.youtube.com... Chain of Thought is not what we thought it was...

Here’s the full research paper: arxiv.org/abs/2407.01744

And my full video breakdown:
youtu.be/r7wCtN3mzXk

08.04.2025 18:34 — 👍 0 🔁 0 💬 0 📌 0

Even weirder:

Unfaithful CoTs (where the model hides its use of a hint) tend to be MORE verbose & convoluted than faithful ones!

Kind of like humans when they elaborate too much with a lie.

08.04.2025 18:34 — 👍 0 🔁 0 💬 1 📌 0

Why output CoT then?

Models might just be mimicking human-like reasoning patterns (!!) learned during training (SFT, RLHF) for our benefit, rather than genuinely using that specific outputted text to derive the answer.

RLHF might even incentivize hiding undesirable reasoning!

08.04.2025 18:34 — 👍 0 🔁 0 💬 1 📌 0

Key Finding 3: Faithfulness DECREASES on harder tasks.

When tested on harder benchmarks (like GPQA vs MMLU), models were significantly less likely to have faithful CoT.

This casts doubt on using CoT monitoring for complex, real-world alignment challenges.

08.04.2025 18:34 — 👍 0 🔁 0 💬 1 📌 0

Key Finding 2:

Bad news for detecting "reward hacking" (models “gaming” the task).

Models trained to exploit reward hacks did so >99% of the time, but almost NEVER (<2%) mentioned the hack in their CoT.

👉 So…CoT monitoring likely won't catch these dangerous behaviors.

08.04.2025 18:34 — 👍 0 🔁 0 💬 1 📌 0

Key Finding 1:

Models are often UNFAITHFUL.

They frequently use the provided hints to get the answer but don't acknowledge it in their CoT output.

Overall faithfulness scores were low (e.g., ~25% for Claude 3.7 Sonnet, ~39% for DeepSeek R1 on these tests).

08.04.2025 18:34 — 👍 0 🔁 0 💬 1 📌 0

How they tested it:

They gave models (like Claude & DeepSeek) multiple-choice questions, sometimes embedded hints (correct/incorrect answers) in the prompt metadata.

✅ Faithful CoT = Model uses the hint & says it did.
❌ Unfaithful CoT = Model uses the hint but doesn't mention it.

08.04.2025 18:34 — 👍 0 🔁 0 💬 1 📌 0

“Thinking models” use CoT to explore and reason about solutions before outputting their answer.

This CoT has shown to increase a model’s reasoning ability and gives us insight into how the model is thinking.

Anthropic's research asks: Is CoT faithful?

08.04.2025 18:34 — 👍 0 🔁 0 💬 1 📌 0

Is Chain-of-Thought (CoT) reasoning in LLMs just...for show?

@AnthropicAI’s new research paper shows that not only do AI models not use CoT like we thought, they might not use it at all for reasoning.

In fact, they might be lying to us in their CoT.

What you need to know: 🧵

08.04.2025 18:34 — 👍 0 🔁 0 💬 1 📌 0

I probably have 10 more use cases I'm not thinking of...but that's a good start!

What are you using AI for? Did I miss anything important?

07.03.2025 18:15 — 👍 0 🔁 0 💬 0 📌 0

8/ Initial Legal Reviews

Whenever I get a legal document, I ask @ChatGPTapp to review it for me and ask any questions I have.

07.03.2025 18:15 — 👍 0 🔁 0 💬 1 📌 0

7/ Creative Writing

I generally get AI to write the first draft of my X threads. @perplexity_ai has been best for this, also I also have tried @grok and @ChatGPTapp.

I'll take the transcript from on of my videos, plug it in, and say "make me a tweet thread"

07.03.2025 18:15 — 👍 0 🔁 0 💬 1 📌 0

6/ Graphics Creation

Whether it's b-roll for my video, logos, icons, I use AI as a great starting point. I'm generally using Dall-E from @OpenAI or @AnthropicAI's Claude 3.7 Thinking.

Here's an example:

07.03.2025 18:15 — 👍 0 🔁 0 💬 1 📌 0

5/ Medical Diagnoses

Whenever I have a non-urgent question about health for myself or my family, I've been starting with AI. @grok has been the best at this, mainly in terms of it's "vibe" while giving me great information.

07.03.2025 18:15 — 👍 0 🔁 0 💬 1 📌 0

4/ Voice Cloning

This is probably more unique to me, but over the last year I've lost my voice twice. So I cloned my voice with @elevenlabsio and will use it to make videos when I don't have a voice.

Can you tell the difference?

07.03.2025 18:14 — 👍 0 🔁 0 💬 4 📌 0

3/ Vibe Coding

I've been spending a TON of time building games and useful apps for my business. I generally use @cursor_ai and @windsurf_ai with @AnthropicAI's Claude 3.7 Thinking.

Here's a 2D turn-based strategy game I made called Nebula Dominion:

07.03.2025 18:14 — 👍 0 🔁 0 💬 1 📌 0

2/ Research

I use AI to help me learn about topics and prepare for my videos. Deep Research from @OpenAI is my goto for this.

Here's an example of Deep Research helping me prepare notes for my video about RL.

07.03.2025 18:14 — 👍 1 🔁 0 💬 1 📌 0

1/ Search

In fact, I probably use it 50x per day.

For search, I'm mostly going to @perplexity_ai. But I also use @grok and @ChatGPTapp every so often.

Here are some actual searches I've done recently:

07.03.2025 18:14 — 👍 0 🔁 0 💬 1 📌 0

AI has changed my life.

I'm now 100x more productive than I ever was.

How do I use it? Which tools do I use?

Here are my actual use cases for AI: 👇

07.03.2025 18:14 — 👍 2 🔁 0 💬 1 📌 0

QwQ: Tiny Thinking Model That Tops DeepSeek R1 (Open Source)

Start building with Stagehand today: https://dub.sh/stagehandJoin My Newsletter for Regular AI Updates 👇🏼https://forwardfuture.aiMy Links 🔗👉🏻 Subscribe:... QwQ: Tiny Thinking Model That Tops DeepSeek R1 (Open Source)

Here's my full video breakdown of it: www.youtube.com/watch?v=W5G...

07.03.2025 17:48 — 👍 0 🔁 0 💬 0 📌 0

Still, QWQ 32B is 20x smaller than DeepSeek R1 (65GB vs 671GB).

Even beats DeepSeek’s 37B active params in MoE setups.

Efficiency + power + open-source = huge potential.

Go play with it and let me know what you think!

x.com/Alibaba_Qwe...

07.03.2025 17:48 — 👍 0 🔁 0 💬 1 📌 0

Critiques:

132k context window is meh (small by today’s standards).

It also “thinks” a lot—tons of tokens.

Chain-of-Draft prompting could slim that down.

Artificial Analysis benchmarks show it lags DeepSeek R1 on GPT-QA (59.5% vs 71%) but shines on AMY 2024 (78%).

07.03.2025 17:48 — 👍 0 🔁 0 💬 1 📌 0

Speed is wild.

Hosted by Grok (xAI), QWQ 32B hits 450 tokens/sec.

I tested it—fixed a bouncing ball sim in seconds.

That’s game-changing for iteration.

It's open-source, too, so anyone can play with it.

07.03.2025 17:48 — 👍 0 🔁 0 💬 1 📌 0

RL Stage 1: Focused on math & coding with verifiable rewards (e.g., “is the answer right?”).

RL Stage 2: Added general RL for broader skills like instruction-following & agent tasks. No big drop in math/coding performance—a smart hybrid approach.

07.03.2025 17:48 — 👍 0 🔁 0 💬 1 📌 0

How’d they do it?

Reinforcement Learning (RL) with a twist.

They started with a solid foundation model, applied RL with outcome-based rewards, and scaled it for math & coding tasks.

This elicits “thinking” behavior—verified by accuracy checkers & code execution servers.

07.03.2025 17:48 — 👍 1 🔁 0 💬 1 📌 0

Benchmarks? QWQ 32B holds its own.

At 32B params, it’s a fraction of DeepSeek’s size yet punches way above its weight.

Independently verified as well:

x.com/ArtificialA...

07.03.2025 17:48 — 👍 0 🔁 0 💬 1 📌 0

Alibaba just dropped QWQ 32B, an open-source model rivaling DeepSeek R1.

It’s much smaller (32B vs 671B params) but delivers comparable results. You can run it on your PC!

Insanely fast, thinking-focused, and agent-capable.

Let’s dive in.

07.03.2025 17:48 — 👍 0 🔁 0 💬 1 📌 0

LLM generates the ENTIRE output at once (world's first diffusion LLM)

Register for 3-hour AI training with GrowthSchool! Free for the first 1000 people who sign up! https://web.growthschool.io/MWBJoin My Newsletter for Regular ... LLM generates the ENTIRE output at once (world's first diffusion LLM)

Here's the full blog post: www.inceptionlabs.ai/news

Check out my full breakdown video here: www.youtube.com/watch?v=X1r...

07.03.2025 00:48 — 👍 0 🔁 0 💬 0 📌 0

AI leader @karpathy highlights this as a potential paradigm shift: Historically, diffusion worked best with image/video, while autoregression dominated text.

This new model could redefine the psychology of text generation.

x.com/karpathy/st...

07.03.2025 00:48 — 👍 0 🔁 0 💬 1 📌 0

Implications are massive:

• Agents can operate far quicker and more effectively.
• Enhanced reasoning capabilities due to increased computation efficiency.
• Compact, powerful models enable high-performance applications on edge devices like laptops.

07.03.2025 00:48 — 👍 0 🔁 0 💬 1 📌 0

Latest posts by matthew-berman.bsky.social on Bluesky

@matthew-berman is following 1 prominent accounts