Hereβs the full research paper: arxiv.org/abs/2407.01744
And my full video breakdown:
youtu.be/r7wCtN3mzXk
@matthew-berman.bsky.social
Hereβs the full research paper: arxiv.org/abs/2407.01744
And my full video breakdown:
youtu.be/r7wCtN3mzXk
Even weirder:
Unfaithful CoTs (where the model hides its use of a hint) tend to be MORE verbose & convoluted than faithful ones!
Kind of like humans when they elaborate too much with a lie.
Why output CoT then?
Models might just be mimicking human-like reasoning patterns (!!) learned during training (SFT, RLHF) for our benefit, rather than genuinely using that specific outputted text to derive the answer.
RLHF might even incentivize hiding undesirable reasoning!
Key Finding 3: Faithfulness DECREASES on harder tasks.
When tested on harder benchmarks (like GPQA vs MMLU), models were significantly less likely to have faithful CoT.
This casts doubt on using CoT monitoring for complex, real-world alignment challenges.
Key Finding 2:
Bad news for detecting "reward hacking" (models βgamingβ the task).
Models trained to exploit reward hacks did so >99% of the time, but almost NEVER (<2%) mentioned the hack in their CoT.
π Soβ¦CoT monitoring likely won't catch these dangerous behaviors.
Key Finding 1:
Models are often UNFAITHFUL.
They frequently use the provided hints to get the answer but don't acknowledge it in their CoT output.
Overall faithfulness scores were low (e.g., ~25% for Claude 3.7 Sonnet, ~39% for DeepSeek R1 on these tests).
How they tested it:
They gave models (like Claude & DeepSeek) multiple-choice questions, sometimes embedded hints (correct/incorrect answers) in the prompt metadata.
β
Faithful CoT = Model uses the hint & says it did.
β Unfaithful CoT = Model uses the hint but doesn't mention it.
βThinking modelsβ use CoT to explore and reason about solutions before outputting their answer.
This CoT has shown to increase a modelβs reasoning ability and gives us insight into how the model is thinking.
Anthropic's research asks: Is CoT faithful?
Is Chain-of-Thought (CoT) reasoning in LLMs just...for show?
@AnthropicAIβs new research paper shows that not only do AI models not use CoT like we thought, they might not use it at all for reasoning.
In fact, they might be lying to us in their CoT.
What you need to know: π§΅
I probably have 10 more use cases I'm not thinking of...but that's a good start!
What are you using AI for? Did I miss anything important?
8/ Initial Legal Reviews
Whenever I get a legal document, I ask @ChatGPTapp to review it for me and ask any questions I have.
7/ Creative Writing
I generally get AI to write the first draft of my X threads. @perplexity_ai has been best for this, also I also have tried @grok and @ChatGPTapp.
I'll take the transcript from on of my videos, plug it in, and say "make me a tweet thread"
6/ Graphics Creation
Whether it's b-roll for my video, logos, icons, I use AI as a great starting point. I'm generally using Dall-E from @OpenAI or @AnthropicAI's Claude 3.7 Thinking.
Here's an example:
5/ Medical Diagnoses
Whenever I have a non-urgent question about health for myself or my family, I've been starting with AI. @grok has been the best at this, mainly in terms of it's "vibe" while giving me great information.
4/ Voice Cloning
This is probably more unique to me, but over the last year I've lost my voice twice. So I cloned my voice with @elevenlabsio and will use it to make videos when I don't have a voice.
Can you tell the difference?
3/ Vibe Coding
I've been spending a TON of time building games and useful apps for my business. I generally use @cursor_ai and @windsurf_ai with @AnthropicAI's Claude 3.7 Thinking.
Here's a 2D turn-based strategy game I made called Nebula Dominion:
2/ Research
I use AI to help me learn about topics and prepare for my videos. Deep Research from @OpenAI is my goto for this.
Here's an example of Deep Research helping me prepare notes for my video about RL.
1/ Search
In fact, I probably use it 50x per day.
For search, I'm mostly going to @perplexity_ai. But I also use @grok and @ChatGPTapp every so often.
Here are some actual searches I've done recently:
AI has changed my life.
I'm now 100x more productive than I ever was.
How do I use it? Which tools do I use?
Here are my actual use cases for AI: π
Here's my full video breakdown of it: www.youtube.com/watch?v=W5G...
07.03.2025 17:48 β π 0 π 0 π¬ 0 π 0Still, QWQ 32B is 20x smaller than DeepSeek R1 (65GB vs 671GB).
Even beats DeepSeekβs 37B active params in MoE setups.
Efficiency + power + open-source = huge potential.
Go play with it and let me know what you think!
x.com/Alibaba_Qwe...
Critiques:
132k context window is meh (small by todayβs standards).
It also βthinksβ a lotβtons of tokens.
Chain-of-Draft prompting could slim that down.
Artificial Analysis benchmarks show it lags DeepSeek R1 on GPT-QA (59.5% vs 71%) but shines on AMY 2024 (78%).
Speed is wild.
Hosted by Grok (xAI), QWQ 32B hits 450 tokens/sec.
I tested itβfixed a bouncing ball sim in seconds.
Thatβs game-changing for iteration.
It's open-source, too, so anyone can play with it.
RL Stage 1: Focused on math & coding with verifiable rewards (e.g., βis the answer right?β).
RL Stage 2: Added general RL for broader skills like instruction-following & agent tasks. No big drop in math/coding performanceβa smart hybrid approach.
Howβd they do it?
Reinforcement Learning (RL) with a twist.
They started with a solid foundation model, applied RL with outcome-based rewards, and scaled it for math & coding tasks.
This elicits βthinkingβ behaviorβverified by accuracy checkers & code execution servers.
Benchmarks? QWQ 32B holds its own.
At 32B params, itβs a fraction of DeepSeekβs size yet punches way above its weight.
Independently verified as well:
x.com/ArtificialA...
Alibaba just dropped QWQ 32B, an open-source model rivaling DeepSeek R1.
Itβs much smaller (32B vs 671B params) but delivers comparable results. You can run it on your PC!
Insanely fast, thinking-focused, and agent-capable.
Letβs dive in.
Here's the full blog post: www.inceptionlabs.ai/news
Check out my full breakdown video here: www.youtube.com/watch?v=X1r...
AI leader @karpathy highlights this as a potential paradigm shift: Historically, diffusion worked best with image/video, while autoregression dominated text.
This new model could redefine the psychology of text generation.
x.com/karpathy/st...
Implications are massive:
β’ Agents can operate far quicker and more effectively.
β’ Enhanced reasoning capabilities due to increased computation efficiency.
β’ Compact, powerful models enable high-performance applications on edge devices like laptops.