Co-led with @pkargupta.bsky.social ✨ We learned so much and couldn't have done it w/o our amazing collaborators and mentors: Ken Wang, Jinu Lee, @shan23chen.bsky.social, @orevaahia.bsky.social, Dean Light, Tom Griffiths, @maxkw.bsky.social, Jiawei Han, @asli-celikyilmaz.bsky.social, Yulia Tsvetkov🩵
25.11.2025 18:25 —
👍 5
🔁 0
💬 0
📌 0
What our Cognitive Foundations framework enables:
🔍 Systematic diagnosis of reasoning failures
🎯 Predicting which training→which capabilities
🧪 Testing cognitive theories at scale
🌉 Shared vocabulary bridging cognition & AI research
More on opportunities & challenges in📄
25.11.2025 18:25 —
👍 1
🔁 0
💬 1
📌 0
Test-time reasoning guidance: up to 66.7% improvement 💡
We scaffold cognitive structures from successful traces to guide reasoning.
Major gains on ill-structured problems🌟
Models possess latent capabilities—they just don't deploy them adaptively without explicit guidance.
25.11.2025 18:25 —
👍 3
🔁 1
💬 1
📌 0
🧑🏻Humans reason differently‼️ More abstraction (54% vs 36%), self-awareness (49% vs 19%), conceptual processing. Less surface enumeration and rigid sequential chaining.
Even with correct answers—underlying mechanisms diverge fundamentally.
25.11.2025 18:25 —
👍 4
🔁 0
💬 1
📌 0
We analyzed 1,598 LLM reasoning papers:
Research concentrates on easily quantifiable behaviors—sequential organization (55%), decomposition (60%)
Neglects meta-cognitive controls (8-16%) and alternative representations (10-27%) that correlate with success⚠️
25.11.2025 18:25 —
👍 5
🔁 1
💬 1
📌 0
Structure matters as much as presence📐
We introduce method to extract reasoning structure from traces
Successful: selective attention → knowledge alignment → forward chaining
Common: skip to forward chaining
LLMs prematurely seek solution before understanding constraints‼️
25.11.2025 18:25 —
👍 3
🔁 0
💬 1
📌 0
Model-specific patterns reveal training impact:
Olmo3 exhibits more diverse cognitive elements (49%)—they explicitly included meta-reasoning data during midtraining.
DeepHermes-3: only 12% avg presence.
Training methodology shapes cognitive profiles dramatically.
25.11.2025 18:25 —
👍 3
🔁 0
💬 1
📌 0
Meta-cognitive deficit is severe:
🤔Self-awareness: 16% in research design, 19% in LLM traces vs 49% in humans
🧐Self-evaluation on non-verifiable problems collapses (53.5% presence, 0.031 correlation)
Models can't self-assess without ground truth.
25.11.2025 18:25 —
👍 2
🔁 0
💬 1
📌 0
The presence-effectiveness paradox:
Logical coherence: 91% of traces, 0.091 corr. w/ success Knowledge alignment: 20% of traces, 0.234 correlation (high)
Models frequently attempt core elements but fail to execute. Having the capability ≠ deploying it successfully😬
25.11.2025 18:25 —
👍 2
🔁 0
💬 1
📌 0
Models deploy strategies inversely to what works 🚨
As problems become ill-structured, models narrow their repertoire—but successful traces show need for greater diversity (successful = high ppmi in fig).
Sequential organization dominates. Meta-cognition disappears in LLMs.
25.11.2025 18:25 —
👍 2
🔁 0
💬 1
📌 0
We analyze 192K LLM reasoning traces from 18 models (text,image,video)+LLM 54 humans think-aloud traces
We introduce a framework for fine-grained span-level cognitive evaluation: WHICH elements appear, WHERE, and HOW they're sequenced.
First analysis of its kind at this scale📊
25.11.2025 18:25 —
👍 2
🔁 0
💬 1
📌 0
Our taxonomy bridges cognitive science → LLM eval:
28 elements across 4 dimensions—reasoning invariants (compositionality, logical coherence), meta-cognitive controls (self-awareness), representations (hierarchical, causal), and operations (backtracking, verification)
25.11.2025 18:25 —
👍 4
🔁 1
💬 1
📌 0
LLMs solve hard problems but fail on easy variants, exhibit patterns different from humans.
The issue: reasoning evaluations is by outcomes w/o understanding the cognitive processes that produce them. We can't diagnose failures or predict how training produces capabilities🚨
25.11.2025 18:25 —
👍 2
🔁 0
💬 1
📌 0
🤔💭What even is reasoning? It's time to answer the hard questions!
We built the first unified taxonomy of 28 cognitive elements underlying reasoning
Spoiler—LLMs commonly employ sequential reasoning, rarely self-awareness, and often fail to use correct reasoning structures🧠
25.11.2025 18:25 —
👍 46
🔁 8
💬 2
📌 0
Because Olmo 3 is fully open, we decontaminate our evals from our pretraining and midtraining data. @stellali.bsky.social proves this with spurious rewards: RL trained on a random reward signal can't improve on the evals, unlike some previous setups
20.11.2025 20:38 —
👍 1
🔁 1
💬 1
📌 0
Day 1 (Tue Oct 7) 4:30-6:30pm, Poster Session 2
Poster #77: ALFA: Aligning LLMs to Ask Good Questions: A Case Study in Clinical Reasoning; led by
@stellali.bsky.social & @jiminmun.bsky.social
06.10.2025 14:51 —
👍 2
🔁 1
💬 1
📌 0
This project was done as part of the Meta FAIR AIM mentorship program. Special thanks to my amazing collaborators and awesome mentors @melaniesclar.bsky.social @jcqln_h @hunterjlang @AnsongNi @andrew_e_cohen @jacoby_xu @chan_young_park @tsvetshop.bsky.social @asli-celikyilmaz.bsky.social 🫶🏻💙
22.07.2025 14:58 —
👍 0
🔁 0
💬 0
📌 0
🌍Bonus: PrefPalette🎨 is a computational social science goldmine!
📊 Quantify community values at scale
📈 Track how norms evolve over time
🔍 Understand group psychology
📋 Move beyond surveys to revealed preferences
22.07.2025 14:58 —
👍 0
🔁 0
💬 1
📌 0
💡Potential real-world applications:
🛡️Smart content moderation—explains why content is flagged/decisions are made
🎯Interpretable LM alignment—revealing prominent attributes
⚙️Controllable personalization—giving user agency to personalize select attributes
22.07.2025 14:58 —
👍 0
🔁 0
💬 1
📌 0
🔍More importantly‼️we can see WHY preferences differ:
r/AskHistorians:📚values verbosity
r/RoastMe:💥values directness
r/confession:❤️values empathy
We visualize each group’s unique preference decisions—no more one-size-fits-all. Understand your audience at a glance🏷️
22.07.2025 14:58 —
👍 0
🔁 0
💬 1
📌 0
🏆Results across 45 Reddit communities:
📈Performance boost: +46.6% vs GPT-4o
💪Outperforms other training-based baselines w/ statistical significance
🕰️Robust to temporal shifts—trained pref models can be used out-of-the box!
22.07.2025 14:58 —
👍 0
🔁 0
💬 1
📌 0
⚙️How it works (pt.2)
1: 🎛️Train compact, efficient detectors for every attribute
2: 🎯Learn community-specific attribute weights during preference training
3: 🔧Add attribute embeddings to preference model for accurate & explainable predictions
22.07.2025 14:58 —
👍 0
🔁 0
💬 1
📌 0
⚙️How it works (prep stage)
📜Define 19 sociolinguistics & cultural attributes from literature
🏭Novel preference data generation pipeline to isolate attributes
Our data gen pipeline generates pairwise data on *any* decomposed dimension, w/ applications beyond preference modeling
22.07.2025 14:58 —
👍 0
🔁 0
💬 1
📌 0
Meet PrefPalette🎨! Our approach:
🔍⚖️models preferences w/ 19 attribute detectors and dynamic, context-aware weights
🕶️👍uses unobtrusive signals from Reddit to avoid response bias
🧠mirrors attribute-mediated human judgment—so you know not just what it predicts, but *why*🧐
22.07.2025 14:58 —
👍 0
🔁 0
💬 1
📌 0
🔬Cognitive science reveals how humans break choices into attributes, e.g.:
😂 Humor
❤️ Empathy
💬 Conformity
...then weight them based on context (e.g. comedy vs counseling).
These traits shape every decision, from product picks to conversation tone. Your mind is a colorful palette🎨
22.07.2025 14:58 —
👍 1
🔁 0
💬 1
📌 0
🚨Current preference models only output a reward/score:
❌No transparency in decision-making
❌Personalization breaks easily, one-size-fits-all scores
❌Use explicit annotations (response bias)
They can’t adapt to individual tastes, can’t debug errors, and fail to build trust🙅
22.07.2025 14:58 —
👍 0
🔁 0
💬 1
📌 0
WHY do you prefer something over another?
Reward models treat preference as a black-box😶🌫️but human brains🧠decompose decisions into hidden attributes
We built the first system to mirror how people really make decisions in our recent COLM paper🎨PrefPalette✨
Why it matters👉🏻🧵
22.07.2025 14:58 —
👍 7
🔁 2
💬 1
📌 2
Want to quickly sample high-quality images from diffusion models, but can’t afford the time or compute to distill them? Introducing S4S, or Solving for the Solver, which learns the coefficients and discretization steps for a DM solver to improve few-NFE generation.
Thread 👇 1/
27.02.2025 18:24 —
👍 3
🔁 1
💬 1
📌 0