Cosmin Stamate's Avatar

Cosmin Stamate

@stamate.bsky.social

AI & ML Scientist | Researcher • Engineer • Lecturer

570 Followers  |  4,420 Following  |  67 Posts  |  Joined: 20.09.2023
Posts Following

Posts by Cosmin Stamate (@stamate.bsky.social)

Original: x.com/rohanpaul_ai/status/1948572304809611701

25.07.2025 08:13 — 👍 0    🔁 0    💬 0    📌 0

... Paper – https://arxiv.org/abs/2507.16003https://arxiv.org/abs/2507.16003

Paper Title: "Learning without training: The implicit dynamics of in-context learning"

25.07.2025 08:13 — 👍 0    🔁 0    💬 1    📌 0

... Results cover only the first generated token and one transformer block without MLP skip, so full‑stack models need more work.

Still, the finding hints that many in‑context tricks come from weight geometry rather than quirky attention rules.

--- ...

25.07.2025 08:13 — 👍 0    🔁 0    💬 1    📌 0

... 🤝 Finetune vs. Implicit Patch

They compare classic gradient finetuning on the same examples to the single‑shot patch strategy.

Both methods cut test loss in a similar pattern, yet the patch avoids any real back‑prop and keeps the rest of the network frozen.

---

🔎 Limits They Admit ...

25.07.2025 08:13 — 👍 0    🔁 0    💬 1    📌 0

... 🔬 Testing on Simple Linear Tasks

They train a small transformer to map x→w·x using 50 prompt pairs plus 1 query.

When they swap the prompt for its equivalent rank 1 patch and feed only the query, the loss curve overlaps the full‑prompt run almost perfectly.

That overlap

--- ...

25.07.2025 08:13 — 👍 0    🔁 0    💬 1    📌 0

... 📐 Hidden Gradient Descent

Feeding tokens one by one stacks these tiny patches.

Proposition 3.1 proves each added token shifts the weights the same way online gradient descent would, with a step size tied to the query vector length.

The shift shrinks as soon as a token stops

--- ...

25.07.2025 08:13 — 👍 0    🔁 0    💬 1    📌 0

... 🧩 How the Patch Works

Theorem 2.2 shows a formula: multiply the base weights by the context change vector, then project it with the query representation, boom, you get the patch.

Because the patch is rank 1, it stores almost no extra parameters yet still carries the full

--- ...

25.07.2025 08:13 — 👍 0    🔁 0    💬 1    📌 0

... 🛠️ Temporary rank 1 patch

A transformer block first runs the self‑attention layer and gets two things for the query token: the usual activation and the tiny difference between “with context” and “without context”.

It multiplies that difference by the frozen weight matrix, then

--- ...

25.07.2025 08:13 — 👍 0    🔁 0    💬 1    📌 0
Image 1 from X post

Image 1 from X post

Image 2 from X post

Image 2 from X post

⚙️ The Core Idea

They call any layer that can read a separate context plus a query a “contextual layer”.

Stack this layer on top of a normal multilayer perceptron and you get a “contextual block”.

For that block, the context acts exactly like a rank 1 additive patch on the

--- ...

25.07.2025 08:13 — 👍 1    🔁 0    💬 1    📌 0

Original: x.com/hardmaru/status/1947998113450631350

23.07.2025 13:32 — 👍 0    🔁 0    💬 0    📌 0
Image 1 from X post

Image 1 from X post

ICML’s Statement about subversive hidden LLM prompts

We live in a weird timeline…

23.07.2025 13:32 — 👍 1    🔁 0    💬 1    📌 0

Original: x.com/mihirp98/status/1947736993229885545

23.07.2025 12:52 — 👍 0    🔁 0    💬 0    📌 0

... In collaboration with Amir Zadeh, Katerina Fragkiadaki (@KaterinaFragiad@KaterinaFragiad) and Deepak Pathak (@pathak2206@pathak2206) at @mldcmu@mldcmu

23.07.2025 12:52 — 👍 0    🔁 0    💬 1    📌 0

... Project webpage & code - https://diffusion-scaling.github.iohttps://diffusion-scaling.github.io

Arxiv - https://arxiv.org/abs/2507.15857https://arxiv.org/abs/2507.15857

This project was co-led with Menging Wu (@WuMengning54261@WuMengning54261). ...

23.07.2025 12:52 — 👍 0    🔁 0    💬 1    📌 0

... 🚨#8: A natural question here is—why does diffusion outperform AR when data is limited?

We hypothesize that the key advantage stems from the use of random masking in diffusion models, which serves as a form of data augmentation. Unlike AR models, which are trained on a single,

---

🚨#9: ...

23.07.2025 12:52 — 👍 0    🔁 0    💬 1    📌 0

... 🚨Finding #7: The data efficiency of diffusion models translates to better downstream performance.

Lastly we evaluate the best-performing diffusion and AR models (trained under the same data budget) on a range of language understanding tasks.

Across most benchmarks, diffusion

--- ...

23.07.2025 12:52 — 👍 0    🔁 0    💬 1    📌 0

... 🚨 Finding #6: The compute required for diffusion to outperform AR follows a predictable power law.

Above we defined the critical compute threshold as the amount of FLOPs where diffusion matches AR performance for a given unique dataset size.

We find that we can derive a simple

--- ...

23.07.2025 12:52 — 👍 0    🔁 0    💬 1    📌 0

... ---

🚨 Finding #5: Muennighoff et al showed that repeating the dataset up to 4 epochs is nearly as effective as using fresh data for autoregressive models.

In contrast, we find that diffusion models can be trained on repeated data for up to 100 epochs, while having repeated data

--- ...

23.07.2025 12:52 — 👍 0    🔁 0    💬 1    📌 0

... 🚨 Finding #4: Diffusion models exhibit a much higher half-life of data reuse (R_D*) —i.e., the number of epochs after which returns from repeating data begins to significantly diminish.

We adopt the data-constrained scaling framework introduced by @Muennighoff@Muennighoff et al. in their ...

23.07.2025 12:52 — 👍 0    🔁 0    💬 1    📌 0

... 🚨Finding #3: Diffusion models are significantly more robust to data repetition than autoregressive (AR) models.

We show training curves of models trained with the same total compute, but different trade-offs between unique data and number of epochs.

An “epoch” here means

--- ...

23.07.2025 12:52 — 👍 0    🔁 0    💬 1    📌 0

... 🚨 Finding #2: Autoregressive models begin to overfit much quickly, while diffusion shows no signs of overfitting even after 10x the number of epochs.
In the above figure, we showed that increasing compute eventually favors diffusion. But compute can be scaled in two ways:

(i)

--- ...

23.07.2025 12:52 — 👍 0    🔁 0    💬 1    📌 0

... 🚨 Finding #1: Diffusion models outperform autoregressive models when trained with sufficient compute (i.e., more epochs & parameters).

Across different unique data scales, we observe:

1️⃣ At low compute, Autoregressive models win.
2️⃣ After a certain amount of compute,

--- ...

23.07.2025 12:52 — 👍 0    🔁 0    💬 1    📌 0
Image 1 from X post

Image 1 from X post

Image 2 from X post

Image 2 from X post

Image 3 from X post

Image 3 from X post

Image 4 from X post

Image 4 from X post

🚨 The era of infinite internet data is ending, So we ask:

👉 What’s the right generative modelling objective when data—not compute—is the bottleneck?

TL;DR:

▶️Compute-constrained? Train Autoregressive models

▶️Data-constrained? Train Diffusion models

Get ready for 🤿 1/n

--- ...

23.07.2025 12:52 — 👍 1    🔁 0    💬 1    📌 0

Original: x.com/mihirp98/status/1947736993229885545

23.07.2025 12:09 — 👍 0    🔁 0    💬 0    📌 0

... In collaboration with Amir Zadeh, Katerina Fragkiadaki (@KaterinaFragiad@KaterinaFragiad) and Deepak Pathak (@pathak2206@pathak2206) at @mldcmu@mldcmu

23.07.2025 12:09 — 👍 0    🔁 0    💬 1    📌 0

... Project webpage & code - https://diffusion-scaling.github.iohttps://diffusion-scaling.github.io

Arxiv - https://arxiv.org/abs/2507.15857https://arxiv.org/abs/2507.15857

This project was co-led with Menging Wu (@WuMengning54261@WuMengning54261). ...

23.07.2025 12:09 — 👍 0    🔁 0    💬 1    📌 0

... 🚨#8: A natural question here is—why does diffusion outperform AR when data is limited?

We hypothesize that the key advantage stems from the use of random masking in diffusion models, which serves as a form of data augmentation. Unlike AR models, which are trained on a single,

---

🚨#9: ...

23.07.2025 12:09 — 👍 0    🔁 0    💬 1    📌 0

... 🚨Finding #7: The data efficiency of diffusion models translates to better downstream performance.

Lastly we evaluate the best-performing diffusion and AR models (trained under the same data budget) on a range of language understanding tasks.

Across most benchmarks, diffusion

--- ...

23.07.2025 12:09 — 👍 0    🔁 0    💬 1    📌 0

... 🚨 Finding #6: The compute required for diffusion to outperform AR follows a predictable power law.

Above we defined the critical compute threshold as the amount of FLOPs where diffusion matches AR performance for a given unique dataset size.

We find that we can derive a simple

--- ...

23.07.2025 12:09 — 👍 0    🔁 0    💬 1    📌 0

... ---

🚨 Finding #5: Muennighoff et al showed that repeating the dataset up to 4 epochs is nearly as effective as using fresh data for autoregressive models.

In contrast, we find that diffusion models can be trained on repeated data for up to 100 epochs, while having repeated data

--- ...

23.07.2025 12:09 — 👍 0    🔁 0    💬 1    📌 0