Original: x.com/rohanpaul_ai/status/1948572304809611701
25.07.2025 08:13 — 👍 0 🔁 0 💬 0 📌 0Original: x.com/rohanpaul_ai/status/1948572304809611701
25.07.2025 08:13 — 👍 0 🔁 0 💬 0 📌 0
... Paper – https://arxiv.org/abs/2507.16003https://arxiv.org/abs/2507.16003
Paper Title: "Learning without training: The implicit dynamics of in-context learning"
... Results cover only the first generated token and one transformer block without MLP skip, so full‑stack models need more work.
Still, the finding hints that many in‑context tricks come from weight geometry rather than quirky attention rules.
--- ...
... 🤝 Finetune vs. Implicit Patch
They compare classic gradient finetuning on the same examples to the single‑shot patch strategy.
Both methods cut test loss in a similar pattern, yet the patch avoids any real back‑prop and keeps the rest of the network frozen.
---
🔎 Limits They Admit ...
... 🔬 Testing on Simple Linear Tasks
They train a small transformer to map x→w·x using 50 prompt pairs plus 1 query.
When they swap the prompt for its equivalent rank 1 patch and feed only the query, the loss curve overlaps the full‑prompt run almost perfectly.
That overlap
--- ...
... 📐 Hidden Gradient Descent
Feeding tokens one by one stacks these tiny patches.
Proposition 3.1 proves each added token shifts the weights the same way online gradient descent would, with a step size tied to the query vector length.
The shift shrinks as soon as a token stops
--- ...
... 🧩 How the Patch Works
Theorem 2.2 shows a formula: multiply the base weights by the context change vector, then project it with the query representation, boom, you get the patch.
Because the patch is rank 1, it stores almost no extra parameters yet still carries the full
--- ...
... 🛠️ Temporary rank 1 patch
A transformer block first runs the self‑attention layer and gets two things for the query token: the usual activation and the tiny difference between “with context” and “without context”.
It multiplies that difference by the frozen weight matrix, then
--- ...
Image 1 from X post
Image 2 from X post
⚙️ The Core Idea
They call any layer that can read a separate context plus a query a “contextual layer”.
Stack this layer on top of a normal multilayer perceptron and you get a “contextual block”.
For that block, the context acts exactly like a rank 1 additive patch on the
--- ...
Original: x.com/hardmaru/status/1947998113450631350
23.07.2025 13:32 — 👍 0 🔁 0 💬 0 📌 0Image 1 from X post
ICML’s Statement about subversive hidden LLM prompts
We live in a weird timeline…
Original: x.com/mihirp98/status/1947736993229885545
23.07.2025 12:52 — 👍 0 🔁 0 💬 0 📌 0... In collaboration with Amir Zadeh, Katerina Fragkiadaki (@KaterinaFragiad@KaterinaFragiad) and Deepak Pathak (@pathak2206@pathak2206) at @mldcmu@mldcmu
23.07.2025 12:52 — 👍 0 🔁 0 💬 1 📌 0
... Project webpage & code - https://diffusion-scaling.github.iohttps://diffusion-scaling.github.io
Arxiv - https://arxiv.org/abs/2507.15857https://arxiv.org/abs/2507.15857
This project was co-led with Menging Wu (@WuMengning54261@WuMengning54261). ...
... 🚨#8: A natural question here is—why does diffusion outperform AR when data is limited?
We hypothesize that the key advantage stems from the use of random masking in diffusion models, which serves as a form of data augmentation. Unlike AR models, which are trained on a single,
---
🚨#9: ...
... 🚨Finding #7: The data efficiency of diffusion models translates to better downstream performance.
Lastly we evaluate the best-performing diffusion and AR models (trained under the same data budget) on a range of language understanding tasks.
Across most benchmarks, diffusion
--- ...
... 🚨 Finding #6: The compute required for diffusion to outperform AR follows a predictable power law.
Above we defined the critical compute threshold as the amount of FLOPs where diffusion matches AR performance for a given unique dataset size.
We find that we can derive a simple
--- ...
... ---
🚨 Finding #5: Muennighoff et al showed that repeating the dataset up to 4 epochs is nearly as effective as using fresh data for autoregressive models.
In contrast, we find that diffusion models can be trained on repeated data for up to 100 epochs, while having repeated data
--- ...
... 🚨 Finding #4: Diffusion models exhibit a much higher half-life of data reuse (R_D*) —i.e., the number of epochs after which returns from repeating data begins to significantly diminish.
We adopt the data-constrained scaling framework introduced by @Muennighoff@Muennighoff et al. in their ...
... 🚨Finding #3: Diffusion models are significantly more robust to data repetition than autoregressive (AR) models.
We show training curves of models trained with the same total compute, but different trade-offs between unique data and number of epochs.
An “epoch” here means
--- ...
... 🚨 Finding #2: Autoregressive models begin to overfit much quickly, while diffusion shows no signs of overfitting even after 10x the number of epochs.
In the above figure, we showed that increasing compute eventually favors diffusion. But compute can be scaled in two ways:
(i)
--- ...
... 🚨 Finding #1: Diffusion models outperform autoregressive models when trained with sufficient compute (i.e., more epochs & parameters).
Across different unique data scales, we observe:
1️⃣ At low compute, Autoregressive models win.
2️⃣ After a certain amount of compute,
--- ...
Image 1 from X post
Image 2 from X post
Image 3 from X post
Image 4 from X post
🚨 The era of infinite internet data is ending, So we ask:
👉 What’s the right generative modelling objective when data—not compute—is the bottleneck?
TL;DR:
▶️Compute-constrained? Train Autoregressive models
▶️Data-constrained? Train Diffusion models
Get ready for 🤿 1/n
--- ...
Original: x.com/mihirp98/status/1947736993229885545
23.07.2025 12:09 — 👍 0 🔁 0 💬 0 📌 0... In collaboration with Amir Zadeh, Katerina Fragkiadaki (@KaterinaFragiad@KaterinaFragiad) and Deepak Pathak (@pathak2206@pathak2206) at @mldcmu@mldcmu
23.07.2025 12:09 — 👍 0 🔁 0 💬 1 📌 0
... Project webpage & code - https://diffusion-scaling.github.iohttps://diffusion-scaling.github.io
Arxiv - https://arxiv.org/abs/2507.15857https://arxiv.org/abs/2507.15857
This project was co-led with Menging Wu (@WuMengning54261@WuMengning54261). ...
... 🚨#8: A natural question here is—why does diffusion outperform AR when data is limited?
We hypothesize that the key advantage stems from the use of random masking in diffusion models, which serves as a form of data augmentation. Unlike AR models, which are trained on a single,
---
🚨#9: ...
... 🚨Finding #7: The data efficiency of diffusion models translates to better downstream performance.
Lastly we evaluate the best-performing diffusion and AR models (trained under the same data budget) on a range of language understanding tasks.
Across most benchmarks, diffusion
--- ...
... 🚨 Finding #6: The compute required for diffusion to outperform AR follows a predictable power law.
Above we defined the critical compute threshold as the amount of FLOPs where diffusion matches AR performance for a given unique dataset size.
We find that we can derive a simple
--- ...
... ---
🚨 Finding #5: Muennighoff et al showed that repeating the dataset up to 4 epochs is nearly as effective as using fresh data for autoregressive models.
In contrast, we find that diffusion models can be trained on repeated data for up to 100 epochs, while having repeated data
--- ...