Andrew Lampinen @lampinen - Bluesky Profile

On the generalization of language models from in-context learning and finetuning: a controlled study Large language models exhibit exciting capabilities, yet can show surprisingly narrow generalization from finetuning. E.g. they can fail to generalize to simple reversals of relations they are trained...

I'm not sure I fully understand this point; part of our argument here (as well as in some of our past work: arxiv.org/abs/2505.00661) is that models *can* readily produce the reversals when the information is in context; they just *don't* unless there is some problem to solve or other cue to do so.

23.09.2025 23:10 — 👍 1 🔁 0 💬 1 📌 0

Hahaha much appreciated

22.09.2025 21:47 — 👍 0 🔁 0 💬 0 📌 0

Even comparing my own work in different areas; it's harder to be both timely and as through with LM works, especially with the scale of experiments

22.09.2025 19:46 — 👍 2 🔁 0 💬 1 📌 0

I was gonna say, I feel attacked by this tweet 😅

22.09.2025 19:44 — 👍 1 🔁 0 💬 2 📌 0

Latent learning: episodic memory complements parametric learning by enabling flexible reuse of experiences When do machine learning systems fail to generalize, and what mechanisms could improve their generalization? Here, we draw inspiration from cognitive science to argue that one weakness of machine lear...

Check out the paper if you’re interested! arxiv.org/abs/2509.16189
And thanks to my awesome collaborators: @martinengelcke.bsky.social, Effie Li, @arslanchaudhry.bsky.social and James McClelland. 9/9

22.09.2025 04:21 — 👍 9 🔁 0 💬 0 📌 0

We think this work sheds light on why retrieval offers distinct benefits beyond just training models more, and provides a different perspective on why episodic memory and parametric learning are complementary, which we hope will be of interest for both AI and cognitive science 8/

22.09.2025 04:21 — 👍 3 🔁 0 💬 1 📌 0

In the paper, we explore many more settings & nuances — including RL and BC versions of maze navigation experiments based on the original experiments on latent learning in rats, the effects of associative cues, the importance of within-episode ICL, and ablations. 7/

22.09.2025 04:21 — 👍 3 🔁 0 💬 1 📌 0

The benefits of oracle retrieval on the (a) Codebooks and (b) simple reversals benchmarks. Both baseline and retrieval models perform well on component tasks like recalling definitions, or encoding new sequences involving indices used in encoding during training (a, center). However, performance differs dramatically on the latent encoding test (right bars on both plots), where only the model with retrieval achieves above-chance performance.

We show that even when models generalize well from parametric learning in standard (nontrivial) evaluations, there are selective, consistent failures of latent learning. Only models with retrieval generalize well on the key tests of latent learning. 6/

22.09.2025 04:21 — 👍 3 🔁 0 💬 1 📌 0

The benchmarks we use and the key types of latent generalization that they test. (a) The codebooks benchmark tests the ability to use latent indices (highlighted in red) for which only the definitions have been seen in training to complete test encoding sequences. (b) The simple reversals benchmark tests the ability of models to reverse relations seen in training, and which models have learned to reverse in-context. (c) The semantic structure benchmark uses training embedded in more naturalistic text to test latent generalization types ranging from reversals to syllogisms, or more challenging category-inclusion-only holdouts. (d) The latent gridworld—with both its pixel-based RL and ASCII-based BC instantiations—tests the ability to navigate to objects that have never been a navigation goal in training for a particular maze, but have been frequently seen.

To illustrate this point, we explore latent learning across a wide range of benchmarks (from codebook translation to BC and RL navigation) — and compare baseline language models or agents to those equipped with oracle retrieval. 5/

22.09.2025 04:21 — 👍 4 🔁 0 💬 1 📌 0

Explicit retrieval of learning experiences from nonparametric learning systems complements the broader knowledge of parametric learning—by making select, relevant experiences available in context where they can be more flexibly used in ways different from the original task setting in which they were encountered.

But models can readily use latent information in their context. We therefore suggest that natural intelligence solves the latent learning problem via the complementary strengths of episodic memory: reinstating experiences into context makes latent information accessible. 4/

22.09.2025 04:21 — 👍 5 🔁 1 💬 1 📌 0

While a model may be trained on some explicit information (e.g. X is Y's teacher" or goals (e.g. navigate to Z), there may be other information latent in it (such as a reversal "Y is X's teacher). Challenges of reversal are one instance of the much broader phenomenon that what is explicitly learned may also latently convey information relevant to other tasks—e.g., multi-hop reasoning, alternative goals, or answering questions in other languages. Like the reversal curse, learning on such sequences may primarily improve performance on the explicit information or goals; however, if the sequence were in context, models would readily be able to make inferences about the latent information.

we argue that parametric learning methods are too tied to the explicit training task, and fail to effectively encode latent information relevant to possible future tasks, and we suggest that this explains a wide range of findings, from navigation to the reversal curse. 3/

22.09.2025 04:21 — 👍 5 🔁 0 💬 2 📌 0

We take inspiration from classic experiments on latent learning in animals, where the animals learn about information that is not useful at present, but that might be useful later — for example, learning the location of useful resources in passing. By contrast, 2/

22.09.2025 04:21 — 👍 5 🔁 0 💬 1 📌 0

Latent learning: episodic memory complements parametric learning by enabling flexible reuse of experiences When do machine learning systems fail to generalize, and what mechanisms could improve their generalization? Here, we draw inspiration from cognitive science to argue that one weakness of machine lear...

Why does AI sometimes fail to generalize, and what might help? In a new paper (arxiv.org/abs/2509.16189), we highlight the latent learning gap — which unifies findings from language modeling to agent navigation — and suggest that episodic memory complements parametric learning to bridge it. Thread:

22.09.2025 04:21 — 👍 44 🔁 10 💬 1 📌 1

How can an imitative model like an LLM outperform the experts it is trained on? Our new COLM paper outlines three types of transcendence and shows that each one relies on a different aspect of data diversity. arxiv.org/abs/2508.17669

29.08.2025 21:45 — 👍 95 🔁 17 💬 3 📌 4

Deep learning generalizes because the parameter-function map is biased towards simple functions Deep neural networks (DNNs) generalize remarkably well without explicit regularization even in the strongly over-parametrized regime where classical learning theory would instead predict that they wou...

Thanks! Yes, I'm interested in which constraints most strongly push against this: 1) efficiency of acting (current FHE is slow), 2) efficiency of learning (simplicity bias), 3) maybe relatedly probability of learning a la arxiv.org/abs/1805.08522 or 4) some combination thereof

06.08.2025 02:59 — 👍 3 🔁 0 💬 0 📌 0

Equivalence between representational similarity analysis, centered kernel alignment, and canonical correlations analysis Centered kernel alignment (CKA) and representational similarity analysis (RSA) of dissimilarity matrices are two popular methods for comparing neural systems in terms of representational geometry. Alt...

they're mostly equivalent after mean-centering: www.biorxiv.org/content/10.1... fwiw

05.08.2025 20:18 — 👍 5 🔁 0 💬 1 📌 0

When we've compared these in past work e.g. Supplement fig. A.13 here proceedings.neurips.cc/paper/2020/h... we've seen pretty similar results between the two. Haven't run it in exactly this setting though. There are also some arguments that 1/2

05.08.2025 20:18 — 👍 2 🔁 0 💬 1 📌 0

even though both are linearly decodable and equally predictive. Katherine's paper studies some instances more thoroughly in simple settings. My sense though is that the magnitude of these effects are quite a bit smaller than the base bias, so probably not a huge issue if datasets aren't tiny. 2/2

05.08.2025 18:28 — 👍 1 🔁 0 💬 0 📌 0

I don't know of any reviews unfortunately! Fig. 16 in our TMLR paper (openreview.net/forum?id=aY2...) shows an instance — training classifiers on the penultimate reps to decode a label predicted by both easy and hard features; at high predictivity the classifier prefers the easy feature, even 1/2

05.08.2025 18:28 — 👍 2 🔁 0 💬 1 📌 0

Thanks, glad you like it!

05.08.2025 17:49 — 👍 1 🔁 0 💬 1 📌 0

just by dimensionality arguments (input dim 64 << first rep 256) even before training *any* function of the inputs will likely be computable from that rep with a sufficiently complex nonlinear decoder — even features like XOR that the model is *incapable* of computing at the first layer. 2/2

05.08.2025 16:30 — 👍 2 🔁 0 💬 1 📌 0

On the Foundations of Shortcut Learning Deep-learning models can extract a rich assortment of features from data. Which features a model uses depends not only on \emph{predictivity} -- how reliably a feature indicates training-set labels --...

Good Q: it clearly helps with that concern! But 1) variance biases still affect what nonlinear decoders will learn from finite data (cf. availability effects here arxiv.org/abs/2310.16228). 2) there's also a concern of "overestimating" what is represented. E.g. in our models, 1/2

05.08.2025 16:29 — 👍 3 🔁 0 💬 1 📌 0

Thoughts and feedback are very welcome btw — there are lots of subtle issues in this space that I probably haven't addressed perfectly, and probably prior works that I've missed.

05.08.2025 14:47 — 👍 1 🔁 0 💬 0 📌 0

Thanks to my co-authors @scychan.bsky.social, Effie Li & Katherine Hermann and and the (many) others I’ve discussed these issues with recently and over the past few years!

05.08.2025 14:36 — 👍 1 🔁 0 💬 2 📌 0

These kinds of cases definitely don’t mean studying representations is useless! But they do suggest we may achieve incomplete understanding if we’re not careful. See the paper (arxiv.org/abs/2507.22216) and our prior work (bsky.app/profile/lamp...) for further discussion, caveats, etc.

05.08.2025 14:36 — 👍 6 🔁 0 💬 2 📌 0

Homomorphic encryption: strongly dissociating computation from patterns of representation In the experiments described above, the role that the representations played in the computations of the system was relatively straightforward, even where the representations were biased. However, this does not have to be the case. We illustrate this with a final case study of the possibility for strong dissociation between computation and patterns of representation: homomorphic encryption (Gentry, 2009; Van Dijk et al., 2010). While the field of cryptography is largely focused on creating representations that preserve information yet are not easily decodable, in homomorphic encryption schemes it is additionally possible to perform arbitrary computations (any algebraic circuit) over this information while it is encrypted. That is, at each step of such a computation, a new encrypted representation is produced that corresponds to the result of encrypting the representation at that step of the original computation. This example shows that it is not necessary for a computational system to have any straightforward (e.g. linearly decodable) representation of the features that it uses in its computations. Systematic computations can be performed even over representations that are deliberately crafted to thwart attempts to understand (decrypt) their content. As a special case, this also illustrates that systematic compositional computations are possible without requiring representations that are straightforwardly compositional. Encrypted representations are compositional only in the sense that “with the right highly-nonlinear decoding scheme compositional representations can be extracted”—which is also true of some coding schemes typically interpreted as non-compositional, such as idiosyncratic representations of each input. This raises questions about if and when it is feasible to rigorously confirm whether a system’s computations are compositional from representational analyses.

We also present a worst-case study I find conceptually interesting: homomorphic encryption. It’s possible to do systematic computation over representations whose content is always encrypted, and thus difficult to decode by design!

05.08.2025 14:36 — 👍 7 🔁 0 💬 2 📌 0

Why are representations biased towards easier features? The biases are driven by multiple factors, including learning dynamics and the different ways that nonlinear features can be represented. (Left) By manipulating training order (training the hard task first rather than both simultaneously), the magnitude of the biases can be reduced. (Right) Likewise, by accounting for the fact that there can be more ways to represent a nonlinear feature that are not linearly equivalent—for example, different ways of drawing intermediate classification boundaries to compute an XOR function—we can identify other components of the representations that may be contributing to the model’s computation of the hard feature. Together the learning dynamics and multiple ways of representing features explain most of the representation bias towards the easy feature over the hard.

We briefly discuss (some of) the origins of these biases — they are driven by both learning dynamics and the fact that there are in some sense a larger variety of “natural” ways to represent a nonlinear feature.

05.08.2025 14:36 — 👍 1 🔁 0 💬 1 📌 0

RSA within and between different sets of models can give surprising results due to representation biases. This plot shows similarities within and between different models computing different types of features. Ideally the similarities would be highest in blocks on the diagonal (i.e. models computing the same features), and the blocks off the diagonal would show graded similarity corresponding to the functional overlap. However, that is not the case. (Left) When comparing a model trained to output both easy and hard features to ones that are trained on only one feature, the multi-task model appears very similar to the easy-task only model (cf. Hermann and Lampinen, 2020). In fact, the models trained only on the hard task do not even appear particularly similar to other models trained on the same exact task. (Right) When models are trained on multiple easy or multiple hard tasks, the models trained on only hard tasks appear less similar to other models trained on exactly the same tasks than they do to models trained on strictly easier tasks that use the same input units.

These biases can lead to dramatic downstream effects that cause unexpected conclusions from analyses. For example, RSA may identify two models computing the same, complex task as much less representationally-similar than either of them is to a model computing a much simpler task (right panel)!

05.08.2025 14:36 — 👍 6 🔁 1 💬 1 📌 0

Representational biases: in representations of a model computing easy (linear) and hard (4-parity) features, the overall variance explained in the last layer representations by the easy feature is over 55%, while the variance explained by the hard feature is around 5%. This is reflected in the top PCs clearly clustering by the easy feature but not reflecting the hard, and these biases are also present the unit level (almost all units, especially the most active ones), represent the easy feature more strongly.

Representations were systematically biased towards certain kinds of features. For example, a model reliably computing easy (linear) and hard (nonlinear) features has 55% repr. variance explained by the easy one, 5% by the hard, with similar biases in top PCs and individual units.

05.08.2025 14:36 — 👍 2 🔁 0 💬 1 📌 0

Datasets where many input features (color, shape, texture, … size) are fed in to creating input data. Linear or nonlinear classification tasks can be created, e.g. classifying whether an object is a circle (linear) or whether it is XOR(yellow, checkered) which is nonlinear. Experiments: training neural networks to output multiple features computed from an input, e.g. a linear and nonlinear one. Learned representations: stimuli presented and datasets of representational activity from the model, as might be collected in a neuroscience experiment.

We constructed controlled datasets with many input features, and trained deep learning models to compute functions of those features (e.g. linear ones like identifying a feature, or nonlinear ones like XOR). We then analyzed the patterns of representational activity they learned.

05.08.2025 14:36 — 👍 4 🔁 0 💬 1 📌 0

Andrew Lampinen

Latest posts by lampinen.bsky.social on Bluesky

@lampinen is following 20 prominent accounts