Vaishnavh Nagarajan's Avatar

Vaishnavh Nagarajan

@vaishnavh.bsky.social

Foundations of AI. I like simple and minimal examples and creative ideas. I also like thinking about the next token ๐Ÿงฎ๐Ÿงธ Google | PhD, CMU | https://arxiv.org/abs/2504.15266 | https://arxiv.org/abs/2403.06963 vaishnavh.github.io

3,289 Followers  |  386 Following  |  202 Posts  |  Joined: 13.11.2024  |  2.351

Latest posts by vaishnavh.bsky.social on Bluesky

Post image

The visual world is composed of objects, and those objects are composed of features. But do VLMs exploit this compositional structure when processing multi-object scenes? In our ๐Ÿ†’๐Ÿ†• #ICLR2026 paper, we find they do โ€“ via emergent symbolic mechanisms for visual binding. ๐Ÿงต๐Ÿ‘‡

05.02.2026 20:54 โ€” ๐Ÿ‘ 81    ๐Ÿ” 25    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 3
Post image

He also contrasts the personalities of Hardy and Einstein:

13.01.2026 20:50 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Currently reading "a mathematician's apology" by GH Hardy. This is excerpt the foreword by CP Snow describing Hardy's personality and his work:

13.01.2026 20:49 โ€” ๐Ÿ‘ 13    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

in associative memory, the latent space doesn't really encode any interesting distance.

imagine you're trying to store which countries share borders. you could simply write down a list of adjacent countries OR you could visualize the world map in your head. this is "associative" vs "geometric".

08.01.2026 22:47 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

fascinating!

12.01.2026 19:08 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Would love pointers to related lit! Will DM you about the other question. Thank you for your kind words!

12.01.2026 19:04 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Rare to see such long term efforts these days ๐Ÿซก

09.01.2026 22:52 โ€” ๐Ÿ‘ 14    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

We introduce epiplexity, a new measure of information that provides a foundation for how to select, generate, or transform data for learning systems. We have been working on this for almost 2 years, and I cannot contain my excitement! arxiv.org/abs/2601.03220 1/7

07.01.2026 17:27 โ€” ๐Ÿ‘ 144    ๐Ÿ” 34    ๐Ÿ’ฌ 9    ๐Ÿ“Œ 9

Please welcome Google's Open Source efforts to Blue Sky at @opensource.google!

07.01.2026 21:12 โ€” ๐Ÿ‘ 245    ๐Ÿ” 38    ๐Ÿ’ฌ 7    ๐Ÿ“Œ 4
Post image

for deeper models, they initialize the network in a way that the decomposition of each layer aligns with the previous layer. if you didn't assume this, there'll be "interference" across components which I *suspect* would contribute to associative memorization.

08.01.2026 22:52 โ€” ๐Ÿ‘ 4    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap ...

thanks for being curious about it :-) I'm basing this off of the assumptions made in this seminal paper arxiv.org/abs/1312.6120

they begin with an analysis of 2-layer (weight-untied) models where the dynamics neatly evolve along each spectral component.

08.01.2026 22:52 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

now if I ask you "how many countries away is Mongolia from India?", in the lookup table approach, you've to sit and piece together the connections by iterating over a frustratingly long list. in the map approach, you can "see" the answer quickly.

08.01.2026 22:47 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

in associative memory, the latent space doesn't really encode any interesting distance.

imagine you're trying to store which countries share borders. you could simply write down a list of adjacent countries OR you could visualize the world map in your head. this is "associative" vs "geometric".

08.01.2026 22:47 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Thanks for engaging with the work! Could you elaborate? I'm not an expert on graph theory but I'd be interested in any ideas to better understand this.

08.01.2026 22:41 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Deep sequence models tend to memorize geometrically; it is unclear why Deep sequence models are said to store atomic facts predominantly in the form of associative memory: a brute-force lookup of co-occurring entities. We identify a dramatically different form of storage...

19/ These findings build on many nascent, fragmented observations in literature not credited here due to low space. There are also caveats in extending all this to natural language (each caveat, an open question ;) ). Please see the full story here:

arxiv.org/abs/2510.26745

08.01.2026 20:31 โ€” ๐Ÿ‘ 15    ๐Ÿ” 4    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

18/ We hope this inspires revisiting analyses of Transformer knowledge/storage capacity/unlearning. Graph setups may also help cleanly understand the emergence of โ€œworld modelsโ€.

08.01.2026 20:31 โ€” ๐Ÿ‘ 7    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

17/ Our findings suggest there's "magic" in integrating knowledge into model weights rather than stuffing it into context. It also shows a vivid contrast between traditional retrieval with two-tower models vs modern generative retrieval models.

08.01.2026 20:31 โ€” ๐Ÿ‘ 8    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

16/ And practically: how do we make Transformer memory more geometric (if you want hasty reasoning/creativity) or more associative (if you want accurate retrieval, no hallucination)?
Understanding & manipulating this competition is a fundamental open question.

08.01.2026 20:31 โ€” ๐Ÿ‘ 11    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
Post image

15/ Indeed in hindsight, the deeper Transformer model produces less elegant geometries than node2vec.

In more general (nastier) graphs, we suspect that memory may be a mix of associative/geometric. So, what notion of graph "complexity" & training dictates this bias?

08.01.2026 20:31 โ€” ๐Ÿ‘ 6    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

14/ The more advanced open question is to study the dynamics in non-shallow models, where associative memory becomes a real competitor!

Here you can NOT merrily disentangle the dynamics of each spectral component to analyze them (not w/o weird assumptions about initialization)

08.01.2026 20:31 โ€” ๐Ÿ‘ 9    ๐Ÿ” 0    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Post image

13/ But strangely, prior node/word2vec analyses assume pressures (bottleneck, early-stopping etc.,) to explain the low-rank bias.

Our analysis intuits how *the CE loss* by nature nicely induces a low-rank spectral bias. We leave a formal proof as an open theoretical question.

08.01.2026 20:31 โ€” ๐Ÿ‘ 9    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

12/ ๐๐š๐ซ๐ญ ๐ˆ๐ˆ๐ˆ: What is this geometry?

It connects to a known low-rank spectral bias in shallow word/node2vec models: the model unearths the top eigenvectors of the adjacency A, rather than store A as is.

These eigenv'rs dominate the multi-hop Aแดธ revealing global info.

08.01.2026 20:31 โ€” ๐Ÿ‘ 9    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
Post image

11/ Thus, the geometry cannot be explained by obvious explicit or implicit pressures from the supervision or architecture or optimization/gradient descent.

We call this the *memorization puzzle in sequence modeling* (as a nod to the "generalize puzzle", which is different!)

08.01.2026 20:31 โ€” ๐Ÿ‘ 13    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
Post image

10/ Maybe associative memory is far away in optimizatn? Nope. For wide archs, associative memory is just 2 steps away. Geometry takes longer to โ€œgrokโ€!

Maybe the geometry is due to compression? No. There are graphs where both memories are equally succinct. Yet there's geometry.

08.01.2026 20:31 โ€” ๐Ÿ‘ 10    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

9/ You may think: Maybe the model memorizes geometrically because of multi-hop supervision? No. We found a geometry even without path-finding supervision (see fig).

Maybe the architecture is narrow and thus precludes associative memory? No. Even wide archs produce geometries.

08.01.2026 20:31 โ€” ๐Ÿ‘ 7    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

8/ ๐๐š๐ซ๐ญ II: But why does the model store geometrically, not associatively?

Some may think this is obvious---we routinely see geometries in language/arithmetic/reasoning tasks & word embedding models. What's the big deal?

We isolate what isn't obvious & is hard to explain.

08.01.2026 20:31 โ€” ๐Ÿ‘ 9    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

7/ Thanks to this geometry, our hard compositional reasoning task became an easy spatial navigation task.

While geometries existed even in word2vec models, it's exciting in the Transformer era: reasoning/discovery over stray facts in a large pretraining set seems within reach.

08.01.2026 20:31 โ€” ๐Ÿ‘ 11    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

6/ In practice though, we find a โ€œgeometric memoryโ€ in the model: entities are embedded in a highly organized way revealing not just local adjacencies but also multi-hop distances.

The model has synthesized novel information not immediately available in the data.

08.01.2026 20:31 โ€” ๐Ÿ‘ 14    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

5/ In the associative view, the model stores local co-occurrences as a brute-force lookup table--an โ€œadjacencyโ€ matrix in weights.

This has been a useful way to abstract model capacity/editing etc. It's intuitive too as no co-occurrence in a graph can be derived from another.

08.01.2026 20:31 โ€” ๐Ÿ‘ 12    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

4/ But when we made the model memorize the graph *in-weights*, we found that the model succeeds, even on massive graphs (10k nodes, 10-hop paths) & *without* any hop-by-hop aid!

This is hard to explain under the common view that atomic co-occurrences are stored โ€œassociativelyโ€.

08.01.2026 20:31 โ€” ๐Ÿ‘ 14    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

@vaishnavh is following 20 prominent accounts