The visual world is composed of objects, and those objects are composed of features. But do VLMs exploit this compositional structure when processing multi-object scenes? In our ๐๐ #ICLR2026 paper, we find they do โ via emergent symbolic mechanisms for visual binding. ๐งต๐
05.02.2026 20:54 โ ๐ 81 ๐ 25 ๐ฌ 1 ๐ 3
He also contrasts the personalities of Hardy and Einstein:
13.01.2026 20:50 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0
Currently reading "a mathematician's apology" by GH Hardy. This is excerpt the foreword by CP Snow describing Hardy's personality and his work:
13.01.2026 20:49 โ ๐ 13 ๐ 1 ๐ฌ 1 ๐ 0
in associative memory, the latent space doesn't really encode any interesting distance.
imagine you're trying to store which countries share borders. you could simply write down a list of adjacent countries OR you could visualize the world map in your head. this is "associative" vs "geometric".
08.01.2026 22:47 โ ๐ 1 ๐ 1 ๐ฌ 1 ๐ 0
fascinating!
12.01.2026 19:08 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
Would love pointers to related lit! Will DM you about the other question. Thank you for your kind words!
12.01.2026 19:04 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
Rare to see such long term efforts these days ๐ซก
09.01.2026 22:52 โ ๐ 14 ๐ 1 ๐ฌ 0 ๐ 0
We introduce epiplexity, a new measure of information that provides a foundation for how to select, generate, or transform data for learning systems. We have been working on this for almost 2 years, and I cannot contain my excitement! arxiv.org/abs/2601.03220 1/7
07.01.2026 17:27 โ ๐ 144 ๐ 34 ๐ฌ 9 ๐ 9
Please welcome Google's Open Source efforts to Blue Sky at @opensource.google!
07.01.2026 21:12 โ ๐ 245 ๐ 38 ๐ฌ 7 ๐ 4
for deeper models, they initialize the network in a way that the decomposition of each layer aligns with the previous layer. if you didn't assume this, there'll be "interference" across components which I *suspect* would contribute to associative memorization.
08.01.2026 22:52 โ ๐ 4 ๐ 1 ๐ฌ 0 ๐ 0
now if I ask you "how many countries away is Mongolia from India?", in the lookup table approach, you've to sit and piece together the connections by iterating over a frustratingly long list. in the map approach, you can "see" the answer quickly.
08.01.2026 22:47 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
in associative memory, the latent space doesn't really encode any interesting distance.
imagine you're trying to store which countries share borders. you could simply write down a list of adjacent countries OR you could visualize the world map in your head. this is "associative" vs "geometric".
08.01.2026 22:47 โ ๐ 1 ๐ 1 ๐ฌ 1 ๐ 0
Thanks for engaging with the work! Could you elaborate? I'm not an expert on graph theory but I'd be interested in any ideas to better understand this.
08.01.2026 22:41 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
Deep sequence models tend to memorize geometrically; it is unclear why
Deep sequence models are said to store atomic facts predominantly in the form of associative memory: a brute-force lookup of co-occurring entities. We identify a dramatically different form of storage...
19/ These findings build on many nascent, fragmented observations in literature not credited here due to low space. There are also caveats in extending all this to natural language (each caveat, an open question ;) ). Please see the full story here:
arxiv.org/abs/2510.26745
08.01.2026 20:31 โ ๐ 15 ๐ 4 ๐ฌ 0 ๐ 0
18/ We hope this inspires revisiting analyses of Transformer knowledge/storage capacity/unlearning. Graph setups may also help cleanly understand the emergence of โworld modelsโ.
08.01.2026 20:31 โ ๐ 7 ๐ 0 ๐ฌ 1 ๐ 0
17/ Our findings suggest there's "magic" in integrating knowledge into model weights rather than stuffing it into context. It also shows a vivid contrast between traditional retrieval with two-tower models vs modern generative retrieval models.
08.01.2026 20:31 โ ๐ 8 ๐ 0 ๐ฌ 1 ๐ 0
16/ And practically: how do we make Transformer memory more geometric (if you want hasty reasoning/creativity) or more associative (if you want accurate retrieval, no hallucination)?
Understanding & manipulating this competition is a fundamental open question.
08.01.2026 20:31 โ ๐ 11 ๐ 0 ๐ฌ 1 ๐ 1
15/ Indeed in hindsight, the deeper Transformer model produces less elegant geometries than node2vec.
In more general (nastier) graphs, we suspect that memory may be a mix of associative/geometric. So, what notion of graph "complexity" & training dictates this bias?
08.01.2026 20:31 โ ๐ 6 ๐ 0 ๐ฌ 1 ๐ 0
14/ The more advanced open question is to study the dynamics in non-shallow models, where associative memory becomes a real competitor!
Here you can NOT merrily disentangle the dynamics of each spectral component to analyze them (not w/o weird assumptions about initialization)
08.01.2026 20:31 โ ๐ 9 ๐ 0 ๐ฌ 2 ๐ 0
13/ But strangely, prior node/word2vec analyses assume pressures (bottleneck, early-stopping etc.,) to explain the low-rank bias.
Our analysis intuits how *the CE loss* by nature nicely induces a low-rank spectral bias. We leave a formal proof as an open theoretical question.
08.01.2026 20:31 โ ๐ 9 ๐ 0 ๐ฌ 1 ๐ 0
12/ ๐๐๐ซ๐ญ ๐๐๐: What is this geometry?
It connects to a known low-rank spectral bias in shallow word/node2vec models: the model unearths the top eigenvectors of the adjacency A, rather than store A as is.
These eigenv'rs dominate the multi-hop Aแดธ revealing global info.
08.01.2026 20:31 โ ๐ 9 ๐ 0 ๐ฌ 1 ๐ 1
11/ Thus, the geometry cannot be explained by obvious explicit or implicit pressures from the supervision or architecture or optimization/gradient descent.
We call this the *memorization puzzle in sequence modeling* (as a nod to the "generalize puzzle", which is different!)
08.01.2026 20:31 โ ๐ 13 ๐ 1 ๐ฌ 1 ๐ 1
10/ Maybe associative memory is far away in optimizatn? Nope. For wide archs, associative memory is just 2 steps away. Geometry takes longer to โgrokโ!
Maybe the geometry is due to compression? No. There are graphs where both memories are equally succinct. Yet there's geometry.
08.01.2026 20:31 โ ๐ 10 ๐ 1 ๐ฌ 1 ๐ 0
9/ You may think: Maybe the model memorizes geometrically because of multi-hop supervision? No. We found a geometry even without path-finding supervision (see fig).
Maybe the architecture is narrow and thus precludes associative memory? No. Even wide archs produce geometries.
08.01.2026 20:31 โ ๐ 7 ๐ 0 ๐ฌ 1 ๐ 0
8/ ๐๐๐ซ๐ญ II: But why does the model store geometrically, not associatively?
Some may think this is obvious---we routinely see geometries in language/arithmetic/reasoning tasks & word embedding models. What's the big deal?
We isolate what isn't obvious & is hard to explain.
08.01.2026 20:31 โ ๐ 9 ๐ 1 ๐ฌ 1 ๐ 0
7/ Thanks to this geometry, our hard compositional reasoning task became an easy spatial navigation task.
While geometries existed even in word2vec models, it's exciting in the Transformer era: reasoning/discovery over stray facts in a large pretraining set seems within reach.
08.01.2026 20:31 โ ๐ 11 ๐ 1 ๐ฌ 1 ๐ 0
6/ In practice though, we find a โgeometric memoryโ in the model: entities are embedded in a highly organized way revealing not just local adjacencies but also multi-hop distances.
The model has synthesized novel information not immediately available in the data.
08.01.2026 20:31 โ ๐ 14 ๐ 1 ๐ฌ 1 ๐ 0
5/ In the associative view, the model stores local co-occurrences as a brute-force lookup table--an โadjacencyโ matrix in weights.
This has been a useful way to abstract model capacity/editing etc. It's intuitive too as no co-occurrence in a graph can be derived from another.
08.01.2026 20:31 โ ๐ 12 ๐ 0 ๐ฌ 1 ๐ 0
4/ But when we made the model memorize the graph *in-weights*, we found that the model succeeds, even on massive graphs (10k nodes, 10-hop paths) & *without* any hop-by-hop aid!
This is hard to explain under the common view that atomic co-occurrences are stored โassociativelyโ.
08.01.2026 20:31 โ ๐ 14 ๐ 0 ๐ฌ 1 ๐ 0
Cognitive neuroscience. Deep learning. PhD Student at Princeton Neuroscience with @cocoscilab.bsky.social and Cohen Lab.
Announcing new open source releases, exploring projects, sharing how we approach FOSS, and supporting communities around the world.
Assistant Professor @ SFU MBB | Systems Biology of Host-Pathogen | Vaccinology | Neonatal Sepsis | Pathogen Genomics | Loves data, food & art.
Ramen whisperer, bad throat singer
The 2025 Conference on Language Modeling will take place at the Palais des Congrรจs in Montreal, Canada from October 7-10, 2025
#Engineer, #anarchist, husband, #GenX, ex- #goth, #dog owner, #cat owned
Focused on #Covid19, #ClimateChange, hybrid #nuclear energy
My meager #writing: http://ciar.org/ttk/orcish_opera
I also like #birds (mainly corvidae). #corvid #cawmunity
PhD Student @ Cornell CIS
Bluesky paper digest: https://bsky.app/profile/paper-feed.bsky.social/feed/preprintdigest
Google Chief Scientist, Gemini Lead. Opinions stated here are my own, not those of Google. Gemini, TensorFlow, MapReduce, Bigtable, Spanner, ML things, ...
Researcher in ML and Privacy.
PhD @UofT & @VectorInst. previously Research Intern @Google and @ServiceNowRSRCH
https://mhaghifam.github.io/mahdihaghifam/
PhD student @uwnlp.bsky.social
See(k)ing the surreal
Causal World Models for Curious Robots @ University of Tรผbingen/Max Planck Institute for Intelligent Systems ๐ฉ๐ช
#reinforcementlearning #robotics #causality #meditation #vegan
AI Reasoning and Foundations
Senior Research Scientist, Google |
PhD, Princeton University
The world's leading venue for collaborative research in theoretical computer science. Follow us at http://YouTube.com/SimonsInstitute.
Assistant Prof in ML @ KTH ๐ธ๐ช
WASP Fellow
ELLIS Member
Ex: Aalto Uni ๐ซ๐ฎ, TU Graz ๐ฆ๐น, originally ๐ฉ๐ช.
โ
https://trappmartin.github.io/
โ
Reliable ML | UQ | Bayesian DL | tractability & PCs
Interests on bsky: ML research, applied math, and general mathematical and engineering miscellany. Also: Uncertainty, symmetry in ML, reliable deployment; applications in LLMs, computational chemistry/physics, and healthcare.
https://shubhendu-trivedi.org