Max Vladymyrov's Avatar

Max Vladymyrov

@mxvl.bsky.social

Research Scientist @ Google DeepMind

2,322 Followers  |  964 Following  |  26 Posts  |  Joined: 21.08.2024  |  1.7352

Latest posts by mxvl.bsky.social on Bluesky

Preview
Long Context In-Context Compression by Getting to the Gist of Gisting Long context processing is critical for the adoption of LLMs, but existing methods often introduce architectural complexity that hinders their practical adoption. Gisting, an in-context compression me...

It’s incredible how much mileage you get out of simple ideas, once you understand where the original method breaks!

πŸ“„ Read the full paper here: arxiv.org/abs/2504.08934

This work is led by our brilliant intern Aleks Petrov, with Mark Sandler, Andrey Zhmoginov, and Nolan Miller as co-authors.

30.04.2025 22:41 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

The result is GistPool:

βœ… Matches or beats average pooling;
βœ… Fixes the issues with Gisting;
βœ… Achieves compression using in-context learning with no complex model modifications, so it’s easy to implement and deploy at scale.

It’s fast, simple, and works across datasets and compression rates.

30.04.2025 22:41 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

So we added an inductive bias:

πŸ‘‰ Spread gist tokens across the context
πŸ‘‰ Restrict each gist token to attend only to its own pooling window

This tiny masking tweak nudges the model toward poolingβ€”and suddenly performance shoots up.

30.04.2025 22:41 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

Even fixing these issues, Gisting still couldn’t match average pooling. Why? Because standard attention can't learn average pooling well. We show both experimentally and theoretically: attention struggles with simple operations like copying and pooling! 😫

30.04.2025 22:41 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Second issue: 𝐜𝐨𝐧𝐟π₯𝐒𝐜𝐭𝐒𝐧𝐠 π¨π›π£πžπœπ­π’π―πžπ¬. You're asking the same parameters to summarize and to do inference. These clash. We split the two: one set of parameters for compressing context, another for predicting answers. That helped a lot. But still not enough. πŸ––

30.04.2025 22:41 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

First culprit: 𝐒𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐒𝐨𝐧 𝐟π₯𝐨𝐰 𝐝𝐞π₯𝐚𝐲. Gist tokens summarizing layer i only get the summary to the model at layer i+2. That’s too lateβ€”the model expects the information a layer earlier. When we shifted activations down, performance improved immediately! πŸ”§

30.04.2025 22:41 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Surprisingly, even a simple 𝘒𝘷𝘦𝘳𝘒𝘨𝘦 𝘱𝘰𝘰𝘭π˜ͺπ˜―π˜¨β€”just mean over token activationsβ€”beats Gisting by a mile. That shouldn’t happen. Why can't Gisting at least learn to emulate average pooling? πŸ€”

30.04.2025 22:41 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Gisting (Mu et al., 2023) offers a starting point: just add "gist" tokens and adjust the attention mask to funnel context through them. This creates an attention bottleneck, forcing summarization into these tokens.

But it turns out that this method when compressing more than just a few tokens.

30.04.2025 22:41 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Handling long contexts efficiently is a major hurdle for LLMs! While models support longer windows, cost & effectiveness remain challenging.

Excited to share our paper on in-context compression for long contexts.

Check out his thread and the paper bellow! πŸ‘‡
arxiv.org/abs/2504.08934

30.04.2025 22:41 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

It’s incredible how much mileage you get out of simple ideas, once you understand where the original method breaks!

πŸ“„ Read the full paper here: arxiv.org/abs/2504.08934

This work is led by our brilliant intern Aleks Petrov, with Mark Sandler, Andrey Zhmoginov, and Nolan Miller as co-authors.

30.04.2025 22:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

The result is GistPool:

βœ… Fixes the issues with Gisting;
βœ… Matches or beats AvgPool baseline;
βœ… Achieves compression using in-context learning, meaning no complex model modifications are needed.

It’s fast, simple, and works across datasets and compression rates.

30.04.2025 22:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

So we added an inductive bias:

πŸ‘‰ Spread gist tokens across the context
πŸ‘‰ Restrict each gist token to attend only to its own pooling window

This tiny masking tweak nudges the model toward poolingβ€”and suddenly performance shoots up.

30.04.2025 22:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Even fixing these issues, Gisting still couldn’t match average pooling. Why? Because standard attention can't learn average pooling well. We show both experimentally and theoretically: attention struggles with simple operations like copying and pooling! 😫

30.04.2025 22:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Second issue: 𝐜𝐨𝐧𝐟π₯𝐒𝐜𝐭𝐒𝐧𝐠 π¨π›π£πžπœπ­π’π―πžπ¬. You're asking the same parameters to summarize and to do inference. These clash. We split the two: one set of parameters for compressing context, another for predicting answers. That helped a lot. But still not enough. πŸ––

30.04.2025 22:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

First culprit: 𝐒𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐒𝐨𝐧 𝐟π₯𝐨𝐰 𝐝𝐞π₯𝐚𝐲. Gist tokens summarizing layer i only get the summary to the model at layer i+2. That’s too lateβ€”the model expects the information a layer earlier. When we shifted activations down, performance improved immediately! πŸ”§

30.04.2025 22:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Surprisingly, even a simple 𝘒𝘷𝘦𝘳𝘒𝘨𝘦 𝘱𝘰𝘰𝘭π˜ͺπ˜―π˜¨β€”just mean over token activationsβ€”beats Gisting by a mile. That shouldn’t happen. Why can't Gisting at least learn to emulate average pooling? πŸ€”

30.04.2025 22:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Gisting (Mu et al., 2023) offers a starting point: just add "gist" tokens and adjust the attention mask to funnel context through them. This creates an attention bottleneck, forcing summarization into these tokens.

But it turns out that this method when compressing more than just a few tokens.

30.04.2025 22:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Long Context In-Context Compression by Getting to the Gist of Gisting Long context processing is critical for the adoption of LLMs, but existing methods often introduce architectural complexity that hinders their practical adoption. Gisting, an in-context compression me...

It’s incredible how much mileage you get out of simple ideas, once you understand where the original method breaks!

πŸ“„ Read the full paper here: arxiv.org/abs/2504.08934

This work is led by our brilliant intern Aleks Petrov, with Mark Sandler, Andrey Zhmoginov, and Nolan Miller as co-authors.

30.04.2025 20:23 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

The result is GistPool:

βœ… Matches or beats average pooling;
βœ… Fixes the issues with Gisting;
βœ… Makes only minimal modifications to the classic transformer so it’s easy to implement and deploy at scale.

It’s fast, simple, and works across datasets and compression rates.

30.04.2025 20:23 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

So we added an inductive bias:

πŸ‘‰ Spread gist tokens across the context
πŸ‘‰ Restrict each gist token to attend only to its own pooling window

This tiny masking tweak nudges the model toward poolingβ€”and suddenly performance shoots up.

30.04.2025 20:23 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Even fixing these issues, Gisting still couldn’t match average pooling. Why? Because standard attention can't learn average pooling well. We show both experimentally and theoretically: attention struggles with simple operations like copying and pooling! 😫

30.04.2025 20:23 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Second issue: 𝐜𝐨𝐧𝐟π₯𝐒𝐜𝐭𝐒𝐧𝐠 π¨π›π£πžπœπ­π’π―πžπ¬. You're asking the same parameters to summarize and to do inference. These clash. We split the two: one set of parameters for compressing context, another for predicting answers. That helped a lot. But still not enough. πŸ––

30.04.2025 20:23 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

First culprit: 𝐒𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐒𝐨𝐧 𝐟π₯𝐨𝐰 𝐝𝐞π₯𝐚𝐲. Gist tokens summarizing layer i only get the summary to the model at layer i+2. That’s too lateβ€”the model expects the information a layer earlier. When we shifted activations down, performance improved immediately! πŸ”§

30.04.2025 20:23 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Surprisingly, even a simple 𝘒𝘷𝘦𝘳𝘒𝘨𝘦 𝘱𝘰𝘰𝘭π˜ͺπ˜―π˜¨β€”just mean over token activationsβ€”beats Gisting by a mile. That shouldn’t happen. Why can't Gisting at least learn to emulate average pooling? πŸ€”

30.04.2025 20:23 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Gisting (Mu et al., 2023) offers a starting point: just add "gist" tokens and adjust the attention mask to funnel context through them. This creates an attention bottleneck, forcing summarization into these tokens.

But turns out that this method when compressing more than just a few tokens.

30.04.2025 20:23 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Don't lie to your friends: Learning what you know from collaborative self-play To be helpful assistants, AI agents must be aware of their own capabilities and limitations. This includes knowing when to answer from parametric knowledge versus using tools, when to trust tool outpu...

We all want LLMs to collaborate with humans to help them achieve their goals. But LLMs are not trained to collaborate, they are trained to imitate. Can we teach LM agents to help humans by first making them help each other?

arxiv.org/abs/2503.14481

24.03.2025 15:39 β€” πŸ‘ 56    πŸ” 20    πŸ’¬ 1    πŸ“Œ 0

Amazing work! Do you think it would be possible to express a general learning algorithms using ALTA, e.g. gradient descent?

24.10.2024 04:25 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@mxvl is following 20 prominent accounts