Long Context In-Context Compression by Getting to the Gist of Gisting
Long context processing is critical for the adoption of LLMs, but existing methods often introduce architectural complexity that hinders their practical adoption. Gisting, an in-context compression me...
Itβs incredible how much mileage you get out of simple ideas, once you understand where the original method breaks!
π Read the full paper here: arxiv.org/abs/2504.08934
This work is led by our brilliant intern Aleks Petrov, with Mark Sandler, Andrey Zhmoginov, and Nolan Miller as co-authors.
30.04.2025 22:41 β π 0 π 0 π¬ 0 π 0
The result is GistPool:
β
Matches or beats average pooling;
β
Fixes the issues with Gisting;
β
Achieves compression using in-context learning with no complex model modifications, so itβs easy to implement and deploy at scale.
Itβs fast, simple, and works across datasets and compression rates.
30.04.2025 22:41 β π 0 π 0 π¬ 1 π 0
So we added an inductive bias:
π Spread gist tokens across the context
π Restrict each gist token to attend only to its own pooling window
This tiny masking tweak nudges the model toward poolingβand suddenly performance shoots up.
30.04.2025 22:41 β π 0 π 1 π¬ 1 π 0
Even fixing these issues, Gisting still couldnβt match average pooling. Why? Because standard attention can't learn average pooling well. We show both experimentally and theoretically: attention struggles with simple operations like copying and pooling! π«
30.04.2025 22:41 β π 0 π 0 π¬ 1 π 0
Second issue: ππ¨π§ππ₯π’πππ’π§π π¨ππ£ππππ’π―ππ¬. You're asking the same parameters to summarize and to do inference. These clash. We split the two: one set of parameters for compressing context, another for predicting answers. That helped a lot. But still not enough. π
30.04.2025 22:41 β π 0 π 0 π¬ 1 π 0
First culprit: π’π§ππ¨π«π¦πππ’π¨π§ ππ₯π¨π° πππ₯ππ². Gist tokens summarizing layer i only get the summary to the model at layer i+2. Thatβs too lateβthe model expects the information a layer earlier. When we shifted activations down, performance improved immediately! π§
30.04.2025 22:41 β π 0 π 0 π¬ 1 π 0
Surprisingly, even a simple π’π·π¦π³π’π¨π¦ π±π°π°ππͺπ―π¨βjust mean over token activationsβbeats Gisting by a mile. That shouldnβt happen. Why can't Gisting at least learn to emulate average pooling? π€
30.04.2025 22:41 β π 0 π 0 π¬ 1 π 0
Gisting (Mu et al., 2023) offers a starting point: just add "gist" tokens and adjust the attention mask to funnel context through them. This creates an attention bottleneck, forcing summarization into these tokens.
But it turns out that this method when compressing more than just a few tokens.
30.04.2025 22:41 β π 0 π 0 π¬ 1 π 0
Handling long contexts efficiently is a major hurdle for LLMs! While models support longer windows, cost & effectiveness remain challenging.
Excited to share our paper on in-context compression for long contexts.
Check out his thread and the paper bellow! π
arxiv.org/abs/2504.08934
30.04.2025 22:41 β π 1 π 1 π¬ 1 π 0
Itβs incredible how much mileage you get out of simple ideas, once you understand where the original method breaks!
π Read the full paper here: arxiv.org/abs/2504.08934
This work is led by our brilliant intern Aleks Petrov, with Mark Sandler, Andrey Zhmoginov, and Nolan Miller as co-authors.
30.04.2025 22:35 β π 0 π 0 π¬ 0 π 0
The result is GistPool:
β
Fixes the issues with Gisting;
β
Matches or beats AvgPool baseline;
β
Achieves compression using in-context learning, meaning no complex model modifications are needed.
Itβs fast, simple, and works across datasets and compression rates.
30.04.2025 22:35 β π 0 π 0 π¬ 1 π 0
So we added an inductive bias:
π Spread gist tokens across the context
π Restrict each gist token to attend only to its own pooling window
This tiny masking tweak nudges the model toward poolingβand suddenly performance shoots up.
30.04.2025 22:35 β π 0 π 0 π¬ 1 π 0
Even fixing these issues, Gisting still couldnβt match average pooling. Why? Because standard attention can't learn average pooling well. We show both experimentally and theoretically: attention struggles with simple operations like copying and pooling! π«
30.04.2025 22:35 β π 0 π 0 π¬ 1 π 0
Second issue: ππ¨π§ππ₯π’πππ’π§π π¨ππ£ππππ’π―ππ¬. You're asking the same parameters to summarize and to do inference. These clash. We split the two: one set of parameters for compressing context, another for predicting answers. That helped a lot. But still not enough. π
30.04.2025 22:35 β π 0 π 0 π¬ 1 π 0
First culprit: π’π§ππ¨π«π¦πππ’π¨π§ ππ₯π¨π° πππ₯ππ². Gist tokens summarizing layer i only get the summary to the model at layer i+2. Thatβs too lateβthe model expects the information a layer earlier. When we shifted activations down, performance improved immediately! π§
30.04.2025 22:35 β π 0 π 0 π¬ 1 π 0
Surprisingly, even a simple π’π·π¦π³π’π¨π¦ π±π°π°ππͺπ―π¨βjust mean over token activationsβbeats Gisting by a mile. That shouldnβt happen. Why can't Gisting at least learn to emulate average pooling? π€
30.04.2025 22:35 β π 0 π 0 π¬ 1 π 0
Gisting (Mu et al., 2023) offers a starting point: just add "gist" tokens and adjust the attention mask to funnel context through them. This creates an attention bottleneck, forcing summarization into these tokens.
But it turns out that this method when compressing more than just a few tokens.
30.04.2025 22:35 β π 0 π 0 π¬ 1 π 0
Long Context In-Context Compression by Getting to the Gist of Gisting
Long context processing is critical for the adoption of LLMs, but existing methods often introduce architectural complexity that hinders their practical adoption. Gisting, an in-context compression me...
Itβs incredible how much mileage you get out of simple ideas, once you understand where the original method breaks!
π Read the full paper here: arxiv.org/abs/2504.08934
This work is led by our brilliant intern Aleks Petrov, with Mark Sandler, Andrey Zhmoginov, and Nolan Miller as co-authors.
30.04.2025 20:23 β π 0 π 0 π¬ 0 π 0
The result is GistPool:
β
Matches or beats average pooling;
β
Fixes the issues with Gisting;
β
Makes only minimal modifications to the classic transformer so itβs easy to implement and deploy at scale.
Itβs fast, simple, and works across datasets and compression rates.
30.04.2025 20:23 β π 0 π 0 π¬ 2 π 0
So we added an inductive bias:
π Spread gist tokens across the context
π Restrict each gist token to attend only to its own pooling window
This tiny masking tweak nudges the model toward poolingβand suddenly performance shoots up.
30.04.2025 20:23 β π 0 π 0 π¬ 1 π 0
Even fixing these issues, Gisting still couldnβt match average pooling. Why? Because standard attention can't learn average pooling well. We show both experimentally and theoretically: attention struggles with simple operations like copying and pooling! π«
30.04.2025 20:23 β π 0 π 0 π¬ 1 π 0
Second issue: ππ¨π§ππ₯π’πππ’π§π π¨ππ£ππππ’π―ππ¬. You're asking the same parameters to summarize and to do inference. These clash. We split the two: one set of parameters for compressing context, another for predicting answers. That helped a lot. But still not enough. π
30.04.2025 20:23 β π 0 π 0 π¬ 1 π 0
First culprit: π’π§ππ¨π«π¦πππ’π¨π§ ππ₯π¨π° πππ₯ππ². Gist tokens summarizing layer i only get the summary to the model at layer i+2. Thatβs too lateβthe model expects the information a layer earlier. When we shifted activations down, performance improved immediately! π§
30.04.2025 20:23 β π 0 π 0 π¬ 1 π 0
Surprisingly, even a simple π’π·π¦π³π’π¨π¦ π±π°π°ππͺπ―π¨βjust mean over token activationsβbeats Gisting by a mile. That shouldnβt happen. Why can't Gisting at least learn to emulate average pooling? π€
30.04.2025 20:23 β π 0 π 0 π¬ 1 π 0
Gisting (Mu et al., 2023) offers a starting point: just add "gist" tokens and adjust the attention mask to funnel context through them. This creates an attention bottleneck, forcing summarization into these tokens.
But turns out that this method when compressing more than just a few tokens.
30.04.2025 20:23 β π 0 π 0 π¬ 1 π 0
Amazing work! Do you think it would be possible to express a general learning algorithms using ALTA, e.g. gradient descent?
24.10.2024 04:25 β π 3 π 0 π¬ 1 π 0
Computer Science -- Computation and Language
source: export.arxiv.org/rss/cs.CL
maintainer: @tmaehara.bsky.social
Teaching Faculty @ Princeton University | CMU, MIT alum | reinforcement learning, AI ethics, equity and justice, baking | ADHD πππ
Dad, husband, President, citizen. barackobama.com
Dad, husband, teacher, coach, veteran. Governor of Minnesota. Working to move our state forward as #OneMinnesota.
Pioneering a new generation of LLMs.
curr: founder
prev: yc s23, ml eng @ google, apple
i dont endorse anything I say
Independent Journalist. Host of βThe Jim Acosta Showβ on Substack, YouTube and Apple Podcasts: Former Chief WH Correspondent. Author. https://substack.com/@jimacosta?r=lpd3z&utm_medium=ios&utm_source=profile Inquiries: go to jimacosta.com
Google Chief Scientist, Gemini Lead. Opinions stated here are my own, not those of Google. Gemini, TensorFlow, MapReduce, Bigtable, Spanner, ML things, ...
Video essays focusing on the intersections of society, masculinity, politics, and entertainment. Hosted by Jonathan McIntosh
Two-time Pulitzer Prize-winning Photojournalist and Chief Photographer for the Associated Press @AP
Former National Press Photographers Association @nppa.org president. Photo editor for @nytimes.com News Service, independent photojournalist and UF alumnus.
Current MSW candidate @fordham. Once upon a time award-winning photojournalist and photo editor for the @nytimes.
Photo editor, Associated Press based in Mexico City
Photojournalist
Washington, D.C.
Spirit resides in Colorado
www.leahmillis.com
Photo/Video Journalist at The San Francisco Chronicle | Instagram: @carlosavilagonzalez
Opinions are my own, reposts are not endorsements.
Iβm a photojournalist with 32 years experience. Part of a team of journalist that were awarded the Pulitzer Prize in 2017 for Breaking News Reporting. Adjunct professor teaching photography at Las Positas College.
Contributing Photographer to Contact Press Images
All images mine.
Author: βFreelance Photographerβs Guide to Success: Business Essentialsβ.
Business of Photography Workshop founder, travel junkie & lover of all things outdoors!
Instagram: tbfreelance
Storyteller, award-winning Filmmaker and Photographer, Senior Communications Advisor and Kiwi mum of two beautiful boys living in Brisbane, Australia. Co-Founder of Prime Collective.
https://www.melanieburford.com/
Instagram.com/melburford
Spokane-based photojournalist.
Latest work: erickdoxey.com/links
Signal: erickdoxey.01
Independent photographer. Wildland wildfire certified. FAA drone pilot. Yosemite, CA. Member @womenphotograph @nppa. Critical Mass 2023 TOP 50, Reuters Pics of the Year. All images Β©οΈ.