Nathan Godey's Avatar

Nathan Godey

@nthngdy.bsky.social

Looking to start a post-doc in early 2025! Working on the representations of LMs and pretraining methods @InriaParis https://nathangodey.github.io

30 Followers  |  18 Following  |  20 Posts  |  Joined: 25.11.2024  |  2.0032

Latest posts by nthngdy.bsky.social on Bluesky

Nathan Godey receiving the 2025 ATALA best thesis prize at CORIA-TALN 2025.

Nathan Godey receiving the 2025 ATALA best thesis prize at CORIA-TALN 2025.

๐Ÿ†๐Ÿคฉ We are excited to share the news that @nthngdy.bsky.social, supervised by @bensagot.bsky.social and ร‰ric de la Clergerie, has received the 2025 ATALA Best PhD Dissertation Prize!

You can read his PhD online here: hal.science/tel-04994414/

17.07.2025 09:40 โ€” ๐Ÿ‘ 9    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

PS: I am looking for an academic post-doc on related topics (efficiency, sparsity, sequence compression, spectral analysis of LLMs, among others), feel free to reach out if you are interested :)

06.03.2025 16:02 โ€” ๐Ÿ‘ 3    ๐Ÿ” 4    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
SLLM@ICLR 2025 Workshop Summary

This work was the final touch to my PhD at @inriaparisnlp.bsky.social and was just accepted to the SLLM workshop at ICLR 2025 (sparsellm.org) ๐ŸŽ‰

06.03.2025 16:02 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Thanks a lot to all my amazing co-authors @alessiodevoto.bsky.social @sscardapane.bsky.social @yuzhaouoe.bsky.social @neuralnoise.com Eric de la Clergerie @bensagot.bsky.social

And a special thanks to @edoardo-ponti.bsky.social for the academic visit that made this work possible!

06.03.2025 16:02 โ€” ๐Ÿ‘ 2    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
GitHub - NathanGodey/qfilters: Repository for the Q-Filters method (https://arxiv.org/pdf/2503.02812) Repository for the Q-Filters method (https://arxiv.org/pdf/2503.02812) - NathanGodey/qfilters

We also release a HuggingFace transformers-compatible implementation here:
github.com/NathanGodey/...

06.03.2025 16:02 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Our method is also competitive in the prompt compression setup, especially for some synthetic token retrieval tasks such as needle-in-a-haystack or variable tracking, allowing reasonable error rates with up to x32 compression ratios:

06.03.2025 16:02 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Q-Filters - a nthngdy Collection Pre-computed Q-Filters for efficient KV cache compression.

This Q-Filter direction is context-agnostic, which means that it can be pre-computed once and for all for any attention head in a given model.

We release a collection of pre-computed Q-Filters for various models ranging from 1.5B to 405B parameters:
huggingface.co/collections/...

06.03.2025 16:02 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

The projection of a Key vector on this direction strongly correlates with the averaged attention weights given to this K along generation, providing a finer KV pair ranking compared to previous work ( arxiv.org/abs/2406.11430 ):

06.03.2025 16:02 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Based on arxiv.org/pdf/2401.12143, we find that they share a single biased direction which encodes a selection mechanism in self-attention: K vectors with a strong component in this direction are ignored by the model.

06.03.2025 16:02 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

We vastly improve over similar counterparts in the compress-as-you-generate scenario, where we reach similar generation throughputs while reducing the perplexity gap by up to 65% in the case of Llama-70B!

06.03.2025 16:02 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Q-Filters is very efficient which allows streaming compression at virtually no latency cost, just like Streaming-LLM...

...but it is also much better at retaining relevant KV pairs compared to fast alternatives (and can even beat slower algorithms such as SnapKV)

06.03.2025 16:02 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿš€ New Paper Alert! ๐Ÿš€

We introduce Q-Filters, a training-free method for efficient KV Cache compression!

It is compatible with FlashAttention and can compress along generation which is particularly useful for reasoning models โšก

TLDR: we make Streaming-LLM smarter using the geometry of attention

06.03.2025 16:02 โ€” ๐Ÿ‘ 21    ๐Ÿ” 7    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1

Thanks a lot to all my amazing co-authors
@alessiodevoto.bsky.social @sscardapane.bsky.social @yuzhaouoe.bsky.social @neuralnoise.com Eric de la Clergerie @bensagot.bsky.social

And a special thanks to @edoardo-ponti.bsky.social for the academic visit that made this work possible!

06.03.2025 13:40 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
GitHub - NathanGodey/qfilters: Repository for the Q-Filters method (https://arxiv.org/pdf/2503.02812) Repository for the Q-Filters method (https://arxiv.org/pdf/2503.02812) - NathanGodey/qfilters

We also release a HuggingFace transformers-compatible implementation here:
github.com/NathanGodey/...

06.03.2025 13:40 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression Autoregressive language models rely on a Key-Value (KV) Cache, which avoids re-computing past hidden states during generation, making it faster. As model sizes and context lengths grow, the KV Cache b...

If you want to read more about the theoretical background and limitations of the method, feel free to check out our pre-print:
arxiv.org/abs/2503.02812

06.03.2025 13:40 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Our method is also competitive in the prompt compression setup, especially for some synthetic token retrieval tasks such as needle-in-a-haystack or variable tracking, allowing reasonable error rates with up to x32 compression ratios:

06.03.2025 13:40 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Q-Filters - a nthngdy Collection Pre-computed Q-Filters for efficient KV cache compression.

This Q-Filter direction is context-agnostic, which means that it can be pre-computed once and for all for any attention head in a given model.

We release a collection of pre-computed Q-Filters for various models ranging from 1.5B to 405B parameters:
huggingface.co/collections/...

06.03.2025 13:40 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

The projection of a Key vector on this direction strongly correlates with the averaged attention weights given to this K along generation, providing a finer KV pair ranking compared to previous work (arxiv.org/abs/2406.11430):

06.03.2025 13:40 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Based on arxiv.org/pdf/2401.12143, we find that they share a single biased direction which encodes a selection mechanism in self-attention: K vectors with a strong component in this direction are ignored by the model.

06.03.2025 13:40 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

We vastly improve over similar counterparts in the compress-as-you-generate scenario, where we reach similar generation throughputs while reducing the perplexity gap by up to 65% in the case of Llama-70B!

06.03.2025 13:40 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Q-Filters is very efficient which allows streaming compression at virtually no latency cost, just like Streaming-LLM...

...but it is also much better at retaining relevant KV pairs compared to fast alternatives (and can even beat slower algorithms such as SnapKV)

06.03.2025 13:40 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

@nthngdy is following 18 prominent accounts