's Avatar

@simjeg.bsky.social

Senior LLM Technologist @NVIDIA Views and opinions are my own

230 Followers  |  36 Following  |  16 Posts  |  Joined: 19.11.2024  |  1.4342

Latest posts by simjeg.bsky.social on Bluesky

Preview
Optimal Yahtzee - a Hugging Face Space by simonjegou Discover amazing ML apps made by the community

🎲 Did you know Yahtzee can be solved optimally in less than 100 lines of Python and under 5min with 2 vCPU?

I built a @gradio-hf.bsky.social app so you can try it yourself: huggingface.co/spaces/simon...

Implementation is based on the excellent paper "An Optimal Strategy for Yahtzee" (Glenn, 2006)

31.03.2025 15:07 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
GitHub - NVIDIA/kvpress: LLM KV cache compression made easy LLM KV cache compression made easy. Contribute to NVIDIA/kvpress development by creating an account on GitHub.

-> Repo: github.com/NVIDIA/kvpress (⭐️ me)
-> Blog post: huggingface.co/spaces/nvidi... (πŸ”Ίme)
-> Space: huggingface.co/spaces/nvidi... (❀️me)

(2/2)

23.01.2025 10:03 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Fresh news from kvpress, our open source library for KV cache compression πŸ”₯

1. We published a blog post with
@huggingface

2. We published a Space for you to try it
3. Following feedback from the research community, we added a bunch of presses and benchmarks

LinksπŸ‘‡(1/2)

23.01.2025 10:03 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Relax, it's Santa Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

How do you find the permutation of words that minimize their perplexity as measured by an LLM ? In this year Kaggle Santa competition, I shared an approach to move to a continuous space where you can use gradient-descent using REINFORCE: www.kaggle.com/code/simjeg/...

03.12.2024 12:40 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Post image

πŸ’‘ We've just released KV cache quantization in kvpress, our open source package for KV cache compression. Check it out : github.com/NVIDIA/kvpress.

Special thanks for Arthur Zucker and Marc Sun from @huggingface.bsky.social for their support πŸ€—

26.11.2024 13:23 β€” πŸ‘ 3    πŸ” 3    πŸ’¬ 0    πŸ“Œ 0

nice work ! Identifying patterns could be done on the fly:
bsky.app/profile/simj...

22.11.2024 07:35 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

πŸš€ Excited to announce KVPress β€” our open-source library for efficient LLM KV cache compression!
πŸ‘‰ Check it out (and drop a ⭐): github.com/NVIDIA/kvpress
πŸ”— Full details in the thread 🧡 (1/4)

19.11.2024 14:25 β€” πŸ‘ 50    πŸ” 6    πŸ’¬ 2    πŸ“Œ 2

of course it's different ! transformer is an MLP predicting the parameters of another MLP πŸ˜€

21.11.2024 18:46 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Google Colab

You can reproduce this plot using this colab notebook: colab.research.google.com/drive/1DbAEm.... We used this property to create a new KV cache compression called Expected Attention in our kvpress repository:

20.11.2024 10:06 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Hidden states in LLM ~ follow normal distributions. Consequently, both queries and keys also follow a normal distribution and if you replace all queries and keys by their average counterpart, this magically explains the slash pattern observed in attention matrices

20.11.2024 10:06 β€” πŸ‘ 31    πŸ” 4    πŸ’¬ 2    πŸ“Œ 1
Post image

I created a DistillationPress that distills the (K,V) cache into a compressed (Kc,Vc) cache by minimizing ||A(q,K,V) - A(q,Kc,Vc)||^2. Checkout my notebook here: github.com/NVIDIA/kvpre.... More work needs to be done, it's just a first step (3/3)

20.11.2024 09:55 β€” πŸ‘ 5    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

KV cache quantization ? KV cache pruning ? KV cache approximation ? Replace "KV cache" by "MLP" and you'll see most of the research has already been explored🀯 So I gave it a try in within our new kvpress repo πŸ‘‡ (2/3)

20.11.2024 09:55 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Ever noticed that the attention mechanism in transformers is essentially a two-layer MLP? πŸ€”
A(q, K, V) = V @ softmax(K / √d @ q)
Weights: K / √d and V
nonlinearity: softmax
πŸ’‘This offers fresh insights into KV cache compression research 🧡(1/3)

20.11.2024 09:55 β€” πŸ‘ 56    πŸ” 5    πŸ’¬ 3    πŸ“Œ 0
Preview
kvpress/notebooks/expected_attention.ipynb at main Β· NVIDIA/kvpress LLM KV cache compression made easy. Contribute to NVIDIA/kvpress development by creating an account on GitHub.

This release also introduces a new method we developed: Expected Attention! 🎯 By leveraging the normal distribution of LLM hidden states, it measures the importance of each key-value pair. Learn more in this notebook: github.com/NVIDIA/kvpre... (4/4)

19.11.2024 14:25 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

kvpress aims at helping researchers and developers to create and benchmark KV cache compression techniques offering a user-friendly repo built on πŸ€— Transformers. All implemented methods are training free and model agnostic (3/4)

19.11.2024 14:25 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Long-context LLMs are resource-heavy due to KV cache growth: e.g., 1M tokens for Llama 3.1-70B (float16) needs 330GB of memory 😬. This challenge has driven intense research into KV cache compression, with many submissions to #ICLR2025. (2/4)

19.11.2024 14:25 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸš€ Excited to announce KVPress β€” our open-source library for efficient LLM KV cache compression!
πŸ‘‰ Check it out (and drop a ⭐): github.com/NVIDIA/kvpress
πŸ”— Full details in the thread 🧡 (1/4)

19.11.2024 14:25 β€” πŸ‘ 50    πŸ” 6    πŸ’¬ 2    πŸ“Œ 2

@simjeg is following 17 prominent accounts