Benjamin Minixhofer's Avatar

Benjamin Minixhofer

@bminixhofer.bsky.social

45 Followers  |  21 Following  |  9 Posts  |  Joined: 10.12.2024  |  1.5484

Latest posts by bminixhofer.bsky.social on Bluesky

Preview
ALM Transfers - a benjamin Collection Weโ€™re on a journey to advance and democratize artificial intelligence through open source and open science.

Models from our paper, including Gemma-2B and Llama-3B instruction-tunes transferred to byte-level, are up on Hugging Face ๐Ÿค—

huggingface.co/collections/...

02.04.2025 06:42 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Cross-Tokenizer Distillation via Approximate Likelihood Matching Distillation has shown remarkable success in transferring knowledge from a Large Language Model (LLM) teacher to a student LLM. However, current distillation methods predominantly require the same tok...

Check out the paper for lots of details.

We are also releasing our code as part of `tokenkit`, a new library implementing advanced tokenization transfer methods. More to follow on that๐Ÿ‘€

Paper: arxiv.org/abs/2503.20083
Code: github.com/bminixhofer/...

w/ Ivan Vuliฤ‡ and @edoardo-ponti.bsky.social

02.04.2025 06:41 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

2๏ธโƒฃWe also use ALM to directly transfer knowledge from a large teacher (with one tokenizer) to a smaller student (with another tokenizer).

We test this by distilling a large maths-specialized Llama into a small Gemma model.๐Ÿ”ข

02.04.2025 06:39 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

1๏ธโƒฃcontinued: we can also transfer different base models to the same tokenizer, then ensemble them by combining their logits.

This would not be possible if they had different tokenizers.

We try ensembling Gemma, Llama and Qwen. They perform better together than separately!๐Ÿค

02.04.2025 06:39 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

We investigate two use cases of ALM in detail (but there's definitely more!)

1๏ธโƒฃTokenizer transfer: the teacher is the model with its original tokenizer; the student is the same model with a new tokenizer.

Here, ALM even lets us distill subword models to a byte-level tokenizer๐Ÿ˜ฎ

02.04.2025 06:38 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Chunks of tokens with different tokenization biases are not fairly comparable!โš ๏ธโš ๏ธ

We thus develop a method to find chunks with low tokenization bias differences (making them *approximately comparable*), then learn to match the likelihoods of thoseโœ…

02.04.2025 06:38 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Our greatest adversary in this endeavour is *tokenization bias*.

Due to tokenization bias, a sequence of subword tokens can leak information about the future contents of the text they encode.

02.04.2025 06:37 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Most distillation methods so far needed the teacher and the student to have the same tokenizer.

We lift this restriction by first identifying comparable chunks of tokens in a sequence (surprisingly, this is not so easy!), then minimizing the difference between their likelihoods.

02.04.2025 06:37 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Image illustrating that ALM can enable Ensembling, Transfer to Bytes, and general Cross-Tokenizer Distillation.

Image illustrating that ALM can enable Ensembling, Transfer to Bytes, and general Cross-Tokenizer Distillation.

We created Approximate Likelihood Matching, a principled (and very effective) method for *cross-tokenizer distillation*!

With ALM, you can create ensembles of models from different families, convert existing subword-level models to byte-level and a bunch more๐Ÿงต

02.04.2025 06:36 โ€” ๐Ÿ‘ 26    ๐Ÿ” 14    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0


Two amazing papers from my students at #NeurIPS today:

โ›“๏ธ๐Ÿ’ฅ Switch the vocabulary and embeddings of your LLM tokenizer zero-shot on the fly (@bminixhofer.bsky.social)
neurips.cc/virtual/2024...

๐ŸŒŠ Align your LLM gradient-free with spectral editing of activations (Yifu Qiu)
neurips.cc/virtual/2024...

12.12.2024 17:45 โ€” ๐Ÿ‘ 46    ๐Ÿ” 8    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0

@bminixhofer is following 20 prominent accounts