Had a really great and fun time with @yanai.bsky.social, Niloofar Mireshghallah, and Reza Shokri discussing memorisation at the @l2m2workshop.bsky.social panel. Thanks to the entire organising team and attendees for making this such a fantastic workshop! #ACL2025
02.08.2025 17:02 β
π 8
π 1
π¬ 0
π 0
@philipwitti.bsky.social will be presenting our paper "Tokenisation is NP-Complete" at #ACL2025 π Come to the language modelling 2 session (Wednesday morning, 9h~10h30) to learn more about how challenging tokenisation can be!
27.07.2025 09:41 β
π 7
π 3
π¬ 0
π 0
Just arrived in Vienna for ACL 2025 π¦πΉ Excited to be here and to finally meet so many people in person!
We have several papers this year and many from @milanlp.bsky.social are around, come say hi!
Here are all the works I'm involved in ‡οΈ
#ACL2025 #ACL2025NLP
27.07.2025 10:29 β
π 20
π 4
π¬ 1
π 0
Also, got burning questions about memorisation? Send them my wayβwe'll make sure to pose them to our panelists during the workshop!
27.07.2025 06:41 β
π 0
π 0
π¬ 0
π 0
Headed to Vienna for #ACL2025 to present our tokenisation bias paper and co-organise the L2M2 workshop on memorisation in language models. Reach out to chat about tokenisation, memorisation, and all things pre-training (esp. data-related topics)!
27.07.2025 06:40 β
π 20
π 2
π¬ 2
π 0
Also, we find that:
β Tokenisation bias appears early in training
β Doesnβt go away as models improve or with scale
We hope this approach can support:
β More principled vocabulary design
β Better understanding of generalisation trade-offs
β Fairer and more stable LMs
05.06.2025 10:43 β
π 1
π 0
π¬ 1
π 0
As our main result, we find that when a token is in a modelβs vocabularyβi.e., when its characters are tokenised as a single symbolβthe model may assign it up to 17x more probability than if it had been split into two tokens instead
05.06.2025 10:43 β
π 2
π 1
π¬ 1
π 0
The trick: tokenisers build vocabs incrementally up to a fixed size (e.g., 32k). This defines a "cutoff": tokens near it are similar (e.g., frequency), but those inside appear as one while those outside as two symbols. Perfect setup for regression discontinuity! Details in π!
05.06.2025 10:43 β
π 4
π 0
π¬ 1
π 0
So, did we train thousands of models, with and without each token in our vocabulary? No! Our method works observationally! ππ
05.06.2025 10:43 β
π 1
π 0
π¬ 1
π 0
While intuitive, this question is tricky. We canβt just compare
1οΈβ£ in- vs. out-of-vocab words (like "hello" vs "appoggiatura") as they differ systematically, e.g., in frequency
2οΈβ£ different tokenisations (e.g., β¨he,lloβ©or β¨helloβ©) as the model only sees one during training
05.06.2025 10:43 β
π 1
π 0
π¬ 2
π 0
In our paper, we estimate a specific type of tokenisation bias: Whatβs the effect of including a token (e.g., β¨helloβ©) in the tokeniserβs vocabulary on the log-probability this model assigns to its characters (βhelloβ)?
05.06.2025 10:43 β
π 2
π 0
π¬ 1
π 0
Most language models assign probabilities to raw strings (like "hello") by first tokenising them (like β¨he, lloβ© or β¨helloβ©). Ideally, different tokenisations shouldn't change these modelsβ outputs. In practice, they do. We call this difference **tokenisation bias**
05.06.2025 10:43 β
π 2
π 0
π¬ 1
π 0
All modern LLMs run on top of a tokeniser, an often overlooked βpreprocessing detailβ. But what if that tokeniser systematically affects model behaviour? We call this tokenisation bias.
Letβs talk about it and why it mattersπ
@aclmeeting.bsky.social #ACL2025 #NLProc
05.06.2025 10:43 β
π 63
π 8
π¬ 1
π 2
Title of paper "Causal Estimation of Tokenisation Bias" and schematic of how we define tokenisation bias, which is the causal effect we are interested in.
A string may get 17 times less probability if tokenised as two symbols (e.g., β¨he, lloβ©) than as one (e.g., β¨helloβ©)βby an LM trained from scratch in each situation! Our new ACL paper proposes an observational method to estimate this causal effect! Longer thread soon!
04.06.2025 10:51 β
π 53
π 9
π¬ 1
π 3
Inline citations with only first author name, or first two co-first author names.
If you're finishing your camera-ready for ACL or ICML and want to cite co-first authors more fairly, I just made a simple fix to do this! Just add $^*$ to the authors' names in your bibtex, and the citations should change :)
github.com/tpimentelms/...
29.05.2025 08:53 β
π 85
π 23
π¬ 4
π 0
ACL 2025 Workshop L2M2 ARR Commitment
Welcome to the OpenReview homepage for ACL 2025 Workshop L2M2 ARR Commitment
π’ @aclmeeting.bsky.social notifications have been sent out, making this the perfect time to finalize your commitment. Don't miss the opportunity to be part of the L2M2 workshop!
π Commit here: openreview.net/group?id=acl...
ποΈ Deadline: May 20, 2025 (AoE)
#ACL2025 #NLProc
16.05.2025 14:57 β
π 1
π 1
π¬ 0
π 0
I'm truly honoured that our paper "Causal Estimation of Memorisation Profiles" has been selected as the Paper of the Year by @cst.cam.ac.uk π
I thank my amazing co-authors Clara Meister, Thomas Hofmann, @tpimentel.bsky.social, and my great advisor and co-author @andreasvlachos.bsky.social!
30.04.2025 04:10 β
π 8
π 2
π¬ 1
π 0
Big thanks to my co-authors: @ovdw.bsky.social, Max MΓΌller-Eberstein, @nsaphra.bsky.social, @hails.computer, Willem Zuidema, and @stellaathena.bsky.social
22.04.2025 11:02 β
π 2
π 0
π¬ 0
π 0
Come find us at the poster session:
ποΈ Fri 25 Apr, 3:00β5:30 p.m. (+08)
π Hall 3 + Hall 2B, Poster n. 259
22.04.2025 11:02 β
π 1
π 0
π¬ 1
π 0
We find that:
π Language modelling is stable: consistent scaling laws for performance and info content.
π Steps 1kβ10k form core of linguistic structure; 10kβ100k bring the biggest jumps in performance.
πΊοΈ Training maps capture these phases and reveal outlier seeds early
22.04.2025 11:02 β
π 1
π 0
π¬ 1
π 0
We introduce PolyPythias: 50 training runs across 5 sizes (14Mβ410M) and 10 seeds to explore:
1οΈβ£ How stable is downstream performance?
2οΈβ£ How similar are the learned linguistic representations?
3οΈβ£ Do training dynamics reveal distinct phases, and can we spot issues early?
22.04.2025 11:02 β
π 1
π 0
π¬ 1
π 0
βοΈ Headed to @iclr-conf.bsky.social β whether youβll be there in person or tuning in remotely, Iβd love to connect!
Weβll be presenting our paper on pre-training stability in language models and the PolyPythias π§΅
π ArXiv: arxiv.org/abs/2503.09543
π€ PolyPythias: huggingface.co/collections/...
22.04.2025 11:02 β
π 6
π 3
π¬ 1
π 0
The First Workshop on Large Language Model Memorization will be co-located at @aclmeeting.bsky.social in Vienna. Help us spread the word!
27.01.2025 21:53 β
π 7
π 2
π¬ 0
π 0
This year, when students of my optimization class were asking for references related to forward-backward mode autodiff, I didn't suggest books or articles: #JAX documentation was actually the best thing I've found! What's your go-to reference for this?
26.11.2024 03:15 β
π 21
π 1
π¬ 2
π 0
Paper screenshot and Figure 1 (c) with cumulative ablations for components of RealMLP-TD.
Can deep learning finally compete with boosted trees on tabular data? π²
In our NeurIPS 2024 paper, we introduce RealMLP, a NN with improvements in all areas and meta-learned default parameters.
Some insights about RealMLP and other models on large benchmarks (>200 datasets): π§΅
18.11.2024 14:15 β
π 61
π 9
π¬ 1
π 7
Anne Gagneux, Ségolène Martin, @quentinbertrand.bsky.social Remi Emonet and I wrote a tutorial blog post on flow matching: dl.heeere.com/conditional-... with lots of illustrations and intuition!
We got this idea after their cool work on improving Plug and Play with FM: arxiv.org/abs/2410.02423
27.11.2024 09:00 β
π 355
π 102
π¬ 12
π 11
Amazing resource by @brandfonbrener.bsky.social and co-authors. They train and release (the last checkpoint of) >500 models with sizes 20M to 3.3B params and FLOPs 2e17 to 1e21 across 6 different pre-training datasets.
Bonus: They have evaluations on downstream benchmarks!
Great work! π
27.11.2024 18:15 β
π 1
π 0
π¬ 1
π 0
Our lab is on Bluesky now: bsky.app/profile/camb... π
25.11.2024 11:25 β
π 7
π 1
π¬ 0
π 0