Pietro Lesci's Avatar

Pietro Lesci

@pietrolesci.bsky.social

PhD student at Cambridge University. Causality & language models. Passionate musician, professional debugger. pietrolesci.github.io

249 Followers  |  1,001 Following  |  23 Posts  |  Joined: 22.11.2024  |  1.8926

Latest posts by pietrolesci.bsky.social on Bluesky

Had a really great and fun time with @yanai.bsky.social, Niloofar Mireshghallah, and Reza Shokri discussing memorisation at the @l2m2workshop.bsky.social panel. Thanks to the entire organising team and attendees for making this such a fantastic workshop! #ACL2025

02.08.2025 17:02 β€” πŸ‘ 8    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

@philipwitti.bsky.social will be presenting our paper "Tokenisation is NP-Complete" at #ACL2025 😁 Come to the language modelling 2 session (Wednesday morning, 9h~10h30) to learn more about how challenging tokenisation can be!

27.07.2025 09:41 β€” πŸ‘ 7    πŸ” 3    πŸ’¬ 0    πŸ“Œ 0

Just arrived in Vienna for ACL 2025 πŸ‡¦πŸ‡Ή Excited to be here and to finally meet so many people in person!

We have several papers this year and many from @milanlp.bsky.social are around, come say hi!

Here are all the works I'm involved in ‡️

#ACL2025 #ACL2025NLP

27.07.2025 10:29 β€” πŸ‘ 20    πŸ” 4    πŸ’¬ 1    πŸ“Œ 0

Also, got burning questions about memorisation? Send them my wayβ€”we'll make sure to pose them to our panelists during the workshop!

27.07.2025 06:41 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Headed to Vienna for #ACL2025 to present our tokenisation bias paper and co-organise the L2M2 workshop on memorisation in language models. Reach out to chat about tokenisation, memorisation, and all things pre-training (esp. data-related topics)!

27.07.2025 06:40 β€” πŸ‘ 19    πŸ” 2    πŸ’¬ 2    πŸ“Œ 0
Preview
Causal Estimation of Tokenisation Bias Modern language models are typically trained over subword sequences, but ultimately define probabilities over character-strings. Ideally, the choice of the tokeniser -- which maps character-strings to...

Paper πŸ“„: arxiv.org/abs/2506.03149
Code πŸ’»: github.com/pietrolesci/...

Joint work with amazing collaborators: Clara Meister, Thomas Hofmann, @andreasvlachos.bsky.social, and @tpimentel.bsky.social!

05.06.2025 10:43 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1

Also, we find that:
– Tokenisation bias appears early in training
– Doesn’t go away as models improve or with scale

We hope this approach can support:
– More principled vocabulary design
– Better understanding of generalisation trade-offs
– Fairer and more stable LMs

05.06.2025 10:43 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

As our main result, we find that when a token is in a model’s vocabularyβ€”i.e., when its characters are tokenised as a single symbolβ€”the model may assign it up to 17x more probability than if it had been split into two tokens instead

05.06.2025 10:43 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

The trick: tokenisers build vocabs incrementally up to a fixed size (e.g., 32k). This defines a "cutoff": tokens near it are similar (e.g., frequency), but those inside appear as one while those outside as two symbols. Perfect setup for regression discontinuity! Details in πŸ“„!

05.06.2025 10:43 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

So, did we train thousands of models, with and without each token in our vocabulary? No! Our method works observationally! πŸ‘€πŸ“Š

05.06.2025 10:43 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

While intuitive, this question is tricky. We can’t just compare
1️⃣ in- vs. out-of-vocab words (like "hello" vs "appoggiatura") as they differ systematically, e.g., in frequency
2️⃣ different tokenisations (e.g., ⟨he,llo⟩or ⟨hello⟩) as the model only sees one during training

05.06.2025 10:43 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

In our paper, we estimate a specific type of tokenisation bias: What’s the effect of including a token (e.g., ⟨hello⟩) in the tokeniser’s vocabulary on the log-probability this model assigns to its characters (β€œhello”)?

05.06.2025 10:43 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Most language models assign probabilities to raw strings (like "hello") by first tokenising them (like ⟨he, llo⟩ or ⟨hello⟩). Ideally, different tokenisations shouldn't change these models’ outputs. In practice, they do. We call this difference **tokenisation bias**

05.06.2025 10:43 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

All modern LLMs run on top of a tokeniser, an often overlooked β€œpreprocessing detail”. But what if that tokeniser systematically affects model behaviour? We call this tokenisation bias.

Let’s talk about it and why it mattersπŸ‘‡
@aclmeeting.bsky.social #ACL2025 #NLProc

05.06.2025 10:43 β€” πŸ‘ 62    πŸ” 8    πŸ’¬ 1    πŸ“Œ 2
Title of paper "Causal Estimation of Tokenisation Bias" and schematic of how we define tokenisation bias, which is the causal effect we are interested in.

Title of paper "Causal Estimation of Tokenisation Bias" and schematic of how we define tokenisation bias, which is the causal effect we are interested in.

A string may get 17 times less probability if tokenised as two symbols (e.g., ⟨he, llo⟩) than as one (e.g., ⟨hello⟩)β€”by an LM trained from scratch in each situation! Our new ACL paper proposes an observational method to estimate this causal effect! Longer thread soon!

04.06.2025 10:51 β€” πŸ‘ 53    πŸ” 9    πŸ’¬ 1    πŸ“Œ 3
Inline citations with only first author name, or first two co-first author names.

Inline citations with only first author name, or first two co-first author names.

If you're finishing your camera-ready for ACL or ICML and want to cite co-first authors more fairly, I just made a simple fix to do this! Just add $^*$ to the authors' names in your bibtex, and the citations should change :)

github.com/tpimentelms/...

29.05.2025 08:53 β€” πŸ‘ 85    πŸ” 23    πŸ’¬ 4    πŸ“Œ 0
Preview
ACL 2025 Workshop L2M2 ARR Commitment Welcome to the OpenReview homepage for ACL 2025 Workshop L2M2 ARR Commitment

πŸ“’ @aclmeeting.bsky.social notifications have been sent out, making this the perfect time to finalize your commitment. Don't miss the opportunity to be part of the L2M2 workshop!

πŸ”— Commit here: openreview.net/group?id=acl...

πŸ—“οΈ Deadline: May 20, 2025 (AoE)

#ACL2025 #NLProc

16.05.2025 14:57 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

I'm truly honoured that our paper "Causal Estimation of Memorisation Profiles" has been selected as the Paper of the Year by @cst.cam.ac.uk πŸŽ‰

I thank my amazing co-authors Clara Meister, Thomas Hofmann, @tpimentel.bsky.social, and my great advisor and co-author @andreasvlachos.bsky.social!

30.04.2025 04:10 β€” πŸ‘ 7    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0

Big thanks to my co-authors: @ovdw.bsky.social, Max MΓΌller-Eberstein, @nsaphra.bsky.social, @hails.computer, Willem Zuidema, and @stellaathena.bsky.social

22.04.2025 11:02 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Come find us at the poster session:
πŸ—“οΈ Fri 25 Apr, 3:00–5:30 p.m. (+08)
πŸ“Œ Hall 3 + Hall 2B, Poster n. 259

22.04.2025 11:02 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We find that:
πŸ“ˆ Language modelling is stable: consistent scaling laws for performance and info content.
πŸ“š Steps 1k–10k form core of linguistic structure; 10k–100k bring the biggest jumps in performance.
πŸ—ΊοΈ Training maps capture these phases and reveal outlier seeds early

22.04.2025 11:02 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We introduce PolyPythias: 50 training runs across 5 sizes (14M–410M) and 10 seeds to explore:
1️⃣ How stable is downstream performance?
2️⃣ How similar are the learned linguistic representations?
3️⃣ Do training dynamics reveal distinct phases, and can we spot issues early?

22.04.2025 11:02 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

✈️ Headed to @iclr-conf.bsky.social β€” whether you’ll be there in person or tuning in remotely, I’d love to connect!

We’ll be presenting our paper on pre-training stability in language models and the PolyPythias 🧡

πŸ”— ArXiv: arxiv.org/abs/2503.09543
πŸ€— PolyPythias: huggingface.co/collections/...

22.04.2025 11:02 β€” πŸ‘ 5    πŸ” 3    πŸ’¬ 1    πŸ“Œ 0

The First Workshop on Large Language Model Memorization will be co-located at @aclmeeting.bsky.social in Vienna. Help us spread the word!

27.01.2025 21:53 β€” πŸ‘ 7    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0

This year, when students of my optimization class were asking for references related to forward-backward mode autodiff, I didn't suggest books or articles: #JAX documentation was actually the best thing I've found! What's your go-to reference for this?

26.11.2024 03:15 β€” πŸ‘ 21    πŸ” 1    πŸ’¬ 2    πŸ“Œ 0
Paper screenshot and Figure 1 (c) with cumulative ablations for components of RealMLP-TD.

Paper screenshot and Figure 1 (c) with cumulative ablations for components of RealMLP-TD.

Can deep learning finally compete with boosted trees on tabular data? 🌲
In our NeurIPS 2024 paper, we introduce RealMLP, a NN with improvements in all areas and meta-learned default parameters.
Some insights about RealMLP and other models on large benchmarks (>200 datasets): 🧡

18.11.2024 14:15 β€” πŸ‘ 60    πŸ” 9    πŸ’¬ 1    πŸ“Œ 7
Video thumbnail

Anne Gagneux, Ségolène Martin, @quentinbertrand.bsky.social Remi Emonet and I wrote a tutorial blog post on flow matching: dl.heeere.com/conditional-... with lots of illustrations and intuition!

We got this idea after their cool work on improving Plug and Play with FM: arxiv.org/abs/2410.02423

27.11.2024 09:00 β€” πŸ‘ 354    πŸ” 102    πŸ’¬ 12    πŸ“Œ 11
Preview
Loss-to-Loss Prediction: Scaling Laws for All Datasets While scaling laws provide a reliable methodology for predicting train loss across compute scales for a single data distribution, less is known about how these predictions should change as we change t...

Checkout their paper: arxiv.org/abs/2411.12925

Checkpoints: huggingface.co/KempnerInsti...
Evals: github.com/KempnerInsti...

27.11.2024 18:15 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Amazing resource by @brandfonbrener.bsky.social and co-authors. They train and release (the last checkpoint of) >500 models with sizes 20M to 3.3B params and FLOPs 2e17 to 1e21 across 6 different pre-training datasets.

Bonus: They have evaluations on downstream benchmarks!

Great work! πŸš€

27.11.2024 18:15 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Our lab is on Bluesky now: bsky.app/profile/camb... πŸŽ‰

25.11.2024 11:25 β€” πŸ‘ 6    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

@pietrolesci is following 20 prominent accounts