Had a really great and fun time with @yanai.bsky.social, Niloofar Mireshghallah, and Reza Shokri discussing memorisation at the @l2m2workshop.bsky.social panel. Thanks to the entire organising team and attendees for making this such a fantastic workshop! #ACL2025
               
            
            
                02.08.2025 17:02 β π 8    π 1    π¬ 0    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            @philipwitti.bsky.social will be presenting our paper "Tokenisation is NP-Complete" at #ACL2025 π Come to the language modelling 2 session (Wednesday morning, 9h~10h30) to learn more about how challenging tokenisation can be!
               
            
            
                27.07.2025 09:41 β π 7    π 3    π¬ 0    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Just arrived in Vienna for ACL 2025 π¦πΉ Excited to be here and to finally meet so many people in person!
We have several papers this year and many from @milanlp.bsky.social are around, come say hi!
Here are all the works I'm involved in ‡οΈ
#ACL2025 #ACL2025NLP
               
            
            
                27.07.2025 10:29 β π 20    π 4    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Also, got burning questions about memorisation? Send them my wayβwe'll make sure to pose them to our panelists during the workshop!
               
            
            
                27.07.2025 06:41 β π 0    π 0    π¬ 0    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Headed to Vienna for #ACL2025 to present our tokenisation bias paper and co-organise the L2M2 workshop on memorisation in language models. Reach out to chat about tokenisation, memorisation, and all things pre-training (esp. data-related topics)!
               
            
            
                27.07.2025 06:40 β π 19    π 2    π¬ 2    π 0                      
            
         
            
        
            
        
            
            
            
            
            
    
    
    
    
            Also, we find that:
β Tokenisation bias appears early in training
β Doesnβt go away as models improve or with scale
We hope this approach can support:
 β More principled vocabulary design
 β Better understanding of generalisation trade-offs
 β Fairer and more stable LMs
               
            
            
                05.06.2025 10:43 β π 1    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            As our main result, we find that when a token is in a modelβs vocabularyβi.e., when its characters are tokenised as a single symbolβthe model may assign it up to 17x more probability than if it had been split into two tokens instead
               
            
            
                05.06.2025 10:43 β π 2    π 1    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
                                                 
                                                
    
    
    
    
            The trick: tokenisers build vocabs incrementally up to a fixed size (e.g., 32k). This defines a "cutoff": tokens near it are similar (e.g., frequency), but those inside appear as one while those outside as two symbols. Perfect setup for regression discontinuity! Details in π!
               
            
            
                05.06.2025 10:43 β π 4    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            So, did we train thousands of models, with and without each token in our vocabulary? No! Our method works observationally! ππ
               
            
            
                05.06.2025 10:43 β π 1    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            While intuitive, this question is tricky. We canβt just compare
1οΈβ£ in- vs. out-of-vocab words (like "hello" vs "appoggiatura") as they differ systematically, e.g., in frequency
2οΈβ£ different tokenisations (e.g., β¨he,lloβ©or β¨helloβ©) as the model only sees one during training
               
            
            
                05.06.2025 10:43 β π 1    π 0    π¬ 2    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            In our paper, we estimate a specific type of tokenisation bias: Whatβs the effect of including a token (e.g., β¨helloβ©) in the tokeniserβs vocabulary on the log-probability this model assigns to its characters (βhelloβ)?
               
            
            
                05.06.2025 10:43 β π 2    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Most language models assign probabilities to raw strings (like "hello") by first tokenising them (like β¨he, lloβ© or β¨helloβ©). Ideally, different tokenisations shouldn't change these modelsβ outputs. In practice, they do. We call this difference **tokenisation bias**
               
            
            
                05.06.2025 10:43 β π 2    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
                                                 
                                                
    
    
    
    
            All modern LLMs run on top of a tokeniser, an often overlooked βpreprocessing detailβ. But what if that tokeniser systematically affects model behaviour? We call this tokenisation bias.
Letβs talk about it and why it mattersπ 
@aclmeeting.bsky.social #ACL2025 #NLProc
               
            
            
                05.06.2025 10:43 β π 62    π 8    π¬ 1    π 2                      
            
         
            
        
            
            
            
            
                                                 
                                            Title of paper "Causal Estimation of Tokenisation Bias" and schematic of how we define tokenisation bias, which is the causal effect we are interested in.
                                                
    
    
    
    
            A string may get 17 times less probability if tokenised as two symbols (e.g., β¨he, lloβ©) than as one (e.g., β¨helloβ©)βby an LM trained from scratch in each situation! Our new ACL paper proposes an observational method to estimate this causal effect! Longer thread soon!
               
            
            
                04.06.2025 10:51 β π 53    π 9    π¬ 1    π 3                      
            
         
            
        
            
            
            
            
                                                 
                                            Inline citations with only first author name, or first two co-first author names.
                                                
    
    
    
    
            If you're finishing your camera-ready for ACL or ICML and want to cite co-first authors more fairly, I just made a simple fix to do this! Just add $^*$ to the authors' names in your bibtex, and the citations should change :)
github.com/tpimentelms/...
               
            
            
                29.05.2025 08:53 β π 85    π 23    π¬ 4    π 0                      
            
         
            
        
            
            
            
            
            
    
    
            
                             
                        
                ACL 2025 Workshop L2M2 ARR Commitment
                Welcome to the OpenReview homepage for ACL 2025 Workshop L2M2 ARR Commitment
            
        
    
    
            π’ @aclmeeting.bsky.social notifications have been sent out, making this the perfect time to finalize your commitment. Don't miss the opportunity to be part of the L2M2 workshop!  
π Commit here: openreview.net/group?id=acl...
ποΈ Deadline: May 20, 2025 (AoE)
#ACL2025 #NLProc
               
            
            
                16.05.2025 14:57 β π 1    π 1    π¬ 0    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            I'm truly honoured that our paper "Causal Estimation of Memorisation Profiles" has been selected as the Paper of the Year by @cst.cam.ac.uk π
I thank my amazing co-authors Clara Meister, Thomas Hofmann, @tpimentel.bsky.social, and my great advisor and co-author @andreasvlachos.bsky.social!
               
            
            
                30.04.2025 04:10 β π 7    π 2    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Big thanks to my co-authors: @ovdw.bsky.social, Max MΓΌller-Eberstein, @nsaphra.bsky.social, @hails.computer, Willem Zuidema, and @stellaathena.bsky.social
               
            
            
                22.04.2025 11:02 β π 2    π 0    π¬ 0    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Come find us at the poster session:
ποΈ Fri 25 Apr, 3:00β5:30 p.m. (+08)
π Hall 3 + Hall 2B, Poster n. 259
               
            
            
                22.04.2025 11:02 β π 1    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            We find that:
π Language modelling is stable: consistent scaling laws for performance and info content.
π Steps 1kβ10k form core of linguistic structure; 10kβ100k bring the biggest jumps in performance.
πΊοΈ Training maps capture these phases and reveal outlier seeds early
               
            
            
                22.04.2025 11:02 β π 1    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            We introduce PolyPythias: 50 training runs across 5 sizes (14Mβ410M) and 10 seeds to explore: 
1οΈβ£ How stable is downstream performance? 
2οΈβ£ How similar are the learned linguistic representations? 
3οΈβ£ Do training dynamics reveal distinct phases, and can we spot issues early?
               
            
            
                22.04.2025 11:02 β π 1    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
                                                 
                                                
    
    
    
    
            βοΈ Headed to @iclr-conf.bsky.social β whether youβll be there in person or tuning in remotely, Iβd love to connect!
Weβll be presenting our paper on pre-training stability in language models and the PolyPythias π§΅ 
π ArXiv: arxiv.org/abs/2503.09543
π€ PolyPythias:  huggingface.co/collections/...
               
            
            
                22.04.2025 11:02 β π 5    π 3    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            The First Workshop on Large Language Model Memorization will be co-located at @aclmeeting.bsky.social in Vienna. Help us spread the word!
               
            
            
                27.01.2025 21:53 β π 7    π 2    π¬ 0    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            This year, when students of my optimization class were asking for references related to forward-backward mode autodiff, I didn't suggest books or articles: #JAX documentation was actually the best thing I've found! What's your go-to reference for this?
               
            
            
                26.11.2024 03:15 β π 21    π 1    π¬ 2    π 0                      
            
         
            
        
            
            
            
            
                                                 
                                            Paper screenshot and Figure 1 (c) with cumulative ablations for components of RealMLP-TD.
                                                
    
    
    
    
            Can deep learning finally compete with boosted trees on tabular data? π²
In our NeurIPS 2024 paper, we introduce RealMLP, a NN with improvements in all areas and meta-learned default parameters.
Some insights about RealMLP and other models on large benchmarks (>200 datasets): π§΅
               
            
            
                18.11.2024 14:15 β π 60    π 9    π¬ 1    π 7                      
            
         
            
        
            
            
                            
            
            
            
    
    
    
    
            Anne Gagneux, Ségolène Martin, @quentinbertrand.bsky.social Remi Emonet and I wrote a tutorial blog post on flow matching: dl.heeere.com/conditional-... with lots of illustrations and intuition!
We got this idea after their cool work on improving Plug and Play with FM: arxiv.org/abs/2410.02423
               
            
            
                27.11.2024 09:00 β π 354    π 102    π¬ 12    π 11                      
            
         
            
        
            
        
            
            
            
            
            
    
    
    
    
            Amazing resource by @brandfonbrener.bsky.social and co-authors. They train and release (the last checkpoint of) >500 models with sizes 20M to 3.3B params and FLOPs 2e17 to 1e21 across 6 different pre-training datasets.
Bonus: They have evaluations on downstream benchmarks!
Great work! π
               
            
            
                27.11.2024 18:15 β π 1    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Our lab is on Bluesky now: bsky.app/profile/camb... π
               
            
            
                25.11.2024 11:25 β π 6    π 1    π¬ 0    π 0                      
            
         
    
         
        
            
        
                            
                    
                    
                                            Doctoral student @ETH ZΓΌrich π¨π
                                     
                            
                    
                    
                                            NLP, Linguistics, Cognitive Science, AI, ML, etc.
Job currently: Research Scientist (NYC)
Job formerly: NYU Linguistics, MSU Linguistics
                                     
                            
                    
                    
                                            studying the minds on our computers | https://kyobrien.io
                                     
                            
                    
                    
                                            #NLP Postdoc at Mila - Quebec AI Institute & McGill University
mariusmosbach.com
                                     
                            
                    
                    
                                            Assistant Professor at Bar-Ilan University
https://yanaiela.github.io/
                                     
                            
                    
                    
                                            Postdoc & Scientific Coordinator @gesistraining.bsky.social. PhD from @gessunimannheim.bsky.social. She/her. Comparative politics, legislative behaviour, European Parliament, computational political science.
                                     
                            
                    
                    
                                            Professor @milanlp.bsky.social for #NLProc, compsocsci, #ML
Also at http://dirkhovy.com/
                                     
                            
                    
                    
                                            asst prof @Stanford linguistics | director of social interaction lab π± | bluskies about computational cognitive science & language
                                     
                            
                    
                    
                                            Pure mathematician working in Ergodic Theory, Fractal Geometry, and (recently) Large Language Models. Senior Lecturer (= Associate Professor) at the University of Manchester.
                                     
                            
                    
                    
                                    
                            
                    
                    
                                            Former: AI Research Scientist at Intel Labs, Postdoc at Princeton, DPhil at Oxford
                                     
                            
                    
                    
                                            PhD student - University of Turin
                                     
                            
                    
                    
                                            PhDing in Complex Systems and Data Science at @UVM and @Vermont Complex Systems Institute
Computational Social Science, Complex Systems and Economics
Enjoying retirement from a former roller hockey career
#proudlycharroco
                                     
                            
                    
                    
                                            Let's Talk about Tokenization
https://tokenization-workshop.github.io
                                     
                            
                    
                    
                                            PhD candidate, AI Computer Science
                                     
                            
                    
                    
                                            The largest workshop on analysing and interpreting neural networks for NLP. 
BlackboxNLP will be held at EMNLP 2025 in Suzhou, China
blackboxnlp.github.io
                                     
                            
                    
                    
                                            NLP Researcher at EleutherAI, PhD UC San Diego Linguistics. 
Previously PleIAs, Edinburgh University. 
Interested in multilingual NLP, tokenizers, open science.
πBoston.  She/her.
https://catherinearnett.github.io/
                                     
                            
                    
                    
                                            AI Architect | North Carolina | AI/ML, IoT, science
WARNING: I talk about kids sometimes
                                     
                            
                    
                    
                                            FAIR Chemistry. Simulation-based Inference.
                                     
                            
                    
                    
                                            Cambridge University Computer Science & Technology Department (aka the Computer Lab). We built the 1st usable programmable computer, offered the UK's 1st Computer Science degree, created the 1st webcam - and continue to advance the field today.