Craig Schmidt's Avatar

Craig Schmidt

@craigschmidt.com.bsky.social

Interested in ML, AI, and NLP. Particularly interested in tokenization. Live in the Boston area and work at Kensho Technologies.

470 Followers  |  2,242 Following  |  47 Posts  |  Joined: 24.11.2024  |  1.8545

Latest posts by craigschmidt.com on Bluesky

@crampell.bsky.social’s post got me to thinking and…yes…Trump has apparently canceled the research grant of Judea Pearl, who is one of the world’s leading scholars, is Jewish, Israeli-American, & is vocally opposed to antisemitism, & is the father of Daniel Pearl.
www.science.org/content/arti...

03.08.2025 02:44 — 👍 213    🔁 91    💬 9    📌 8
Stellen OBP - Georg-August-Universität Göttingen Webseiten der Georg-August-Universität Göttingen

Interested in multilingual tokenization in #NLP? Lisa Beinborn and I are hiring!

PhD candidate position in Göttingen, Germany: www.uni-goettingen.de/de/644546.ht...

PostDoc position in Leuven, Belgium:
www.kuleuven.be/personeel/jo...

Deadline 6th of June

16.05.2025 08:23 — 👍 25    🔁 13    💬 2    📌 2

I've posted a few papers I missed including yours here bsky.app/profile/crai.... Thomas pointed that out about 5 seconds after I posted on the discord :-)

30.07.2025 15:17 — 👍 1    🔁 0    💬 1    📌 0
Preview
Causal Estimation of Tokenisation Bias Pietro Lesci, Clara Meister, Thomas Hofmann, Andreas Vlachos, Tiago Pimentel. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

16) Causal Estimation of Tokenisation Bias
Pietro Lesci et al
aclanthology.org/2025.acl-lon...

30.07.2025 14:22 — 👍 2    🔁 0    💬 0    📌 0
Preview
Tokenisation is NP-Complete Philip Whittington, Gregor Bachmann, Tiago Pimentel. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

15) Tokenisation is NP-Complete
Philip Whittington et al
aclanthology.org/2025.acl-lon...

30.07.2025 14:22 — 👍 3    🔁 1    💬 1    📌 0
Preview
GRaMPa: Subword Regularisation by Skewing Uniform Segmentation Distributions with an Efficient Path-counting Markov Model Thomas Bauwens, David Kaczér, Miryam De Lhoneux. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

14) GRaMPa: Subword Regularisation by Skewing Uniform Segmentation Distributions with an Efficient Path-counting Markov Model
Thomas Bauwens et al
aclanthology.org/2025.acl-lon...

30.07.2025 14:22 — 👍 2    🔁 0    💬 1    📌 1

And of course I missed some tokenization related papers at #ACL2025 in my previous post. Any more I should add?

30.07.2025 14:22 — 👍 2    🔁 0    💬 1    📌 0
Preview
Evaluating Tokenizer Adaptation Methods for Large Language Models on Low-Resource Programming Languages Georgy Andryushchenko, Vladimir V. Ivanov. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop). 2025.

13) Evaluating Tokenizer Adaptation Methods for Large Language Models on Low-Resource Programming Languages
Georgii Andriushchenko et al
aclanthology.org/2025.acl-srw...

30.07.2025 14:03 — 👍 1    🔁 0    💬 0    📌 0
Preview
Retrofitting Large Language Models with Dynamic Tokenization Darius Feher, Ivan Vulić, Benjamin Minixhofer. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

12) Retrofitting Large Language Models with Dynamic Tokenization
Darius Feher et al
aclanthology.org/2025.acl-lon...

30.07.2025 14:03 — 👍 1    🔁 0    💬 1    📌 0
Preview
TokAlign: Efficient Vocabulary Adaptation via Token Alignment Chong Li, Jiajun Zhang, Chengqing Zong. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

11) TokAlign: Efficient Vocabulary Adaptation via Token Alignment
Chong Li et al
aclanthology.org/2025.acl-lon...

30.07.2025 14:03 — 👍 1    🔁 0    💬 1    📌 0
Preview
Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models Kexin Chen, Dongxia Wang, Yi Liu, Haonan Zhang, Wenhai Wang. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

10) Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models
Kexin Chen et al
aclanthology.org/2025.acl-lon...

30.07.2025 14:03 — 👍 1    🔁 0    💬 1    📌 0
Preview
Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar Andrew Gambardella, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2025.

9) Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar
Andrew Gambardella et al
aclanthology.org/2025.acl-sho...

30.07.2025 14:03 — 👍 1    🔁 0    💬 1    📌 0
Preview
Adversarial Tokenization Renato Geh, Zilei Shao, Guy Van Den Broeck. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

8) Adversarial Tokenization
Renato Lui Geh et al
aclanthology.org/2025.acl-lon...

30.07.2025 14:03 — 👍 1    🔁 0    💬 1    📌 0
Preview
Incorporating Domain Knowledge into Materials Tokenization Yerim Oh, Jun-Hyung Park, Junho Kim, SungHo Kim, SangKeun Lee. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

7) Incorporating Domain Knowledge into Materials Tokenization
Yerim Oh et al
aclanthology.org/2025.acl-lon...

30.07.2025 14:03 — 👍 1    🔁 0    💬 1    📌 0
Preview
Beyond Text Compression: Evaluating Tokenizers Across Scales Jonas F. Lotz, António V. Lopes, Stephan Peitz, Hendra Setiawan, Leonardo Emili. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

6) Beyond Text Compression: Evaluating Tokenizers Across Scales
Jonas F. Lotz et al
aclanthology.org/2025.acl-lon...

30.07.2025 14:03 — 👍 1    🔁 0    💬 1    📌 0
Preview
Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning Zhu Xu, Zhiqiang Zhao, Zihan Zhang, Yuchi Liu, Quanwei Shen, Fei Liu, Yu Kuang, Jian He, Conglin Liu. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1:...

5) Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning
Zhu Xu et al
aclanthology.org/2025.acl-lon...

30.07.2025 14:03 — 👍 1    🔁 0    💬 1    📌 0
Preview
Unsupervised Morphological Tree Tokenizer Qingyang Zhu, Xiang Hu, Pengyu Ji, Wei Wu, Kewei Tu. Findings of the Association for Computational Linguistics: ACL 2025. 2025.

4) Unsupervised Morphological Tree Tokenizer
Xiang Hu et al
aclanthology.org/2025.finding...

30.07.2025 14:03 — 👍 1    🔁 0    💬 1    📌 0
Preview
Splintering Nonconcatenative Languages for Better Tokenization Bar Gazit, Shaltiel Shmidman, Avi Shmidman, Yuval Pinter. Findings of the Association for Computational Linguistics: ACL 2025. 2025.

3) Splintering Nonconcatenative Languages for Better Tokenization
Yuval Pinter et al
aclanthology.org/2025.finding...

30.07.2025 14:03 — 👍 1    🔁 0    💬 1    📌 0
Preview
Tokenization is Sensitive to Language Variation Anna Wegmann, Dong Nguyen, David Jurgens. Findings of the Association for Computational Linguistics: ACL 2025. 2025.

2) Tokenization is Sensitive to Language Variation
Anna Wegmann et al
aclanthology.org/2025.finding...

30.07.2025 14:03 — 👍 1    🔁 0    💬 1    📌 0
Preview
Byte Latent Transformer: Patches Scale Better Than Tokens Artidoro Pagnoni, Ramakanth Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason E Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srini...

1) Byte Latent Transformer: Patches Scale Better Than Tokens
Artidoro Pagnoni et al
aclanthology.org/2025.acl-lon...

30.07.2025 14:03 — 👍 1    🔁 0    💬 1    📌 0

I'm sadly not at #ACL2025, but the work on tokenization seem to continue to explode. Here are the tokenization related papers I could find, in no particular order. Let me know if I missed any.

30.07.2025 14:03 — 👍 11    🔁 4    💬 2    📌 0

Really grateful to the organizers for the recognition of our work!

19.07.2025 13:55 — 👍 12    🔁 1    💬 1    📌 0
ICML Poster Chameleon: A Flexible Data-mixing Framework for Language Model Pretraining and FinetuningICML 2025

You’re right these results apply to general “big” datasets like ThePile or RedPajama. There are several papers at ICML on weighting datasets like Chameleon (icml.cc/virtual/2025...) that could probably let you get away with less data.

17.07.2025 15:29 — 👍 1    🔁 0    💬 1    📌 0
Preview
Entropy-Driven Pre-Tokenization for Byte-Pair Encoding Byte-Pair Encoding (BPE) has become a widely adopted subword tokenization method in modern language models due to its simplicity and strong empirical performance across downstream tasks. However, appl...

The second is on entropy-driven pre-tokenization for non-space delimited languages, arxiv.org/abs/2506.15889. That came out of a capstone project for a Harvard Masters program. Congrats to them on achieving a peer-reviewed paper.

17.07.2025 06:05 — 👍 4    🔁 0    💬 0    📌 0
Preview
How Much is Enough? The Diminishing Returns of Tokenization Training Data Tokenization, a crucial initial step in natural language processing, is governed by several key parameters, such as the tokenization algorithm, vocabulary size, pre-tokenization strategy, inference st...

I'm at #ICML2025 this week in Vancouver. My co-authors and I are presenting two posters at the Tokenization Workshop on Friday tokenization-workshop.github.io. The first is on how much data is useful in training a tokenizer arxiv.org/abs/2502.20273.

17.07.2025 06:05 — 👍 6    🔁 2    💬 2    📌 0

My son said he couldn’t call me on Father’s Day because he had worked the weekend dealing with a North Korean hacking group. Valid excuse I guess. The hack analysis …

18.06.2025 23:58 — 👍 1    🔁 0    💬 0    📌 0

I see I was too slow, and you're already on the discord

01.06.2025 14:11 — 👍 1    🔁 0    💬 0    📌 0
Post image

A bit of a mess around the conflict of COLM with the ARR (and to lesser degree ICML) reviews release. We feel this is creating a lot of pressure and uncertainty. So, we are pushing our deadlines:

Abstracts due March 22 AoE (+48hr)
Full papers due March 28 AoE (+24hr)

Plz RT 🙏

20.03.2025 18:20 — 👍 37    🔁 31    💬 3    📌 2
Preview
Tokenization Is More Than Compression Craig W Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, Chris Tanner. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024.

Sorry I'm a bit late to the party, but you might be interested in aclanthology.org/2024.emnlp-m...

13.03.2025 16:02 — 👍 3    🔁 0    💬 0    📌 0

If you have an interest in tokenization in Natural Language Processing (NLP), this is a nice discord. Come say hi.

12.02.2025 14:17 — 👍 4    🔁 0    💬 1    📌 0

@craigschmidt.com is following 19 prominent accounts