Eugene Jang @EMNLP's Avatar

Eugene Jang @EMNLP

@eugeneonnlp.bsky.social

NLP PhD student @ Northeastern Multilingual NLP, tokenizers https://genesith.github.io/

273 Followers  |  226 Following  |  17 Posts  |  Joined: 08.11.2024  |  2.4206

Latest posts by eugeneonnlp.bsky.social on Bluesky

I’ll be presenting our work on Byte-level Tokenizer Vulnerabilities at the poster session at 2:00pm!

If you’ve ever encountered oddities or frustrations with #tokenization I’d love to chat about it! #EMNLP

06.11.2025 21:23 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

great list, would love an add!

05.12.2024 06:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

To paraphrase Dennett (rip πŸ’”), the goal of reviewing is to determine truth, not to conquer your opponent.

Too many reviewers seem to not have internalised this. In my opinion, this is the hardest lesson a reviewer has to learn, and I want to share some thoughts.

27.11.2024 17:25 β€” πŸ‘ 47    πŸ” 9    πŸ’¬ 3    πŸ“Œ 1

Would appreciate an add!

20.11.2024 12:48 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

πŸ‘‹πŸ˜Ά

17.11.2024 11:45 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Thanks to coauthors from S2W Inc. (Jin-Woo Chung
, Keuntae Park), and KAIST (professors Kimin Lee
and Seungwon Shin)!

You can find our paper here: arxiv.org/abs/2410.23684 (11/11)

12.11.2024 05:10 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Trustworthy models require more reliable tokenization, with robustness that extends beyond the training distribution.
Tokenizer research has surged this year. I'm hoping to share that there's more tokenizer-rooted vulnerabilities beyond undertrained tokens. (10/11)

12.11.2024 05:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

But why?

During training, incomplete tokens can co-occur with only a few tokens due to their syntax.
Since they can resolve to many characters, they will also be trained to be semantically ambiguous.
We hypothesize these factors can cause fragile token representations. (9/11)

12.11.2024 05:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

This was very surprising, especially if you consider that the model was trained to never input/output the sequence of "<0x9F>" and "能" together (the tokenizer combines them into a single token.)
Yet, it was more reliable than using the original incomplete tokens. (8/11)

12.11.2024 05:08 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

"But a phrase like ΰ€Ÿθƒ½ is very OOD. Are you sure these hallucinations are a tokenization problem?"

We think so! When we tokenize the same phrase differently to *avoid* incomplete tokens, the models generally performed much better (including a 93% reduction in Llama3.1). (7/11)

12.11.2024 05:08 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We prepare up to 100 improbable bigrams for each tokenizer, and use comparable complete token bigrams as baselines.
Improbable bigrams were significantly higher to hallucinations.
(For this, we only used trained tokens to remove influence of glitch tokens.) (6/11)

12.11.2024 05:08 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We test a model's ability to repeat a target phrase with three different scenarios, which should be doable even for meaningless phrases.
A target phrase is considered hallucinatory only if the model fails to repeat the phrase in all 3 prompts. (5/11)

12.11.2024 05:08 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

We can analyze each incomplete token's structure based on starting bytes and continuation bytes. We can then find which tokens have complementary structures.
If the pair is re-encodable to the incomplete tokens, it is a legal incomplete bigram. (4/11)

12.11.2024 05:07 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

ΰ€Ÿθƒ½ combines two "incomplete tokens" ('<0xE0><0xA4>' and '<0x9F>能').
Such tokens with stray bytes rely on adjacent tokens' stray bytes to resolve as a character.
If two such tokens combine into an "improbable bigram" like ΰ€Ÿθƒ½, we get a phrase that causes model errors. (3/11)

12.11.2024 05:07 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

You might be familiar with this kind of model behavior from undertrained tokens (SolidGoldMagikarp, $PostalCodesNL). However, what we found was a completely separate phenomenon.
These hallucinatory behaviors persist even when we limit the vocabulary to trained tokens! (2/11)

12.11.2024 05:07 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

#nlp
Have you ever wondered what "ΰ€Ÿθƒ½" means?
Probably not, since it's not a meaningful phrase.
But if you ever did, any well-trained LLM should be able to tell you that. Right?
Not quite! We discover phrases like "ΰ€Ÿθƒ½" trigger vulnerabilities in Byte-Level BPE Tokenizers. (1/11)

12.11.2024 05:06 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1
Post image

A platform for coexistence.

08.11.2024 05:21 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Hello World!

The sky really is bluer on the other side.

08.11.2024 05:05 β€” πŸ‘ 9    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@eugeneonnlp is following 20 prominent accounts