Iβll be presenting our work on Byte-level Tokenizer Vulnerabilities at the poster session at 2:00pm!
If youβve ever encountered oddities or frustrations with #tokenization Iβd love to chat about it! #EMNLP
@eugeneonnlp.bsky.social
NLP PhD student @ Northeastern Multilingual NLP, tokenizers https://genesith.github.io/
Iβll be presenting our work on Byte-level Tokenizer Vulnerabilities at the poster session at 2:00pm!
If youβve ever encountered oddities or frustrations with #tokenization Iβd love to chat about it! #EMNLP
great list, would love an add!
05.12.2024 06:57 β π 0 π 0 π¬ 0 π 0To paraphrase Dennett (rip π), the goal of reviewing is to determine truth, not to conquer your opponent.
Too many reviewers seem to not have internalised this. In my opinion, this is the hardest lesson a reviewer has to learn, and I want to share some thoughts.
Would appreciate an add!
20.11.2024 12:48 β π 1 π 0 π¬ 0 π 0ππΆ
17.11.2024 11:45 β π 0 π 0 π¬ 1 π 0Thanks to coauthors from S2W Inc. (Jin-Woo Chung
, Keuntae Park), and KAIST (professors Kimin Lee
and Seungwon Shin)!
You can find our paper here: arxiv.org/abs/2410.23684 (11/11)
Trustworthy models require more reliable tokenization, with robustness that extends beyond the training distribution.
Tokenizer research has surged this year. I'm hoping to share that there's more tokenizer-rooted vulnerabilities beyond undertrained tokens. (10/11)
But why?
During training, incomplete tokens can co-occur with only a few tokens due to their syntax.
Since they can resolve to many characters, they will also be trained to be semantically ambiguous.
We hypothesize these factors can cause fragile token representations. (9/11)
This was very surprising, especially if you consider that the model was trained to never input/output the sequence of "<0x9F>" and "θ½" together (the tokenizer combines them into a single token.)
Yet, it was more reliable than using the original incomplete tokens. (8/11)
"But a phrase like ΰ€θ½ is very OOD. Are you sure these hallucinations are a tokenization problem?"
We think so! When we tokenize the same phrase differently to *avoid* incomplete tokens, the models generally performed much better (including a 93% reduction in Llama3.1). (7/11)
We prepare up to 100 improbable bigrams for each tokenizer, and use comparable complete token bigrams as baselines.
Improbable bigrams were significantly higher to hallucinations.
(For this, we only used trained tokens to remove influence of glitch tokens.) (6/11)
We test a model's ability to repeat a target phrase with three different scenarios, which should be doable even for meaningless phrases.
A target phrase is considered hallucinatory only if the model fails to repeat the phrase in all 3 prompts. (5/11)
We can analyze each incomplete token's structure based on starting bytes and continuation bytes. We can then find which tokens have complementary structures.
If the pair is re-encodable to the incomplete tokens, it is a legal incomplete bigram. (4/11)
ΰ€θ½ combines two "incomplete tokens" ('<0xE0><0xA4>' and '<0x9F>θ½').
Such tokens with stray bytes rely on adjacent tokens' stray bytes to resolve as a character.
If two such tokens combine into an "improbable bigram" like ΰ€θ½, we get a phrase that causes model errors. (3/11)
You might be familiar with this kind of model behavior from undertrained tokens (SolidGoldMagikarp, $PostalCodesNL). However, what we found was a completely separate phenomenon.
These hallucinatory behaviors persist even when we limit the vocabulary to trained tokens! (2/11)
#nlp
Have you ever wondered what "ΰ€θ½" means?
Probably not, since it's not a meaningful phrase.
But if you ever did, any well-trained LLM should be able to tell you that. Right?
Not quite! We discover phrases like "ΰ€θ½" trigger vulnerabilities in Byte-Level BPE Tokenizers. (1/11)
A platform for coexistence.
08.11.2024 05:21 β π 1 π 0 π¬ 0 π 0Hello World!
The sky really is bluer on the other side.