garreth @garrethlee - Bluesky Profile

Number Tokenization Blog - a Hugging Face Space by huggingface Discover amazing ML apps made by the community

All this history is nice, but which method actually performs best for math?

Read our latest blog to find out:
huggingface.co/spaces/huggi...

[6/N]

16.12.2024 17:31 — 👍 7 🔁 1 💬 1 📌 0

Rumor has it that earlier Claude models used a modified three-digit tokenization, processing numbers right-to-left instead of left-to-right.

This method mirrors how we often read and interpret numbers, like grouping digits with commas. Theoretically, this should help with math reasoning!

[5/N]

16.12.2024 17:31 — 👍 4 🔁 0 💬 1 📌 0

Alas, tokenizing numbers as digits was costly:

A 10-digit numbers now took 10 tokens instead of 3-4, which is ~2-3x more than before. That's a significant hit on training & inference costs!

LLaMA 3 fixed this by grouping numbers into threes, balancing compression and consistency.

[4/N]

16.12.2024 17:31 — 👍 4 🔁 0 💬 1 📌 0

Then came LLaMA 1, which took a clever approach to fix number inconsistencies: it tokenized numbers into individual digits (0-9), meaning any large number could now be represented with just 10 tokens.

The consistent representation of numbers made mathematical reasoning much better!

[3/N]

16.12.2024 17:31 — 👍 3 🔁 0 💬 1 📌 0

When GPT-2 came out in 2019, its tokenizer used byte-pair encoding (BPE), still common today:

• Merges frequent substrings, saving memory vs. inputting single characters
• However, vocabulary depends on training data
• Common numbers (e.g., 1999) get single tokens; others are split

[2/N]

16.12.2024 17:31 — 👍 3 🔁 0 💬 1 📌 0

🚀 With Meta's recent paper replacing tokenization in LLMs with patches 🩹, I figured that it's a great time to revisit how tokenization has evolved over the years using everyone's favourite medium - memes!

Let's take a trip down memory lane!

[1/N]

16.12.2024 17:31 — 👍 33 🔁 11 💬 4 📌 4

Shouted out by the goat 🥹🤗

25.11.2024 16:07 — 👍 2 🔁 0 💬 1 📌 0

GitHub - garrethlee/gcmt: A simple CLI tool that uses LLMs to automatically generate meaningful & conventional commit messages A simple CLI tool that uses LLMs to automatically generate meaningful & conventional commit messages - garrethlee/gcmt

github.com/garrethlee/g...

25.11.2024 04:31 — 👍 1 🔁 0 💬 0 📌 0

I made a simple CLI tool to write conventional git commit messages using the Hugging Face Inference API 🤗 (with some useful functionality baked into it)

➡️ To install: `pip install gcmt`

25.11.2024 04:31 — 👍 2 🔁 0 💬 1 📌 0

garreth

Latest posts by garrethlee.bsky.social on Bluesky

@garrethlee is following 20 prominent accounts