garreth's Avatar

garreth

@garrethlee.bsky.social

๐Ÿ‡ฎ๐Ÿ‡ฉ | Co-Founder at Mundo AI (YC W25) | ex-{Hugging Face, Cohere}

82 Followers  |  76 Following  |  9 Posts  |  Joined: 17.11.2024  |  1.5815

Latest posts by garrethlee.bsky.social on Bluesky

Preview
Number Tokenization Blog - a Hugging Face Space by huggingface Discover amazing ML apps made by the community

All this history is nice, but which method actually performs best for math?

Read our latest blog to find out:
huggingface.co/spaces/huggi...

[6/N]

16.12.2024 17:31 โ€” ๐Ÿ‘ 7    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Rumor has it that earlier Claude models used a modified three-digit tokenization, processing numbers right-to-left instead of left-to-right.

This method mirrors how we often read and interpret numbers, like grouping digits with commas. Theoretically, this should help with math reasoning!

[5/N]

16.12.2024 17:31 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Alas, tokenizing numbers as digits was costly:

A 10-digit numbers now took 10 tokens instead of 3-4, which is ~2-3x more than before. That's a significant hit on training & inference costs!

LLaMA 3 fixed this by grouping numbers into threes, balancing compression and consistency.

[4/N]

16.12.2024 17:31 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Then came LLaMA 1, which took a clever approach to fix number inconsistencies: it tokenized numbers into individual digits (0-9), meaning any large number could now be represented with just 10 tokens.

The consistent representation of numbers made mathematical reasoning much better!

[3/N]

16.12.2024 17:31 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

When GPT-2 came out in 2019, its tokenizer used byte-pair encoding (BPE), still common today:

โ€ข Merges frequent substrings, saving memory vs. inputting single characters
โ€ข However, vocabulary depends on training data
โ€ข Common numbers (e.g., 1999) get single tokens; others are split

[2/N]

16.12.2024 17:31 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿš€ With Meta's recent paper replacing tokenization in LLMs with patches ๐Ÿฉน, I figured that it's a great time to revisit how tokenization has evolved over the years using everyone's favourite medium - memes!

Let's take a trip down memory lane!

[1/N]

16.12.2024 17:31 โ€” ๐Ÿ‘ 33    ๐Ÿ” 11    ๐Ÿ’ฌ 4    ๐Ÿ“Œ 4

Shouted out by the goat ๐Ÿฅน๐Ÿค—

25.11.2024 16:07 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
GitHub - garrethlee/gcmt: A simple CLI tool that uses LLMs to automatically generate meaningful & conventional commit messages A simple CLI tool that uses LLMs to automatically generate meaningful & conventional commit messages - garrethlee/gcmt

github.com/garrethlee/g...

25.11.2024 04:31 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

I made a simple CLI tool to write conventional git commit messages using the Hugging Face Inference API ๐Ÿค— (with some useful functionality baked into it)

โžก๏ธ To install: `pip install gcmt`

25.11.2024 04:31 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

@garrethlee is following 20 prominent accounts