Christopher Akiki's Avatar

Christopher Akiki

@cakiki.bsky.social

research scientist at ScaDS.AI Leipzig in nlp, ir, and ml. @hf.co fellow. @lichess.org team member. @kaggle.com datasets expert.

1,224 Followers  |  71 Following  |  41 Posts  |  Joined: 01.06.2023  |  2.2055

Latest posts by cakiki.bsky.social on Bluesky

- Dataset: huggingface.co/datasets/Hug...
- Embeddings: huggingface.co/datasets/air... (H/T @loubnabnl.hf.co for recommending this)

31.10.2025 14:52 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Weโ€™re doing AI all wrong. Hereโ€™s how to get it right Artificial intelligence is changing everything โ€” but at what cost? AI sustainability expert Sasha Luccioni exposes how tech companies' massive data centers are burning through energy and wrecking the ...

- Sasha's talk: ted.com/talks/sasha_...
- Tools: datamapplot and EVลC by @lelandmcinnes.bsky.social and colleagues at the Tutte Institute and openTSNE by @pavlinpolicar.bsky.social

31.10.2025 14:52 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Scatterplot of a document corpus with cluster information in the form of colors and cluster labels. Includes labels like computing, mental health, religion, etc.

Scatterplot of a document corpus with cluster information in the form of colors and cluster labels. Includes labels like computing, mental health, religion, etc.

I made this annotated scatter plot of 1 million FineWeb-Edu documents for @sashamtl.bsky.social's new TED talk.

31.10.2025 14:52 โ€” ๐Ÿ‘ 4    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
A four frame comic about the Cambodian genocide. In the first frame, a man can be seen from the back, looking at a window. The text reads: "I never ask my father about Cambodia." In the second frame, a silhouette of the father can be seen, screaming in his sleep, arms reaching into nothingness. The text reads: "What could he say, really, that he didn't already scream in his nightmares?". The third frame shows the father pondering on something, expression neutral. The text reads: But sometimes he would say "It's funny, when it happened, all the fish left the river. They never came back." . The fourth frame shows the father lowering his head, thinking. He says: "I guess they knew."

A four frame comic about the Cambodian genocide. In the first frame, a man can be seen from the back, looking at a window. The text reads: "I never ask my father about Cambodia." In the second frame, a silhouette of the father can be seen, screaming in his sleep, arms reaching into nothingness. The text reads: "What could he say, really, that he didn't already scream in his nightmares?". The third frame shows the father pondering on something, expression neutral. The text reads: But sometimes he would say "It's funny, when it happened, all the fish left the river. They never came back." . The fourth frame shows the father lowering his head, thinking. He says: "I guess they knew."

The second page of the comic on the Cambodian genocide, and everything to come. The first frame shows the daughter, young, silent and unsure of what to say. The text reads: "I didn't say anything then." In the present, the daughter, now an adult, looks at the Seine in Paris. The text reads "But I often think of it now.". The daughter then crouches near the Seine to see if there are any fish left. She says: "Are you still here?".

The second page of the comic on the Cambodian genocide, and everything to come. The first frame shows the daughter, young, silent and unsure of what to say. The text reads: "I didn't say anything then." In the present, the daughter, now an adult, looks at the Seine in Paris. The text reads "But I often think of it now.". The daughter then crouches near the Seine to see if there are any fish left. She says: "Are you still here?".

When the fish left the river:

28.10.2025 00:00 โ€” ๐Ÿ‘ 88    ๐Ÿ” 27    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 1

Also really love how organic the plot looks with "inferno" (left) and "viridis" (right).

27.10.2025 10:42 โ€” ๐Ÿ‘ 4    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Update: the color map in this post is misleading. See the quoted post for context.

bsky.app/profile/caki...

26.10.2025 16:07 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Thanks to @jamesabednar.bsky.social I realized I had used the wrong background color for the colormap I had chosen. This is another version of the plot (different embeddings) with the corrected background.

26.10.2025 16:06 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1

Glad you like it! Nothing yet, only a very messy jupyter notebook for now. I will share the code and data pipelines once they're all cleaned up.

26.10.2025 15:05 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Huge props to @lelandmcinnes.bsky.social for optimizing the hammer bundling code in datashader and to Barrett Lyon's inspring work at the Opte Project.

26.10.2025 13:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
organic looking graph of the BGP nodes of the internet. black and white

organic looking graph of the BGP nodes of the internet. black and white

Map of the internet: 1.3M nodes (BGP)

26.10.2025 13:39 โ€” ๐Ÿ‘ 29    ๐Ÿ” 5    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 2
Allรด maman bobo
YouTube video by Alain Souchon - Topic Allรด maman bobo

"Allรด maman bobo" is another poignant classic. www.youtube.com/watch?v=AgdU...

14.10.2025 13:36 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Photo of three hard disk drives.

Photo of three hard disk drives.

We're cooking.. ๐Ÿ‘€

07.10.2025 11:52 โ€” ๐Ÿ‘ 18    ๐Ÿ” 1    ๐Ÿ’ฌ 4    ๐Ÿ“Œ 0
very colorful scatterplot of player deaths in mario maker levels. "inferno" colormap. some level elements highlighted.

very colorful scatterplot of player deaths in mario maker levels. "inferno" colormap. some level elements highlighted.

526.9 million player deaths in 24.7 million levels of Super Mario Maker 2. Data by @tgr.bsky.social

28.09.2025 15:54 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Such a cutie! ๐Ÿฅบ

21.07.2025 15:56 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

- repo: github.com/apple/embedd...
- live demo: apple.github.io/embedding-at...

11.07.2025 14:17 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
screenshot from the embeddings atlas repo

screenshot from the embeddings atlas repo

example density plot exploring a wine dataset.

example density plot exploring a wine dataset.

Really cool new embeddings exploration tool by @domoritz.de and colleagues from Apple. Can't wait to build with this. Also includes a streamlit component and a Jupyter widget.

11.07.2025 14:17 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Would you happen to have your university lectures on OT anywhere online?

04.06.2025 14:25 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

We are grateful for your sacrifice ๐Ÿซก

06.03.2025 10:07 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Licensing is weird though, they say it's GPL but also include this: "To use the compiled binaries, you must own the game".

28.02.2025 12:14 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
box art from 1996 version of CNC RED ALERT

box art from 1996 version of CNC RED ALERT

Woah! EA just open sourced "Command and Conquer: Red Alert" and a bunch of other CnC games! github.com/electronicar...

28.02.2025 12:12 โ€” ๐Ÿ‘ 2    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
We observe that the merge list of LLAMA, LLAMA 3, GEMMA, and MISTRAL contain clusters of
redundant merge rules. For instance, in the LLAMA 3 merge list, we see the sequence of merges
_ the, _t he, and _th e, as well as _ and, _a nd, and _an d. Because the merge path for every
token is unique, it is impossible for more than one of these merges to ever be used, and we empirically
verify this by applying the tokenizer to a large amount of text.
We find that this is an artifact of the conversion from sentencepiece to Huggingface tokenizers
format. To construct the merge list, the conversion algorithm naively combines every pair of tokens
in the vocabulary, and then sorts them by token ID, which represents order of creation. While this
is functionally correct, because the redundant merges are not products of the BPE algorithm (i.e.,
they do not actually represent the most-likely next-merge), we need to remove them to apply our
algorithm. To do this, we do some simple pre-processing: for every cluster of redundant merges, we
record the path of merges that achieves each merge; the earliest path is the one that would be taken,
so we keep that merge and remove the rest.
As an aside, this means that a tokenizerโ€™s merge list can be completely reconstructed from its
vocabulary list ordered by token creation. Given only the resulting token at each time step, we can
derive the corresponding merge.

We observe that the merge list of LLAMA, LLAMA 3, GEMMA, and MISTRAL contain clusters of redundant merge rules. For instance, in the LLAMA 3 merge list, we see the sequence of merges _ the, _t he, and _th e, as well as _ and, _a nd, and _an d. Because the merge path for every token is unique, it is impossible for more than one of these merges to ever be used, and we empirically verify this by applying the tokenizer to a large amount of text. We find that this is an artifact of the conversion from sentencepiece to Huggingface tokenizers format. To construct the merge list, the conversion algorithm naively combines every pair of tokens in the vocabulary, and then sorts them by token ID, which represents order of creation. While this is functionally correct, because the redundant merges are not products of the BPE algorithm (i.e., they do not actually represent the most-likely next-merge), we need to remove them to apply our algorithm. To do this, we do some simple pre-processing: for every cluster of redundant merges, we record the path of merges that achieves each merge; the earliest path is the one that would be taken, so we keep that merge and remove the rest. As an aside, this means that a tokenizerโ€™s merge list can be completely reconstructed from its vocabulary list ordered by token creation. Given only the resulting token at each time step, we can derive the corresponding merge.

This is also addressed in the appendix of @alisawuffles.bsky.social and colleagues' paper on BPE mixture inference. I think it might have been discovered by @soldaini.net if I'm not mistaken.

arxiv.org/abs/2407.16607

28.02.2025 10:47 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Shouldn't "l" and "o" both still be part of the vocab along with "lo" after the merge? Vocab size should grow, not shrink.

06.02.2025 19:01 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Screenshot of Lichess organization overview page on Kaggle

Screenshot of Lichess organization overview page on Kaggle

Lichess is now on @kaggle.com!

Use our puzzles, openings, and engine evaluation datasets directly in your kaggle notebooks: https://www.kaggle.com/organizations/lichess โ™Ÿ๏ธ

02.02.2025 12:03 โ€” ๐Ÿ‘ 53    ๐Ÿ” 4    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
foursquare/fsq-os-places ยท Datasets at Hugging Face Weโ€™re on a journey to advance and democratize artificial intelligence through open source and open science.

Forgot to link the dataset: huggingface.co/datasets/fou...

08.12.2024 13:35 โ€” ๐Ÿ‘ 5    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
world map with points of interest

world map with points of interest

The folks at Foursquare released a @hf.co dataset of 104.5 million places of interest and here's all of them plotted using datashader

08.12.2024 13:34 โ€” ๐Ÿ‘ 16    ๐Ÿ” 3    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0

I'm currently working on an interactive version which will hopefully answer some of these questions! An initial cluster analysis didn't align well with the puzzle labels that are included in the dataset.

06.12.2024 18:29 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

It's part of an ongoing paper project which I hope to open source very soon!

06.12.2024 18:27 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

I recently used the @lichess.org puzzles dataset to experiment with chess position embeddings and visualize 4.5M starting positions. (hf.co/datasets/Lic...)

06.12.2024 13:00 โ€” ๐Ÿ‘ 28    ๐Ÿ” 4    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 1

The Lichessย databaseย of games, puzzles, and engine evaluations is now onย @hf.co - https://huggingface.co/Lichess. Billions of chess data points to download, query, and stream and we're excited to see what you'll build with it! โ™Ÿ๏ธ ๐Ÿค—

06.12.2024 09:46 โ€” ๐Ÿ‘ 94    ๐Ÿ” 23    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 2

@cakiki is following 20 prominent accounts