- Dataset: huggingface.co/datasets/Hug...
- Embeddings: huggingface.co/datasets/air... (H/T @loubnabnl.hf.co for recommending this)
@cakiki.bsky.social
research scientist at ScaDS.AI Leipzig in nlp, ir, and ml. @hf.co fellow. @lichess.org team member. @kaggle.com datasets expert.
- Dataset: huggingface.co/datasets/Hug...
- Embeddings: huggingface.co/datasets/air... (H/T @loubnabnl.hf.co for recommending this)
- Sasha's talk: ted.com/talks/sasha_...
- Tools: datamapplot and EVลC by @lelandmcinnes.bsky.social and colleagues at the Tutte Institute and openTSNE by @pavlinpolicar.bsky.social
Scatterplot of a document corpus with cluster information in the form of colors and cluster labels. Includes labels like computing, mental health, religion, etc.
I made this annotated scatter plot of 1 million FineWeb-Edu documents for @sashamtl.bsky.social's new TED talk.
31.10.2025 14:52 โ ๐ 4 ๐ 1 ๐ฌ 1 ๐ 0A four frame comic about the Cambodian genocide. In the first frame, a man can be seen from the back, looking at a window. The text reads: "I never ask my father about Cambodia." In the second frame, a silhouette of the father can be seen, screaming in his sleep, arms reaching into nothingness. The text reads: "What could he say, really, that he didn't already scream in his nightmares?". The third frame shows the father pondering on something, expression neutral. The text reads: But sometimes he would say "It's funny, when it happened, all the fish left the river. They never came back." . The fourth frame shows the father lowering his head, thinking. He says: "I guess they knew."
The second page of the comic on the Cambodian genocide, and everything to come. The first frame shows the daughter, young, silent and unsure of what to say. The text reads: "I didn't say anything then." In the present, the daughter, now an adult, looks at the Seine in Paris. The text reads "But I often think of it now.". The daughter then crouches near the Seine to see if there are any fish left. She says: "Are you still here?".
When the fish left the river:
28.10.2025 00:00 โ ๐ 88 ๐ 27 ๐ฌ 2 ๐ 1Also really love how organic the plot looks with "inferno" (left) and "viridis" (right).
27.10.2025 10:42 โ ๐ 4 ๐ 1 ๐ฌ 0 ๐ 0Update: the color map in this post is misleading. See the quoted post for context.
bsky.app/profile/caki...
Thanks to @jamesabednar.bsky.social I realized I had used the wrong background color for the colormap I had chosen. This is another version of the plot (different embeddings) with the corrected background.
26.10.2025 16:06 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 1Glad you like it! Nothing yet, only a very messy jupyter notebook for now. I will share the code and data pipelines once they're all cleaned up.
26.10.2025 15:05 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0Huge props to @lelandmcinnes.bsky.social for optimizing the hammer bundling code in datashader and to Barrett Lyon's inspring work at the Opte Project.
26.10.2025 13:39 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0organic looking graph of the BGP nodes of the internet. black and white
Map of the internet: 1.3M nodes (BGP)
26.10.2025 13:39 โ ๐ 29 ๐ 5 ๐ฌ 3 ๐ 2"Allรด maman bobo" is another poignant classic. www.youtube.com/watch?v=AgdU...
14.10.2025 13:36 โ ๐ 4 ๐ 0 ๐ฌ 0 ๐ 0Photo of three hard disk drives.
We're cooking.. ๐
07.10.2025 11:52 โ ๐ 18 ๐ 1 ๐ฌ 4 ๐ 0very colorful scatterplot of player deaths in mario maker levels. "inferno" colormap. some level elements highlighted.
526.9 million player deaths in 24.7 million levels of Super Mario Maker 2. Data by @tgr.bsky.social
28.09.2025 15:54 โ ๐ 4 ๐ 0 ๐ฌ 0 ๐ 0Such a cutie! ๐ฅบ
21.07.2025 15:56 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0- repo: github.com/apple/embedd...
- live demo: apple.github.io/embedding-at...
screenshot from the embeddings atlas repo
example density plot exploring a wine dataset.
Really cool new embeddings exploration tool by @domoritz.de and colleagues from Apple. Can't wait to build with this. Also includes a streamlit component and a Jupyter widget.
11.07.2025 14:17 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0Would you happen to have your university lectures on OT anywhere online?
04.06.2025 14:25 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0We are grateful for your sacrifice ๐ซก
06.03.2025 10:07 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0Licensing is weird though, they say it's GPL but also include this: "To use the compiled binaries, you must own the game".
28.02.2025 12:14 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0box art from 1996 version of CNC RED ALERT
Woah! EA just open sourced "Command and Conquer: Red Alert" and a bunch of other CnC games! github.com/electronicar...
28.02.2025 12:12 โ ๐ 2 ๐ 1 ๐ฌ 1 ๐ 0We observe that the merge list of LLAMA, LLAMA 3, GEMMA, and MISTRAL contain clusters of redundant merge rules. For instance, in the LLAMA 3 merge list, we see the sequence of merges _ the, _t he, and _th e, as well as _ and, _a nd, and _an d. Because the merge path for every token is unique, it is impossible for more than one of these merges to ever be used, and we empirically verify this by applying the tokenizer to a large amount of text. We find that this is an artifact of the conversion from sentencepiece to Huggingface tokenizers format. To construct the merge list, the conversion algorithm naively combines every pair of tokens in the vocabulary, and then sorts them by token ID, which represents order of creation. While this is functionally correct, because the redundant merges are not products of the BPE algorithm (i.e., they do not actually represent the most-likely next-merge), we need to remove them to apply our algorithm. To do this, we do some simple pre-processing: for every cluster of redundant merges, we record the path of merges that achieves each merge; the earliest path is the one that would be taken, so we keep that merge and remove the rest. As an aside, this means that a tokenizerโs merge list can be completely reconstructed from its vocabulary list ordered by token creation. Given only the resulting token at each time step, we can derive the corresponding merge.
This is also addressed in the appendix of @alisawuffles.bsky.social and colleagues' paper on BPE mixture inference. I think it might have been discovered by @soldaini.net if I'm not mistaken.
arxiv.org/abs/2407.16607
Shouldn't "l" and "o" both still be part of the vocab along with "lo" after the merge? Vocab size should grow, not shrink.
06.02.2025 19:01 โ ๐ 3 ๐ 0 ๐ฌ 0 ๐ 0Screenshot of Lichess organization overview page on Kaggle
Lichess is now on @kaggle.com!
Use our puzzles, openings, and engine evaluation datasets directly in your kaggle notebooks: https://www.kaggle.com/organizations/lichess โ๏ธ
Forgot to link the dataset: huggingface.co/datasets/fou...
08.12.2024 13:35 โ ๐ 5 ๐ 0 ๐ฌ 0 ๐ 0world map with points of interest
The folks at Foursquare released a @hf.co dataset of 104.5 million places of interest and here's all of them plotted using datashader
08.12.2024 13:34 โ ๐ 16 ๐ 3 ๐ฌ 2 ๐ 0I'm currently working on an interactive version which will hopefully answer some of these questions! An initial cluster analysis didn't align well with the puzzle labels that are included in the dataset.
06.12.2024 18:29 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0It's part of an ongoing paper project which I hope to open source very soon!
06.12.2024 18:27 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0I recently used the @lichess.org puzzles dataset to experiment with chess position embeddings and visualize 4.5M starting positions. (hf.co/datasets/Lic...)
06.12.2024 13:00 โ ๐ 28 ๐ 4 ๐ฌ 2 ๐ 1The Lichessย databaseย of games, puzzles, and engine evaluations is now onย @hf.co - https://huggingface.co/Lichess. Billions of chess data points to download, query, and stream and we're excited to see what you'll build with it! โ๏ธ ๐ค
06.12.2024 09:46 โ ๐ 94 ๐ 23 ๐ฌ 3 ๐ 2