Anton anton-l - Bluesky Statics

Open R1: Update #2 A Blog post by Open R1 on Hugging Face

Stay tuned for more Open R1 updates!

huggingface.co/blog/open-r1...

12.02.2025 14:36 — 👍 1 🔁 0 💬 0 📌 0

open-r1/OpenR1-Math-Raw · Datasets at Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

🤗 Dataset: huggingface.co/datasets/ope...

12.02.2025 14:36 — 👍 1 🔁 0 💬 1 📌 0

LLM Reasoning labs will be eating good today🍔

We commandeered the HF cluster for a few days and generated 1.2M reasoning-filled solutions to 500k NuminaMath problems with DeepSeek-R1 🐳
Have fun!

12.02.2025 14:36 — 👍 22 🔁 3 💬 2 📌 2

A plot showing increased performance of Llama-3.2-3B when pretrained on FineMath

Introducing 📐FineMath: the best open math pre-training dataset with 50B+ tokens!

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

🤗 huggingface.co/datasets/Hug...

Here’s a breakdown 🧵

19.12.2024 15:55 — 👍 46 🔁 15 💬 2 📌 1

FineMath - The finest collection of mathematical content

We hope this dataset helps advance the performance of LLMs on Math 🚀 We’re also releasing all the ablation models in this collection, as well as the evaluation code.

Collection: huggingface.co/collections/...

Evaluation: github.com/huggingface/...

19.12.2024 15:55 — 👍 1 🔁 0 💬 0 📌 0

Below is the breakdown of the performance of each data source after decontamination, FineMath 4+ outperforms all other datasets when doing continued pre-training of Llama3.2-3B-Base on 60B tokens.

19.12.2024 15:55 — 👍 2 🔁 0 💬 1 📌 0

We got two high quality datasets with 34B and 10B tokens depending on the filtering threshold (3 vs 4).

We also augment the datasets by filtering the English text subset of InfiMM-WebMath-40B with our math classifier and adding it to FineMath.

19.12.2024 15:55 — 👍 1 🔁 0 💬 1 📌 0

For the text extraction, we switched to Resiliparse with OWM’s pipeline.
We then trained a classifier on Llama3's annotations to find pages with math reasoning and applied it in two stages. This helped us identify key math domains and recall high quality math data.

huggingface.co/HuggingFaceT...

19.12.2024 15:55 — 👍 1 🔁 0 💬 1 📌 0

💡It was time to re-extract the Common Crawl data directly.

We retrieved pages from FineWeb’s URLs to retain its high-quality data. Then, we added back the math pages that earlier FineWeb filters have removed, such as those containing curly braces (“{}“), a common LaTeX pattern.

19.12.2024 15:55 — 👍 1 🔁 0 💬 1 📌 0

Turns out math formatting is very important, our FineWebMath data was worse than OWM.

The classifier was mostly retrieving academic papers because math forums weren’t properly extracted with Trafilatura, and most equations needed better formatting.

19.12.2024 15:55 — 👍 1 🔁 0 💬 1 📌 0

For FineMath, we first tried starting directly from FineWeb. Although we didn’t tailor FineWeb’s text extraction for math, the data retained enough equations.

Then we trained a fastText classifier to retrieve OWM-like data.

19.12.2024 15:55 — 👍 0 🔁 0 💬 1 📌 0

Llama3 trains a DistilRoBERTa classifier to target pages with math reasoning and deduction. The process resembles FineWeb-Edu, where we train classifiers on synthetic web annotations.

The authors highlight a specialized math extractor from HTML pages to preserve the equations.

19.12.2024 15:55 — 👍 1 🔁 0 💬 1 📌 0

First let’s break down how AI labs curate math pre-training datasets 🕵️

DeepSeekMath and QwenMath train a fastText classifier on data like OpenWebMath (OWM). They iteratively filter and recall math content from Common Crawl, focusing on the most relevant domains.

19.12.2024 15:55 — 👍 1 🔁 0 💬 1 📌 0

A plot showing increased performance of Llama-3.2-3B when pretrained on FineMath

Introducing 📐FineMath: the best open math pre-training dataset with 50B+ tokens!

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

🤗 huggingface.co/datasets/Hug...

Here’s a breakdown 🧵

19.12.2024 15:55 — 👍 46 🔁 15 💬 2 📌 1

The Open LLM Leaderboard got a new front page for Christmas

Check it out at huggingface.co/spaces/open-...

11.12.2024 08:16 — 👍 66 🔁 12 💬 2 📌 0

Announcing 🥂 FineWeb2: A sparkling update with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other datasets.

08.12.2024 09:19 — 👍 76 🔁 19 💬 1 📌 0

Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.

SmolVLM can be fine-tuned on a Google collab and be run on a laptop! Or process millions of documents with a consumer GPU!

26.11.2024 15:57 — 👍 104 🔁 22 💬 4 📌 4

Small yet mighty! 💫

We are releasing SmolVLM: a new 2B small vision language made for on-device use, fine-tunable on consumer GPU, immensely memory efficient 🤠

We release three checkpoints under Apache 2.0: SmolVLM-Instruct, SmolVLM-Synthetic and SmolVLM-Base huggingface.co/collections/...

26.11.2024 16:04 — 👍 158 🔁 27 💬 11 📌 4

smollm/evaluation at main · huggingface/smollm Everything about the SmolLM & SmolLM2 family of models - huggingface/smollm

Repo: github.com/huggingface/...

Here's how we use it for SmolLM 🤏
github.com/huggingface/...

25.11.2024 17:24 — 👍 6 🔁 0 💬 1 📌 0

A screenshot of LightEval benchmarking results in a terminal

Check out how easy it is to do LLM evals with LightEval!

* any dataset on the 🤗 Hub can become an eval task in a few lines of code: customize the prompt, metrics, parsing, few-shots, everything!
* model- and data-parallel inference
* auto batching with the new vLLM backend

25.11.2024 17:24 — 👍 77 🔁 10 💬 2 📌 1

GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models Everything about the SmolLM & SmolLM2 family of models - GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models

Making SmolLM2 more reproducible: open-sourcing our training & evaluation toolkit 🛠️ github.com/huggingface/...

Pre-training & evaluation code, synthetic data generation pipelines, post-training scripts, on-device tools & demos

Apache 2.0. V2 data mix coming soon!

Which tools should we add next?

24.11.2024 07:16 — 👍 59 🔁 10 💬 2 📌 0

HuggingFaceTB/smoltalk · Datasets at Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Excited to announce the SFT dataset used for @huggingface.bsky.social SmolLM2!

The dataset for SmolLM2 was created by combining multiple existing datasets and generating new synthetic datasets, including MagPie Ultra v1.0, using distilabel.

Check out the dataset:
huggingface.co/datasets/Hug...

21.11.2024 15:22 — 👍 24 🔁 8 💬 1 📌 1

10x followers in the past week, I guess it's happening!

15.11.2024 14:54 — 👍 3 🔁 0 💬 0 📌 0

Posts by Anton (@anton-l.bsky.social)