Stay tuned for more Open R1 updates!
huggingface.co/blog/open-r1...
Stay tuned for more Open R1 updates!
huggingface.co/blog/open-r1...
π€ Dataset: huggingface.co/datasets/ope...
12.02.2025 14:36 β π 1 π 0 π¬ 1 π 0
LLM Reasoning labs will be eating good todayπ
We commandeered the HF cluster for a few days and generated 1.2M reasoning-filled solutions to 500k NuminaMath problems with DeepSeek-R1 π³
Have fun!
A plot showing increased performance of Llama-3.2-3B when pretrained on FineMath
Introducing πFineMath: the best open math pre-training dataset with 50B+ tokens!
Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.
π€ huggingface.co/datasets/Hug...
Hereβs a breakdown π§΅
FineMath - The finest collection of mathematical content
We hope this dataset helps advance the performance of LLMs on Math π Weβre also releasing all the ablation models in this collection, as well as the evaluation code.
Collection: huggingface.co/collections/...
Evaluation: github.com/huggingface/...
Below is the breakdown of the performance of each data source after decontamination, FineMath 4+ outperforms all other datasets when doing continued pre-training of Llama3.2-3B-Base on 60B tokens.
19.12.2024 15:55 β π 2 π 0 π¬ 1 π 0
We got two high quality datasets with 34B and 10B tokens depending on the filtering threshold (3 vs 4).
We also augment the datasets by filtering the English text subset of InfiMM-WebMath-40B with our math classifier and adding it to FineMath.
For the text extraction, we switched to Resiliparse with OWMβs pipeline.
We then trained a classifier on Llama3's annotations to find pages with math reasoning and applied it in two stages. This helped us identify key math domains and recall high quality math data.
huggingface.co/HuggingFaceT...
π‘It was time to re-extract the Common Crawl data directly.
We retrieved pages from FineWebβs URLs to retain its high-quality data. Then, we added back the math pages that earlier FineWeb filters have removed, such as those containing curly braces (β{}β), a common LaTeX pattern.
Turns out math formatting is very important, our FineWebMath data was worse than OWM.
The classifier was mostly retrieving academic papers because math forums werenβt properly extracted with Trafilatura, and most equations needed better formatting.
For FineMath, we first tried starting directly from FineWeb. Although we didnβt tailor FineWebβs text extraction for math, the data retained enough equations.
Then we trained a fastText classifier to retrieve OWM-like data.
Llama3 trains a DistilRoBERTa classifier to target pages with math reasoning and deduction. The process resembles FineWeb-Edu, where we train classifiers on synthetic web annotations.
The authors highlight a specialized math extractor from HTML pages to preserve the equations.
First letβs break down how AI labs curate math pre-training datasets π΅οΈ
DeepSeekMath and QwenMath train a fastText classifier on data like OpenWebMath (OWM). They iteratively filter and recall math content from Common Crawl, focusing on the most relevant domains.
A plot showing increased performance of Llama-3.2-3B when pretrained on FineMath
Introducing πFineMath: the best open math pre-training dataset with 50B+ tokens!
Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.
π€ huggingface.co/datasets/Hug...
Hereβs a breakdown π§΅
The Open LLM Leaderboard got a new front page for Christmas
Check it out at huggingface.co/spaces/open-...
Announcing π₯ FineWeb2: A sparkling update with 1000s of π£οΈlanguages.
We applied the same data-driven approach that led to SOTA English performance inπ· FineWeb to thousands of languages.
π₯ FineWeb2 has 8TB of compressed text data and outperforms other datasets.
Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.
SmolVLM can be fine-tuned on a Google collab and be run on a laptop! Or process millions of documents with a consumer GPU!
Small yet mighty! π«
We are releasing SmolVLM: a new 2B small vision language made for on-device use, fine-tunable on consumer GPU, immensely memory efficient π€
We release three checkpoints under Apache 2.0: SmolVLM-Instruct, SmolVLM-Synthetic and SmolVLM-Base huggingface.co/collections/...
Repo: github.com/huggingface/...
Here's how we use it for SmolLM π€
github.com/huggingface/...
A screenshot of LightEval benchmarking results in a terminal
Check out how easy it is to do LLM evals with LightEval!
* any dataset on the π€ Hub can become an eval task in a few lines of code: customize the prompt, metrics, parsing, few-shots, everything!
* model- and data-parallel inference
* auto batching with the new vLLM backend
Making SmolLM2 more reproducible: open-sourcing our training & evaluation toolkit π οΈ github.com/huggingface/...
Pre-training & evaluation code, synthetic data generation pipelines, post-training scripts, on-device tools & demos
Apache 2.0. V2 data mix coming soon!
Which tools should we add next?
Excited to announce the SFT dataset used for @huggingface.bsky.social SmolLM2!
The dataset for SmolLM2 was created by combining multiple existing datasets and generating new synthetic datasets, including MagPie Ultra v1.0, using distilabel.
Check out the dataset:
huggingface.co/datasets/Hug...
10x followers in the past week, I guess it's happening!
15.11.2024 14:54 β π 3 π 0 π¬ 0 π 0