Thanks to my intrepid collaborators @zihengh1.bsky.social @jfrcooper2 @chu_ziyi18870 @srinath_namburi @jackcai1206 @kendallpark @nick11roberts.bsky.social berts @fredsala.bsky.social. Special thanks to @MayeeChen and members of @SprocketLab for feedback and discussion! @uwcdis @WisconsinCS
08.05.2025 17:00 β π 2 π 0 π¬ 0 π 0
Zooming out, itβs been very encouraging to see the recent interest in clustering-based approaches to training data. Highlighting some recent works (@shizediao, CLIMB), (@wettig, OrganizeTheWeb), (@Olivia61368522, DoGE/DGA) in this space!
08.05.2025 17:00 β π 2 π 0 π¬ 1 π 0
Our setup is not just for language domains but is universally applicable - we extend to multimodal tasks (e.g., CLIP), as well as long reasoning traces!
08.05.2025 17:00 β π 2 π 0 π¬ 1 π 0
When combining regrouping and reweighting strategies, we get the best of both worlds: we match or exceed performance while requiring orders of magnitude less compute overhead when optimizing domain weights - even with as many as 100 domains!
08.05.2025 17:00 β π 2 π 0 π¬ 1 π 0
Intuitively, we should upweight domains that best support our downstream tasks. Our approach: cluster train/eval data identically. Then, use training gradients to estimate domain alignments while accounting for eval data composition, and optimize weights accordingly.
08.05.2025 17:00 β π 2 π 0 π¬ 1 π 0
Still, optimizing data mixtures typically requires expensive evaluation passes. Our efficiency hack is to use domain gradients collected during training for two purposes: training the model AND estimating optimal proportions!
08.05.2025 17:00 β π 2 π 0 π¬ 1 π 0
πHow many groups are optimal? There is a "sweet spot" in data mixing! Model performance shows a U-shaped relationship with the number of clustersβtoo few or too many hurt performance. Their geometry matters too: well-separated, compact clusters are better!
08.05.2025 17:00 β π 2 π 0 π¬ 1 π 0
Paper: arxiv.org/abs/2505.00358
Take the Dolly-15k instruction set. Instead of human-defined categories, we repartition the data into semantic categories. Training on these newly-discovered domains results in better evaluation performance.
08.05.2025 17:00 β π 2 π 0 π¬ 1 π 0
Online data mixing reduces training costs for foundation models, but faces challenges:
β οΈ Human-defined domains miss semantic nuances
β οΈ Limited eval accessibility
β οΈ Poor scalability
Introducing π΅R&B: first regroup data, then dynamically reweight domains during training!
08.05.2025 17:00 β π 5 π 3 π¬ 1 π 0
First up at #NeurIPS2024 from our group, our work on labeling via programmatic distillation (a spotlight!). Label your data orders of magnitude faster and cheaper β come join us today at Poster Session 2 East for a demo!
11.12.2024 23:15 β π 15 π 8 π¬ 0 π 0
Anti-cynic. Towards a weirder future. Reinforcement Learning, Autonomous Vehicles, transportation systems, the works. Asst. Prof at NYU
https://emerge-lab.github.io
https://www.admonymous.co/eugenevinitsky
Wisconsin CS. Snorkel AI.
Working on machine learning & information theory.
https://pages.cs.wisc.edu/~fredsala/
Reproducible bugs are candies ππ¬
I like programming too much for not liking automatic programming.
CS PhD student @ UW Madison. Working on data and compute efficient LLM adaptation.
STAT PhD @ Wisc | Working on social network analysis & LLM adaptation
Ph.D. Candidate at UW-Madison
https://harit7.github.io/
Ph.D student at @WisconsinCS @UWMadison
Ph.D. student at UW-Madison. Working on automating foundation model guided science. Previously at CMU, UCSD, Fresno City College.
https://nick11roberts.science
CS PhD @UW-Madison | Data- and compute- efficient, reasoning for foundation models
Website: https://jiayuww.github.io/
PhD Student @UW-Madison, working on synthetic data, instruction tuning, and foundation models, @BrownUniversity '24
https://avitrost.github.io/
CS Ph.D. Student @UWMadison. Research Intern @Apple AIML. Focusing on multimodal models, data curation, and data-centric AI.
zihengh1.github.io
official Bluesky account (check usernameπ)
Bugs, feature requests, feedback: support@bsky.app