@albertge - Bluesky Profile

Thanks to my intrepid collaborators @zihengh1.bsky.social @jfrcooper2 @chu_ziyi18870 @srinath_namburi @jackcai1206 @kendallpark @nick11roberts.bsky.social berts @fredsala.bsky.social. Special thanks to @MayeeChen and members of @SprocketLab for feedback and discussion! @uwcdis @WisconsinCS

08.05.2025 17:00 — 👍 2 🔁 0 💬 0 📌 0

Zooming out, it’s been very encouraging to see the recent interest in clustering-based approaches to training data. Highlighting some recent works (@shizediao, CLIMB), (@wettig, OrganizeTheWeb), (@Olivia61368522, DoGE/DGA) in this space!

08.05.2025 17:00 — 👍 2 🔁 0 💬 1 📌 0

R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training Data mixing strategies have successfully reduced the costs involved in training language models. While promising, such methods suffer from two flaws. First, they rely on predetermined data domains (e....

Check out our paper for more technical details - we’ve got some more theoretical and empirical nuggets on how our method works: arxiv.org/abs/2505.00358 Code+datasets will be released soon! If you found this interesting, feel free to spread the word!

08.05.2025 17:00 — 👍 1 🔁 0 💬 1 📌 0

Our setup is not just for language domains but is universally applicable - we extend to multimodal tasks (e.g., CLIP), as well as long reasoning traces!

08.05.2025 17:00 — 👍 2 🔁 0 💬 1 📌 0

When combining regrouping and reweighting strategies, we get the best of both worlds: we match or exceed performance while requiring orders of magnitude less compute overhead when optimizing domain weights - even with as many as 100 domains!

08.05.2025 17:00 — 👍 2 🔁 0 💬 1 📌 0

Intuitively, we should upweight domains that best support our downstream tasks. Our approach: cluster train/eval data identically. Then, use training gradients to estimate domain alignments while accounting for eval data composition, and optimize weights accordingly.

08.05.2025 17:00 — 👍 2 🔁 0 💬 1 📌 0

Still, optimizing data mixtures typically requires expensive evaluation passes. Our efficiency hack is to use domain gradients collected during training for two purposes: training the model AND estimating optimal proportions!

08.05.2025 17:00 — 👍 2 🔁 0 💬 1 📌 0

🔍How many groups are optimal? There is a "sweet spot" in data mixing! Model performance shows a U-shaped relationship with the number of clusters—too few or too many hurt performance. Their geometry matters too: well-separated, compact clusters are better!

08.05.2025 17:00 — 👍 2 🔁 0 💬 1 📌 0

Paper: arxiv.org/abs/2505.00358

Take the Dolly-15k instruction set. Instead of human-defined categories, we repartition the data into semantic categories. Training on these newly-discovered domains results in better evaluation performance.

08.05.2025 17:00 — 👍 2 🔁 0 💬 1 📌 0

Online data mixing reduces training costs for foundation models, but faces challenges:
⚠️ Human-defined domains miss semantic nuances
⚠️ Limited eval accessibility
⚠️ Poor scalability

Introducing 🎵R&B: first regroup data, then dynamically reweight domains during training!

08.05.2025 17:00 — 👍 5 🔁 3 💬 1 📌 0

First up at #NeurIPS2024 from our group, our work on labeling via programmatic distillation (a spotlight!). Label your data orders of magnitude faster and cheaper — come join us today at Poster Session 2 East for a demo!

11.12.2024 23:15 — 👍 15 🔁 8 💬 0 📌 0

Latest posts by albertge.bsky.social on Bluesky

@albertge is following 14 prominent accounts