Online data mixing reduces training costs for foundation models, but faces challenges:
⚠️ Human-defined domains miss semantic nuances
⚠️ Limited eval accessibility
⚠️ Poor scalability
Introducing 🎵R&B: first regroup data, then dynamically reweight domains during training!
08.05.2025 17:00 — 👍 5 🔁 3 💬 1 📌 0
What enables a strong model to surpass its weaker teacher?
🚀 Excited to share our ICLR 2025 paper: "Weak-to-Strong Generalization Through the Data-Centric Lens"! 🧵
05.02.2025 18:22 — 👍 4 🔁 2 💬 1 📌 0
Our results are exciting! 🎉
✔️ Pretrained model weights are reliable guides for data selection.
✔️ Grad-Mimic identifies noisy samples and estimates training dataset quality.
✔️ It even complements other filtering methods, boosting CLIP performance with less data!
09.02.2025 21:07 — 👍 0 🔁 0 💬 1 📌 0
Mimic Score helps identify samples that can misguide weight updates. We can automatically filter these out, improving training. Here is an identified example using noisy web datasets!
09.02.2025 21:07 — 👍 0 🔁 0 💬 1 📌 0
🛠️ Using Mimic Score, we develop Grad-Mimic, a two-stage framework:
1️⃣ Training Phase: Prioritizes which samples to learn, boosting data efficiency.
2️⃣ Post-Training Phase: Evaluates sample utility across training steps, creating an ensemble filter using weak supervision.
09.02.2025 21:07 — 👍 0 🔁 0 💬 1 📌 0
📣 We propose the Mimic Score: a new data quality metric. It leverages reference model weights to assess sample utility, relying on the alignment between gradients and a target direction induced by the reference model.
09.02.2025 21:07 — 👍 0 🔁 0 💬 1 📌 0
Tons of model weights available, but what else can we do besides prediction? 🤔 Introducing Grad-Mimic! A new data selection framework using well-trained model’s weights to find high-value samples for foundation models. Boost data curation & data efficiency!
09.02.2025 21:07 — 👍 3 🔁 3 💬 1 📌 0
First up at #NeurIPS2024 from our group, our work on labeling via programmatic distillation (a spotlight!). Label your data orders of magnitude faster and cheaper — come join us today at Poster Session 2 East for a demo!
11.12.2024 23:15 — 👍 14 🔁 8 💬 0 📌 0
🤗 ML at Hugging Face
🌲 Academic Staff at Stanford University (AIMI Center)
🦴 Radiology AI is my stuff
CS PhD student @ UW Madison. Working on data and compute efficient LLM adaptation.
STAT PhD @ Wisc | Working on social network analysis & LLM adaptation
Ph.D. Candidate at UW-Madison
https://harit7.github.io/
Ph.D student at @WisconsinCS @UWMadison
Ph.D. student at UW-Madison. Working on automating foundation model guided science. Previously at CMU, UCSD, Fresno City College.
https://nick11roberts.science
Data Quality x Privacy
PhD student @ CMU with Zico Kolter and Zack Lipton | Founding Member @datologyai.com | Prev. Comp Sc @iitdelhi
http://pratyushmaini.github.io/
Associate Professor at Princeton
Machine Learning Researcher
Director of the Center for the Advancement of Progress
Research Scientist @ Google DeepMind. Physics of learning, ML / AI, condensed matter. Prev Ph.D. Physics @ UC Berkeley.
I work on AI at OpenAI.
Former VP AI and Distinguished Scientist at Microsoft.
Professor and Head of Machine Learning Department at Carnegie Mellon. Board member OpenAI. Chief Technical Advisor Gray Swan AI. Chief Expert Bosch Research.
AI professor at Caltech. General Chair ICLR 2025.
http://www.yisongyue.com
CS PhD @UW-Madison | Data- and compute- efficient, reasoning for foundation models
Website: https://jiayuww.github.io/
PhD Student @UW-Madison, working on synthetic data, instruction tuning, and foundation models, @BrownUniversity '24
https://avitrost.github.io/
Everything datasets and human feedback for AI at Hugging Face.
Prev: co-founder and CEO of Argilla (acquired by Hugging Face)