Online data mixing reduces training costs for foundation models, but faces challenges:
⚠️ Human-defined domains miss semantic nuances
⚠️ Limited eval accessibility
⚠️ Poor scalability
Introducing 🎵R&B: first regroup data, then dynamically reweight domains during training!
08.05.2025 17:00 —
👍 5
🔁 3
💬 1
📌 0
What enables a strong model to surpass its weaker teacher?
🚀 Excited to share our ICLR 2025 paper: "Weak-to-Strong Generalization Through the Data-Centric Lens"! 🧵
05.02.2025 18:22 —
👍 4
🔁 2
💬 1
📌 0
Our results are exciting! 🎉
✔️ Pretrained model weights are reliable guides for data selection.
✔️ Grad-Mimic identifies noisy samples and estimates training dataset quality.
✔️ It even complements other filtering methods, boosting CLIP performance with less data!
09.02.2025 21:07 —
👍 0
🔁 0
💬 1
📌 0
Mimic Score helps identify samples that can misguide weight updates. We can automatically filter these out, improving training. Here is an identified example using noisy web datasets!
09.02.2025 21:07 —
👍 0
🔁 0
💬 1
📌 0
🛠️ Using Mimic Score, we develop Grad-Mimic, a two-stage framework:
1️⃣ Training Phase: Prioritizes which samples to learn, boosting data efficiency.
2️⃣ Post-Training Phase: Evaluates sample utility across training steps, creating an ensemble filter using weak supervision.
09.02.2025 21:07 —
👍 0
🔁 0
💬 1
📌 0
📣 We propose the Mimic Score: a new data quality metric. It leverages reference model weights to assess sample utility, relying on the alignment between gradients and a target direction induced by the reference model.
09.02.2025 21:07 —
👍 0
🔁 0
💬 1
📌 0
Tons of model weights available, but what else can we do besides prediction? 🤔 Introducing Grad-Mimic! A new data selection framework using well-trained model’s weights to find high-value samples for foundation models. Boost data curation & data efficiency!
09.02.2025 21:07 —
👍 3
🔁 3
💬 1
📌 0
First up at #NeurIPS2024 from our group, our work on labeling via programmatic distillation (a spotlight!). Label your data orders of magnitude faster and cheaper — come join us today at Poster Session 2 East for a demo!
11.12.2024 23:15 —
👍 15
🔁 8
💬 0
📌 0