Tzu-Heng (Brian) Huang @zihengh1

Online data mixing reduces training costs for foundation models, but faces challenges:
⚠️ Human-defined domains miss semantic nuances
⚠️ Limited eval accessibility
⚠️ Poor scalability

Introducing 🎵R&B: first regroup data, then dynamically reweight domains during training!

08.05.2025 17:00 — 👍 5 🔁 3 💬 1 📌 0

What enables a strong model to surpass its weaker teacher?

🚀 Excited to share our ICLR 2025 paper: "Weak-to-Strong Generalization Through the Data-Centric Lens"! 🧵

05.02.2025 18:22 — 👍 4 🔁 2 💬 1 📌 0

Evaluating Sample Utility for Data Selection by Mimicking Model Weights Foundation models are trained on large-scale web-crawled datasets, which often contain noise, biases, and irrelevant information. This motivates the use of data selection techniques, which can be divi...

This was my intern work Apple! Huge thanks to Manjot Bilkhu, my advisor @fredsala.bsky.social, and Javier Movellan! Check out the full paper for all the details:
arxiv.org/abs/2501.06708.

If you found this interesting, feel free to spread the word!

09.02.2025 21:07 — 👍 0 🔁 0 💬 0 📌 0

Our results are exciting! 🎉
✔️ Pretrained model weights are reliable guides for data selection.
✔️ Grad-Mimic identifies noisy samples and estimates training dataset quality.
✔️ It even complements other filtering methods, boosting CLIP performance with less data!

09.02.2025 21:07 — 👍 0 🔁 0 💬 1 📌 0

Mimic Score helps identify samples that can misguide weight updates. We can automatically filter these out, improving training. Here is an identified example using noisy web datasets!

09.02.2025 21:07 — 👍 0 🔁 0 💬 1 📌 0

🛠️ Using Mimic Score, we develop Grad-Mimic, a two-stage framework:
1️⃣ Training Phase: Prioritizes which samples to learn, boosting data efficiency.
2️⃣ Post-Training Phase: Evaluates sample utility across training steps, creating an ensemble filter using weak supervision.

09.02.2025 21:07 — 👍 0 🔁 0 💬 1 📌 0

📣 We propose the Mimic Score: a new data quality metric. It leverages reference model weights to assess sample utility, relying on the alignment between gradients and a target direction induced by the reference model.

09.02.2025 21:07 — 👍 0 🔁 0 💬 1 📌 0

Data selection is crucial!
⚠️ Existing methods have limitations: Model-free approaches are hard to design, while model-based ones can be computationally expensive.
✅ Grad-Mimic offers a better way using existing model weights!
arxiv.org/abs/2501.06708

09.02.2025 21:07 — 👍 0 🔁 0 💬 1 📌 0

Tons of model weights available, but what else can we do besides prediction? 🤔 Introducing Grad-Mimic! A new data selection framework using well-trained model’s weights to find high-value samples for foundation models. Boost data curation & data efficiency!

09.02.2025 21:07 — 👍 3 🔁 3 💬 1 📌 0

First up at #NeurIPS2024 from our group, our work on labeling via programmatic distillation (a spotlight!). Label your data orders of magnitude faster and cheaper — come join us today at Poster Session 2 East for a demo!

11.12.2024 23:15 — 👍 14 🔁 8 💬 0 📌 0

Tzu-Heng (Brian) Huang

Latest posts by zihengh1.bsky.social on Bluesky

@zihengh1 is following 19 prominent accounts