Vishaal Udandarao's Avatar

Vishaal Udandarao

@vishaalurao.bsky.social

@ELLISforEurope PhD Student @bethgelab @caml_lab @Cambridge_Uni @uni_tue; Currently SR @GoogleAI; Previously MPhil @Cambridge_Uni, RA @RutgersU, UG @iiitdelhi vishaal27.github.io

577 Followers  |  245 Following  |  13 Posts  |  Joined: 19.11.2024  |  2.1816

Latest posts by vishaalurao.bsky.social on Bluesky

CuratedThoughts: Data Curation for RL Datasets πŸš€

Since DeepSeek-R1 introduced reasoning-based RL, datasets like Open-R1 & OpenThoughts emerged for fine-tuning & GRPO. Our deep dive found major flaws β€” 25% of OpenThoughts needed elimination by data curation.

Here's why πŸ‘‡πŸ§΅

17.02.2025 18:22 β€” πŸ‘ 13    πŸ” 9    πŸ’¬ 1    πŸ“Œ 1

Ever wondered why presenting more facts can sometimes *worsen* disagreements, even among rational people? πŸ€”

It turns out, Bayesian reasoning has some surprising answers - no cognitive biases needed! Let's explore this fascinating paradox quickly ☺️

07.01.2025 22:25 β€” πŸ‘ 233    πŸ” 77    πŸ’¬ 8    πŸ“Œ 2
Post image

πŸŽ‰ Had fun at #NeurIPS2024 Workshop on #AdaptiveFoundationModels!

πŸš€ Speakers: @rsalakhu.bsky.social @sedielem.bsky.social Kate Saenko, Matthias Bethge / @vishaalurao.bsky.social Minjoon Seo, Bing Liu, Tianqi Chen

🌐Posters: adaptive-foundation-models.org/papers

🎬 neurips.cc/virtual/2024...

🧡Recap!

19.12.2024 04:59 β€” πŸ‘ 10    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0
Post image

Our workshop in numbers:
πŸ–‡οΈ 128 Papers
πŸ’¬ 8 Orals
πŸ–‹οΈ 564 Authors
βœ… 40 Reviewers
πŸ”Š 7 Invited Speakers
πŸ‘• 100 T-Shirts

πŸ”₯ Organizers: Paul Vicol, Mengye Ren, Renjie Liao, Naila Murray, Wei-Chiu Ma, Beidi Chen

#NeurIPS2024 #AdaptiveFoundationModels

19.12.2024 04:59 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

🚨Looking to test your foundation model on an arbitrary and open-ended set of capabilities, not explicitly captured by static benchmarks? 🚨

Check out ✨ONEBench✨, where we show how sample-level evaluation is the solution.

πŸ”Ž arxiv.org/abs/2412.06745

10.12.2024 17:44 β€” πŸ‘ 18    πŸ” 5    πŸ’¬ 1    πŸ“Œ 2
Post image

πŸ˜΅β€πŸ’« Continually pretraining large multimodal models to keep them up-to-date all-the-time is tough, covering everything from adapters, merging, meta-scheduling to data design and more!

So I'm really happy to present our large-scale study at #NeurIPS2024!

Come drop by to talk about all that and more!

10.12.2024 16:42 β€” πŸ‘ 40    πŸ” 6    πŸ’¬ 1    πŸ“Œ 2

This was work done during my internship with amazing folks @google @deep-mind.bsky.social: @nikparth1.bsky.social (joint-first) Ferjad Talfan @samuelalbanie.bsky.social Federico Yongqin Alessio & @olivierhenaff.bsky.social

Super excited about this direction of strong pretraining for smol models!

02.12.2024 18:06 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Bonus: Along the way, we found current state of CLIP zero-shot benchmarking in disarrayβ€”some test datasets have a seed std of ~12%!

We construct a stable & reliable set of evaluations (StableEval) inspired by the inverse-variance-weighting method, to prune out unreliable evals!

02.12.2024 18:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Finally, we scale all our insights to pretrain SoTA FLOP-efficient models across three different FLOP-scales: ACED-F{0,1,2}

Outperforming strong baselines including Apple's MobileCLIP, TinyCLIP and @datologyai.com CLIP models!

02.12.2024 18:02 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

There's more! ACID and KD are complementary β€” they can be profitably combined, at scale! Our simple pretraining recipe ACED-ACIDistill showcases continued benefits as we scale to 26B samples seen!

02.12.2024 18:01 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We also show that ACID strongly outperforms KD across different reference/teacher training datasets, KD objectives, and student sizes.

02.12.2024 18:01 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Our ACID method shows very strong scaling properties as the size of the reference model increases, until we hit a saturation point β€” the optimal reference-student capacity ratio.

Further, ACID significantly outperforms KD as we scale up the reference/teacher sizes.

02.12.2024 18:01 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

As our ACID method performs implicit distillation, we can further combine our data curation strategy with an explicit distillation objective, and conduct a series of experiments to determine the optimal combination strategy.

02.12.2024 18:00 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Our online curation method (ACID) uses large pretrained reference models (adopting from prior work: JEST) & we show a theoretical equivalence b/w KD and ACID (appx C in paper).

02.12.2024 18:00 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

TLDR: We introduce an online data curation method that when coupled with simple softmax knowledge distillation produces a very effective pretraining recipe yielding SoTA inference-efficient two-tower contrastive VLMs!

02.12.2024 17:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸš€New Paper: Active Data Curation Effectively Distills Multimodal Models
arxiv.org/abs/2411.18674

Smol models are all the rage these days & knowledge distillation (KD) is key for model compression!

We show how data curation can effectively distill to yield SoTA FLOP-efficient {C/Sig}LIPs!!
πŸ§΅πŸ‘‡

02.12.2024 17:58 β€” πŸ‘ 23    πŸ” 6    πŸ’¬ 1    πŸ“Œ 2
Preview
Technical Deep-Dive: Curating Our Way to a State-of-the-Art Text Dataset Our data curation pipeline to obtain substantial improvements in LLM quality, training speed, and inference efficiency.

ICYMI, check out our latest results @datologyai.com on curating data for LLMs.

Intervening only on training data, our pipeline can train models faster (7.7x less compute), better (+8.5% performance), and smaller (models half the size outperform by >5%)!

www.datologyai.com/post/technic...

29.11.2024 16:36 β€” πŸ‘ 5    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0

Great paper! Why do you think it doesn’t make sense for pretraining to be made aware of the model being used in a few-shot setting downstream? Do you see any potential downsides of this kind of approach?

29.11.2024 09:39 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image

πŸ€” Can you turn your vision-language model from a great zero-shot model into a great-at-any-shot generalist?

Turns out you can, and here is how: arxiv.org/abs/2411.15099

Really excited to this work on multimodal pretraining for my first bluesky entry!

🧡 A short and hopefully informative thread:

28.11.2024 14:32 β€” πŸ‘ 135    πŸ” 24    πŸ’¬ 2    πŸ“Œ 7

@bayesiankitten.bsky.social @dziadzio.bsky.social and I also work on continual pretraining :)

28.11.2024 21:13 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Congrats, super exciting!!

22.11.2024 12:50 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

1/ Introducing α΄α΄˜α΄‡Ι΄κœ±α΄„Κœα΄ΚŸα΄€Κ€: a retrieval-augmented LM to help scientists synthesize knowledge πŸ“š
@uwnlp.bsky.social & Ai2
With open models & 45M-paper datastores, it outperforms proprietary systems & match human experts.
Try out our demo!
openscholar.allen.ai

19.11.2024 16:30 β€” πŸ‘ 161    πŸ” 39    πŸ’¬ 6    πŸ“Œ 8
Preview
TΓΌbingen AI Join the conversation

Here's a fledgling starter pack for the AI community in TΓΌbingen. Let me know if you'd like to be added!

go.bsky.app/NFbVzrA

19.11.2024 13:14 β€” πŸ‘ 24    πŸ” 13    πŸ’¬ 18    πŸ“Œ 0

@vishaalurao is following 20 prominent accounts