Adhiraj Ghosh

Adhiraj Ghosh

@adhirajghosh.bsky.social

ELLIS PhD, University of Tübingen | Data-centric Vision and Language @bethgelab.bsky.social Website: adhirajghosh.github.io Twitter: https://x.com/adhiraj_ghosh98

1,551 Followers 427 Following 77 Posts Joined Nov 2024
7 months ago
Post image

Excited to be in Vienna for #ACL2025 🇦🇹!You'll find @dziadzio.bsky.social and I by our ONEBench poster, so do drop by!

🗓️Wed, July 30, 11-12:30 CET
📍Hall 4/5

I’m also excited to talk about lifelong and personalised benchmarking, data curation and vision-language in general! Let’s connect!

4 1 0 0
8 months ago
Post image

Why More Researchers Should be Content Creators

Just trying something new! I recorded one of my recent talks, sharing what I learned from starting as a small content creator.

youtu.be/0W_7tJtGcMI

We all benefit when there are more content creators!

9 2 1 1
9 months ago
Post image

I'm in Nashville this week attending #CVPR2025. Excited to discuss post-training VLMs and diffusion models!

10 1 0 0
9 months ago
Post image

🧵1/10 Excited to share our #SIGGRAPH paper "MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills" 🌟
We explore how to make MLLMs operation-aware by solving visual puzzles and propose a procedural framework for image retouching
#MLLM

4 2 1 0
9 months ago

🏆ONEBench accepted to ACL main! ✨
Stay tuned for the official leaderboard and real-time personalised benchmarking release!

If you’re attending ACL or are generally interested in the future of foundation model benchmarking, happy to talk!

#ACL2025NLP #ACL2025
@aclmeeting.bsky.social

7 2 0 0
11 months ago
Post image

🧠 Keeping LLMs factually up to date is a common motivation for knowledge editing.

But what would it actually take to support this in practice at the scale and speed the real world demands?

We explore this question and really push the limits of lifelong knowledge editing in the wild.
👇

29 8 1 4
1 year ago

Check out our newest paper!

As always, it was super fun working on this with @prasannamayil.bsky.social

5 1 0 0
1 year ago
Post image

🚨Great Models Think Alike and this Undermines AI Oversight🚨
New paper quantifies LM similarity
(1) LLM-as-a-judge favor more similar models🤥
(2) Complementary knowledge benefits Weak-to-Strong Generalization☯️
(3) More capable models have more correlated failures 📈🙀
🧵👇

21 9 2 1
1 year ago

Godsend

3 0 0 0
1 year ago
Post image

Fuck it, today we're open-sourcing the codebase used to train SmolVLM from scratch on 256 H100s 🔥
Inspired by our team's effort to open-source DeepSeek's R1, we are releasing the training and evaluation code on top of the weights 🫡
Now you can train any SmolVLM—or create your own custom VLMs!

25 5 2 0
1 year ago

Added you!

1 0 0 0
1 year ago
Video thumbnail

NLI Improves Compositionality in Vision-Language Models is accepted to #ICLR2025!

CECE enables interpretability and achieves significant improvements in hard compositional benchmarks without fine-tuning (e.g., Winoground, EqBen) and alignment (e.g., DrawBench, EditBench). + info: cece-vlm.github.io

14 2 1 1
1 year ago

I feel like my “following” and “popular with friends” feeds are well tuned as I have complete control over them. Just that people still are posting less on bsky and are more active on Twitter. Once that changes (and I think it will), we’ll have the same experience as it is on Twitter right now.

2 0 0 0
1 year ago
Preview
How to Merge Your Multimodal Models Over Time? Model merging combines multiple expert models - finetuned from a base foundation model on diverse tasks and domains - into a single, more capable model. However, most existing model merging approaches...

📄 New Paper: "How to Merge Your Multimodal Models Over Time?"

arxiv.org/abs/2412.06712

Model merging assumes all finetuned models are available at once. But what if they need to be created over time?

We study Temporal Model Merging through the TIME framework to find out!

🧵

25 7 1 2
1 year ago

Added you!

1 0 0 0
1 year ago

Sure!

0 0 0 0
1 year ago

Welcome, stranger

1 0 0 0
1 year ago

How do we benchmark the vast capabilities of foundation models? Introducing ONEBench – a unifying benchmark to test them all, led by
@adhirajghosh.bsky.social and
@dziadzio.bsky.social!⬇️

Sample-level benchmarks could be the new generation- reusable, recombinable & evaluate lots of capabilities!

2 1 0 0
1 year ago

This extremely ambitious project would not have been possible without @dziadzio.bsky.social @bayesiankitten.bsky.social @vishaalurao.bsky.social @samuelalbanie.bsky.social and Matthias Bethge!
Special thanks to everyone at @bethgelab.bsky.social, Bo Li, Yujie Lu and Palzer Lama for all your help!

3 0 0 0
1 year ago

In summary, we release ONEBench as a valuable tool for comprehensively evaluating foundation models and generating customised benchmarks, in the hopes of sparking a restructuring how benchmarking is done. We plan on publishing the code, benchmark and metadata for capability probing very soon.

2 0 1 0
1 year ago
Post image

Finally, we probe open-ended capabilities by defining a query pool to test, as proof-of-concept, and generating personalised model rankings. Expanding ONEBench can only improve reliability and scale of these queries and we’re excited to extend this framework.
More insights like these in the paper!

3 0 1 0
1 year ago
Post image

Let's look under the hood! ONEBench comprises ONEBench-LLM, and ONEBench-LMM: the largest pool of evaluation samples for foundation models(~50K for LLMs and ~600K for LMMs), spanning various domains and tasks. ONEBench will be continually expanded to accommodate more models and datasets.

3 0 1 0
1 year ago
Post image

We compare our Plackett-Luce implementation to ELO and ELO-distribution based ranking methods, not only showing superior correlation to the aggregated mean model scores for each test set but also extremely stable correlations to missing datapoints and missing measurements, even up to 95% sparsity!

3 0 1 0
1 year ago
Post image

✅ ONEBench uses these rankings and aggregates them using the Plackett-Luce framework: providing an extremely efficient MLE from the aggregated probabilities of individual rankings, resulting in an estimate of the parameters of the strength(or utility value) of models in a ranking.

3 0 1 0
1 year ago

🤔 How do we aggregate samples from different test sets, spanning different metrics?
The solution lies in converting individual model evals into ordinal measurements(A<B<C), two or more models can be directly compared on the same data sample to obtain model preference rankings.

3 0 1 0
1 year ago
Post image

Given a query of interest from a practitioner, we are able to pick relevant samples from the data pool by top-k retrieval in the embedding space or searching through sample-specific metadata.
Aggregating through relevant samples and model performance on them, we obtain our final model rankings.

3 0 1 0
1 year ago
Post image

❗Status quo: Benchmarking on large testsets is costly. Static benchmarks fail to use truly held-out data and also can’t probe the ever-evolving capabilities of LLMs/VLMs.

ONEBench mitigates this by re-structuring static benchmarks to accommodate an ever-expanding pool of datasets and models.

3 0 1 0
1 year ago
Post image

🚨Looking to test your foundation model on an arbitrary and open-ended set of capabilities, not explicitly captured by static benchmarks? 🚨

Check out ✨ONEBench✨, where we show how sample-level evaluation is the solution.

🔎 arxiv.org/abs/2412.06745

18 5 1 2
1 year ago
Post image

🚀New Paper: Active Data Curation Effectively Distills Multimodal Models
arxiv.org/abs/2411.18674

Smol models are all the rage these days & knowledge distillation (KD) is key for model compression!

We show how data curation can effectively distill to yield SoTA FLOP-efficient {C/Sig}LIPs!!
🧵👇

23 6 1 2