Multimodal Large Language Models as Image Classifiers
Nikita Kisel, Illia Volkov @klara-cz.bsky.social Jiri Matas
tl;dr: if you evaluate good (chatGPT) model on a dirty (ImageNet) test set, it is bad. Yes, ImageNet test is bad nowadays. +insights from labeling.
arxiv.org/abs/2603.065...
I am glad somebody has appreciated it! ๐
I am not gonna lie, I tried to have my dog there at first, but despite ImageNet's over 100 classes being dog breeds, they still somehow managed not to squeeze the australian shepherd in.
To study this, we introduce ReGT, a new multilabel reannotation of 625 ImageNet classes that corrects many of these issues. When evaluated on the cleaned labels, multimodal LLMs improve by up to +10.8% accuracy, substantially narrowing the gap with supervised vision models. ๐
Work with Nikita Kisel, Illia Volkov and Jiri Matas, to be presented at #CVPR26, findings!
๐ค Finally, we show that these models arenโt just affected by annotation quality; they can help fix it. In a controlled verification study, annotators integrated model predictions in roughly half of the difficult cases, suggesting MLLMs can be useful tools for large-scale dataset curation.
To study this, we introduce ReGT, a new multilabel reannotation of 625 ImageNet classes that corrects many of these issues. When evaluated on the cleaned labels, multimodal LLMs improve by up to +10.8% accuracy, substantially narrowing the gap with supervised vision models. ๐
We show that small changes in evaluation protocol, like choice of distractors, output mapping, even image order, significantly impact accuracy.
โ ๏ธ But thereโs a deeper issue: the data. ImageNet contains a lot of label noise, so even a perfect eval. protocol may not give a meaningful result.
Let me introduce our new paper: Multimodal Large Language Models as Image Classifiers
โ Multimodal LLMs are increasingly used for visual tasks, but evaluating their image classification ability has produced conflicting conclusions.
Link: arxiv.org/html/2603.06...
He totally does, he is getting more snuggly every day โบ๏ธ
Morning walks ๐พ
It also really does feel like reviewer psychology since they have not explicitly pointed it out as the issue - not being able to run the experiment again with different framing but same reviewers is tough :D
When you re-read the introduction of your freshly rejected paper that was somewhat rushed before the deadline and you are like: Ok, this is why. ๐ฅฒ
Team 2/2 rejected, with one suggested for the findings workshop.
I am a bit sad because I feel they were rejected for the wrong reasons + I am tired of getting BR rating with no suggestions for rebuttal, but I am much more into ECCV than CVPR this year anyway. ๐
Good luck with resubmission! ๐
I feel like for the first time in my (short) reviewing career, I may have helped a (IMO of course) nice paper get accepted despite other reviewer(s).
1/n Attention, Please! ๐
Our work โRevisiting Attentive Probing Through the Lens of Efficiencyโ has been accepted at #ICLR2026.
We introduce Efficient Probing (EP) โ a lightweight, multi-query attentive probing method for frozen encoders.
Paper + code at the end ๐
I was starting to wonder what do I do with my time now
Oh ok, that is a different level of wrong than I thought ๐ฅฒ
I think most benchmarks are pretty noisy; it is just that for some (say ImageNet :)), enough people actually looked at the images and noticed.
To be fair, data annotation is HARD. I do agree people should at least try to do a better job and be responsive, of course :)
bsky.app/profile/klar...
What a beautiful day to be done with all deadlines! โ๏ธ
This was my WFH lunch break today, if it is not clear why I do not live in Prague ๐
25 % left, a few more nice bedtime readings for me. :)
JAZZ HANDS!
I am currently at
R: Be very ready.
G: I am very ready. Be calm.
R: Am calm. You be calm.
G: NO YOU BE-
Having stopped about midway through Project Hail Mary and forbidding myself to resume until I finish my CVPR reviews was a pretty good motivation.
Also, if you have not read it yet, but you think you might enjoy it, go for it, you are in for a treat (and the movie is coming)! ๐ค
I should have added it looks like this (a few lucky days a year ๐คฃ), like today โ๏ธโบ๏ธ
True, not many positions come with free canistherapy ๐ถ (letโs ignore that heโs a teenager now). I hope my profile pic makes up for the regrettable omission and is self-explanatory!
Imagine this: Prague ๐ฐ, a top CV lab, learning all the things we work on at VRG, regular cake at coffee breaks (hope you are not on a diet, but we also have a free gym on site) ๐ฐ, excellent filter coffee โ, and - last but not least - working with Giorgos.
Thereโs a postdoc opening. Donโt miss out ๐
Recently, Illia received an award for the research he has been doing with us.
Most people would think about sth to buy for themselves. He donated it all to support his home. ๐บ๐ฆ
#DoNotForget
I have been getting pretty good and diverse assignments (my work is a bit "all over the place") before, max 1 paper on X per conf :) But now it is 3 papers on my msc topic X and before, I got my bsc stuff for WACV.
It accumulated and I had to vent a bit but it is not enough to make me grumpy (yet) ๐
I am fine reviewing a paper on it here and there, just not the whole batch like this cvpr ๐ฅน But I have not though of this, might use it next time, thanks!
The curse of doing research as an undergrad: publish one paper on topic X, then spend your entire PhD reviewing papers on X. Thereโs a reason I changed topics. ๐ซ
At least one paper in my batch actually looks very interesting though :)
โฐ๐: Tomorrow, 11 am, Hadfield Hall - come say hi to Nikita presenting!