Solved: robustness to paraphrasing and false premises, OCR, world-knowledge based reasoning.
Open: spatial reasoning, data-efficiency, learning compatible representations.
@dhruvbatra.bsky.social
Co-founder & Chief Scientist at Yutori. Prev: Senior Director leading FAIR Embodied AI at Meta, and Professor at Georgia Tech.
Solved: robustness to paraphrasing and false premises, OCR, world-knowledge based reasoning.
Open: spatial reasoning, data-efficiency, learning compatible representations.
As part of the award ceremony, VQA team presented a recap of vision-and-language research over the last decade โ solved problems, progress, and open-challenges for mutimodal LLMs.
23.10.2025 17:17 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0Lots to be done. Thank you to all our collaborators and the research community for this recognition!
21.10.2025 19:27 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0Fun-fact: the T-shirt I'm wearing is an inside joke about the quality of 2015 models.
However, every few years we rediscover the lesson that on difficult tasks, VLMs silently regress to being nearly blind.
x.com/DhruvBatra_/...
VQA challenge series won the Mark Everingham prize at #ICCV2025 for stimulating a new strand of vision-and-language research.
It's extra special because ICCV25 marks the 10-year anniversary of the VQA paper.
When we started, the idea of answering any question about any image seemed outlandish.
Anything by Ted Chiang
20.10.2025 03:52 โ ๐ 6 ๐ 0 ๐ฌ 1 ๐ 0I dunno man, Dagger is cool.
20.10.2025 03:51 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0The problem with โAI slopโ isnโt the AI โ itโs the slop.
People act like AI is the issue, when itโs actually part of the fix.
If we're honest: most of what we make, most of the time, is slop by our own standards.
Thatโs the generatorโdiscriminator gap in creative work that Ira Glass talks about.
Somebody is a fan of Abundance
10.06.2025 05:33 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0It is so refreshing to see conferences innovate on the reviewing model and run actual experiments (!) as opposed to fighting change.
16.04.2025 04:43 โ ๐ 3 ๐ 0 ๐ฌ 0 ๐ 0Good. Autonomous interface locomotion is the fundamental robotics problem of our time. The more the merrier.
01.04.2025 17:12 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0My entire robotics career has led to this.
01.04.2025 16:05 โ ๐ 5 ๐ 1 ๐ฌ 1 ๐ 0The answer to many "why X?" questions:
Because the laws of physics do not prohibit X and the forces of biology gave us curiosity.
The web is the ultimate boss-level for agents โ dynamic, non-deterministic, and noisy; some mistakes are inevitable and so far, every agent fails eventually.
Yutori is building superhuman agents for this ultimate digital environment.
Join our waitlist for early access to our product!
yutori.com
๐๐ฆ๐๐ ๐ข๐ง๐ ๐ ๐ฐ๐จ๐ซ๐ฅ๐ ๐ฐ๐ก๐๐ซ๐ ๐ง๐จ ๐ก๐ฎ๐ฆ๐๐ง ๐ก๐๐ฌ ๐ญ๐จ ๐๐ข๐ซ๐๐๐ญ๐ฅ๐ฒ ๐ข๐ง๐ญ๐๐ซ๐๐๐ญ ๐ฐ๐ข๐ญ๐ก ๐ญ๐ก๐ ๐ฐ๐๐ ๐๐ ๐๐ข๐ง.
Where teams of AI assistants coordinate to book flights, manage budgets, or file paperworkโproactively surfacing insights and correcting errors.
Only problem โ no one knows how to build AI agents that actually work.
I started something new last year with a wonderful group of people. We showed a demo in Jan.
Today, weโre telling our story โ show before you talk!
๐๐ฆ ๐ข๐ณ๐ฆ ๐ณ๐ฆ-๐ช๐ฎ๐ข๐จ๐ช๐ฏ๐ช๐ฏ๐จ ๐ฉ๐ฐ๐ธ ๐ฑ๐ฆ๐ฐ๐ฑ๐ญ๐ฆ ๐ช๐ฏ๐ต๐ฆ๐ณ๐ข๐ค๐ต ๐ธ๐ช๐ต๐ฉ ๐ต๐ฉ๐ฆ ๐ธ๐ฆ๐ฃ โ one of humanityโs greatest inventions and a a mess overdue for an overhaul.
yutori.com
Ah, understood. No idea about the tracing of that meme.
23.03.2025 15:27 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0Seems like the ultimate thing to rally around, no? To the extent there is any purpose, what's the alternative?
23.03.2025 02:28 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0I'm already there for low-stakes queries.
23.03.2025 01:02 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0Where's the skepticism coming from? Now that web search and citations are in there, isn't it easy to verify and thus become more confident?
23.03.2025 00:59 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0๐ขExcited to announce our upcoming workshop - Vision Language Models For All: Building Geo-Diverse and Culturally Aware Vision-Language Models (VLMs-4-All) @CVPR 2025!
๐ sites.google.com/view/vlms4all
Using a locally-running LLM to translate a review is explicitly prohibited by @iccv.bsky.social
Why? Whom does this possibly harm?
The way it's always been done isn't handling the current scale well (as evidenced by the feedback from the authors). Yes, outsource to a company, pay for creation of new tools, start new companies, all of the standard ways of addressing a growing market.
26.02.2025 15:52 โ ๐ 0 ๐ 0 ๐ฌ 3 ๐ 0Why is it volunteer work? Why doesn't an organization that takes in millions in sponsorship professionalize?
26.02.2025 15:46 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0Some of us did :)
21.12.2024 16:51 โ ๐ 9 ๐ 0 ๐ฌ 1 ๐ 0It's not just about how accurate the laws are, but also how robust their predictions are under uncertainty.
Int physics operates directly from pixels without knowing precise masses, coefficients of friction, restitution, etc. Physics engines make heavy demands and "explode" when things are off.
Agreed on that comparison.
But one likely learn more about intuitive physics from watching billiard balls collide than by reading the wiki page.
Text is likely more information rich on average. My point is that we are not running out of other sources of information for learning about the world.
Fair, but text is not all of intelligence
14.12.2024 22:35 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0If it works, it's a good solution :)
14.12.2024 22:34 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
Brilliant talk by Ilya, but he's wrong on one point.
We are NOT running out of data. We are running out of human-written text.
We have more videos than we know what to do with. We just haven't solved pre-training in vision.
Just go out and sense the world. Data is easy.