Excited to be in Vienna for #ACL2025 π¦πΉ!You'll find @dziadzio.bsky.social and I by our ONEBench poster, so do drop by!
ποΈWed, July 30, 11-12:30 CET
πHall 4/5
Iβm also excited to talk about lifelong and personalised benchmarking, data curation and vision-language in general! Letβs connect!
27.07.2025 22:26 β π 3 π 1 π¬ 0 π 0
Stumbled upon this blogpost recently and found some very useful tips to improve the Bluesky experience. This seemed almost tailored to me - I don't live in the USA and the politics there don't affect me personally. Settings -> Moderation -> Muted Words & Tags cleaned up my feed - strongly recommend!
25.06.2025 16:14 β π 11 π 1 π¬ 0 π 0
Why More Researchers Should be Content Creators
Just trying something new! I recorded one of my recent talks, sharing what I learned from starting as a small content creator.
youtu.be/0W_7tJtGcMI
We all benefit when there are more content creators!
24.06.2025 21:58 β π 7 π 1 π¬ 1 π 0
I'm in Nashville this week attending #CVPR2025. Excited to discuss post-training VLMs and diffusion models!
11.06.2025 03:04 β π 10 π 1 π¬ 0 π 0
π§΅1/10 Excited to share our #SIGGRAPH paper "MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills" π
We explore how to make MLLMs operation-aware by solving visual puzzles and propose a procedural framework for image retouching
#MLLM
27.05.2025 15:13 β π 3 π 2 π¬ 1 π 0
πONEBench accepted to ACL main! β¨
Stay tuned for the official leaderboard and real-time personalised benchmarking release!
If youβre attending ACL or are generally interested in the future of foundation model benchmarking, happy to talk!
#ACL2025NLP #ACL2025
@aclmeeting.bsky.social
17.05.2025 19:52 β π 7 π 2 π¬ 0 π 0
π§ Keeping LLMs factually up to date is a common motivation for knowledge editing.
But what would it actually take to support this in practice at the scale and speed the real world demands?
We explore this question and really push the limits of lifelong knowledge editing in the wild.
π
08.04.2025 15:31 β π 29 π 8 π¬ 1 π 4
Check out our newest paper!
As always, it was super fun working on this with @prasannamayil.bsky.social
18.02.2025 14:12 β π 5 π 1 π¬ 0 π 0
π¨Great Models Think Alike and this Undermines AI Oversightπ¨
New paper quantifies LM similarity
(1) LLM-as-a-judge favor more similar modelsπ€₯
(2) Complementary knowledge benefits Weak-to-Strong Generalizationβ―οΈ
(3) More capable models have more correlated failures ππ
π§΅π
07.02.2025 21:12 β π 19 π 9 π¬ 2 π 1
Godsend
07.02.2025 16:38 β π 3 π 0 π¬ 0 π 0
Fuck it, today we're open-sourcing the codebase used to train SmolVLM from scratch on 256 H100s π₯
Inspired by our team's effort to open-source DeepSeek's R1, we are releasing the training and evaluation code on top of the weights π«‘
Now you can train any SmolVLMβor create your own custom VLMs!
31.01.2025 15:06 β π 24 π 5 π¬ 2 π 0
Added you!
27.01.2025 11:38 β π 1 π 0 π¬ 0 π 0
NLI Improves Compositionality in Vision-Language Models is accepted to #ICLR2025!
CECE enables interpretability and achieves significant improvements in hard compositional benchmarks without fine-tuning (e.g., Winoground, EqBen) and alignment (e.g., DrawBench, EditBench). + info: cece-vlm.github.io
23.01.2025 18:34 β π 13 π 2 π¬ 1 π 1
I feel like my βfollowingβ and βpopular with friendsβ feeds are well tuned as I have complete control over them. Just that people still are posting less on bsky and are more active on Twitter. Once that changes (and I think it will), weβll have the same experience as it is on Twitter right now.
12.01.2025 23:34 β π 2 π 0 π¬ 0 π 0
How to Merge Your Multimodal Models Over Time?
Model merging combines multiple expert models - finetuned from a base foundation model on diverse tasks and domains - into a single, more capable model. However, most existing model merging approaches...
π New Paper: "How to Merge Your Multimodal Models Over Time?"
arxiv.org/abs/2412.06712
Model merging assumes all finetuned models are available at once. But what if they need to be created over time?
We study Temporal Model Merging through the TIME framework to find out!
π§΅
11.12.2024 18:00 β π 24 π 7 π¬ 1 π 2
Added you!
11.12.2024 23:55 β π 1 π 0 π¬ 0 π 0
Sure!
10.12.2024 21:27 β π 0 π 0 π¬ 0 π 0
Welcome, stranger
10.12.2024 21:25 β π 1 π 0 π¬ 0 π 0
How do we benchmark the vast capabilities of foundation models? Introducing ONEBench β a unifying benchmark to test them all, led by
@adhirajghosh.bsky.social and
@dziadzio.bsky.social!β¬οΈ
Sample-level benchmarks could be the new generation- reusable, recombinable & evaluate lots of capabilities!
10.12.2024 18:39 β π 2 π 1 π¬ 0 π 0
This extremely ambitious project would not have been possible without @dziadzio.bsky.social @bayesiankitten.bsky.social @vishaalurao.bsky.social @samuelalbanie.bsky.social and Matthias Bethge!
Special thanks to everyone at @bethgelab.bsky.social, Bo Li, Yujie Lu and Palzer Lama for all your help!
10.12.2024 17:52 β π 3 π 0 π¬ 0 π 0
In summary, we release ONEBench as a valuable tool for comprehensively evaluating foundation models and generating customised benchmarks, in the hopes of sparking a restructuring how benchmarking is done. We plan on publishing the code, benchmark and metadata for capability probing very soon.
10.12.2024 17:51 β π 2 π 0 π¬ 1 π 0
Finally, we probe open-ended capabilities by defining a query pool to test, as proof-of-concept, and generating personalised model rankings. Expanding ONEBench can only improve reliability and scale of these queries and weβre excited to extend this framework.
More insights like these in the paper!
10.12.2024 17:50 β π 3 π 0 π¬ 1 π 0
Let's look under the hood! ONEBench comprises ONEBench-LLM, and ONEBench-LMM: the largest pool of evaluation samples for foundation models(~50K for LLMs and ~600K for LMMs), spanning various domains and tasks. ONEBench will be continually expanded to accommodate more models and datasets.
10.12.2024 17:49 β π 3 π 0 π¬ 1 π 0
We compare our Plackett-Luce implementation to ELO and ELO-distribution based ranking methods, not only showing superior correlation to the aggregated mean model scores for each test set but also extremely stable correlations to missing datapoints and missing measurements, even up to 95% sparsity!
10.12.2024 17:49 β π 3 π 0 π¬ 1 π 0
β
ONEBench uses these rankings and aggregates them using the Plackett-Luce framework: providing an extremely efficient MLE from the aggregated probabilities of individual rankings, resulting in an estimate of the parameters of the strength(or utility value) of models in a ranking.
10.12.2024 17:48 β π 3 π 0 π¬ 1 π 0
π€ How do we aggregate samples from different test sets, spanning different metrics?
The solution lies in converting individual model evals into ordinal measurements(A<B<C), two or more models can be directly compared on the same data sample to obtain model preference rankings.
10.12.2024 17:47 β π 3 π 0 π¬ 1 π 0
Given a query of interest from a practitioner, we are able to pick relevant samples from the data pool by top-k retrieval in the embedding space or searching through sample-specific metadata.
Aggregating through relevant samples and model performance on them, we obtain our final model rankings.
10.12.2024 17:45 β π 3 π 0 π¬ 1 π 0
βStatus quo: Benchmarking on large testsets is costly. Static benchmarks fail to use truly held-out data and also canβt probe the ever-evolving capabilities of LLMs/VLMs.
ONEBench mitigates this by re-structuring static benchmarks to accommodate an ever-expanding pool of datasets and models.
10.12.2024 17:45 β π 3 π 0 π¬ 1 π 0
π¨Looking to test your foundation model on an arbitrary and open-ended set of capabilities, not explicitly captured by static benchmarks? π¨
Check out β¨ONEBenchβ¨, where we show how sample-level evaluation is the solution.
π arxiv.org/abs/2412.06745
10.12.2024 17:44 β π 18 π 5 π¬ 1 π 2
PhD student at the University of Amsterdam working on vision-language models and cognitive computational neuroscience
Assistant Professor at @cs.ubc.caβ¬ and βͺ@vectorinstitute.aiβ¬ working on Natural Language Processing. Book: https://lostinautomatictranslation.com/
Waiting on a robot body. All opinions are universal and held by both employers and family.
Recruiting students to start my lab!
ML/NLP/they/she.
(jolly good) Fellow at the Kempner Institute @kempnerinstitute.bsky.socialβ¬, incoming assistant professor at UBC Linguistics (and by courtesy CS, Sept 2025). PhD @stanfordnlp.bsky.socialβ¬ with the lovely @jurafsky.bsky.socialβ¬
isabelpapad.com
PhD Candidate at the Max Planck ETH Center for Learning Systems working on 3D Computer Vision.
https://wimmerth.github.io
DH Prof @URichmond. Exploring computer vision and visual culture. Ideas for the Association for Computers & the Humanities @ach.bsky.social and Computational Humanities Research Journal? Please share!
AI researcher at Google DeepMind. Synthesized views are my own.
πSF Bay Area π http://jonbarron.info
This feed is a partial mirror of https://twitter.com/jon_barron
PhD student at the University of Tuebingen. Computer vision, video understanding, multimodal learning.
https://ninatu.github.io/
Postdoc at Utrecht University, previously PhD candidate at the University of Amsterdam
Multimodal NLP, Vision and Language, Cognitively Inspired NLP
https://ecekt.github.io/
π₯ LLMs together (co-created model merging, BabyLM, textArena.ai)
π₯ Spreading science over hype in #ML & #NLP
Proud shareLM㪠Donor
@IBMResearch & @MIT_CSAIL
Research Scientist GoogleDeepMind
Ex @UniofOxford, AIatMeta, GoogleAI
CS PhD student at UCSB π | Trustworthy AI/FL/HAI π | DAOlivia co-founder | Building for this universe π
Research scientist at Google DeepMind.π¦
She/her.
http://www.aidanematzadeh.me/
Incoming assistant professor at JHU CS
PhD at UNC | Bloomberg PhD Fellow
Prev: Google, Microsoft, Adobe, AI2, SNU
https://j-min.io
#multimodal #nlp
PhD student @ CMU LTI. efficiency/data in NLP/ML
Postdoc @UW, Prev.@UMich, Ph.D @PSU, Research Intern @GoogleAI, @AmazonScience. π¦ https://hua-shen.org.
Stanford Professor of Linguistics and, by courtesy, of Computer Science, and member of @stanfordnlp.bsky.social and The Stanford AI Lab. He/Him/His. https://web.stanford.edu/~cgpotts/
Research Scientist at Google DeepMind
https://e-bug.github.io
Research Scientist at Ai2, PhD in NLP π€ UofA. Ex
GoogleDeepMind, MSFTResearch, MilaQuebec
https://nouhadziri.github.io/