Feel free to reach out and chat with Xinyi on July 18th in Vancouver at the #ICML
14.07.2025 08:36 β π 0 π 0 π¬ 0 π 0@jiaangli.bsky.social
PhD student at University of Copenhagen @belongielab.org | #nlp #computervision | ELLIS student @ellis.eu π https://jiaangli.github.io/
Feel free to reach out and chat with Xinyi on July 18th in Vancouver at the #ICML
14.07.2025 08:36 β π 0 π 0 π¬ 0 π 0Would you present your next NeurIPS paper in Europe instead of traveling to San Diego (US) if this was an option? SΓΈren Hauberg (DTU) and I would love to hear the answer through this poll: (1/6)
30.03.2025 18:04 β π 280 π 160 π¬ 6 π 13Check out our new preprint πππ§π¬π¨π«ππππ.
We use a robust decomposition of the gradient tensors into low-rank + sparse parts to reduce optimizer memory for Neural Operators by up to ππ%, while matching the performance of Adam, even on turbulent NavierβStokes (Re 10e5).
PhD student, Jiaang Li and his collaborators, with insights into cultural understanding of vision-language models π
02.06.2025 18:12 β π 1 π 1 π¬ 0 π 0Paper title "Cultural Evaluations of Vision-Language Models Have a Lot to Learn from Cultural Theory"
I am excited to announce our latest work π "Cultural Evaluations of Vision-Language Models Have a Lot to Learn from Cultural Theory". We review recent works on culture in VLMs and argue for deeper grounding in cultural theory to enable more inclusive evaluations.
Paper π: arxiv.org/pdf/2505.22793
Great collaboration with @yfyuan01.bsky.social @wenyan62.bsky.social @aliannejadi.bsky.social @danielhers.bsky.social , Anders SΓΈgaard, Ivan VuliΔ, Wenxuan Zhang, Paul Liang, Yang Deng, @serge.belongie.com
23.05.2025 17:04 β π 2 π 0 π¬ 0 π 0πMore here:
Project Page: jiaangli.github.io/RAVENEA/
Code: github.com/yfyuan01/RAV...
Dataset: huggingface.co/datasets/jaa...
πOur experiments demonstrate that even lightweight VLMs, when augmented with culturally relevant retrievals, outperform their non-augmented counterparts and even surpass the next larger model tier, achieving at least a 3.2% improvement in cVQA and 6.2% in cIC.
23.05.2025 17:04 β π 0 π 0 π¬ 1 π 0π Culture-Aware Contrastive Learning
We propose Culture-aware Contrastive (CAC) Learning, a supervised learning framework compatible with both CLIP and SigLIP architectures. Fine-tuning with CAC can help models better capture culturally significant content.
π Dataset Construction
RAVENEA integrates 1,800+ images, 2,000+ culture-related questions, 500+ human captions, and 10,000+ human-ranked Wikipedia documents to support two key tasks:
π―Culture-focused Visual Question Answering (cVQA)
πCulture-informed Image Captioning (cIC)
πNew Preprintπ
Can Multimodal Retrieval Enhance Cultural Awareness in Vision-Language Models?
Excited to introduce RAVENEA, a new benchmark aimed at evaluating cultural understanding in VLMs through RAG.
arxiv.org/abs/2505.14462
More details:π
Super cool! Incidentally, in our previous project, we also found that linear alignment between embedding spaces from two modalities is viable β and the alignment improves as LLMs scale.
bsky.app/profile/jiaa...
I wonβt be attending #ICLR in person this yearπ’. But feel free to check our paper βRevisiting the Othello World Model Hypothesisβ with Anders SΓΈgaard, accepted at ICLR world models workshop!
Paper link arxiv.org/abs/2503.04421
Thrilled to announce "Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation" is accepted as a Spotlight (5%) at #ICLR2025!
Our model MM-FSS leverages 3D, 2D, & text modalities for robust few-shot 3D segmentationβall without extra labeling cost. π€©
arxiv.org/pdf/2410.22489
More detailsπ
Forget just thinking in words.
πOur New Preprint:
π New Era of Multimodal Reasoningπ¨
π Imagine While Reasoning in Space with MVoT
Multimodal Visualization-of-Thought (MVoT) revolutionizes reasoning by generating visual "thoughts" that transform how AI thinks, reasons, and explains itself.
FGVC12 Workshop is coming to #CVPR 2025 in Nashville!
Are you working on fine-grained visual problems?
This year we have two peer-reviewed paper tracks:
i) 8-page CVPR Workshop proceedings
ii) 4-page non-archival extended abstracts
CALL FOR PAPERS: sites.google.com/view/fgvc12/...
Hereβs a short film produced by the Danish Royal Academy of Sciences, showcasing the WineSensed π· project of ΓΓ³ranna Bender et al. thoranna.github.io/learning_to_...
30.12.2024 11:05 β π 17 π 3 π¬ 0 π 0From San Diego to New York to Copenhagen, wishing you Happy Holidays!π
21.12.2024 11:20 β π 39 π 4 π¬ 0 π 0With @neuripsconf.bsky.social right around the corner, weβre excited to be presenting our work soon! Hereβs an overview
(1/5)
Hereβs a starter pack with members of our lab that have joined Bluesky
25.11.2024 10:42 β π 13 π 4 π¬ 0 π 0No one can explain stochastic gradient descent better than this panda.
24.11.2024 15:04 β π 216 π 32 π¬ 10 π 6πββοΈ
24.11.2024 11:09 β π 0 π 0 π¬ 0 π 0Great collaboration with @constanzafierro.bsky.social , @YovaKem_v2, and Anders SΓΈgaard!
π¨βπ» github.com/jiaangli/VLCA
π direct.mit.edu/tacl/article...
πTake away:
1. Representation spaces of LMs and VMs grow more partially similar with model size.
2. Lower frequency, polysemy, dispersion can be easier to align.
3. Shared concepts between LMs and VMs might extend beyond nouns.
π§΅(7/8)
#NLP #NLProc
π±We then discuss the implications of our finding:
- the LM understanding debate
- the study of emergent properties
- philosophy
π§΅(6/8)
πWe also measure the generalization of the mapping to other POS, and explore the impact of different size of the training data. πTo investigate the effects of incorporating text signals during vision pretraining, we compare pure vision models against selected CLIP vision encoders.
π§΅(5/8)
What factors influence the convergence?
πOur experiments show the alignability of LMs
and vision models is sensitive to image and language dispersion, polysemy, and frequency.
π§΅(4/8)
The results show a clear trend:
β¨LMs converge toward the geometry of visual models as they grow bigger and better.
π§΅(3/8)
Mapping vector spaces:
π―We measure the alignment between vision models and LMs by mapping their vector spaces and evaluating retrieval precision on held-out data.
π§΅(2/8)
π€Do Vision and Language Models Share Concepts? π
We present an empirical evaluation and find that language models partially converge towards representations isomorphic to those of vision models. #EMNLP
π direct.mit.edu/tacl/article...