Cultural awareness is trickier. Different data for different cultures means we can't really compare performance across cultures in a straightforward way. And there's no clear optimization target for cultural awareness beyond curating diverse training data.
21.10.2025 13:30 — 👍 1 🔁 0 💬 0 📌 0
☝️🧵 Most current approaches emphasize langauge neutrality: about two-thirds of VL benchmarks use translation-based evaluation. This makes sense because we can explicitly train for language neutrality when we have parallel data. But... 🧵👇
21.10.2025 13:30 — 👍 0 🔁 0 💬 1 📌 0
With @andrei-a-manea.bsky.social, we posted a survey on multilingual vision-language models 👉 arxiv.org/pdf/2509.22123
We reviewed 31 models+21 benchmarks. There's a tension between language neutrality (same results across languages) & cultural awareness (context matters differently across cultures)
21.10.2025 13:30 — 👍 2 🔁 1 💬 1 📌 0
Most vision-language models only work in English. We explore how different parallel data types (machine-translated vs authentic captions) affect cross-lingual transfer. Key finding: authentic data can outperform machine translation, and multilingual training beats bilingual approaches. #NLP
01.09.2025 15:38 — 👍 2 🔁 0 💬 0 📌 0
So proud of my PhD student @andrei-a-manea.bsky.social for his first first-author publication! 🎉 He presented this work last week at TSD. Investigating the Effect of Parallel Data in the Cross-Lingual Transfer for Vision-Language Encoders arxiv.org/pdf/2504.21681
01.09.2025 15:38 — 👍 6 🔁 0 💬 1 📌 0
For evaluation researchers: Simple string-overlap metrics (BLEU, chrF) work surprisingly well for factual QA. 🤔 When answers are mostly named entities, exact matches matter more than we thought.
LLM-as-judge 🦙🧑⚖️ correlates best with human judgment, though.
25.08.2025 08:06 — 👍 1 🔁 0 💬 1 📌 0
The results are... humbling 😅
Even the best models:
>40% accuracy on textual questions
<30% on visual questions
Often perform better in English than the local language (!!)
Visual QA with regional images is especially challenging.
25.08.2025 08:06 — 👍 0 🔁 0 💬 1 📌 0
The problem: Most QA benchmarks focus on globally known facts. But real users ask about local geography, culture, and history.
We collected questions from native speakers in Czechia 🇨🇿, Slovakia 🇸🇰, and Ukraine 🇺🇦 about facts locals know but outsiders don't.
25.08.2025 08:06 — 👍 0 🔁 0 💬 1 📌 0
ufal/cus-qa · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
🧵 We're releasing CUS-QA - a new benchmark for testing LLMs on regional knowledge!
Find out what your model knows about Czechia 🇨🇿, Slovakia 🇸🇰, and Ukraine 🇺🇦!
👉 Textual and visual questions, answers, and human judgment on model outputs!
huggingface.co/datasets/ufa...
www.arxiv.org/abs/2507.22752
25.08.2025 08:06 — 👍 16 🔁 3 💬 1 📌 2
Stay tuned, we will release the dataset soon...
01.08.2025 16:49 — 👍 2 🔁 0 💬 0 📌 0
We need to have poster fights at the end of every conference.
29.07.2025 19:01 — 👍 3 🔁 1 💬 0 📌 0
Just presented MAGBIG, a new dataset and evaluation methodology for gender bias in multilingual text-to-image generation. Grammatical gender matters when studying these biases across languages!
Thanks to Felix Friedrich, @kathaem.bsky.social and all co-authors - it was fun to work on this together!
28.07.2025 13:14 — 👍 2 🔁 0 💬 0 📌 0
This week I am at #ACL2025NLP in Vienna 🎡🇦🇹. Find me 🕵️ or message 💌 me if you want to chat about multilinguality or tokenization. Stop 🛑 by our poster on gender bias in text-to-image generation on Monday aclanthology.org/2025.acl-lon...
27.07.2025 07:24 — 👍 8 🔁 0 💬 0 📌 0
TokShop 2025
Registering interest in all things tokenization at TokShop @ ICML 2025 (July 18)
Consider joining the Google group for future updates!
https://groups.google.com/g/tokshop
TokShop @ #ICML2025 got way more submissions than expected! 📈 We could really use a few more reviewers to help out. If you have the capacity to review a #tokenization paper by Saturday, please fill out this form: forms.gle/32A6sQHQrMSb... 🙏
02.06.2025 16:40 — 👍 0 🔁 4 💬 0 📌 2
ICML 2025 Workshop TokShop
Welcome to the OpenReview homepage for ICML 2025 Workshop TokShop
📣 Call for Paper Alert: TokShop @ ICML 2025
TokShop explores tokenization across all data modalities. Topics include: subword NLP techniques, multimodal approaches, multilingual challenges, post-training modification, alternative representations, and statistical perspectives.
14.05.2025 13:31 — 👍 18 🔁 12 💬 1 📌 2
Tokenization Workshop @ ICML 2025
Got a tokenization paper that just didn't make the cut for ICML? Submit it to the Tokenization Workshop TokShop at #ICML2025 -- we'd love to see it there!
tokenization-workshop.github.io
04.05.2025 19:27 — 👍 8 🔁 6 💬 0 📌 0
Attending #NAACL2025 virtually. Since 2022, I've been training a classifier on papers I read to tackle the arXiv madness. Ran it on the NAACL proceedings for my personalized watch list. 🤓📺 However, it's far from perfect: Multilingual cultural awareness is great, but where is tokenization? 🤷
30.04.2025 12:50 — 👍 2 🔁 0 💬 2 📌 0
We're organizing ✨Tokenization Workhop✨ TokShop❗ Join us at @icmlconf.bsky.social in July in Vancouver 🇨🇦. Follow @tokshop.bsky.social for updates! Submit your paper by May 30.
15.04.2025 17:37 — 👍 4 🔁 0 💬 0 📌 0
Random take on the #TuringTest: Rather than testing machine intelligence, it can be a measure of societal awareness about #AI capabilities. The real objective isn't creating a machine that passes but educating people to think critically and avoid being deceived, so the machines do not pass the test.
04.04.2025 19:37 — 👍 4 🔁 0 💬 0 📌 0
Our paper 'Beyond Literal Token Overlap: Token Alignability for Multilinguality' will be at #NAACL2025! We show that token alignability is a stronger predictor of cross-lingual transfer than literal token overlap.
Read it here: arxiv.org/abs/2502.06468
10.03.2025 15:48 — 👍 6 🔁 1 💬 0 📌 1
Welcome to SemEval-2025 Task-3 — Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes
Join Mu-SHROOM 🍄, a SemEval 2025 shared task on detecting hallucination spans in multilingual LLM outputs! 🌍 Includes Czech with regional Czech questions 🇨🇿. Do you think you can spot when something isn’t true? 🤔 Try it out! 👉 helsinki-nlp.github.io/shroom #SemEval2025 #NLP
14.01.2025 15:56 — 👍 4 🔁 0 💬 0 📌 1
Happy holidays! 🎄🎅🤩🎁
24.12.2024 13:36 — 👍 9 🔁 0 💬 0 📌 0
This is going to be fun! 🤓 We have three years to spend 6.5M CZK on improving multilingual tokenization. The goal is to make subwords more alignable across languages and help languages that suffer from over-segmentation with current models.
03.12.2024 17:53 — 👍 11 🔁 1 💬 2 📌 0
🙋♂️👋
19.11.2024 21:02 — 👍 1 🔁 0 💬 0 📌 0
🙋♂️👋
19.11.2024 19:13 — 👍 0 🔁 0 💬 1 📌 0
AI Architect | North Carolina | AI/ML, IoT, science
WARNING: I talk about kids sometimes
Postdoctoral researcher @ CIIRC, CTU, Prague working in vision & language. Also robotics noob. PhD from University of Bristol. Ex. Samsung Research (SAIC-C). I love coffee and plants. And socks.
Assistant Prof in ML @ KTH 🇸🇪.
Previous: Aalto University 🇫🇮, TU Graz 🇦🇹, originally from 🇩🇪.
Doing: Reliable ML | uncertainty stuff | Bayesian stats | probabilistic circuits
https://trappmartin.github.io/
Studying language in biological brains and artificial ones at the Kempner Institute at Harvard University.
www.tuckute.com
your daily dose of bunnies ⸜(。˃ ᵕ ˂ )⸝♡
Studying NLP, CSS, and Human-AI interaction. PhD student @MIT. Previously at Microsoft FATE + CSS, Oxford Internet Institute, Stanford Symbolic Systems
hopeschroeder.com
Postdoctoral Researcher at Utrecht University | Accounting for language variation in ML/NLP | Including different styles in NLP | Tokenizers | Paraphrases | she/her
https://annawegmann.github.io/
Let's Talk about Tokenization
https://tokenization-workshop.github.io
Computational musicologist and head of the Prague Music Computing Group at Charles University.
AI and cognitive science, Founder and CEO (Geometric Intelligence, acquired by Uber). 8 books including Guitar Zero, Rebooting AI and Taming Silicon Valley.
Newsletter (50k subscribers): garymarcus.substack.com
A series of state-of-the-art, open source and transparent
foundation models for European languages
PhD student of NLP at TU Munich 🥨🇩🇪
Working on scientific fact verification, LLM factuality, biomedical NLP. 🌐🧑🏻🎓🇭🇷
PhD student in NLP, focusing on low-resource translation with LLMs.
University of Amsterdam
Komunikace vědy | Redaktorka Vědavýzkum.cz | Co-founder SciComHub | PhD z Mikrobiologie
Researcher in NLP in the ALMAnaCH team (Inria Paris)
Research Scientist at Meta.
LLMs, neural networks, logographic writing systems.
https://nbogoychev.com
NLP PhD student at Chalmers University of Technology, Sweden. Working on retrieval augmented language models and interpretability.
NLP, linguistics, and more?