๐Introducing BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data!
LLMs learn from vastly more data than humans ever experience. BabyLM challenges this paradigm by focusing on developmentally plausible data
We extend this effort to 45 new languages!
15.10.2025 10:53 โ ๐ 43 ๐ 16 ๐ฌ 1 ๐ 3
๐๐จ ๐ฒ๐จ๐ฎ ๐ซ๐๐๐ฅ๐ฅ๐ฒ ๐ฐ๐๐ง๐ญ ๐ญ๐จ ๐ฌ๐๐ ๐ฐ๐ก๐๐ญ ๐ฆ๐ฎ๐ฅ๐ญ๐ข๐ฅ๐ข๐ง๐ ๐ฎ๐๐ฅ ๐๐๐๐จ๐ซ๐ญ ๐ฅ๐จ๐จ๐ค๐ฌ ๐ฅ๐ข๐ค๐? ๐จ๐ณ๐ฎ๐ฉ๐ธ๐ช
Hereโs the proof! ๐๐๐๐ฒ๐๐๐๐๐ฅ๐๐ is the first Multilingual Benchmark of Developmentally Plausible Training Data available for 45 languages to the NLP community ๐
arxiv.org/abs/2510.10159
14.10.2025 17:01 โ ๐ 42 ๐ 16 ๐ฌ 2 ๐ 1
Today our poster will be up at @loreslm.bsky.social Poster Session #2 (2-3pm local time Abu Dhabi).
It's also available online at Whova: whova.com/portal/webap...
20.01.2025 06:43 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 1
This work was carried out by three great UCT CS Honours students - Alexis, Charl, and Hishaam.
14.01.2025 07:11 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
This work unites two directions of research: cognitively plausible modelling and NLP for low-resource languages. We hope more researchers pursue work at the intersection of these two subfields, since they share the goal of improving data-efficiency in the era of scaling.
14.01.2025 07:11 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
However, unlike in the original BabyLM challenge, our isiXhosa BabyLMs do not outperform all skylines. We attribute this to a lack of developmentally plausible isiXhosa data. The success of English BabyLMs is due to both modelling innovations and highly curated pretraining data.
14.01.2025 07:11 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
We pretrain two of the top BabyLM submissions (ELC-BERT and MLSM) for isiXhosa and evaluate it on isiXhosa POS tagging, NER, and topic classification. The BabyLMs outperform an isiXhosa RoBERTa and ELC-BERT even outperforms XLM-R on two tasks.
14.01.2025 07:11 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
The BabyLM challenge (babylm.github.io) produced new sample-efficient architectures. We investigate the potential of BabyLMs to improve LMs for low-resource languages with limited pretraining data. As a case study we use isiXhosa, a language with corpora similar in size to BabyLM strict-small.
14.01.2025 07:11 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Our paper "BabyLMs for isiXhosa: Data-Efficient Language Modelling in a Low-Resource Context" will be presented at The First Workshop on Language Models for Low-Resource Languages at #COLING2025 in Abu Dhabi.
Paper: arxiv.org/pdf/2501.03855
14.01.2025 07:09 โ ๐ 1 ๐ 1 ๐ฌ 1 ๐ 1
You can learn more about me here: https://nikitas-theo.github.io/
Computational linguist trying to understand how humans and computers learn and use language ๐ถ๐ง ๐ฃ๏ธ๐ฅ๏ธ๐ฌ
The work is mysterious and important. See https://bbunzeck.github.io
PhDing at @clausebielefeld.bsky.social
2nd year PhD Student at @gronlp.bsky.social ๐ฎ - University of Groningen
Language Acquisition - NLP
Building personalized Bluesky feeds for academics! Pin Paper Skygest, which serves posts about papers from accounts you're following: https://bsky.app/profile/paper-feed.bsky.social/feed/preprintdigest. By @sjgreenwood.bsky.social and @nkgarg.bsky.social
Natural Language Processing and Computational Linguistics group at the University of Groningen ๐ฎ
https://www.rug.nl/research/clcg/research/cl/
Associate Professor at GroNLP ( @gronlp.bsky.socialโฌ ) #NLP | Multilingualism | Interpretability | Language Learning in Humans vs NeuralNets | Mum^2
Head of the InClow research group: https://inclow-lm.github.io/
Author of Interpretable Machine Learning and other books
Newsletter: https://mindfulmodeler.substack.com/
Website: https://christophmolnar.com/
Enjoy not enjoying ideals | Interpretability of modular convnets applied to ๐๏ธ and ๐ฐ๏ธ๐ | she/her ๐ฆ๐
variint.github.io
INSERM group leader @ Neuromodulation Institute and NeuroSpin (Paris) in computational neuroscience.
How and why are computations enabling cognition distributed across the brain?
Expect neuroscience and ML content.
jbarbosa.org
Full of childlike wonder. Building friendly robots. UT Austin PhD student, MIT โ20.
Professor of Statistical Machine Learning at the University of Adelaide.
https://sejdino.github.io/
Explainability, Computer Vision, Neuro-AI.๐ชด Kempner Fellow @Harvard.
Prev. PhD @Brown, @Google, @GoPro. Crรชpe lover.
๐ Boston | ๐ thomasfel.me
Principal Researcher @ CENTAI.eu | Leading the Responsible AI Team. Building Responsible AI through Explainable AI, Fairness, and Transparency. Researching Graph Machine Learning, Data Science, and Complex Systems to understand collective human behavior.
Research in NLP (mostly LM interpretability & explainability).
Assistant prof at UMD CS + CLIP.
Previously @ai2.bsky.social @uwnlp.bsky.social
Views my own.
sarahwie.github.io
Linguist in AI & CogSci ๐ง ๐ฉโ๐ป๐ค PhD student @ ILLC, University of Amsterdam
๐ https://mdhk.net/
๐ https://scholar.social/@mdhk
๐ฆ https://twitter.com/mariannedhk
Postdoc AI Researcher (NLP) @ ITU Copenhagen
๐งญ https://mxij.me
Comm tech & social media research professor by day, symphony violinist by night, outside as much as possible otherwise. German/American Pacific Northwestern New Englander, #firstgen academic, she/her, ๐ณ๏ธโ๐
https://anne-oeldorf-hirsch.uconn.edu
Machine Learner by day, ๐ฆฎ Statistician at โค๏ธ
In search of statistical intuition for modern ML & simple explanations for complex things๐
Interested in the mysteries of modern ML, causality & all of stats. Opinions my own.
https://aliciacurth.github.io
Assistant Professor at PoliTo ๐ฎ๐น |
Former Visiting scholar at UCSC ๐บ๐ธ |
she/her | TrustworthyAI, XAI, Fairness in AI
https://elianap.github.io/
PhD Candidate in Interpretability @FraunhoferHHI | ๐Berlin, Germany
dilyabareeva.github.io