Francois Meyer's Avatar

Francois Meyer

@francois-meyer.bsky.social

PhD student at the University of Cape Town, working on text generation for low-resource, morphologically complex languages. https://francois-meyer.github.io/ Cape Town, South Africa

51 Followers  |  426 Following  |  7 Posts  |  Joined: 19.11.2024  |  1.8921

Latest posts by francois-meyer.bsky.social on Bluesky

Post image

๐ŸŒIntroducing BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data!

LLMs learn from vastly more data than humans ever experience. BabyLM challenges this paradigm by focusing on developmentally plausible data

We extend this effort to 45 new languages!

15.10.2025 10:53 โ€” ๐Ÿ‘ 43    ๐Ÿ” 16    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 3
Post image

๐ƒ๐จ ๐ฒ๐จ๐ฎ ๐ซ๐ž๐š๐ฅ๐ฅ๐ฒ ๐ฐ๐š๐ง๐ญ ๐ญ๐จ ๐ฌ๐ž๐ž ๐ฐ๐ก๐š๐ญ ๐ฆ๐ฎ๐ฅ๐ญ๐ข๐ฅ๐ข๐ง๐ ๐ฎ๐š๐ฅ ๐ž๐Ÿ๐Ÿ๐จ๐ซ๐ญ ๐ฅ๐จ๐จ๐ค๐ฌ ๐ฅ๐ข๐ค๐ž? ๐Ÿ‡จ๐Ÿ‡ณ๐Ÿ‡ฎ๐Ÿ‡ฉ๐Ÿ‡ธ๐Ÿ‡ช

Hereโ€™s the proof! ๐๐š๐›๐ฒ๐๐š๐›๐ž๐ฅ๐‹๐Œ is the first Multilingual Benchmark of Developmentally Plausible Training Data available for 45 languages to the NLP community ๐ŸŽ‰

arxiv.org/abs/2510.10159

14.10.2025 17:01 โ€” ๐Ÿ‘ 42    ๐Ÿ” 16    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 1

Today our poster will be up at @loreslm.bsky.social Poster Session #2 (2-3pm local time Abu Dhabi).

It's also available online at Whova: whova.com/portal/webap...

20.01.2025 06:43 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1

This work was carried out by three great UCT CS Honours students - Alexis, Charl, and Hishaam.

14.01.2025 07:11 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

This work unites two directions of research: cognitively plausible modelling and NLP for low-resource languages. We hope more researchers pursue work at the intersection of these two subfields, since they share the goal of improving data-efficiency in the era of scaling.

14.01.2025 07:11 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

However, unlike in the original BabyLM challenge, our isiXhosa BabyLMs do not outperform all skylines. We attribute this to a lack of developmentally plausible isiXhosa data. The success of English BabyLMs is due to both modelling innovations and highly curated pretraining data.

14.01.2025 07:11 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

We pretrain two of the top BabyLM submissions (ELC-BERT and MLSM) for isiXhosa and evaluate it on isiXhosa POS tagging, NER, and topic classification. The BabyLMs outperform an isiXhosa RoBERTa and ELC-BERT even outperforms XLM-R on two tasks.

14.01.2025 07:11 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

The BabyLM challenge (babylm.github.io) produced new sample-efficient architectures. We investigate the potential of BabyLMs to improve LMs for low-resource languages with limited pretraining data. As a case study we use isiXhosa, a language with corpora similar in size to BabyLM strict-small.

14.01.2025 07:11 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Our paper "BabyLMs for isiXhosa: Data-Efficient Language Modelling in a Low-Resource Context" will be presented at The First Workshop on Language Models for Low-Resource Languages at #COLING2025 in Abu Dhabi.

Paper: arxiv.org/pdf/2501.03855

14.01.2025 07:09 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1

@francois-meyer is following 20 prominent accounts