Authors: @wpoelman.bsky.social, Thomas Bauwens and @mdlhx.bsky.social
03.11.2025 12:06 β π 0 π 0 π¬ 0 π 0@lagom-nlp.bsky.social
We are the Leuven AI Group of Multilingual NLP (LAGoM NLP), a research lab at the department of Computer Science at KU Leuven, led by @mdlhx
Authors: @wpoelman.bsky.social, Thomas Bauwens and @mdlhx.bsky.social
03.11.2025 12:06 β π 0 π 0 π¬ 0 π 0We are presenting this paper at #EMNLP2025 in the βMultilinguality and Language Diversityβ oral session this Wednesday (November 5th) from 11:00-12:30 (UTC+8). Paper: aclanthology.org/2025.emnlp-m... Code: github.com/LAGoM-NLP/Co...
03.11.2025 11:53 β π 3 π 0 π¬ 1 π 0Our proposed tokenizer metrics are a step in that direction
03.11.2025 11:53 β π 0 π 0 π¬ 1 π 0We disentangle more such factors in an attempt to outline what the βidealβ experiment would look like and how to work backwards to a feasible setup. This way, we outline the requirements to reliably answer whether, and how, morphology relates to language modeling.
03.11.2025 11:53 β π 0 π 0 π¬ 1 π 0Finally, we take a look at experimental factors that confounded experiments and conclusions in prior research. Coarse language grouping is one of several confounding factors.
03.11.2025 11:53 β π 0 π 0 π¬ 1 π 0What's more: using entropy allows for finer-grained ordering of languages than the coarse groupings of "agglutinative" and "fusional".
03.11.2025 11:53 β π 2 π 0 π¬ 1 π 0We compute the normalized entropy over each token's distribution of neighbors, and indeed find that agglutinative languages tend to have higher entropy than fusional languages on average.
03.11.2025 11:53 β π 2 π 0 π¬ 1 π 0To measure this token ambiguity, we re-visit the idea of accessor variety (AV) from Harris (1955) and Feng et al. (2004) by counting which tokens neighbor each other in a corpus and how many times.
03.11.2025 11:53 β π 1 π 0 π¬ 1 π 0it is harder to predict the next token. We then hypothesize that this contextual ambiguity is higher in morphologically complex languages.
03.11.2025 11:53 β π 0 π 0 π¬ 1 π 0In our new #EMNLP2025 paper, we argue that such statistics should relate directly to what a language model actually does: reliably predicting the next token produced by its tokenizer. We argue that if the most recent token has more contextual ambiguity,
03.11.2025 11:53 β π 3 π 0 π¬ 1 π 0When is a language hard to model? Previous research has suggested that morphological complexity both does and does not play a role, but it does so by relating the performance of language models to corpus statistics of words or subword tokens in isolation.
03.11.2025 11:53 β π 6 π 3 π¬ 1 π 0Ok, added the ones that were missing from yours to ours
12.08.2025 10:54 β π 1 π 0 π¬ 0 π 0β
12.08.2025 10:48 β π 1 π 0 π¬ 0 π 0You're included in the NLP labs starter pack, see go.bsky.app/LKGekew
11.08.2025 09:46 β π 1 π 0 π¬ 1 π 0* (Findings) Supervised and Unsupervised Probing of Shortcut Learning: Case Study on the Emergence and Evolution of Syntactic Heuristics in BERT by Elke Vandermeerschen and @mdlhx.bsky.social, presented by Elke. URL: aclanthology.org/2025.finding...
28.07.2025 11:52 β π 3 π 0 π¬ 0 π 0Our group has two papers at #acl2025:
* (Main) GRaMPa: Subword Regularisation by Skewing Uniform Segmentation Distributions with an Efficient Path-counting Markov Model by Thomas Bauwens, David KaczΓ©r and @mdlhx.bsky.social, presented by Thomas. URL: aclanthology.org/2025.acl-lon...
The submission deadline for #CLIN35 has been extended by one week! New deadline: June 20th. π Spread the word! More info: clin35.ccl.kuleuven.be/call-for-abs...
06.06.2025 09:38 β π 1 π 3 π¬ 0 π 0Reminder, a few more days to apply!
03.06.2025 09:35 β π 2 π 3 π¬ 0 π 0π Don't forget! The deadline for submitting your abstract to the #CLIN conference in Leuven is coming: 13th of June! Submitting is easy: name, title of your work, 500-word abstract, done! #nlp #nlproc #compling #llm #ai #dutch clin35.ccl.kuleuven.be
02.06.2025 07:08 β π 1 π 2 π¬ 0 π 2We are hiring in #nlproc!!
16.05.2025 08:24 β π 2 π 1 π¬ 0 π 0β
18.02.2025 08:14 β π 1 π 0 π¬ 0 π 0Iβm looking for a postdoc, to start ideally ASAP!
The work would be in the EU-funded TrustLLM project, focusing on modularisation and language adaptation of LLMs, tokenization, and evaluation benchmarks for multilingual LLMs. The position would be full-time for 2 years with no teaching obligation.
We look at the role of English in this evaluation: it can be, and is often used as, an interface to boost task performance. Or it can be used as a natural language to evaluate language understanding. We recommend to move away from task performance as a main goal and focus on language understanding.
12.12.2024 15:28 β π 7 π 0 π¬ 0 π 0@wpoelman.bsky.social and @mdlhx.bsky.social 's π₯ hot takes on multilingual LLM evaluation, to appear @nodalida.bsky.social is up on arXiv: arxiv.org/abs/2412.08392
12.12.2024 15:28 β π 14 π 1 π¬ 2 π 1π¨ New Account Alert! This is the official account of the *MilaNLP group*. We had to recreate it because it was not indexed.
If you were following us before, please follow us again. If not, nowβs the perfect time to start!
milanlp.bsky.social is having the same issue, maybe take a look at this github issue here: github.com/bluesky-soci...
02.12.2024 09:39 β π 2 π 0 π¬ 0 π 0There's too many starter packs.
π Here's a list, mostly for NLP, ML, and related areas.
Moreover, we advocate for a shift in perspective from seeking a general definition of data quality towards a more language- and task-specific one. Ultimately, we aim for this study to serve as a guide to using Wikipedia for pretraining in a multilingual setting.
29.11.2024 14:02 β π 3 π 0 π¬ 0 π 0We evaluate the downstream impact of quality filtering on Wikipedia by training tiny monolingual pretrained models for each Wikipedia to find that data quality pruning is an effective means for resource-efficient training without hurting performance, especially for LRLs.
29.11.2024 14:02 β π 3 π 0 π¬ 1 π 0We subject non-English Wikipedias to common quality filtering techniques like script filtering, MinHash and heuristic filtering, which reveal widespread issues such as a high percentage of one-line articles and duplicate articles.
29.11.2024 14:02 β π 2 π 0 π¬ 1 π 0