LAGoM NLP @lagom-nlp - Bluesky Profile

Authors: @wpoelman.bsky.social, Thomas Bauwens and @mdlhx.bsky.social

03.11.2025 12:06 — 👍 0 🔁 0 💬 0 📌 0

Confounding Factors in Relating Model Performance to Morphology Wessel Poelman, Thomas Bauwens, Miryam de Lhoneux. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025.

We are presenting this paper at #EMNLP2025 in the “Multilinguality and Language Diversity” oral session this Wednesday (November 5th) from 11:00-12:30 (UTC+8). Paper: aclanthology.org/2025.emnlp-m... Code: github.com/LAGoM-NLP/Co...

03.11.2025 11:53 — 👍 3 🔁 0 💬 1 📌 0

Our proposed tokenizer metrics are a step in that direction

03.11.2025 11:53 — 👍 0 🔁 0 💬 1 📌 0

We disentangle more such factors in an attempt to outline what the “ideal” experiment would look like and how to work backwards to a feasible setup. This way, we outline the requirements to reliably answer whether, and how, morphology relates to language modeling.

03.11.2025 11:53 — 👍 0 🔁 0 💬 1 📌 0

Finally, we take a look at experimental factors that confounded experiments and conclusions in prior research. Coarse language grouping is one of several confounding factors.

03.11.2025 11:53 — 👍 0 🔁 0 💬 1 📌 0

What's more: using entropy allows for finer-grained ordering of languages than the coarse groupings of "agglutinative" and "fusional".

03.11.2025 11:53 — 👍 2 🔁 0 💬 1 📌 0

We compute the normalized entropy over each token's distribution of neighbors, and indeed find that agglutinative languages tend to have higher entropy than fusional languages on average.

03.11.2025 11:53 — 👍 2 🔁 0 💬 1 📌 0

To measure this token ambiguity, we re-visit the idea of accessor variety (AV) from Harris (1955) and Feng et al. (2004) by counting which tokens neighbor each other in a corpus and how many times.

03.11.2025 11:53 — 👍 1 🔁 0 💬 1 📌 0

it is harder to predict the next token. We then hypothesize that this contextual ambiguity is higher in morphologically complex languages.

03.11.2025 11:53 — 👍 0 🔁 0 💬 1 📌 0

In our new #EMNLP2025 paper, we argue that such statistics should relate directly to what a language model actually does: reliably predicting the next token produced by its tokenizer. We argue that if the most recent token has more contextual ambiguity,

03.11.2025 11:53 — 👍 3 🔁 0 💬 1 📌 0

When is a language hard to model? Previous research has suggested that morphological complexity both does and does not play a role, but it does so by relating the performance of language models to corpus statistics of words or subword tokens in isolation.

03.11.2025 11:53 — 👍 6 🔁 3 💬 1 📌 0

Ok, added the ones that were missing from yours to ours

12.08.2025 10:54 — 👍 1 🔁 0 💬 0 📌 0

✅

12.08.2025 10:48 — 👍 1 🔁 0 💬 0 📌 0

You're included in the NLP labs starter pack, see go.bsky.app/LKGekew

11.08.2025 09:46 — 👍 1 🔁 0 💬 1 📌 0

Supervised and Unsupervised Probing of Shortcut Learning: Case Study on the Emergence and Evolution of Syntactic Heuristics in BERT Elke Vandermeerschen, Miryam De Lhoneux. Findings of the Association for Computational Linguistics: ACL 2025. 2025.

* (Findings) Supervised and Unsupervised Probing of Shortcut Learning: Case Study on the Emergence and Evolution of Syntactic Heuristics in BERT by Elke Vandermeerschen and @mdlhx.bsky.social, presented by Elke. URL: aclanthology.org/2025.finding...

28.07.2025 11:52 — 👍 3 🔁 0 💬 0 📌 0

GRaMPa: Subword Regularisation by Skewing Uniform Segmentation Distributions with an Efficient Path-counting Markov Model Thomas Bauwens, David Kaczér, Miryam De Lhoneux. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

Our group has two papers at #acl2025:
* (Main) GRaMPa: Subword Regularisation by Skewing Uniform Segmentation Distributions with an Efficient Path-counting Markov Model by Thomas Bauwens, David Kaczér and @mdlhx.bsky.social, presented by Thomas. URL: aclanthology.org/2025.acl-lon...

28.07.2025 11:52 — 👍 3 🔁 0 💬 1 📌 2

CLIN35 - Call for Abstracts We invite submissions for CLIN35, the 35th edition of the Computational Linguistics in the Netherlands (CLIN) conference, which will take place in Leuven on September 12th, 2025. Abstracts describing ...

The submission deadline for #CLIN35 has been extended by one week! New deadline: June 20th. 🔊 Spread the word! More info: clin35.ccl.kuleuven.be/call-for-abs...

06.06.2025 09:38 — 👍 1 🔁 3 💬 0 📌 0

Reminder, a few more days to apply!

03.06.2025 09:35 — 👍 2 🔁 3 💬 0 📌 0

CLIN35 Computational Linguistics in The Netherlands (CLIN) is a yearly conference on computational linguistics. Each year the conference is organized by a different institution in the Dutch-speaking region. ...

📅 Don't forget! The deadline for submitting your abstract to the #CLIN conference in Leuven is coming: 13th of June! Submitting is easy: name, title of your work, 500-word abstract, done! #nlp #nlproc #compling #llm #ai #dutch clin35.ccl.kuleuven.be

02.06.2025 07:08 — 👍 1 🔁 2 💬 0 📌 2

We are hiring in #nlproc!!

16.05.2025 08:24 — 👍 2 🔁 1 💬 0 📌 0

✅

18.02.2025 08:14 — 👍 1 🔁 0 💬 0 📌 0

I’m looking for a postdoc, to start ideally ASAP!

The work would be in the EU-funded TrustLLM project, focusing on modularisation and language adaptation of LLMs, tokenization, and evaluation benchmarks for multilingual LLMs. The position would be full-time for 2 years with no teaching obligation.

13.12.2024 10:18 — 👍 18 🔁 8 💬 1 📌 0

We look at the role of English in this evaluation: it can be, and is often used as, an interface to boost task performance. Or it can be used as a natural language to evaluate language understanding. We recommend to move away from task performance as a main goal and focus on language understanding.

12.12.2024 15:28 — 👍 7 🔁 0 💬 0 📌 0

The Roles of English in Evaluating Multilingual Language Models Multilingual natural language processing is getting increased attention, with numerous models, benchmarks, and methods being released for many languages. English is often used in multilingual evaluati...

@wpoelman.bsky.social and @mdlhx.bsky.social 's 🔥 hot takes on multilingual LLM evaluation, to appear @nodalida.bsky.social is up on arXiv: arxiv.org/abs/2412.08392

12.12.2024 15:28 — 👍 14 🔁 1 💬 2 📌 1

🚨 New Account Alert! This is the official account of the *MilaNLP group*. We had to recreate it because it was not indexed.

If you were following us before, please follow us again. If not, now’s the perfect time to start!

06.12.2024 14:08 — 👍 19 🔁 7 💬 1 📌 0

MilaNLP Lab (@milanlp.bsky.social) The Milan Natural Language Processing Group #NLProc #ML #AI https://milanlproc.github.io/

milanlp.bsky.social is having the same issue, maybe take a look at this github issue here: github.com/bluesky-soci...

02.12.2024 09:39 — 👍 2 🔁 0 💬 0 📌 0

NLP grad students Join the conversation

There's too many starter packs.
👇 Here's a list, mostly for NLP, ML, and related areas.

01.12.2024 03:05 — 👍 40 🔁 11 💬 3 📌 2

Moreover, we advocate for a shift in perspective from seeking a general definition of data quality towards a more language- and task-specific one. Ultimately, we aim for this study to serve as a guide to using Wikipedia for pretraining in a multilingual setting.

29.11.2024 14:02 — 👍 3 🔁 0 💬 0 📌 0

We evaluate the downstream impact of quality filtering on Wikipedia by training tiny monolingual pretrained models for each Wikipedia to find that data quality pruning is an effective means for resource-efficient training without hurting performance, especially for LRLs.

29.11.2024 14:02 — 👍 3 🔁 0 💬 1 📌 0

We subject non-English Wikipedias to common quality filtering techniques like script filtering, MinHash and heuristic filtering, which reveal widespread issues such as a high percentage of one-line articles and duplicate articles.

29.11.2024 14:02 — 👍 2 🔁 0 💬 1 📌 0

LAGoM NLP

Latest posts by lagom-nlp.bsky.social on Bluesky

@lagom-nlp is following 20 prominent accounts