LTG Oslo @ltgoslo.sigmoid.social.ap.brid.gy

20000 authors! #acl2025

28.07.2025 07:12 — 👍 2 🔁 2 💬 0 📌 0

Language Technology Group #Oslo at the #ACL2025 conference in #Vienna today

#NLProc

29.07.2025 20:29 — 👍 2 🔁 0 💬 0 📌 0

You are also welcome to the "Multilingualism: from data crawling to evaluation" birds-of-a-feather (BoF) event, which is co-organized by the #HPLT project.

Join us to discuss web-scale text data collection and processing, as well as open multilingual #LLM training and evaluation. You will have […]

22.07.2025 13:04 — 👍 0 🔁 0 💬 0 📌 0

Original post on sigmoid.social

If you are attending the ACL 2025 conference in Vienna, come to the poster presenting the latest #HPLT v2 datasets (the paper is available here: https://arxiv.org/abs/2503.10267).

You can find the HPLT folks on Wednesday, July 30, 11:00 at the in-person poster session, Level 0, Exhibit Halls X4 […]

22.07.2025 13:02 — 👍 0 🔁 0 💬 1 📌 0

Original post on sigmoid.social

In addition to the previously mentioned five papers: we are very proud for the LTG Master students who got their paper accepted to the ACL Fact Extraction and Verification Workshop (FEVER).
A big shout-out to Eivind Morris Bakke, Nora Winger Heggelund and their paper "(Fact) Check Your Bias"! […]

30.06.2025 14:47 — 👍 0 🔁 0 💬 0 📌 0

Original post on sigmoid.social

We're hiring! A postdoc-level researcher position in NLP, focusing on generative approaches to event extraction, is open at the University of Oslo. The contract is for 30 months. Closing date 11 Aug. Come join us! […]

30.06.2025 12:06 — 👍 0 🔁 3 💬 0 📌 0

This is how Language Technology Group at the University of Oslo looks like these days :)

#NLProc #Norway #Oslo

18.06.2025 12:06 — 👍 1 🔁 0 💬 0 📌 0

Systematic Generalization in Language Models Scales with Information Entropy Systematic generalization remains challenging for current language models, which are known to be both sensitive to semantically similar permutations of the input and to struggle with known concepts presented in novel contexts. Although benchmarks exist for assessing compositional behavior, it is unclear how to measure the difficulty of a systematic generalization problem. In this work, we show how one aspect of systematic generalization can be described by the entropy of the distribution of component parts in the training data. We formalize a framework for measuring entropy in a sequence-to-sequence task and find that the performance of popular model architectures scales with the entropy. Our work connects systematic generalization to information efficiency, and our results indicate that success at high entropy can be achieved even without built-in priors, and that success at low entropy can serve as a target for assessing progress towards robust systematic generalization.

5. "Systematic Generalization in Language Models Scales with Information Entropy", defining a framework to measure entropy in sequence-to-sequence tasks. Authored by Sondre Wold, Lucas Charpentier, Étienne Simon
https://arxiv.org/abs/2505.13089 (ACL Findings)

See you in Vienna!
(end of 🧵)

06.06.2025 13:20 — 👍 0 🔁 0 💬 0 📌 0

Original post on sigmoid.social

"#NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark", introducing a new evaluation suite for benchmarking of #Norwegian generative language models. Authored by Vladislav Mikhailov, David Samuel, Andrey Kutuzov, Erik Velldal, Lilja Øvrelid (LTG), Tita Enstad, Hans […]

06.06.2025 13:19 — 👍 0 🔁 0 💬 0 📌 0

Re-identification of De-identified Documents with Autoregressive Infilling Documents revealing sensitive information about individuals must typically be de-identified. This de-identification is often done by masking all mentions of personally identifiable information (PII), thereby making it more difficult to uncover the identity of the person(s) in question. To investigate the robustness of de-identification methods, we present a novel, RAG-inspired approach that attempts the reverse process of re-identification based on a database of documents representing background knowledge. Given a text in which personal identifiers have been masked, the re-identification proceeds in two steps. A retriever first selects from the background knowledge passages deemed relevant for the re-identification. Those passages are then provided to an infilling model which seeks to infer the original content of each text span. This process is repeated until all masked spans are replaced. We evaluate the re-identification on three datasets (Wikipedia biographies, court rulings and clinical notes). Results show that (1) as many as 80% of de-identified text spans can be successfully recovered and (2) the re-identification accuracy increases along with the level of background knowledge.

3. "Re-identification of De-identified Documents with Autoregressive Infilling" challenging the existing methods of masking personally identifiable information. Authored by Lucas Charpentier and Pierre Lison.
https://arxiv.org/abs/2505.12859 (main ACL)

06.06.2025 13:18 — 👍 0 🔁 0 💬 0 📌 0

Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models A fundamental question in interpretability research is to what extent neural networks, particularly language models, implement reusable functions through subnetworks that can be composed to perform more complex tasks. Recent advances in mechanistic interpretability have made progress in identifying $\textit{circuits}$, which represent the minimal computational subgraphs responsible for a model's behavior on specific tasks. However, most studies focus on identifying circuits for individual tasks without investigating how functionally similar circuits $\textit{relate}$ to each other. To address this gap, we study the modularity of neural networks by analyzing circuits for highly compositional subtasks within a transformer-based language model. Specifically, given a probabilistic context-free grammar, we identify and compare circuits responsible for ten modular string-edit operations. Our results indicate that functionally similar circuits exhibit both notable node overlap and cross-task faithfulness. Moreover, we demonstrate that the circuits identified can be reused and combined through set operations to represent more complex functional model capabilities.

2. "Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models", studying modularity of transformer based language models. Authored by Sondre Wold (LTG), Philipp Mondorf, Barbara Plank
https://arxiv.org/abs/2410.01434 (main ACL)

06.06.2025 13:18 — 👍 0 🔁 0 💬 0 📌 0

Original post on sigmoid.social

1. "An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)", describing a new generation of the #HPLT web-crawled corpora in 193 languages. LTG co-authors: Nikolay Arefyev, Mariia Fedorova, Andrey Kutuzov, Petter Mæhlum, Vladislav Mikhailov, Stephan Oepen […]

06.06.2025 13:17 — 👍 0 🔁 0 💬 0 📌 0

Language Technology Group #Oslo will be presenting five papers at the #ACL2025NLP conference this summer in #Vienna: 🧵

#NLProc

06.06.2025 13:15 — 👍 0 🔁 1 💬 1 📌 0

LTG Oslo

Latest posts by ltgoslo.sigmoid.social.ap.brid.gy on Bluesky

@ltgoslo.sigmoid.social.ap.brid.gy is following 1 prominent accounts