Theory of XAI Workshop
Explainable AI (XAI) is now deployed across a wide range of settings, including high-stakes domains in which misleading explanations can cause real harm. For example, explanations are required by law ...
Interested in provable guarantees and fundamental limitations of XAI? Join us at the "Theory of Explainable AI" workshop Dec 2 in Copenhagen! @ellis.eu @euripsconf.bsky.social
Speakers: @jessicahullman.bsky.social @doloresromerom.bsky.social @tpimentel.bsky.social
Call for Contributions: Oct 15
07.10.2025 12:53 β π 8 π 5 π¬ 0 π 2
Paper title: Language models align with brain regions that represent concepts across modalities.
Authors: Maria Ryskina, Greta Tuckute, Alexander Fung, Ashley Malkin, Evelina Fedorenko.
Affiliations: Maria is affiliated with the Vector Institute for AI, but the work was done at MIT. All other authors are affiliated with MIT.
Email address: maria.ryskina@vectorinstitute.ai.
Interested in language models, brains, and concepts? Check out our COLM 2025 π¦ Spotlight paper!
(And if youβre at COLM, come hear about it on Tuesday β sessions Spotlight 2 & Poster 2)!
04.10.2025 02:15 β π 26 π 5 π¬ 1 π 1
Accepted to EMNLP (and more to come π)! The camera ready version is now online---very happy with how this turned out
arxiv.org/abs/2507.01234
24.09.2025 15:21 β π 13 π 5 π¬ 0 π 0
See our paper for more: we have analyses on other models, downstream tasks, and considering only subsets of tokens (e.g., only tokens with a certain part-of-speech)!
01.10.2025 18:08 β π 0 π 0 π¬ 1 π 0
This means that: (1) LMs can get less similar to each other, even while they all get closer to the true distribution; and (2) larger models reconverge faster, while small ones may never reconverge.
01.10.2025 18:08 β π 0 π 0 π¬ 1 π 0
* A sharp-divergence phase, where models diverge as they start using context.
* A slow-reconvergence phase, where predictions slowly become more similar again (especially in larger models).
01.10.2025 18:08 β π 0 π 0 π¬ 1 π 0
Surprisingly, convergence isnβt monotonic. Instead, we find four convergence phases across model training.
* A uniform phase, where all seeds output nearly-uniform distributions.
* A sharp-convergence phase, where models align, largely due to unigram frequency learning.
01.10.2025 18:08 β π 0 π 0 π¬ 1 π 0
In this paper, we define convergence as the similarity between outputs of LMs trained under different seeds, where similarity is measured as a per-token KL divergence. This lets us track whether models trained under identical settings, but different seeds, behave the same.
01.10.2025 18:08 β π 0 π 0 π¬ 1 π 0
Figure showing the four phases of convergence in LM training
LLMs are trained to mimic a βtrueβ distributionβtheir reducing cross-entropy then confirms they get closer to this target while training. Do similar models approach this target distribution in similar ways, though? π€ Not really! Our new paper studies this, finding 4-convergence phases in training π§΅
01.10.2025 18:08 β π 24 π 4 π¬ 1 π 1
Very happy this paper got accepted to NeurIPS 2025 as a Spotlight! π
Main takeaway: In mechanistic interpretability, we need assumptions about how DNNs encode concepts in their representations (eg, the linear representation hypothesis). Without them, we can claim any DNN implements any algorithm!
01.10.2025 15:00 β π 25 π 4 π¬ 0 π 0
Honoured to receive two (!!) SAC highlights awards at #ACL2025 π (Conveniently placed on the same slide!)
With the amazing: @philipwitti.bsky.social, @gregorbachmann.bsky.social and @wegotlieb.bsky.social,
@cuiding.bsky.social, Giovanni Acampa, @alexwarstadt.bsky.social, @tamaregev.bsky.social
31.07.2025 07:41 β π 22 π 3 π¬ 0 π 0
We are presenting this paper at #ACL2025 π Find us at poster session 4 (Wednesday morning, 11h~12h30) to learn more about tokenisation bias!
27.07.2025 11:59 β π 11 π 2 π¬ 0 π 0
@philipwitti.bsky.social will be presenting our paper "Tokenisation is NP-Complete" at #ACL2025 π Come to the language modelling 2 session (Wednesday morning, 9h~10h30) to learn more about how challenging tokenisation can be!
27.07.2025 09:41 β π 7 π 3 π¬ 0 π 0
Headed to Vienna for #ACL2025 to present our tokenisation bias paper and co-organise the L2M2 workshop on memorisation in language models. Reach out to chat about tokenisation, memorisation, and all things pre-training (esp. data-related topics)!
27.07.2025 06:40 β π 19 π 2 π¬ 2 π 0
Causal Abstraction, the theory behind DAS, tests if a network realizes a given algorithm. We show (w/ @denissutter.bsky.social, T. Hofmann, @tpimentel.bsky.social ) that the theory collapses without the linear representation hypothesisβa problem we call the non-linear representation dilemma.
17.07.2025 10:57 β π 5 π 2 π¬ 1 π 0
Importantly, despite these results, we still believe causal abstraction is one of the best frameworks available for mech interpretability. Going forward, we should try to better understand how it is impacted by assumptions about how DNNs encode information. Longerπ§΅soon by @denissutter.bsky.social
14.07.2025 12:15 β π 4 π 0 π¬ 0 π 0
Overall, our results show that causal abstraction (and interventions) is not a silver bullet, as it relies on assumptions about how features are encoded in the DNNs. We then connect our results to the linear representation hypothesis and to older debates in the probing literature.
14.07.2025 12:15 β π 2 π 0 π¬ 1 π 0
We showβboth theoretically (under reasonable assumptions) and empirically (on real-world models)βthat, if we allow variables to be encoded in arbitrarily complex subspaces of the DNNβs representations, any algorithm can be mapped to any model.
14.07.2025 12:15 β π 1 π 0 π¬ 1 π 0
Causal abstraction identifies this correspondence by finding subspaces in the DNN's hidden states which encode the algorithmβs hidden variables. Given such a map, we say the DNN implements the algorithm if the two behave identically under interventions.
14.07.2025 12:15 β π 0 π 0 π¬ 1 π 0
Paper title "The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?" with the paper's graphical abstract showing how more powerful alignment maps between a DNN and an algorithm allow more complex features to be found and more "accurate" abstractions.
Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to guarantee the features we find are not spurious? No!β οΈ In our new paper, we show many mech int methods implicitly rely on the linear representation hypothesisπ§΅
14.07.2025 12:15 β π 65 π 12 π¬ 1 π 1
All modern LLMs run on top of a tokeniser, an often overlooked βpreprocessing detailβ. But what if that tokeniser systematically affects model behaviour? We call this tokenisation bias.
Letβs talk about it and why it mattersπ
@aclmeeting.bsky.social #ACL2025 #NLProc
05.06.2025 10:43 β π 62 π 8 π¬ 1 π 2
The word "laundry" contains both steps of the laundry process:
1. Undry
2. Dry
04.06.2025 19:14 β π 26 π 2 π¬ 1 π 0
Love this! Especially the explicit operationalization of what βbiasβ they are measuring via specifying the relevant counterfactual.
Definitely an approach that more papers talking about effects can incorporate to better clarify what the phenomenon they are studying.
04.06.2025 15:55 β π 2 π 1 π¬ 0 π 0
If you use LLMs, tokenisation bias probably affects you:
* Text generation: tokenisation bias β length bias π€―
* Psycholinguistics: tokenisation bias β systematically biased surprisal estimates π«
* Interpretability: tokenisation bias β biased logits π€
04.06.2025 14:55 β π 7 π 0 π¬ 0 π 1
Title of paper "Causal Estimation of Tokenisation Bias" and schematic of how we define tokenisation bias, which is the causal effect we are interested in.
A string may get 17 times less probability if tokenised as two symbols (e.g., β¨he, lloβ©) than as one (e.g., β¨helloβ©)βby an LM trained from scratch in each situation! Our new ACL paper proposes an observational method to estimate this causal effect! Longer thread soon!
04.06.2025 10:51 β π 53 π 9 π¬ 1 π 3
I think it's a reasonable change, and it doesn't change the template style, so I'd say yes. There is also already the command `\citep*` to cite all authors in a paper, so citing only the first two should also be ok? I created a pull request this morning to add it to the official template :)
29.05.2025 14:13 β π 2 π 0 π¬ 0 π 0
I created a pull request earlier today. So hopefully they will approve and merge it soon-ish? :)
29.05.2025 14:04 β π 2 π 0 π¬ 0 π 0
Professor in Operations Research at Copenhagen Business School. In β€οΈ with Sevilla and its Real Betis BalompiΓ©.
Ginni Rometty Prof @NorthwesternCS | Fellow @NU_IPR | Uncertainty + decisions | Humans + AI/ML | Blog @statmodeling
Strengthening Europe's Leadership in AI through Research Excellence | ellis.eu
EurIPS is a community-organized, NeurIPS-endorsed conference in Copenhagen where you can present papers accepted at @neuripsconf.bsky.social
eurips.cc
PostDoc @ Uni TΓΌbingen
explainable AI, causality
gunnarkoenig.com
Postdoc @vectorinstitute.ai | organizer @queerinai.com | previously MIT, CMU LTI | π rodent enthusiast | she/they
π https://ryskina.github.io/
Helping machines make sense of the world. Asst Prof @icepfl.bsky.social; Before: @stanfordnlp.bsky.social @uwnlp.bsky.social AI2 #NLProc #AI
Website: https://atcbosselut.github.io/
Msc at @eth interested in ML interpretability
Assistant Professor in NLP (Fairness, Interpretability and lately interested in Political Science) at the University of Copenhagen β¨
Before: PostDoc in NLP at Uni of CPH, PhD student in ML at TU Berlin
Postdoc at Utrecht University, previously PhD candidate at the University of Amsterdam
Multimodal NLP, Vision and Language, Cognitively Inspired NLP
https://ecekt.github.io/
The largest workshop on analysing and interpreting neural networks for NLP.
BlackboxNLP will be held at EMNLP 2025 in Suzhou, China
blackboxnlp.github.io
PhD student in NLP at ETH Zurich.
anejsvete.github.io
language model pretraining @ai2.bsky.social, co-lead of data research w/ @soldaini.net, statistics @uw, open science, tabletop, seattle, he/him,π§ kyleclo.com
Posting about research fby and events and news relevant for the Amsterdam NLP community. Account maintained by @wzuidema@bsky.social
MIT Brain and Cognitive Sciences
Postdoc researcher @ Fedorenko lab, MIT. Cognitive neuroscience of language and speech.
Asst Prof. @ UCSD | PI of LeMπN Lab | Former Postdoc at ETH ZΓΌrich, PhD @ NYU | computational linguistics, NLProc, CogSci, pragmatics | he/him π³οΈβπ
alexwarstadt.github.io
Assistant professor in NLP @UniMelb
Postdoc @rug.nl with Arianna Bisazza.
Interested in NLP, interpretability, syntax, language acquisition and typology.