Paper: arxiv.org/pdf/2507.08802
Code: github.com/densutter/no...
@denissutter.bsky.social
Msc at @eth interested in ML interpretability
Paper: arxiv.org/pdf/2507.08802
Code: github.com/densutter/no...
9/9
Paper title: The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
Authored by Denis Sutter, @jkminder, Thomas Hoffmann, and @tpimentelms.
8/9 In summary, causal abstraction remains a valuable framework, but without explicit assumptions about how mechanisms are represented, it risks producing interpretability results that are not robust or meaningful.
15.07.2025 11:20 β π 1 π 0 π¬ 1 π 07/9 For generality, we present these findings on simpler architectures (MLPs) across multiple random seeds and two additional tasks. This indicates that the issue is not confined to LLMs, but applies more broadly.
15.07.2025 11:20 β π 1 π 0 π¬ 1 π 06/9 We further show that small LLMs, which fail at the Indirect Object Identification task, can nevertheless be interpreted as containing such an algorithm.
15.07.2025 11:20 β π 1 π 0 π¬ 1 π 05/9 Beyond the theoretical argument, we present a broad set of experiments supporting our claim. Most notably, we show that a randomly initialised LLM can be interpreted as implementing an algorithm for Indirect Object Identification.
15.07.2025 11:20 β π 1 π 0 π¬ 1 π 04/9 This occurs because the existing theoretical framework makes no structural assumptions about how mechanisms are encoded in distributed representations. This relates to the accuracy complexity trade-off of probing.
15.07.2025 11:20 β π 1 π 0 π¬ 1 π 03/9 While we do not critique causal abstraction as a framework, we show that combining it with current insights that modern models store information in a distributed way introduces a fundamental problem.
15.07.2025 11:20 β π 1 π 0 π¬ 1 π 02/9 We demonstrate both theoretically (under reasonable assumptions) and empirically on real-world models that with arbitrarily complex representations, any algorithm can be mapped to any model.
15.07.2025 11:20 β π 1 π 0 π¬ 1 π 01/9 In our new interpretability paper, we analyse causal abstractionβthe framework behind Distributed Alignment Searchβand show it breaks when we remove linearity constraints on feature representations. We refer to this problem as the Non-Linear Representation Dilemma.
15.07.2025 11:20 β π 5 π 0 π¬ 1 π 1