Overview of descriptions for model components (neurons, attention heads) and model abstractions (SAE features, circuits).
π Are you curious about uncovering the underlying mechanisms and identifying the roles of model components (neurons, β¦) and abstractions (SAEs, β¦)?
We provide the first survey of concept description generation and evaluation methods.
Joint effort w/ @lkopf.bsky.social
π arxiv.org/abs/2510.01048
02.10.2025 09:13 β π 17 π 3 π¬ 1 π 0
Many thanks as well to the institutions that supported this research:
@tuberlin.bsky.social
@bifold.berlin
UMI Lab
@fraunhoferhhi.bsky.social
@unipotsdam.bsky.social
@leibnizatb.bsky.social
19.09.2025 12:01 β π 1 π 0 π¬ 0 π 0
Iβm very grateful to my amazing collaborators @nfel.bsky.social, @kirillbykov.bsky.social, @philinelb.bsky.social, Anna HedstrΓΆm, Marina M.-C. HΓΆhne, and @eberleoliver.bsky.social π
19.09.2025 12:01 β π 1 π 0 π¬ 1 π 0
Happy to share that our PRISM paper has been accepted at #NeurIPS2025 π
In this work, we introduce a multi-concept feature description framework that can identify and score polysemantic features.
π Paper: arxiv.org/abs/2506.15538
#NeurIPS #MechInterp #XAI
19.09.2025 12:01 β π 25 π 3 π¬ 1 π 3
Grateful to the institutions that supported this work:
@tuberlin.bsky.social
@bifold.berlin
UMI Lab
@fraunhoferhhi.bsky.social
@unipotsdam.bsky.social
@leibnizatb.bsky.social
(7/7)
19.06.2025 15:18 β π 1 π 0 π¬ 0 π 0
Many thanks to my amazing co-authors:
@nfel.bsky.social
@kirillbykov.bsky.social
@philinelb.bsky.social
Anna HedstrΓΆm
Marina M.-C. HΓΆhne
@eberleoliver.bsky.social
(6/7)
19.06.2025 15:18 β π 3 π 0 π¬ 1 π 0
Our results highlight that the PRISM framework not only provides multiple human interpretable descriptions for neurons but also aligns with the human interpretation of polysemanticity. (5/7)
19.06.2025 15:18 β π 1 π 0 π¬ 1 π 0
In exploring the concept space, we use PRISM to characterize more complex components, finding and interpreting patterns that specific attention heads or groups of neurons respond to. (4/7)
19.06.2025 15:18 β π 1 π 0 π¬ 1 π 0
We benchmark PRISM across layers and architectures, showing how polysemanticity and interpretability shift through the model. (3/7)
19.06.2025 15:18 β π 1 π 0 π¬ 1 π 0
PRISM samples sentences from the top percentile activation distribution, clusters them in embedding space, and uses an LLM to generate labels for each concept cluster. (2/7)
19.06.2025 15:18 β π 1 π 0 π¬ 1 π 0
π When do neurons encode multiple concepts?
We introduce PRISM, a framework for extracting multi-concept feature descriptions to better understand polysemanticity.
π Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework
arxiv.org/abs/2506.15538
π§΅ (1/7)
19.06.2025 15:18 β π 36 π 12 π¬ 1 π 3
Huge thanks to my incredible supervisor
@kirillbykov.bsky.social, who laid the foundation for this project and provided brilliant guidance π, and to @philinelb.bsky.social and Sebastian Lapuschkin, who unfortunately couldnβt be there.
13.12.2024 02:48 β π 5 π 0 π¬ 0 π 0
Still overwhelmed by the amazing response to our poster session at @neuripsconf.bsky.social with Anna HedstrΓΆm and Marina HΓΆhne! It was incredible to have such lively and inspiring discussions with brilliant people whose work I admire. β¨
13.12.2024 02:48 β π 11 π 2 π¬ 1 π 0
Thanks for putting together this amazing list Margaret! I would love to be added if you still have space :)
12.12.2024 08:24 β π 2 π 0 π¬ 1 π 0
Want to know more about CoSy?
π Paper: arxiv.org/abs/2405.20331
π» Code: github.com/lkopf/cosy
π Poster: neurips.cc/virtual/2024...
#NeurIPS2024 #MechInterp #ExplainableAI #Interpretability
11.12.2024 06:43 β π 0 π 0 π¬ 0 π 0
Special thanks to our supporting institutions: UMI Lab, @xtraexer.bsky.social, @tuberline.bsky.social, Uni Potsdam, ATB Potsdam, and Fraunhofer Heinrich-Hertz-Institut.
11.12.2024 06:43 β π 0 π 0 π¬ 1 π 0
My co-authors Anna HedstrΓΆm and Marina HΓΆhne will also be at @neuripsconf.bsky.social. A big thank you to my other co-authors @kirillbykov.bsky.social, @philinelb.bsky.social and Sebastian Lapuschkin, who unfortunately couldnβt be there.
11.12.2024 06:43 β π 0 π 0 π¬ 1 π 0
Iβll be presenting our work at @neuripsconf.bsky.social in Vancouver! π
Join me this Thursday, December 12th, in East Exhibit Hall A-C, Poster #3107, from 11 a.m. PST to 2 p.m. PST. I'll be discussing our paper βCoSy: Evaluating Textual Explanations of Neurons.β
11.12.2024 06:43 β π 10 π 1 π¬ 1 π 0
Language and keyboard stuff at Google + PhD student at Tokyo Institute of Technology.
I like computers and Korean and computers-and-Korean and high school CS education.
Georgia Tech β μ°μΈλνκ΅ β ζ±δΊ¬ε·₯ζ₯ε€§ε¦.
https://theoreticallygoodwithcomputers.com/
PhD student @ Charles University. Researching evaluation and explainability of reasoning in language models.
Post-doc researcher at BIFOLD and the XplaiNLP group from Quality and Usability lab at TU Berlin. Interested in: xAI, fact-checking, synthetic data generation and evaluation
Assistant professor of computer science at Technion; visiting scholar at @KempnerInst 2025-2026
https://belinkov.com/
PhD student at TU Berlin & @bifold.berlin
I am interested in Machine Learning, Explainable AI (XAI), and Optimal Transport
Assistant Professor of Computer Science at TU Darmstadt, Member of @ellis.eu, DFG #EmmyNoether Fellow, PhD @ETH
Computer Vision & Deep Learning
PhD Student at the Max Planck Institute for Informatics @cvml.mpi-inf.mpg.de @maxplanck.de | Explainable AI, Computer Vision, Neuroexplicit Models
Web: sukrutrao.github.io
Head of XAI research at Fraunhofer HHI
Google Scholar: https://scholar.google.de/citations?user=wpLQuroAAAAJ
PhD student @ Fraunhofer HHI. XAI and Interpretability for NLP & Vision.
Professor of Psychology and Philosophy at the University of Illinois Urbana-Champaign. I build computational models and conduct experiments to understand how neural computing architectures give rise to symbolic thought.
PhD student explainable AI @ ML Group TU Berlin, BIFOLD
Cognitive and perceptual psychologist, industrial designer, & electrical engineer. Assistant Professor of Industrial Design at University of Illinois Urbana-Champaign. I make neurally plausible bio-inspired computational process models of visual cognition.
Postdoc in NeuroAI at Sorbonne University.
Studying collaboration and morality in humans and machines. Computacional ethics, Cybernetics, ALife, self-organization, complexity, ecology, cultural evolution.
PhD candidate - Centre for Cognitive Science at TU Darmstadt,
explanations for AI, sequential decision-making, problem solving
PhD candidate studying perception of naturalistic facial expressions across lifespan | Former opera singer | Interested in multimodal communication (vocal/facial) & MSI, affective breathing, interoception π«π«
Postdoc at MIT BCS, interested in language(s) in humans and LMs
https://andrea-de-varda.github.io/
PhD student/research scientist intern at UCL NLP/Google DeepMind (50/50 split). Previously MS at KAIST AI and research engineer at Naver Clova. #NLP #ML π https://soheeyang.github.io/
I work on speech and language technologies at Google. I like languages, history, maps, traveling, cycling, and buying way too many books.