Anna Tsvetkov's Avatar

Anna Tsvetkov

@annatsv.bsky.social

Postdoc @ Princeton AI Lab Natural and Artificial Minds Prev: PhD @ Brown, MIT FutureTech Website: https://annatsv.github.io/

1,177 Followers  |  624 Following  |  5 Posts  |  Joined: 19.11.2024
Posts Following

Posts by Anna Tsvetkov (@annatsv.bsky.social)

Thinking of all the colleagues and friends I have who are connected to Brown University. What a devastating day for Brown, and for all of us.

14.12.2025 03:19 β€” πŸ‘ 119    πŸ” 13    πŸ’¬ 1    πŸ“Œ 0
Preview
The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks Do neural networks, trained on well-understood algorithmic tasks, reliably rediscover known algorithms for solving those tasks? Several recent studies, on tasks ranging from group arithmetic to in-con...

Some tasks admit dif algorithms that behave the same on the training data, so a model’s learned mechanism can look arbitrary unless we know what the task requires (the goals, constraints, and invariances that define a correct solution)

❓ Other cases like this or other limits of mech interp?
🧡 (2/2)

25.11.2025 23:55 β€” πŸ‘ 7    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

πŸ” What are the limits of interpretability in ML?
Mech interp often stays at Marr’s algorithmic level but without the computational level (what the task is, what counts as the right solution) the mechanisms we find can look arbitrary. Why does a model learn one algorithm rather than another?
🧡 (1/2)

25.11.2025 23:53 β€” πŸ‘ 10    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Introspection targets our ongoing or recently past mental states. What could it mean for a system that lacks any obvious analogue of a continuous stream of experience to have current or recently past β€œinternal states” to introspect on?

Robert Long makes a similar point in his substack

01.11.2025 18:33 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Emergent introspective awareness in large language models Research from Anthropic on the ability of large language models to introspect

Anthropic has a great new piece on β€œSigns of introspection in large language models” πŸ‘‰ www.anthropic.com/research/int...

πŸ€” Neat evidence that LLMs can report on manipulated activations, with big caveats!

🧠 But leaves open: what are the β€œinternal states” an LLM can introspect in the first place?

01.11.2025 16:49 β€” πŸ‘ 9    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

This is a beautiful paper! The first third helpfully labels a stream of recent work in philosophy of AI as "propositional interpretability". The idea is to use propositional attitudes like belief, desire, and intention, to help explain AI in a way that we can understand. 1/n

29.01.2025 13:24 β€” πŸ‘ 49    πŸ” 11    πŸ’¬ 2    πŸ“Œ 0
Preview
MIT researchers release a repository of AI risks | TechCrunch A group of researchers at MIT and elsewhere have compiled what they claim is the most thorough databases of possible risks around AI use.

"The AI risk repository, which includes over 700 AI risks grouped by causal factors (e.g. intentionality), and domains (e.g. discrimination), was born out of a desire to understand the overlaps and disconnects in AI safety research"
#AIEthics

techcrunch.com/2024/08/14/m...

05.01.2025 21:03 β€” πŸ‘ 41    πŸ” 16    πŸ’¬ 2    πŸ“Œ 1

Would love to be included!

23.11.2024 20:21 β€” πŸ‘ 8    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0