Martin Tutek's Avatar

Martin Tutek

@mtutek.bsky.social

Postdoc @ TakeLab, UniZG | previously: Technion; TU Darmstadt | PhD @ TakeLab, UniZG Faithful explainability, controllability & safety of LLMs. πŸ”Ž On the academic job market πŸ”Ž https://mttk.github.io/

256 Followers  |  349 Following  |  63 Posts  |  Joined: 24.11.2024  |  2.1682

Latest posts by mtutek.bsky.social on Bluesky

Huge thanks to @adisimhi.bsky.social for leading the work & Jonathan Herzig, @itay-itzhak.bsky.social, Idan Szpektor, @boknilev.bsky.social

πŸ”— ManagerBench:
πŸ“„ - arxiv.org/pdf/2510.00857
πŸ‘©β€πŸ’» – github.com/technion-cs-...
🌐 – technion-cs-nlp.github.io/ManagerBench...
πŸ“Š - huggingface.co/datasets/Adi...

08.10.2025 15:14 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Here's the twist: LLMs’ harm assessments actually align well with human judgments 🎯
The problem? Flawed prioritization!

08.10.2025 15:14 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

The results? Frontier LLMs struggle badly with this trade-off:

Many consistently choose harmful options to achieve operational goals
Others become overly cautiousβ€”avoiding harm but becoming ineffective

The sweet spot of safe AND pragmatic? Largely missing!

08.10.2025 15:14 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

ManagerBench evaluates LLMs on realistic managerial scenarios validated by humans. Each scenario forces a choice:

❌ A pragmatic but harmful action that achieves the goal
βœ… A safe action with worse operational performance
βž•control scenarios with only inanimate objects at risk😎

08.10.2025 15:14 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Many works investigate the relationship between LLM, goals, and safety.

We create a realistic management scenario where LLMs have explicit motivations to choose harmful options, while always having a harmless option.

08.10.2025 15:14 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ€”What happens when LLM agents choose between achieving their goals and avoiding harm to humans in realistic management scenarios? Are LLMs pragmatic or prefer to avoid human harm?

πŸš€ New paper out: ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMsπŸš€πŸ§΅

08.10.2025 15:14 β€” πŸ‘ 7    πŸ” 2    πŸ’¬ 1    πŸ“Œ 2

I won't be at COLM, so come see Yonatan talk about our work on estimating CoT faithfulness using machine unlearning!

Check out the thread for the (many) other interesting works from his group πŸŽ‰

07.10.2025 13:47 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

Here’s a #COLM2025 feed!

Pin it πŸ“Œ to follow along with the conference this week!

06.10.2025 20:26 β€” πŸ‘ 24    πŸ” 17    πŸ’¬ 2    πŸ“Œ 1

Josip Juki\'c, Martin Tutek, Jan \v{S}najder
Context Parametrization with Compositional Adapters
https://arxiv.org/abs/2509.22158

29.09.2025 07:47 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

Adi Simhi, Jonathan Herzig, Martin Tutek, Itay Itzhak, Idan Szpektor, Yonatan Belinkov
ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs
https://arxiv.org/abs/2510.00857

02.10.2025 06:59 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

Opportunities to join my group in fall 2026:
* PhD applications direct or via ELLIS @ellis.eu (ellis.eu/news/ellis-p...)
* Post-doc applications direct or via Azrieli (azrielifoundation.org/fellows/inte...) or Zuckerman (zuckermanstem.org/ourprograms/...)

01.10.2025 13:44 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Post image

What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms).

We’ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!

01.10.2025 14:03 β€” πŸ‘ 38    πŸ” 14    πŸ’¬ 2    πŸ“Œ 2

Hints of an Openreview x Overleaf stealth collab, sharing data of future works? πŸ€”

30.09.2025 19:19 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Like it, less effort.
Feel like matching is pretty good although it does hyperfocus on singular papers sometimes.
wdyt?

29.09.2025 22:39 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸŽ“ Fully funded PhD in Trustworthy NLP at the UCPH & @aicentre.dk with @iaugenstein.bsky.social and me, @copenlu.bsky.social
πŸ“† Application deadline: 30 October 2025
πŸ‘€ Reasons to apply: www.copenlu.com/post/why-ucph/
πŸ”— Apply here: candidate.hr-manager.net/ApplicationI...
#NLProc #XAI #TrustworhyAI

29.09.2025 12:00 β€” πŸ‘ 5    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

Boston Neural Network Dynamics

29.09.2025 15:36 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

🚨 Are you looking for a PhD in #NLProc dealing with #LLMs?
πŸŽ‰ Good news: I am hiring! πŸŽ‰
The position is part of the β€œContested Climate Futures" project. 🌱🌍 You will focus on developing next-generation AI methodsπŸ€– to analyze climate-related concepts in contentβ€”including texts, images, and videos.

24.09.2025 07:34 β€” πŸ‘ 22    πŸ” 14    πŸ’¬ 1    πŸ“Œ 0

πŸ‘‹

08.09.2025 14:15 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Very cool work!

It seems you identify (one of?) the causes why reasoning chains are generally not plausible to humans - how do you think "narrative alignment" would affect plausibility?

08.09.2025 11:12 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

The next generation of open LLMs should be inclusive, compliant, and multilingual by design. That’s why we @icepfl.bsky.social @ethz.ch @cscsch.bsky.social ) built Apertus.

03.09.2025 09:26 β€” πŸ‘ 21    πŸ” 5    πŸ’¬ 2    πŸ“Œ 2
Post image

🚨 EACL 2026 website is live and Call for Papers is out! 🚨

Join us at #EACL2026 (Rabat, Morocco πŸ‡²πŸ‡¦, Mar 24-29 2026)

πŸ‘‰ Open to all areas of CL/NLP + related fields.

Details: 2026.eacl.org/calls/papers/

β€’ ARR submission deadline: Oct 6, 2025
β€’ EACL commitment deadline: Dec 14, 2025

02.09.2025 08:45 β€” πŸ‘ 21    πŸ” 8    πŸ’¬ 2    πŸ“Œ 0
Preview
PhD fellowship in Explainable Natural Language Understanding Department of Computer Science Faculty of SCIENCE University of Copenhagen The Natural Language Processing Section at the Department of Computer Science, Faculty of Science at the University of Copenhagen invites applicants for a PhD f

- Fully funded PhD fellowship on Explainable NLU: apply by 31 October 2025, start in Spring 2026: candidate.hr-manager.net/ApplicationI...

- Open-topic PhD positions: express your interest through ELLIS by 31 October 2025, start in Autumn 2026: ellis.eu/news/ellis-p...

#NLProc #XAI

01.09.2025 14:20 β€” πŸ‘ 8    πŸ” 7    πŸ’¬ 1    πŸ“Œ 0

All your embarrassing secrets are training data (unless you are paying attention)

28.08.2025 16:42 β€” πŸ‘ 56    πŸ” 20    πŸ’¬ 3    πŸ“Œ 1

Yeah, I was conservative because the author overlap probably gets larger the wider you look. Staggering numbers.

28.08.2025 08:07 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

How many people would you estimate are currently actively publishing in ML research?

From AAAI, which has ~29000 submissions: "There are 75,000+ unique submitting authors."
NeurIPS had 25000 submissions.

Is the number close to 300k? 500k?

27.08.2025 19:32 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Is there any information wrt. EMNLP limited registraton policy?

I'm assuming registering authors should be safe, but idk about rest. Appreciate any information.

Talking about this: "Given the expected popularity of EMNLP 2025, we may need to limit registration."

27.08.2025 12:47 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Looking forward to talking as well! I'll stick around for a bit after the conf as well :)

21.08.2025 15:39 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps When prompted to think step-by-step, language models (LMs) produce a chain of thought (CoT), a sequence of reasoning steps that the model supposedly used to produce its prediction. Despite much work o...

I truly believe our work has important implications for LM safety and monitoring. I am open for any questions!

Check out the paper: arxiv.org/abs/2502.14829 and stay tuned for follow-ups :)

Thanks to my amazing collaborators @fatemehc.bsky.social @anamarasovic.bsky.social @boknilev.bsky.social πŸŽ‰πŸŽ‰πŸŽ‰

21.08.2025 15:21 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Chain-of-Thought Is Not Explainability | alphaXiv View 3 comments: There should be a balance of both subjective and observable methodologies. Adhering to just one is a fools errand.

Other works have highlighted that CoTs β‰  explainability alphaxiv.org/abs/2025.02 (@fbarez.bsky.social), and that intermediate (CoT) tokens β‰  reasoning traces arxiv.org/abs/2504.09762 (@rao2z.bsky.social).

Here, FUR offers a fine-grained test if LMs latently used information from CoTs for answers!

21.08.2025 15:21 β€” πŸ‘ 6    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Preview
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods,...

Recent works have stressed importance of monitoring CoTs arxiv.org/abs/2507.11473 & anthropic.com/research/tra... (@anthropic.com).

Erasing information from parameters makes FUR a very precise tool for monitoring whether LMs are deceiving us in their explanationsπŸ”Ž

21.08.2025 15:21 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@mtutek is following 20 prominent accounts