Huge thanks to @adisimhi.bsky.social for leading the work & Jonathan Herzig, @itay-itzhak.bsky.social, Idan Szpektor, @boknilev.bsky.social
π ManagerBench:
π - arxiv.org/pdf/2510.00857
π©βπ» β github.com/technion-cs-...
π β technion-cs-nlp.github.io/ManagerBench...
π - huggingface.co/datasets/Adi...
08.10.2025 15:14 β π 2 π 0 π¬ 0 π 0
Here's the twist: LLMsβ harm assessments actually align well with human judgments π―
The problem? Flawed prioritization!
08.10.2025 15:14 β π 2 π 0 π¬ 1 π 0
The results? Frontier LLMs struggle badly with this trade-off:
Many consistently choose harmful options to achieve operational goals
Others become overly cautiousβavoiding harm but becoming ineffective
The sweet spot of safe AND pragmatic? Largely missing!
08.10.2025 15:14 β π 2 π 0 π¬ 1 π 0
ManagerBench evaluates LLMs on realistic managerial scenarios validated by humans. Each scenario forces a choice:
β A pragmatic but harmful action that achieves the goal
β
A safe action with worse operational performance
βcontrol scenarios with only inanimate objects at riskπ
08.10.2025 15:14 β π 2 π 0 π¬ 1 π 0
Many works investigate the relationship between LLM, goals, and safety.
We create a realistic management scenario where LLMs have explicit motivations to choose harmful options, while always having a harmless option.
08.10.2025 15:14 β π 2 π 0 π¬ 1 π 0
π€What happens when LLM agents choose between achieving their goals and avoiding harm to humans in realistic management scenarios? Are LLMs pragmatic or prefer to avoid human harm?
π New paper out: ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMsππ§΅
08.10.2025 15:14 β π 7 π 2 π¬ 1 π 2
I won't be at COLM, so come see Yonatan talk about our work on estimating CoT faithfulness using machine unlearning!
Check out the thread for the (many) other interesting works from his group π
07.10.2025 13:47 β π 3 π 1 π¬ 0 π 0
Hereβs a #COLM2025 feed!
Pin it π to follow along with the conference this week!
06.10.2025 20:26 β π 24 π 17 π¬ 2 π 1
Josip Juki\'c, Martin Tutek, Jan \v{S}najder
Context Parametrization with Compositional Adapters
https://arxiv.org/abs/2509.22158
29.09.2025 07:47 β π 1 π 1 π¬ 0 π 0
Adi Simhi, Jonathan Herzig, Martin Tutek, Itay Itzhak, Idan Szpektor, Yonatan Belinkov
ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs
https://arxiv.org/abs/2510.00857
02.10.2025 06:59 β π 1 π 1 π¬ 0 π 0
Opportunities to join my group in fall 2026:
* PhD applications direct or via ELLIS @ellis.eu (ellis.eu/news/ellis-p...)
* Post-doc applications direct or via Azrieli (azrielifoundation.org/fellows/inte...) or Zuckerman (zuckermanstem.org/ourprograms/...)
01.10.2025 13:44 β π 3 π 1 π¬ 0 π 0
What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms).
Weβve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!
01.10.2025 14:03 β π 38 π 14 π¬ 2 π 2
Hints of an Openreview x Overleaf stealth collab, sharing data of future works? π€
30.09.2025 19:19 β π 0 π 0 π¬ 0 π 0
Like it, less effort.
Feel like matching is pretty good although it does hyperfocus on singular papers sometimes.
wdyt?
29.09.2025 22:39 β π 0 π 0 π¬ 1 π 0
π Fully funded PhD in Trustworthy NLP at the UCPH & @aicentre.dk with @iaugenstein.bsky.social and me, @copenlu.bsky.social
π Application deadline: 30 October 2025
π Reasons to apply: www.copenlu.com/post/why-ucph/
π Apply here: candidate.hr-manager.net/ApplicationI...
#NLProc #XAI #TrustworhyAI
29.09.2025 12:00 β π 5 π 1 π¬ 0 π 0
Boston Neural Network Dynamics
29.09.2025 15:36 β π 1 π 0 π¬ 0 π 0
π¨ Are you looking for a PhD in #NLProc dealing with #LLMs?
π Good news: I am hiring! π
The position is part of the βContested Climate Futures" project. π±π You will focus on developing next-generation AI methodsπ€ to analyze climate-related concepts in contentβincluding texts, images, and videos.
24.09.2025 07:34 β π 22 π 14 π¬ 1 π 0
π
08.09.2025 14:15 β π 1 π 0 π¬ 0 π 0
Very cool work!
It seems you identify (one of?) the causes why reasoning chains are generally not plausible to humans - how do you think "narrative alignment" would affect plausibility?
08.09.2025 11:12 β π 0 π 0 π¬ 0 π 0
The next generation of open LLMs should be inclusive, compliant, and multilingual by design. Thatβs why we @icepfl.bsky.social @ethz.ch @cscsch.bsky.social ) built Apertus.
03.09.2025 09:26 β π 21 π 5 π¬ 2 π 2
π¨ EACL 2026 website is live and Call for Papers is out! π¨
Join us at #EACL2026 (Rabat, Morocco π²π¦, Mar 24-29 2026)
π Open to all areas of CL/NLP + related fields.
Details: 2026.eacl.org/calls/papers/
β’ ARR submission deadline: Oct 6, 2025
β’ EACL commitment deadline: Dec 14, 2025
02.09.2025 08:45 β π 21 π 8 π¬ 2 π 0
All your embarrassing secrets are training data (unless you are paying attention)
28.08.2025 16:42 β π 56 π 20 π¬ 3 π 1
Yeah, I was conservative because the author overlap probably gets larger the wider you look. Staggering numbers.
28.08.2025 08:07 β π 0 π 0 π¬ 0 π 0
How many people would you estimate are currently actively publishing in ML research?
From AAAI, which has ~29000 submissions: "There are 75,000+ unique submitting authors."
NeurIPS had 25000 submissions.
Is the number close to 300k? 500k?
27.08.2025 19:32 β π 0 π 0 π¬ 1 π 0
Is there any information wrt. EMNLP limited registraton policy?
I'm assuming registering authors should be safe, but idk about rest. Appreciate any information.
Talking about this: "Given the expected popularity of EMNLP 2025, we may need to limit registration."
27.08.2025 12:47 β π 3 π 0 π¬ 0 π 0
Looking forward to talking as well! I'll stick around for a bit after the conf as well :)
21.08.2025 15:39 β π 1 π 0 π¬ 0 π 0
Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps
When prompted to think step-by-step, language models (LMs) produce a chain of thought (CoT), a sequence of reasoning steps that the model supposedly used to produce its prediction. Despite much work o...
I truly believe our work has important implications for LM safety and monitoring. I am open for any questions!
Check out the paper: arxiv.org/abs/2502.14829 and stay tuned for follow-ups :)
Thanks to my amazing collaborators @fatemehc.bsky.social @anamarasovic.bsky.social @boknilev.bsky.social πππ
21.08.2025 15:21 β π 3 π 0 π¬ 1 π 0
Chain-of-Thought Is Not Explainability | alphaXiv
View 3 comments: There should be a balance of both subjective and observable methodologies. Adhering to just one is a fools errand.
Other works have highlighted that CoTs β explainability alphaxiv.org/abs/2025.02 (@fbarez.bsky.social), and that intermediate (CoT) tokens β reasoning traces arxiv.org/abs/2504.09762 (@rao2z.bsky.social).
Here, FUR offers a fine-grained test if LMs latently used information from CoTs for answers!
21.08.2025 15:21 β π 6 π 1 π¬ 1 π 0
PhD student @ Charles University. Researching evaluation and explainability of reasoning in language models.
Chief scientist at Redwood Research (https://www.redwoodresearch.org/), focused on technical AI safety research to reduce risks from rogue AIs
Research Scientist at Apple for uncertainty quantification.
Assistant Professor at Bocconi University in MilaNLP group β’ Working in #NLP, #HateSpeech and #Ethics β’ She/her β’ #ERCStG PERSONAE
Postdoc @milanlp.bsky.social
seeks to understand language.
Head of Cohere Labs
@Cohere_Labs @Cohere
PhD from @UvA_Amsterdam
https://marziehf.github.io/
Computational linguist trying to understand how humans and computers learn and use language πΆπ§ π£οΈπ₯οΈπ¬
PhD @clausebielefeld.bsky.social, Bielefeld University
https://bbunzeck.github.io
Incoming Associate Professor of Computer Science and Psychology @ Princeton. Posts are my views only. https://cims.nyu.edu/~brenden/
The Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University.
PhD student at @gesis.org & @hhu.de, computational linguist, researching (annotation) disagreement and its impact on model behavior.
(He/Him). Previously at LARA Lab @UMBC, Mohsin Lab @BRACU | Accessibility, Explainability and Multimodal DL. My opinions are mine.
I'm on the PhD application cycle for Fall '26!
www.shadabchy.com
PhD student @ Northeastern University, Clinical NLP
https://hibaahsan.github.io/
she/her
I hate slop and yet I work on generative models
PhD from UT Austin, applied scientist @ AWS
He/him β’ https://bostromk.net
π« asst. prof. of compling at university of pittsburgh
past:
ποΈ postdoc @mainlp.bsky.social, LMU Munich
π€ PhD in CompLing from Georgetown
πΊπ» x2 intern @Spotify @SpotifyResearch
https://janetlauyeung.github.io/
AI researcher & teacher at SCAI, ASU. Former President of AAAI & Chair of AAAS Sec T. Here to tweach #AI. YouTube Ch: http://bit.ly/38twrAV Twitter: rao2z
Royal Society University Research Fellow researching cetacean communication and culture @uniofstandrews.bsky.social and @seamammalresearch.bsky.social
PhD Candidate @ Leipzig University. Active Learning, Text Classification and LLMs. Check out my active learning library: small-text. #NLP #NLProc #ActiveLearning #LLM #ML #AI
Professor of Language Technology at the University of Helsinki @helsinki.fi
Head of Helsinki-NLP @helsinki-nlp.bsky.social
Member of the Ellis unit Helsinki @ellisfinland.bsky.social
Responsible and efficient AI.
Topics: LLM efficiency; LLM alignment; Differential Privacy; Information Theory. Research Scientist @Google; PhD @Cornell