Scientific writing is an open-end expert task utilizing user preferences. Evaluating it means capturing both general and expert-specific quality criteria. Our new works push forward preference-aligned evaluation.
26.02.2026 08:14 —
👍 0
🔁 0
💬 1
📌 0
Poster titled “Reward Modeling for Scientific Writing Evaluation & Expert Preference Based Evaluation of Automated Related Work Generation,” with authors Furkan Sahinuç, Subhabrata Dutta, and Iryna Gurevych. The graphic illustrates a workflow: cited papers are used by an author to create a gold standard related work section and evaluation criteria, while AI generators produce draft related work sections. An LLM-supported evaluation system ranks outputs based on expert preferences, with feedback loops to improve generation quality.
📣🧪 𝗡𝗲𝘄 𝗿𝗲𝘀𝘂𝗹𝘁𝘀: 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗻𝗴 𝗦𝗰𝗶𝗲𝗻𝘁𝗶𝗳𝗶𝗰 𝗪𝗿𝗶𝘁𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗘𝘅𝗽𝗲𝗿𝘁 𝗣𝗿𝗲𝗳𝗲𝗿𝗲𝗻𝗰𝗲𝘀
👉 Today we are presenting our first two papers that build on HAI-Co2, our recent position paper on human-ai text co-construction in expert domains:
26.02.2026 08:14 —
👍 0
🔁 0
💬 1
📌 0
We thank Andreas for his contributions to the Lab and wish him all the best for his future!
#NLP #PhDDefense #ComputationalArgumentation #Reliability #Interpretability #UKPLab #TUDarmstadt #UniTuebingen #LLMs #NLProc
24.02.2026 08:07 —
👍 1
🔁 0
💬 0
📌 0
👥 Jury: Marcus Rohrbach (@tuda.bsky.social), @igurevych.bsky.social (@tuda.bsky.social), Fajri Koto (MBZUAI (Mohamed bin Zayed University of Artificial Intelligence), Simone Schaub-Meyer (@tuda.bsky.social), Yufang Hou (@ituaustria.bsky.social)
24.02.2026 08:07 —
👍 1
🔁 1
💬 1
📌 0
We are excited to highlight that Andreas is now a PostDoc and Junior Group Leader at @unituebingen.bsky.social, working with Michael Franke in the Cognitive Science and Pragmatics group.
24.02.2026 08:07 —
👍 0
🔁 0
💬 1
📌 0
... and cognitive theories to better understand how language models operate and where their limitations lie.
Andreas delivers a valuable contribution to the field of 𝗿𝗲𝗹𝗶𝗮𝗯𝗹𝗲 𝗮𝗻𝗱 𝗶𝗻𝘁𝗲𝗿𝗽𝗿𝗲𝘁𝗮𝗯𝗹𝗲 𝗡𝗟𝗣 and an important step toward more robust language technologies.
24.02.2026 08:07 —
👍 0
🔁 0
💬 1
📌 0
Andreas examines the reliability of language models in computational argumentation, focusing on how their outputs can be interpreted and evaluated more systematically. His work connects interpretability methods with linguistic ...
24.02.2026 08:07 —
👍 0
🔁 0
💬 1
📌 0
#NLP #LLMs #MentalHealth #ClinicalNLP #DigitalHealth #ResponsibleAI #NLProc #AIevaluation #ModelEvaluation #TrustworthyAI #Safety #Equity #HumanCenteredAI
19.02.2026 09:46 —
👍 1
🔁 0
💬 0
📌 0
This work has been supported by the LOEWE program as part of @loewe-dynamic.bsky.social, the LOEWE Chair of Excellence “Ubiquitous Knowledge Processing” in Hesse, Germany and @dagstuhl.de through the Dagstuhl Seminar ‘25361: Natural Language Processing for Mental Health’.
19.02.2026 09:46 —
👍 2
🔁 0
💬 1
📌 0
· @florplaza.bsky.social (@unileiden.bsky.social) · @dirkhovy.bsky.social (Università Bocconi) · Maria Liakata (Queen Mary University of London) · @igurevych.bsky.social (@tuda.bsky.social)
19.02.2026 09:46 —
👍 2
🔁 0
💬 1
📌 0
@schwartz-psyres.bsky.social (@unitrier.bsky.social) · @wlutzpsyres.bsky.social (@unitrier.bsky.social) · Tim Althoff (University of Washington) · Munmun De Choudhury (Georgia Institute of Technology) · Hamidreza Jamalabadi (@unimarburg.bsky.social) · Raj Shah (Georgia Institute of Technology) [...]
19.02.2026 09:46 —
👍 1
🔁 0
💬 1
📌 0
👥 𝗔𝘂𝘁𝗵𝗼𝗿𝘀:
Hiba Arnaout (@tuda.bsky.social) · @anmolgoel.bsky.social (@tuda.bsky.social) · H. Andrew Schwartz (@vanderbilt.edu) · Steffen T. Eberhardt (@unitrier.bsky.social) · Dana Atzil-Slonim (Bar-Ilan University) · Gavin Doherty (Trinity College Dublin) [...]
19.02.2026 09:46 —
👍 1
🔁 0
💬 1
📌 0
• An evaluation “dashboard” grounded in 𝘃𝗮𝗹𝗶𝗱𝗶𝘁𝘆, 𝗿𝗲𝗹𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆, 𝗶𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻 and 𝗺𝗮𝗶𝗻𝘁𝗲𝗻𝗮𝗻𝗰𝗲, drawing from psychometrics and implementation science—the study of how evidence-based tools can be successfully adopted and used in real-world settings.
19.02.2026 09:46 —
👍 1
🔁 0
💬 1
📌 0
• A taxonomy of mental health AI support types: 𝗮𝘀𝘀𝗲𝘀𝘀𝗺𝗲𝗻𝘁, 𝗶𝗻𝘁𝗲𝗿𝘃𝗲𝗻𝘁𝗶𝗼𝗻 and 𝗶𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻 𝘀𝘆𝗻𝘁𝗵𝗲𝘀𝗶𝘀: each one comes with its own risks and needs distinct evidence.
19.02.2026 09:46 —
👍 1
🔁 0
💬 1
📌 0
UKP Lab
UKP Lab - Responsible Evaluation of AI for Mental Health
• A review of 𝟭𝟯𝟱 𝗔𝗖𝗟 𝗔𝗻𝘁𝗵𝗼𝗹𝗼𝗴𝘆 𝗽𝗮𝗽𝗲𝗿𝘀 from the past 5 years: many studies rely on generic AI metrics and lack expert involvement or attention to safety and equity: ukplab.github.io/nlp-mh-evals/
19.02.2026 09:46 —
👍 1
🔁 0
💬 1
📌 0
... and clinical expertise, argues that current evaluation approaches are too narrow and are not easily understood or applied by different communities (AI, clinicians, users, regulators).
📌 𝗪𝗵𝗮𝘁 𝘁𝗵𝗲 𝗽𝗮𝗽𝗲𝗿 𝗶𝘀 𝗰𝗼𝗻𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗻𝗴:
19.02.2026 09:46 —
👍 1
🔁 0
💬 1
📌 0
Table titled “Taxonomy for evaluation of AI in mental health applications,” organized into columns for quality criteria (validity and reliability) and real-world use (implementation and maintenance). Rows distinguish support types: assessment, intervention, and information synthesis. Each cell lists detailed evaluation questions, such as construct and criterion validity, consistency across populations and time, feasibility, effectiveness, usability, acceptability, safety, and unintended consequences, providing a structured framework for assessing AI systems in mental health contexts.
🔎🧩 𝗕𝗲𝘆𝗼𝗻𝗱 𝗕𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝘀: 𝗛𝗼𝘄 𝘁𝗼 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗲 𝗠𝗲𝗻𝘁𝗮𝗹 𝗛𝗲𝗮𝗹𝘁𝗵 𝗔𝗜 𝗥𝗲𝘀𝗽𝗼𝗻𝘀𝗶𝗯𝗹𝘆
AI for mental health is a high-stakes area: its evaluation needs to meet the highest expectations.
The new preprint 𝘙𝘦𝘴𝘱𝘰𝘯𝘴𝘪𝘣𝘭𝘦 𝘌𝘷𝘢𝘭𝘶𝘢𝘵𝘪𝘰𝘯 𝘰𝘧 𝘈𝘐 𝘧𝘰𝘳 𝘔𝘦𝘯𝘵𝘢𝘭 𝘏𝘦𝘢𝘭𝘵𝘩, written by an interdisciplinary team spanning AI [...]
19.02.2026 09:46 —
👍 3
🔁 3
💬 1
📌 0
The briefing also features perspectives from:
👤 Dr. Anne Reinhardt, @lmu.de
👤 Prof. Dr. Ute Schmid, Otto-Friedrich-Universität Bamberg / Bamberger Zentrum für Künstliche Intelligenz (BaCAI)
👤 Prof. Dr. Kerstin Denecke, @bfh-ch.bsky.social
18.02.2026 09:07 —
👍 1
🔁 0
💬 1
📌 0
Gurevych’s practical takeaway is clear: to be useful as a medical first contact, chatbots must do more than answer questions. They should 𝗴𝘂𝗶𝗱𝗲 𝘂𝘀𝗲𝗿𝘀 𝘁𝗼 𝗽𝗿𝗼𝘃𝗶𝗱𝗲 𝗰𝗼𝗺𝗽𝗹𝗲𝘁𝗲 𝗶𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻, 𝗮𝘀𝗸 𝗳𝗼𝗹𝗹𝗼𝘄-𝘂𝗽 𝗾𝘂𝗲𝘀𝘁𝗶𝗼𝗻𝘀, 𝗰𝗼𝗺𝗺𝘂𝗻𝗶𝗰𝗮𝘁𝗲 𝘂𝗻𝗰𝗲𝗿𝘁𝗮𝗶𝗻𝘁𝘆, 𝗮𝗻𝗱 𝘀𝘁𝗮𝘆 𝘄𝗶𝘁𝗵𝗶𝗻 𝗰𝗹𝗲𝗮𝗿𝗹𝘆 𝗱𝗲𝗳𝗶𝗻𝗲𝗱, 𝗹𝗼𝘄-𝗿𝗶𝘀𝗸 𝗯𝗼𝘂𝗻𝗱𝗮𝗿𝗶𝗲𝘀.
18.02.2026 09:07 —
👍 0
🔁 0
💬 1
📌 0
Strikingly, the study finds that models perform much better with simulated users than with real people. This suggests that 𝘀𝗶𝗺𝘂𝗹𝗮𝘁𝗶𝗼𝗻𝘀 𝗺𝗮𝘆 𝘀𝘆𝘀𝘁𝗲𝗺𝗮𝘁𝗶𝗰𝗮𝗹𝗹𝘆 𝗼𝘃𝗲𝗿𝗲𝘀𝘁𝗶𝗺𝗮𝘁𝗲 𝗿𝗲𝗮𝗹-𝘄𝗼𝗿𝗹𝗱 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲.
18.02.2026 09:07 —
👍 0
🔁 0
💬 1
📌 0
A new Nature Medicine study suggests that LLMs don’t reliably add value when people search for health information. The key issue is less about raw model capability and more about 𝗵𝗼𝘄 𝗵𝘂𝗺𝗮𝗻𝘀 𝗮𝗻𝗱 𝗺𝗼𝗱𝗲𝗹𝘀 𝗶𝗻𝘁𝗲𝗿𝗮𝗰𝘁: users omit crucial details, misunderstand outputs or don’t act on correct suggestions.
18.02.2026 09:07 —
👍 0
🔁 0
💬 1
📌 0
Share graphic titled “Chatbots: Faulty Communication on Health Issues.” The top left shows the Ubiquitous Knowledge Processing Lab logo, and the bottom left features the Science Media Center Germany logo. On the right is a circular portrait of a woman with curly hair and glasses. The background shows blurred letter tiles.
⚕️ 𝗖𝗵𝗮𝘁𝗯𝗼𝘁𝘀 𝗳𝗼𝗿 𝗵𝗲𝗮𝗹𝘁𝗵 𝗾𝘂𝗲𝘀𝘁𝗶𝗼𝗻𝘀: 𝘄𝗵𝗲𝗿𝗲 𝗱𝗼𝗲𝘀 𝗰𝗼𝗺𝗺𝘂𝗻𝗶𝗰𝗮𝘁𝗶𝗼𝗻 𝗯𝗿𝗲𝗮𝗸 𝗱𝗼𝘄𝗻?
In a new briefing by @sciencemediacenter.de, Prof. Dr. @igurevych.bsky.social (UKP Lab, @tuda.bsky.social) highlights why the gap between benchmarks and real-world use matters: 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝘀 𝗮𝗿𝗲 𝗼𝗳𝘁𝗲𝗻 𝘀𝗶𝗺𝗽𝗹𝗶𝗳𝗶𝗲𝗱 𝗮𝗻𝗱 𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲𝗱.
18.02.2026 09:07 —
👍 6
🔁 0
💬 1
📌 0
#AAAI2026 #ProcessSupervision #Reasoning #RewardModelling #ReferenceGuidedEvaluation #NLP #NLProc #LLMs
13.02.2026 11:07 —
👍 1
🔁 0
💬 0
📌 0
Follow the authors Md Imbesat Hassan Rizvi and @igurevych.bsky.social from the @ukplab.bsky.social at @tuda.bsky.social and Xiaodan Zhu from the Department of Electrical and Computer Engineering, Smith Engineering at Queen's University and Ingenuity Labs Research Institute at Queen's University.
13.02.2026 11:07 —
👍 1
🔁 0
💬 1
📌 0