📄 Read the full statements here: (in German)
www.sciencemediacenter.de/angebote/gpt...
(4/4)
#AI #Benchmarks #NLP #AIresearch #MachineLearning
@ukplab.bsky.social
The Ubiquitous Knowledge Processing Lab researches Natural Language Processing (#NLProc) with a strong emphasis on Large Language Models, Conversational AI & Question Answering | @cs-tudarmstadt.bsky.social · @TUDa.bsky.social https://www.ukp.tu-darmstadt
📄 Read the full statements here: (in German)
www.sciencemediacenter.de/angebote/gpt...
(4/4)
#AI #Benchmarks #NLP #AIresearch #MachineLearning
The briefing also features perspectives from:
👤 Prof. Dr. Chris Biemann, @uni-hamburg.de
👤 Dr. @paul-rottger.bsky.social, @milanlp.bsky.social, Università Bocconi
All experts note that strong benchmark scores don’t always mean strong real-world performance.
(3/🧵)
Following the release of #GPT-5, benchmarks are once again in the spotlight. In her statement, Prof. Gurevych notes that while they allow comparisons between models, they can only give a cautious indication of performance and should be interpreted with caution.
(2/🧵)
Logo of the Science Media Center Germany. The design features a golden polygonal network pattern made up of connected triangles and dots on the left, resembling a stylized abstract map or molecular structure. To the right, the text reads “science media center” in black lowercase letters and “germany” in gold lowercase letters.
🔍 𝗛𝗼𝘄 𝗿𝗲𝗹𝗶𝗮𝗯𝗹𝗲 𝗮𝗿𝗲 𝗔𝗜 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝘀?
Prof. Dr. @igurevych.bsky.social (@athenecenter.bsky.social Distinguished Professor, @tuda.bsky.social / @ukplab.bsky.social) contributed to the latest @sciencemediacenter.de briefing on how benchmarks can — and cannot — measure AI model performance
(1/🧵)
Many thanks to our colleagues for their contributions and to the organizers for a great conference — we’re already looking forward to the next #ACL!
📸 First photo by @andreasgeiger.bsky.social – thank you!
#ACL2025 #NLProc
Highlights included inspiring discussions during poster sessions, reconnecting with collaborators, and catching up with UKP Lab alumni from across the globe 🌍.
14.08.2025 09:02 — 👍 0 🔁 0 💬 1 📌 0A group of twelve people sitting around a long outdoor dining table in a cozy courtyard restaurant with stone walls and greenery. They are enjoying food and drinks, smiling, and looking up toward the camera. Plates of food, wine glasses, and bottles are on the table, which is covered with a red-and-white checkered tablecloth. The atmosphere is warm and social, with other diners visible in the background.
𝗧𝗵𝗿𝗼𝘄𝗯𝗮𝗰𝗸 𝘁𝗼 #ACL2025 𝗶𝗻 𝗯𝗲𝗮𝘂𝘁𝗶𝗳𝘂𝗹 𝗩𝗶𝗲𝗻𝗻𝗮 🍰✨
The UKP Lab had a fantastic time at this year’s #ACL. Our team presented a total of 𝟭𝟬 𝗽𝗮𝗽𝗲𝗿𝘀 in the Main and Findings tracks and 𝟮 𝗧𝗔𝗖𝗟 journal papers, sharing our latest research with the global #NLP community.
🔹 Wie verändern Big Data, KI und Fernerkundung archäologische Forschung und Feldarbeit?
🔹 Wie lassen sich Daten nachhaltig archivieren, wenn alte digitale Speichermedien an ihre Grenzen stoßen?
🔹 Welche Fragen zu Datensouveränität und Eigentum entstehen im internationalen Kontext?
(3/🧵)
Digitale Methoden eröffnen der Archäologie völlig neue Möglichkeiten vom Auffinden und Dokumentieren von Funden bis zur Rettung bedrohten Kulturguts. Gleichzeitig stellen sie die Disziplin vor komplexe Herausforderungen.
(2/🧵)
We are happy to announce that our colleague Jingcheng Niu has received an Outstanding Paper award at #ACL2025! Congratulations 🎉🎉🎉
📄 Paper: arxiv.org/abs/2505.09338
UKP is at #ACL2025 in Vienna!
Meet @igurevych.bsky.social, our postdocs, and our PhD students, and be sure to check out their posters 🎉
More information about our ACL and TACL papers below ⬇️
Consider following the authors @jingyng.bsky.social (Quality and Usability Lab, TU Berlin), @maxglockner.bsky.social (@ukplab.bsky.social), Anderson Rocha (RECOD.ai, University of Campinas) and @igurevych.bsky.social (@ukplab.bsky.social) for an exchange of ideas.
See you in Vienna! #ACL2025
More surprising results are in the paper!
🖥️ Project: jingyng.github.io/tacl2025-ood...
📄 Paper: direct.mit.edu/tacl/article...
💻 Code: github.com/UKPLab/tacl2...
🧵 Key findings:
✅ 384 samples = 50K samples for fine-tuning on OOD performance
❌ Data selection methods don't matter much vs. size/source
✅ Better prediction = better explanations
OOD evaluation pipeline of self-rationalization, and OOD datasets categories considered in the paper.
Can LLMs generate explanations for datasets without such annotations? 🧠
We tested model explanations across 19 datasets (NLI, fact-checking, hallucination detection) to see how well they self-rationalize on completely unseen data.
#LLMs #Explainability #ACL2025 #TACL
📄 Paper: direct.mit.edu/tacl/article...
And consider following the authors @ccliu.bsky.social , @igurevych.bsky.social (@ukplab.bsky.social), and Anna Korhonen, if you are interested in more information or an exchange of ideas.
See you this week in Vienna! #ACL2025 #ResponsibleAI
A taxonomy of cultural NLP
🌍🌎🌏 Culture is more than language.
We organize cultural NLP with a new taxonomy—spanning the ideational, linguistic, and social dimensions of culture. We analyze 127+ papers to uncover the research gaps. Let’s build language technologies that work for everyone. 💬🗺️🎭🤝 (1/🧵)
#CulturalNLP #ACL2025
📄 Paper: arxiv.org/abs/2505.05949
💾 Data: huggingface.co/datasets/mgl...
💻 Code: github.com/amazon-scien...
#ACL2025 #NLProc #EvidenceBasedAI #LLM
(6/6)
The original post was published on Twitter/X by Markus Dreyer:
x.com/markusdr/sta...
Work by: @maxglockner.bsky.social (@ukplab.bsky.social, @hessianai.bsky.social), Xiang Jiang (Amazon AGI), @leonardofribeiro.bsky.social (Amazon AGI), @igurevych.bsky.social (@ukplab.bsky.social), and Markus Dreyer (Amazon AGI).
(5/🧵)
Our experiments with multiple LLMs show significant gaps in evidence-based reasoning. NeoQA exposes limitations in multi-hop reasoning and shortcut reliance—crucial insights for building #trustworthyAI.
(4/🧵)
𝗪𝗵𝗮𝘁'𝘀 𝘂𝗻𝗶𝗾𝘂𝗲?
NeoQA includes answerable, unanswerable, and misleading evidence scenarios to truly challenge LLMs. It reveals where models rely on shortcuts and struggle to detect mismatches between questions and evidence.
(3/🧵)
𝗪𝗵𝘆 𝗡𝗲𝗼𝗤𝗔?
Traditional RAG datasets grow stale as LLMs internalize real-world events during pretraining. #NeoQA solves this with fictional news timelines, preventing models from shortcut reasoning with parametric knowledge.
(2/🧵)
Screenshot of the first page of the research paper titled "NeoQA: Evidence-based Question Answering with Generated News Events", authored by Max Glockner, Xiang Jiang, Leonardo F. R. Ribeiro, Iryna Gurevych, and Markus Dreyer. Affiliations include UKP Lab at TU Darmstadt, the Hessian Center for AI (hessian.AI), and Amazon AGI. The abstract explains the motivation behind NeoQA: traditional benchmarks for evaluating Retrieval-Augmented Generation (RAG) in large language models become outdated quickly, as models integrate new knowledge during pretraining. NeoQA introduces a benchmark using fictional news events and entities to ensure LLMs rely solely on retrieved evidence rather than memorized information. A figure on the right shows a comparison: Left side: NeoQA setting, where the LLM answers a fictional question based only on provided documents (e.g., “What did Selvia Renek criticize?”). Right side: Real-world RAG task, where the LLM may rely on memorized knowledge (e.g., “How much was the 2022 Twitter deal worth?”). ArXiv identifier and submission date are on the left margin: arXiv:2505.05949v1
𝗡𝗲𝗼𝗤𝗔: A benchmark for evidence-based question answering with generated news events. Unlike traditional datasets, NeoQA constructs fictional timelines, ensuring LLMs must reason exclusively over retrieved evidence.
(1/🧵)
Also consider following the authors Tianyu Yang ( @ukplab.bsky.social, @hessianai.bsky.social), Xiaodan Zhu (Queen's University Canada), and @igurevych.bsky.social ( @ukplab.bsky.social).
(5/5)
#NLProc #ACL2025 #TextAnonymization #LLMSafety #AIPrivacy
📄 Paper: arxiv.org/abs/2407.11770
💻 Code: github.com/UKPLab/acl20...
🔗 Project: ukplab.github.io/acl2025-rupta/
(4/🧵)
⚙️ Supports 𝗰𝘂𝘀𝘁𝗼𝗺𝗶𝘇𝗮𝗯𝗹𝗲 𝗽𝗿𝗶𝘃𝗮𝗰𝘆-𝘂𝘁𝗶𝗹𝗶𝘁𝘆 𝘁𝗿𝗮𝗱𝗲-𝗼𝗳𝗳𝘀 and distillation into lightweight models for real-time use.
📊 𝗢𝘂𝘁𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝘀 𝗽𝗿𝗶𝗼𝗿 𝗺𝗲𝘁𝗵𝗼𝗱𝘀, achieving lower re-identification success rates and higher downstream accuracy on DB-bio and PersonalReddit datasets.
(3/🧵)
✅ 𝗥𝗨𝗣𝗧𝗔 uses LLMs to:
→ 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗲 𝗽𝗿𝗶𝘃𝗮𝗰𝘆 𝗿𝗶𝘀𝗸 via simulated re-identification attacks (privacy evaluator).
→ 𝗠𝗲𝗮𝘀𝘂𝗿𝗲 𝘂𝘁𝗶𝗹𝗶𝘁𝘆 𝗿𝗲𝘁𝗲𝗻𝘁𝗶𝗼𝗻 for tasks like classification (utility evaluator).
→ 𝗜𝘁𝗲𝗿𝗮𝘁𝗶𝘃𝗲𝗹𝘆 𝗼𝗽𝘁𝗶𝗺𝗶𝘇𝗲 𝘁𝗲𝘅𝘁 via lexicographic optimization: prioritize privacy, then maximize utility.
(2/🧵)
Diagram showing three versions of a biographical text about French tennis player Jacques "Toto" Brugnon, comparing different abstraction levels in textual rewriting. Original Document (top box with red border): Describes Brugnon’s birth and death dates, nationality (French), and achievements as a tennis player in the 1920s–1930s. Highlights specific terms like "tennis player," "Paris," and names of Grand Slam championships. Adversarial Feedback (middle box with green border): A generalization of the original text using abstract placeholders, such as “a person,” “a certain century,” “a certain region,” “sport,” and “birth city.” This version removes all specific named entities and replaces them with vague references. RUPTA Version (bottom box with green border and gear icon): A refined rewrite that reintroduces meaningful specificity without exact reproduction. For example, it uses “tennis athlete,” “a nation,” “during a historical period,” and “various international tennis competitions,” while maintaining more natural language than the adversarial version. Each section is marked with distinct icons: a locked document for the original, two robot heads for the adversarial feedback, and a gear with a robot for the RUPTA approach. Yellow highlights show key phrases retained or transformed across versions.
🔍 𝗛𝗼𝘄 𝗰𝗮𝗻 𝘄𝗲 𝗮𝗻𝗼𝗻𝘆𝗺𝗶𝘇𝗲 𝘁𝗲𝘅𝘁 𝘀𝗼 𝗟𝗟𝗠𝘀 𝗰𝗮𝗻’𝘁 𝗿𝗲-𝗶𝗱𝗲𝗻𝘁𝗶𝗳𝘆 𝘀𝗲𝗻𝘀𝗶𝘁𝗶𝘃𝗲 𝗶𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻 — 𝘄𝗵𝗶𝗹𝗲 𝗽𝗿𝗲𝘀𝗲𝗿𝘃𝗶𝗻𝗴 𝘂𝘁𝗶𝗹𝗶𝘁𝘆 𝗳𝗼𝗿 𝗱𝗼𝘄𝗻𝘀𝘁𝗿𝗲𝗮𝗺 𝘁𝗮𝘀𝗸𝘀?
🚀 𝗪𝗲 𝗶𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗲 𝗥𝗨𝗣𝗧𝗔: 𝗥𝗼𝗯𝘂𝘀𝘁 𝗨𝘁𝗶𝗹𝗶𝘁𝘆-𝗣𝗿𝗲𝘀𝗲𝗿𝘃𝗶𝗻𝗴 𝗧𝗲𝘅𝘁 𝗔𝗻𝗼𝗻𝘆𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻.
(1/🧵)
Also consider following the authors
@aniket-pramanick.bsky.social (@ukplab.bsky.social), @yufanghou.bsky.social (IT:U- Interdisciplinary Transformation University Austria, IBM Research), Saif M. Mohammad (National Research Council Canada), and @igurevych.bsky.social (@ukplab.bsky.social).
(3/3)
🎁 Tools, data, and analysis await you:
📄 Paper: arxiv.org/abs/2409.19505
🌐Project: ukplab.github.io/acl25-nlp-co...
💻 Code: github.com/UKPLab/acl25...
💾 Data: tudatalib.ulb.tu-darmstadt.de/handle/tudat...
🗺️ See you at #ACL2025 in Vienna
(2/🧵)
#NLProc #ACL2025 #AI4Science #ACL2025