Women in AI Research - WiAIR @wiair

Women in AI Research WiAIR Women in AI Research (WiAIR) is a podcast dedicated to celebrating the remarkable contributions of female AI researchers from around the globe. Our mission is to challenge the prevailing perception…

🎧 Coming soon — don’t miss it!

🎬 YouTube - www.youtube.com/@WomeninAIRe...
🎙️ Spotify - open.spotify.com/show/51RJNlZ...
🎧 Apple Podcasts - podcasts.apple.com/ca/podcast/w...

24.10.2025 16:04 — 👍 0 🔁 0 💬 0 📌 0

Vered's work explores how to make AI more culturally aware, responsible, and reliable across language and vision models.

24.10.2025 16:04 — 👍 0 🔁 0 💬 1 📌 0

We’re excited to feature @veredshwartz.bsky.social , Asst Prof at @cs.ubc.ca, CIFAR AI Chair @vectorinstitute.ai, and author of lostinautomatictranslation.com, in the next @wiair.bsky.social episode.

24.10.2025 16:04 — 👍 4 🔁 2 💬 1 📌 0

"Trust only exists when there's risk." - Ana Marasović
Trust isn't about certainty - it's about risk acceptance.
Full conversation: youtu.be/xYb6uokKKOo

22.10.2025 16:06 — 👍 0 🔁 0 💬 0 📌 0

Synthetic evaluation speeds up data creation but can’t replace human judgment in building benchmarks that truly test reasoning.
🎬 YouTube: www.youtube.com/@WiAIRPodcast
🎙️ Spotify: open.spotify.com/episode/5JBN...
🍎 Apple: podcasts.apple.com/ca/podcast/c...
📄 Paper: arxiv.org/pdf/2505.22830

20.10.2025 16:04 — 👍 0 🔁 0 💬 0 📌 0

👩‍💻NLP researchers often preferred LLM-generated items for their stricter adherence to annotation guidelines, yet those same datasets inflated model performance. Synthetic benchmarks look solid but miss the subtle reasoning complexity and creative diversity of human data. (5/6🧵)

20.10.2025 16:04 — 👍 0 🔁 0 💬 1 📌 0

📊 Models achieved significantly higher scores on these LLM-generated (“synthetic”) benchmarks than on human ones.
Synthetic data proved easier and failed to preserve the model-ranking patterns that make real benchmarks informative. (4/6🧵)

20.10.2025 16:03 — 👍 0 🔁 0 💬 1 📌 0

🧩 LLMs like GPT-4-Turbo can create moderately to highly valid benchmark questions and contrastive edits, often adhering to human-written guidelines at far lower cost.
But, high validity alone doesn't ensure difficulty. (3/6🧵)

20.10.2025 16:03 — 👍 0 🔁 0 💬 1 📌 0

🔍 The study focuses on two reasoning-over-text benchmarks:
🔹CondaQA – reasoning about negation
🔹DROP – discrete reasoning over quantities
It asks whether prompting LLMs can replace human annotators in building valid and challenging evaluation data. (2/6🧵)

20.10.2025 16:03 — 👍 0 🔁 0 💬 1 📌 0

🧠 Can large language models build the very benchmarks used to evaluate them?
In “What Has Been Lost with Synthetic Evaluation”, Ana Marasović (@anamarasovic.bsky.social) and collaborators ask what happens when LLMs start generating the datasets used to test their reasoning. (1/6🧵)

20.10.2025 16:01 — 👍 9 🔁 3 💬 2 📌 0

AI academia and industry aren’t rivals — they’re partners. 🤝
As Ana Marasović says, innovation flows both ways: research trains the next generation who power real-world AI.

🎓🤖 www.youtube.com/@WomeninAIRe...

17.10.2025 16:07 — 👍 0 🔁 0 💬 0 📌 0

YouTube video by Women in AI Research WiAIR Can We Trust AI Explanations? Dr. Ana Marasović on AI Trustworthiness, Explainability & Faithfulness

🎧 A thought-provoking discussion on trust, transparency & reasoning.
🎬 YouTube: www.youtube.com/watch?v=xYb6...
🎙️ Spotify: open.spotify.com/episode/5JBN...
🍎 Apple: podcasts.apple.com/ca/podcast/c...
#WiAIR #WomenInAI #AIResearch #ExplainableAI #TrustworthyAI #ChainOfThought #NLP #LLM

15.10.2025 16:08 — 👍 4 🔁 0 💬 0 📌 0

📊 The normalized measure correlates strongly with accuracy (R² = 0.74), implying that what seemed like “reasoning faithfulness” may instead mirror task performance.
Thus, answer-change rates with or without CoT may not truly capture reasoning alignment. (5/6🧵)

15.10.2025 16:07 — 👍 0 🔁 0 💬 1 📌 0

📏 The team introduces a normalized unfaithfulness metric to correct for answer-choice bias—small models’ habit of favoring particular options regardless of reasoning.
After normalization, smaller models’ apparent unfaithfulness drops, flattening the V-shape. (4/6🧵)

15.10.2025 16:07 — 👍 0 🔁 0 💬 1 📌 0

📈 Under the unnormalized Lanham metric, some models (especially Pythia-DPO) show a V-shaped trend: mid-sized models look most “faithful”, while both smaller and larger ones deviate.
This reproduces earlier inverse-scaling behavior found in proprietary models. (3/6🧵)

15.10.2025 16:07 — 👍 0 🔁 0 💬 1 📌 0

🔍 Building on Lanham et al. (2023), the authors test whether reported “faithfulness” patterns hold beyond one model line. They replicate and extend the scaling experiments across Llama-2, FLAN-T5 / UL2, and Pythia-DPO model families. (2/6🧵)

15.10.2025 16:07 — 👍 0 🔁 0 💬 1 📌 0

👉 Do large language models really reason the way their chain-of-thoughts suggest?
This week on #WiAIRpodcast, we talk with Ana Marasović (@anamarasovic.bsky.social) about her paper “Chain-of-Thought Unfaithfulness as Disguised Accuracy.” (1/6🧵)
📄 Paper: arxiv.org/pdf/2402.14897

15.10.2025 16:06 — 👍 5 🔁 1 💬 1 📌 0

✈️🤖 AI Safety Like Aviation: Too Ambitious or Absolutely Necessary?

Can AI ever be as safely regulated as aviation?
Ana Marasović shares her vision for the future of AI governance — where safety principles and regulation become the default, not an afterthought.

www.youtube.com/@WomeninAIRe...

13.10.2025 16:49 — 👍 1 🔁 0 💬 0 📌 0

Women in AI Research WiAIR Women in AI Research (WiAIR) is a podcast dedicated to celebrating the remarkable contributions of female AI researchers from around the globe. Our mission is to challenge the prevailing perception th...

🎧 Tune in to hear Ana and her co-authors💫
🎬 YouTube: www.youtube.com/@WomeninAIRe...
🎙️ Spotify: open.spotify.com/show/51RJNlZ...
🍎 Apple: podcasts.apple.com/ca/podcast/w...
📄 Paper: arxiv.org/abs/2407.03545 (8/8🧵)
#ExplainableAI #TrustworthyAI #HumanAICollaboration #NLP #WomenInAI #WiAIR

10.10.2025 19:13 — 👍 0 🔁 0 💬 0 📌 0

💭 Takeaway: True explainability isn’t about opening the black box — it’s about building systems that know when to ask for help and let humans lead when it matters most. (7/8🧵)

10.10.2025 19:12 — 👍 0 🔁 0 💬 1 📌 0

📊 Finetuning a stronger model (Flan-T5-3B) boosted performance by +22–24 F1 points — reminding us that reliable collaboration starts with capable models. (6/8🧵)

10.10.2025 19:12 — 👍 0 🔁 0 💬 1 📌 0

🤝 The paper proposes a smarter path forward: let models defer to humans when uncertain, rather than explaining every prediction — boosting both efficiency and trust. (5/8🧵)

10.10.2025 19:11 — 👍 0 🔁 0 💬 1 📌 0

⚖️ Even when models were correct, human–AI teams often underperformed compared to AI alone. People hesitated or over-relied — showing that explanations don’t always improve judgment. (4/8🧵)

10.10.2025 19:11 — 👍 0 🔁 0 💬 1 📌 0

🔍 From 50 + explainability datasets, only 4 — ContractNLI, SciFact-Open, EvidenceInference v2, and ILDC — were suitable for studying explanation utility in realistic contexts. Most datasets miss how humans actually use AI explanations. (3/8🧵)

10.10.2025 19:11 — 👍 0 🔁 0 💬 1 📌 0

Ana and her co-authors dive deep in “On Evaluating Explanation Utility for Human-AI Decision Making in NLP” (Findings of #EMNLP2024) 🧠 — asking whether explanations truly help humans make better decisions, or just make us feel more confident. (2/8🧵)

10.10.2025 19:10 — 👍 0 🔁 0 💬 1 📌 0

How do we really know when and how much to trust large language models? 🤔
In this week’s #WiAIRpodcast, we talk with Ana Marasović (Asst Prof @ University of Utah; ex @ Allen AI, UWNLP) about explainability, trust, and human–AI collaboration. (1/8🧵)

10.10.2025 19:04 — 👍 1 🔁 0 💬 1 📌 0

We dive into how to make AI systems that truly earn our trust - not just appear trustworthy.

🎬 Full episode now on YouTube → youtu.be/xYb6uokKKOo
Also on Spotify: open.spotify.com/show/51RJNlZ...
Apple: podcasts.apple.com/ca/podcast/w...

08.10.2025 16:03 — 👍 1 🔁 0 💬 0 📌 0

💡 Key takeaways from our conversation:
• Real AI research is messy, nonlinear, and full of surprises.
• Trust in AI comes in two forms: intrinsic (how it reasons) and extrinsic (proven reliability).
• Sometimes, human-AI collaboration makes things… worse.

08.10.2025 16:03 — 👍 0 🔁 0 💬 1 📌 0

🎙️ New Women in AI Research episode out now!
This time, we sit down with @anamarasovic.bsky.social to unpack some of the toughest questions in AI explainability and trust.

🔗 Watch here → youtu.be/xYb6uokKKOo

08.10.2025 16:03 — 👍 1 🔁 1 💬 1 📌 1

🎙️ New #WiAIR episode coming soon!

We sat down with Ana Marasović to talk about the uncomfortable truths behind AI trust.
When can we really trust AI explanations?

Watch the trailer youtu.be/GBghj6S6cic
Then subscribe on YouTube to catch the full episode when it drops.

06.10.2025 16:02 — 👍 0 🔁 0 💬 0 📌 0

Latest posts by wiair.bsky.social on Bluesky