π§ Coming soon β donβt miss it!
π¬ YouTube - www.youtube.com/@WomeninAIRe...
ποΈ Spotify - open.spotify.com/show/51RJNlZ...
π§ Apple Podcasts - podcasts.apple.com/ca/podcast/w...
@wiair.bsky.social
WiAIR is dedicated to celebrating the remarkable contributions of female AI researchers from around the globe. Our goal is to empower early career researchers, especially women, to pursue their passion for AI and make an impact in this exciting field.
π§ Coming soon β donβt miss it!
π¬ YouTube - www.youtube.com/@WomeninAIRe...
ποΈ Spotify - open.spotify.com/show/51RJNlZ...
π§ Apple Podcasts - podcasts.apple.com/ca/podcast/w...
Vered's work explores how to make AI more culturally aware, responsible, and reliable across language and vision models.
24.10.2025 16:04 β π 0 π 0 π¬ 1 π 0Weβre excited to feature @veredshwartz.bsky.social , Asst Prof at @cs.ubc.ca, CIFAR AI Chair @vectorinstitute.ai, and author of lostinautomatictranslation.com, in the next @wiair.bsky.social episode.
24.10.2025 16:04 β π 4 π 2 π¬ 1 π 0"Trust only exists when there's risk." - Ana MarasoviΔ
Trust isn't about certainty - it's about risk acceptance.
Full conversation: youtu.be/xYb6uokKKOo
Synthetic evaluation speeds up data creation but canβt replace human judgment in building benchmarks that truly test reasoning.
π¬ YouTube: www.youtube.com/@WiAIRPodcast
ποΈ Spotify: open.spotify.com/episode/5JBN...
π Apple: podcasts.apple.com/ca/podcast/c...
π Paper: arxiv.org/pdf/2505.22830
π©βπ»NLP researchers often preferred LLM-generated items for their stricter adherence to annotation guidelines, yet those same datasets inflated model performance. Synthetic benchmarks look solid but miss the subtle reasoning complexity and creative diversity of human data. (5/6π§΅)
20.10.2025 16:04 β π 0 π 0 π¬ 1 π 0π Models achieved significantly higher scores on these LLM-generated (βsyntheticβ) benchmarks than on human ones.
Synthetic data proved easier and failed to preserve the model-ranking patterns that make real benchmarks informative. (4/6π§΅)
π§© LLMs like GPT-4-Turbo can create moderately to highly valid benchmark questions and contrastive edits, often adhering to human-written guidelines at far lower cost.
But, high validity alone doesn't ensure difficulty. (3/6π§΅)
π The study focuses on two reasoning-over-text benchmarks:
πΉCondaQA β reasoning about negation
πΉDROP β discrete reasoning over quantities
It asks whether prompting LLMs can replace human annotators in building valid and challenging evaluation data. (2/6π§΅)
π§ Can large language models build the very benchmarks used to evaluate them?
In βWhat Has Been Lost with Synthetic Evaluationβ, Ana MarasoviΔ (@anamarasovic.bsky.social) and collaborators ask what happens when LLMs start generating the datasets used to test their reasoning. (1/6π§΅)
AI academia and industry arenβt rivals β theyβre partners. π€
As Ana MarasoviΔ says, innovation flows both ways: research trains the next generation who power real-world AI.
ππ€ www.youtube.com/@WomeninAIRe...
π§ A thought-provoking discussion on trust, transparency & reasoning.
π¬ YouTube: www.youtube.com/watch?v=xYb6...
ποΈ Spotify: open.spotify.com/episode/5JBN...
π Apple: podcasts.apple.com/ca/podcast/c...
#WiAIR #WomenInAI #AIResearch #ExplainableAI #TrustworthyAI #ChainOfThought #NLP #LLM
π The normalized measure correlates strongly with accuracy (RΒ² = 0.74), implying that what seemed like βreasoning faithfulnessβ may instead mirror task performance.
Thus, answer-change rates with or without CoT may not truly capture reasoning alignment. (5/6π§΅)
π The team introduces a normalized unfaithfulness metric to correct for answer-choice biasβsmall modelsβ habit of favoring particular options regardless of reasoning.
After normalization, smaller modelsβ apparent unfaithfulness drops, flattening the V-shape. (4/6π§΅)
π Under the unnormalized Lanham metric, some models (especially Pythia-DPO) show a V-shaped trend: mid-sized models look most βfaithfulβ, while both smaller and larger ones deviate.
This reproduces earlier inverse-scaling behavior found in proprietary models. (3/6π§΅)
π Building on Lanham et al. (2023), the authors test whether reported βfaithfulnessβ patterns hold beyond one model line. They replicate and extend the scaling experiments across Llama-2, FLAN-T5 / UL2, and Pythia-DPO model families. (2/6π§΅)
15.10.2025 16:07 β π 0 π 0 π¬ 1 π 0π Do large language models really reason the way their chain-of-thoughts suggest?
This week on #WiAIRpodcast, we talk with Ana MarasoviΔ (@anamarasovic.bsky.social) about her paper βChain-of-Thought Unfaithfulness as Disguised Accuracy.β (1/6π§΅)
π Paper: arxiv.org/pdf/2402.14897
βοΈπ€ AI Safety Like Aviation: Too Ambitious or Absolutely Necessary?
Can AI ever be as safely regulated as aviation?
Ana MarasoviΔ shares her vision for the future of AI governance β where safety principles and regulation become the default, not an afterthought.
www.youtube.com/@WomeninAIRe...
π§ Tune in to hear Ana and her co-authorsπ«
π¬ YouTube: www.youtube.com/@WomeninAIRe...
ποΈ Spotify: open.spotify.com/show/51RJNlZ...
π Apple: podcasts.apple.com/ca/podcast/w...
π Paper: arxiv.org/abs/2407.03545 (8/8π§΅)
#ExplainableAI #TrustworthyAI #HumanAICollaboration #NLP #WomenInAI #WiAIR
π Takeaway: True explainability isnβt about opening the black box β itβs about building systems that know when to ask for help and let humans lead when it matters most. (7/8π§΅)
10.10.2025 19:12 β π 0 π 0 π¬ 1 π 0π Finetuning a stronger model (Flan-T5-3B) boosted performance by +22β24 F1 points β reminding us that reliable collaboration starts with capable models. (6/8π§΅)
10.10.2025 19:12 β π 0 π 0 π¬ 1 π 0π€ The paper proposes a smarter path forward: let models defer to humans when uncertain, rather than explaining every prediction β boosting both efficiency and trust. (5/8π§΅)
10.10.2025 19:11 β π 0 π 0 π¬ 1 π 0βοΈ Even when models were correct, humanβAI teams often underperformed compared to AI alone. People hesitated or over-relied β showing that explanations donβt always improve judgment. (4/8π§΅)
10.10.2025 19:11 β π 0 π 0 π¬ 1 π 0π From 50 + explainability datasets, only 4 β ContractNLI, SciFact-Open, EvidenceInference v2, and ILDC β were suitable for studying explanation utility in realistic contexts. Most datasets miss how humans actually use AI explanations. (3/8π§΅)
10.10.2025 19:11 β π 0 π 0 π¬ 1 π 0Ana and her co-authors dive deep in βOn Evaluating Explanation Utility for Human-AI Decision Making in NLPβ (Findings of #EMNLP2024) π§ β asking whether explanations truly help humans make better decisions, or just make us feel more confident. (2/8π§΅)
10.10.2025 19:10 β π 0 π 0 π¬ 1 π 0How do we really know when and how much to trust large language models? π€
In this weekβs #WiAIRpodcast, we talk with Ana MarasoviΔ (Asst Prof @ University of Utah; ex @ Allen AI, UWNLP) about explainability, trust, and humanβAI collaboration. (1/8π§΅)
We dive into how to make AI systems that truly earn our trust - not just appear trustworthy.
π¬ Full episode now on YouTube β youtu.be/xYb6uokKKOo
Also on Spotify: open.spotify.com/show/51RJNlZ...
Apple: podcasts.apple.com/ca/podcast/w...
π‘ Key takeaways from our conversation:
β’ Real AI research is messy, nonlinear, and full of surprises.
β’ Trust in AI comes in two forms: intrinsic (how it reasons) and extrinsic (proven reliability).
⒠Sometimes, human-AI collaboration makes things⦠worse.
ποΈ New Women in AI Research episode out now!
This time, we sit down with @anamarasovic.bsky.social to unpack some of the toughest questions in AI explainability and trust.
π Watch here β youtu.be/xYb6uokKKOo
ποΈ New #WiAIR episode coming soon!
We sat down with Ana MarasoviΔ to talk about the uncomfortable truths behind AI trust.
When can we really trust AI explanations?
Watch the trailer youtu.be/GBghj6S6cic
Then subscribe on YouTube to catch the full episode when it drops.