Women in AI Research - WiAIR's Avatar

Women in AI Research - WiAIR

@wiair.bsky.social

WiAIR is dedicated to celebrating the remarkable contributions of female AI researchers from around the globe. Our goal is to empower early career researchers, especially women, to pursue their passion for AI and make an impact in this exciting field.

66 Followers  |  0 Following  |  287 Posts  |  Joined: 19.02.2025  |  1.778

Latest posts by wiair.bsky.social on Bluesky

Video thumbnail

Early struggles and rejection are normal in academia. Your value as a scientist is not about how quickly things happen, it is about your persistence and passion. Keep going πŸ’ͺ✨

Stay tuned: www.youtube.com/@WomeninAIRe...

#Science #Research #AcademicLife #WiAIRpodcast

11.08.2025 15:02 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
HALoGEN: Fantastic LLM Hallucinations and Where to Find Them Despite their impressive ability to generate high-quality and fluent text, generative large language models (LLMs) also produce hallucinations: statements that are misaligned with established world kn...

πŸ“„ Read the full paper here: arxiv.org/abs/2501.082...

08.08.2025 16:50 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
LLM Hallucinations and Machine Unlearning, with Dr. Abhilasha Ravichander
YouTube video by Women in AI Research WiAIR LLM Hallucinations and Machine Unlearning, with Dr. Abhilasha Ravichander

🎧 Don’t miss our conversation with Dr. Ravichander on WiAIR Podcast, where we explore the paper’s findings and its implications for trustworthy AI.β€¨πŸŽ¬ YouTube: www.youtube.com/watch?v=QPp0...
πŸŽ™οΈ Spotify: open.spotify.com/episode/7lGC...
🍎 Apple: podcasts.apple.com/ca/podcast/l...

08.08.2025 16:50 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

⚑ Model Performance: Even the best models like GPT-4 still hallucinate, with error rates up to 86% in certain tasks.

08.08.2025 16:50 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

πŸ” Real-World Impact: Hallucinations can cause serious issues in applications like content generation, scientific discovery, and decision-making where accuracy is crucial.

08.08.2025 16:50 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Key Takeaways from HALoGEN πŸ“œ:
🧠 HALoGEN Benchmark: A comprehensive framework to evaluate hallucinations across 9 diverse domains, from scientific citations to programming.
πŸ’‘ Types of Hallucinations: Type A (misremembered data), Type B (misleading facts), Type C (fabrications)

08.08.2025 16:50 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

What are hallucinations in LLMs? They happen when AI models generate facts that misalign with the world’s established knowledge or context, leading to false or fabricated information.

08.08.2025 16:49 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

This paper tackles the issue of hallucinations in large language models (LLMs)β€”where models produce misleading or inaccurate information. It's a critical problem in AI research. πŸ§ πŸ’‘

08.08.2025 16:49 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We’re thrilled to congratulate Dr. Abhilasha Ravichander (@lasha.bsky.social) and her team for receiving the Outstanding Paper Award at #acl2025 for their work titled "HALoGEN: Fantastic LLM Hallucinations and Where to Find Them"! πŸ†βœ¨

#ACL #LLMs #Hallucination #WiAIR #WomenInAI

08.08.2025 16:49 β€” πŸ‘ 15    πŸ” 2    πŸ’¬ 2    πŸ“Œ 0
Post image

πŸŽ™οΈ New Women in AI Research episode out now!
We speak with Dr. Abhilasha Ravichander about:

– LLM hallucination types
– Benchmarks WildHallucinations & HALoGEN
– Machine unlearning + memorization probes
– Responsible AI & transparent systems

🎧 Listen here: youtu.be/QPp0cJNBbL8

#LLMHallucinations

06.08.2025 15:30 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Video thumbnail

πŸ€– Can LLMs know when to be factual vs. creative?

πŸŽ™οΈ In this short clip, Abhilasha Ravichander explores one of the hardest challenges in LLM behavior: adaptability.

πŸ“… Full #WiAIRpodcast episode drops in 2 days!

04.08.2025 15:30 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Women in AI Research WiAIR Women in AI Research (WiAIR) is a podcast dedicated to celebrating the remarkable contributions of female AI researchers from around the globe. Our mission is to challenge the prevailing perception th...

🎧 Episode out Aug 6.
Tune in to #WiAIRpodcast for critical conversations shaping the future of AI research:
🎬 YouTube: youtube.com/@WomeninAIRe...
πŸŽ™οΈ Spotify: open.spotify.com/show/51RJNlZ...
🍎 Apple: podcasts.apple.com/ca/podcast/w...
4/

01.08.2025 14:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

We also explore:
🧠 Tackling model memorization
πŸ”Ž Pushing for data transparency
πŸ› οΈ Building tools for machine unlearning
πŸŽ“ Advice on navigating academic transitions
3/

01.08.2025 14:35 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

πŸ† HALoGEN β€” a benchmark for detecting hallucinations in LLMs β€” just won the Outstanding Paper Award @aclmeeting.bsky.social in Vienna.
We unpack the challenges of evaluating hallucinations, and how factuality benchmarks can guide better LLM assessment.
2/

#ACL2025NLP #ACL2025

01.08.2025 14:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Speaker announcement: the new episode of the Women in AI Research WiAIR podcast is out on August 6th. Our guest is Dr. Abhilasha Ravichander, a postdoc at University of Washington and Assistant Professor at Max Planck Institute for Software Systems.

Speaker announcement: the new episode of the Women in AI Research WiAIR podcast is out on August 6th. Our guest is Dr. Abhilasha Ravichander, a postdoc at University of Washington and Assistant Professor at Max Planck Institute for Software Systems.

πŸŽ™οΈ New Women in AI Research #WiAIR episode coming Aug 6!

We talk to @lasha.bsky.social about LLM Hallucination, her award-winning HALoGEN benchmark, and how we can better evaluate hallucinations in language models.
πŸ‘‡ What’s inside:
1/

01.08.2025 14:35 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

Our guest Dieuwke Hupkes questions the lack of accountability in academic peer review.

Why do bad reviewers get off the hook? Could consequences - like limiting their ability to submit - actually improve the system?

What's your take on this?

30.07.2025 15:02 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Generalization in AI, with Dr. Dieuwke Hupkes
YouTube video by Women in AI Research WiAIR Generalization in AI, with Dr. Dieuwke Hupkes

🎧 Hear Dieuwke Hupkes on why scaling laws differ for knowledge and reasoning in LLMs.

🎬 YouTube: www.youtube.com/watch?v=CuTW...
πŸŽ™οΈ Apple: podcasts.apple.com/ca/podcast/g...
🎧 Spotify: open.spotify.com/show/51RJNlZ...
πŸ”— Paper: arxiv.org/abs/2503.10061
#WiAIR #AIResearch #Reasoning #WomanInAI

28.07.2025 16:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

πŸ“Œ Why it matters:
Compute‑optimal training depends on the skills you care about.
Careful datamix calibration & validation design are essential to train LLMs that perform well across both knowledge & reasoning. (7/8)

28.07.2025 16:08 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

3️⃣ Validation Sensitivity
The validation set you choose matters.
At small compute scales, the optimal parameter count can shift by ~50% depending on validation design. Even at large scales, >10% variation remains. (6/8)

28.07.2025 16:08 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

2️⃣ Beyond the Datamix
Even after balancing data proportions, knowledge vs. code diverge.
πŸ“Š Knowledge keeps demanding more parameters, while code scales better with more data. (5/8)

28.07.2025 16:08 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

1️⃣ Skill‑Dependent Optima
- Knowledge QA β†’ capacity‑hungry (needs more parameters)
- Code β†’ data‑hungry (benefits more from tokens)
Scaling laws can’t be captured by a single curve. (4/8)

28.07.2025 16:07 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

πŸ› οΈ The study:
Experiments across 9 compute levels & 19 datasets, comparing two skill categories:
- Knowledge‑based QA
- Code generation (as a proxy for reasoning)
Findings reveal fundamental differences in scaling behavior. (3/8)

28.07.2025 16:07 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

πŸ” The problem:
Scaling laws guide LLM training by trading off model size & data under fixed compute. But compute‑optimal scaling is usually measured via aggregate validation loss.
What happens when we zoom in on specific skills? (2/8)

28.07.2025 16:06 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image

🧠 Do scaling laws in LLMs apply equally across skills, or do knowledge and reasoning scale differently?
A new study with Dieuwke Hupkes (Meta AI) shows that scaling laws are skill‑dependent. πŸ‘‡ (1/8)
#WiAIR #AIResearch

28.07.2025 16:05 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1
Video thumbnail

Trust in AI starts with generalization.
Dieuwke Hupkes (#MetaAI) explains why it's critical to know when you can count on your model, and when to be cautious.
Full episode:
πŸŽ₯ Youtube: youtu.be/CuTWIW1JcsA?...
πŸŽ™οΈ Spotify: open.spotify.com/episode/0KSR...
#LLMs #TrustworthyAI

25.07.2025 16:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency Abstract. The staggering pace with which the capabilities of large language models (LLMs) are increasing, as measured by a range of commonly used natural language understanding (NLU) benchmarks, raise...

πŸ“„ Paper: doi.org/10.1162/coli...
πŸ”— Code: github.com/facebookrese...

23.07.2025 16:05 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency Abstract. The staggering pace with which the capabilities of large language models (LLMs) are increasing, as measured by a range of commonly used natural language understanding (NLU) benchmarks, raise...

🎧 In our discussion, Dieuwke Hupkes reflects on model behavior, philosophical roots, and the importance of cross-form consistency in language evaluation.
πŸ“½οΈ YouTube: www.youtube.com/watch?v=CuTW...
🎧 Spotify: open.spotify.com/show/51RJNlZ...
🎧 Apple Podcasts: podcasts.apple.com/ca/podcast/w...
(8/8)

23.07.2025 16:05 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Multisense Consistency does not focus on correctness alone. It probes semantic stabilityβ€”whether a model preserves meaning across variation.
This reframes evaluation toward robustness, especially in multilingual and generalization settings. (7/8)

23.07.2025 16:04 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

They identify two sources of inconsistency:
🧭 Task misinterpretation due to form change
βš™οΈ Failure to apply consistent logic across inputs
These issues occur in both simple and complex tasks, despite semantic equivalence (6/8)

23.07.2025 16:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Findings include:
❌ Frequent inconsistencies across reworded or translated inputs
πŸ“‰ Lower consistency in non-English (especially low-resource) settings
🀯 Sometimes incorrect answers were more stable than correct ones
These patterns challenge benchmark-based assumptions (5/8)

23.07.2025 16:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0