Shiwali Mohan's Avatar

Shiwali Mohan

@shiwali.bsky.social

Founder | AI Scientist | Intelligent Agents & Multi-Agent Systems| Agent Frameworks & Architectures | Human-Agent Collaboration | Cognitive Science

56 Followers  |  42 Following  |  10 Posts  |  Joined: 14.09.2023  |  1.6508

Latest posts by shiwali.bsky.social on Bluesky

Preview
Can Generative AI Support Patients' & Caregivers' Informational Needs? Towards Task-Centric Evaluation Of AI Systems Generative AI systems such as ChatGPT and Claude are built upon language models that are typically evaluated for accuracy on curated benchmark datasets. Such evaluation paradigms measure predictive an...

Its a preliminary study but it shows how we can make #AI #ML evaluations more informative; beyond benchmarks curated with minimal insights about what a useful questions is and what an appropriate answer looks like. (8/8)

๐Ÿ“– Paper: arxiv.org/abs/2402.00234
๐ŸŽฅTalk: drive.google.com/file/d/1m79W...

24.03.2025 19:27 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

๐Ÿค– Measured how #GenAI systems did; not only in terms of correctness but also how similar they were to an expert answering the same question. (7/8)

24.03.2025 19:27 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

๐Ÿ“ฅ Curated an evaluation question set from observed interactions. The set contains real questions asked by participants as they were attempting to do a specific task. Such datasets are critical to measuring if an #AI system is producing responses that are useful. (6/8)

24.03.2025 19:27 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

๐Ÿค•๐Ÿ‘ฉโ€โš•๏ธ Studied how people interact with the expert if they were available. This uncovered specific needs people have as they make sense of data and also, how an expert addresses those needs. (5/8)

24.03.2025 19:27 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

๐Ÿฅ Identified a specific usecase in which people need support from an expert but the expert is not easily accessible; understanding medical scans and reports in order to make good decisions about your treatment. (4/8)

24.03.2025 19:27 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

In our most recent paper, we explore an evaluation approach for #GenAI #GenerativeAI systems. Here are the steps we followed - (3/8)

24.03.2025 19:27 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

As a science, we have to adopt rigorous evaluations that identify what #IntelligentSystem #AIAgent #Agent behavior should be & measure if it works as intended. Move beyond a ๐˜ฑ๐˜ณ๐˜ฐ๐˜ฃ๐˜ญ๐˜ฆ๐˜ฎ-๐˜ข๐˜จ๐˜ฏ๐˜ฐ๐˜ด๐˜ต๐˜ช๐˜ค metric (accuracy) on a ๐˜ต๐˜ข๐˜ด๐˜ฌ-๐˜ข๐˜จ๐˜ฐ๐˜ฏ๐˜ด๐˜ต๐˜ช๐˜ค benchmark. Adopt practices from #HCI, #psychology, #economics. (2/8)

24.03.2025 19:27 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

#AI #ML evals measure accuracy on benchmarks, telling us how algorithms compare with each other. But, not much about how an #IntelligentSystem should be built. How do we make evals more informative? (1/8)

๐Ÿ“– Paper: arxiv.org/abs/2402.00234
๐ŸŽฅTalk: drive.google.com/file/d/1m79W...

24.03.2025 19:27 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
An architecture showing how a planning agent can be extended with metareasoning.

An architecture showing how a planning agent can be extended with metareasoning.

At #AAAI2025? Looking for #AI #ML research beyond #GenAI hype & doom? Excited about #AI running on your laptop? Listen to my colleague Wiktor Piotrowski talk about #OpenWorldLearning #OWL at 9:30 am on Feb 28th (Journal Track).

arxiv.org/abs/2306.06272
#AIPlanning #MBR #CognitiveSystems #KRR

26.02.2025 20:14 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Wait, are the AnthropicAI people seriously claiming to โ€œunlock a rich theoretical landscapeโ€ for AI evaluation by proposing the use ofโ€ฆ. error bars? And this secret trove of deep statistical insight starts with โ€œuse the Central Limit Theoremโ€?

Befuddling

27.11.2024 22:09 โ€” ๐Ÿ‘ 98    ๐Ÿ” 16    ๐Ÿ’ฌ 11    ๐Ÿ“Œ 1

When your experiments show that your #AI is more human than humans, it is not that you have built #AGI or #SuperIntelligence, it is that you don't know how to evaluate, experiment, and measure.

27.11.2024 21:28 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

@shiwali is following 19 prominent accounts