Shang Qu's Avatar

Shang Qu

@lindsayttsq.bsky.social

AI4Biomed & LLMs @ Tsinghua University

25 Followers  |  251 Following  |  8 Posts  |  Joined: 13.11.2024  |  1.6857

Latest posts by lindsayttsq.bsky.social on Bluesky

πŸ“We've released the MedXpertQA dataset!
huggingface.co/datasets/Tsi...

πŸ“šCheck out more details:
Preprint: arxiv.org/pdf/2501.18362
Github: github.com/TsinghuaC3I/...

09.02.2025 02:19 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Check out the details!
πŸ“’Preprint: arxiv.org/pdf/2501.18362
πŸ—ƒοΈData files will be released shortly at: github.com/TsinghuaC3I/...

04.02.2025 13:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

We also found that reasoning process errors & perceptual errors (in MM) take up a large percentage of model errors. Error cases provide further insights into the challenges models still face regarding clinical reasoning:

04.02.2025 13:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ’‘Clinical reasoning facilitates model reasoning evaluation beyond math & code. We annotate MedXpertQA questions as Reasoning/Understanding based on required reasoning complexity.
Comparing 3 inference-time scaled models against their backbones, we find distinct improvements in the Reasoning subset:

04.02.2025 13:32 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Benchmark construction process - 38k original ➑️ 4k+ final questions
- Filtering for difficulty and diversity using responses from humans + 8 AI experts
- Question rewriting & option set expansion to lower data leakage risk
- Human expert proofreading & error correction

04.02.2025 13:31 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We improve clinical relevance through
⭐️Medical specialty coverage: MedXpertQA includes questions from 20+ exams of medical licensing level or higher
⭐️Realistic context: MM is the first multimodal medical benchmark to introduce rich clinical information with diverse image types

04.02.2025 13:31 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

Compared with rapidly saturating benchmarks like MedQA, we raise the bar with harder questions and a sharper focus on medical reasoning.
Full results evaluating 17 LLMs, LMMs, and inference-time scaled models:

04.02.2025 13:30 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ“ˆHow far are leading models from mastering realistic medical tasks? MedXpertQA, our new text & multimodal medical benchmark, reveals gaps in model abilities

πŸ“ŒPercentage scores on our Text subset:
o3-mini: 37.30
R1: 37.76 - frontrunner among open-source models
o1: 44.67 - still room for improvement!

04.02.2025 13:29 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 1    πŸ“Œ 1

@lindsayttsq is following 20 prominent accounts