dpeskoff - Bluesky Statics

🚨 New Position Paper 🚨

Multiple choice evals for LLMs are simple and popular, but we know they are awful 😬

We complain they're full of errors, saturated, and test nothing meaningful, so why do we still use them? 🫠

Here's why MCQA evals are broken, and how to fix them 🧵

24.02.2025 21:03 — 👍 46 🔁 13 💬 2 📌 0

Hello, World!

05.04.2025 03:21 — 👍 3 🔁 0 💬 0 📌 0

Posts by (@dpeskoff.bsky.social)