๐จ New Position Paper ๐จ
Multiple choice evals for LLMs are simple and popular, but we know they are awful ๐ฌ
We complain they're full of errors, saturated, and test nothing meaningful, so why do we still use them? ๐ซ
Here's why MCQA evals are broken, and how to fix them ๐งต