This seems especially relevant for real world cases which are full of extra information that is not necessary to reach an accurate conclusion
but often lead to inaccuracies/distractions in LLM outputs
@dradamb.bsky.social
founder of numenor.health (previously cofounder of Span - acquired by Eight Sleep)
This seems especially relevant for real world cases which are full of extra information that is not necessary to reach an accurate conclusion
but often lead to inaccuracies/distractions in LLM outputs
A team at Apple recently published a really interesting paper where they tested LLM performance with GSM (a standard benchmark test for mathematical reasoning ability)
they modified the questions with unnecessary information to distract the LLMs
It led to much lower accuracy even for o1
I wonder how much of the improvement in performance is because of goodhart's law
βWhen a measure becomes a target, it ceases to be a good measureβ
I.e. is better performance on benchmark tests translatable to real world performance?