's Avatar

@asaf-yehudai.bsky.social

83 Followers  |  45 Following  |  12 Posts  |  Joined: 19.11.2024  |  1.5119

Latest posts by asaf-yehudai.bsky.social on Bluesky

Preview
JuStRank - a Hugging Face Space by ibm Discover amazing ML apps made by the community

Yes!
huggingface.co/spaces/ibm/J...

13.12.2024 13:06 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
JuStRank - a Hugging Face Space by ibm Discover amazing ML apps made by the community

Checkout our full leaderboard here:
huggingface.co/spaces/ibm/J...

13.12.2024 10:16 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Paper page - JuStRank: Benchmarking LLM Judges for System Ranking Join the discussion on this paper page

Many more details are in the paper:
huggingface.co/papers/2412....

Thanks for the amazing collaborators: Ariel Gera, Odellia Boni, @yperlitz.bsky.social, Roy Bar-Haim, Lilach Eden, from IBM Research.

13.12.2024 10:16 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Overall, we found:
1⃣strong correlation between judge ranking abilities and decisiveness
2⃣and Negative correlation with its tendency for System-specific biases

13.12.2024 10:16 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Surprisingly, we found that self-bias is less prevalent than we thought

13.12.2024 10:16 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Secondly, we define a new type of Bias:

System-specific bias

Where a judge prefers or dislikes a specific system

Our results demonstrate large biases that affect systems-ranking

13.12.2024 10:16 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Analyzing these figures, we found an emergent judge behavior:

We call it decisiveness!
decisive judges prefer stronger systems, more than humans do!

We measure it based on the empirical fit

13.12.2024 10:16 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

What does JuStRank tell us about general judge behavior?

For that, we turn to the system preference task
Given a pair of systems, which one is better!

We plot gold and judge predicted win-rates

13.12.2024 10:16 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

With JuStRank we found:
1⃣Smaller dedicated judges are on par with big ones
2⃣LLM judge's realization matters a lot
3⃣Comparative judgment is not the best for most judges

πŸ•ΊπŸ’ƒ

13.12.2024 10:16 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

So how did we do it?

For LLMs, we took 4 unique realizations
βž• Reward models
they judge the responses of 64 systems
and got each judge's system ranking

Then we compare the ranking to Arena's gold rank

13.12.2024 10:16 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

There are many new judge benchmarks
But most focus on evaluating the judge's ability to choose a better response

We focus on the judge's ability to choose a better system

13.12.2024 10:16 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
JuStRank: Benchmarking LLM Judges for System Ranking Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such eva...

New preprint! ✨
Interested in LLM-as-a-Judge?
Want to get the best judge for ranking your system?
our new work is just for you:
"JuStRank: Benchmarking LLM Judges for System Ranking"
πŸ•ΊπŸ’ƒ
arxiv.org/abs/2412.09569

13.12.2024 10:16 β€” πŸ‘ 9    πŸ” 5    πŸ’¬ 1    πŸ“Œ 1

@asaf-yehudai is following 19 prominent accounts