Yotam Perlitz's Avatar

Yotam Perlitz

@yperlitz.bsky.social

Research Scientist at @ibmresearch #NLProc, #RL. Opinions are my own.

663 Followers  |  56 Following  |  58 Posts  |  Joined: 11.02.2024  |  1.6353

Latest posts by yperlitz.bsky.social on Bluesky

How important are LLM evaluations to you?

A) Who cares?
B) Somewhat important (I guess?)
C) I'm an LLM, I evaluate myself.
D) Enough to join the pack

Lets talk about LLM evals here: go.bsky.app/DJpp8cy

18.11.2024 20:50 β€” πŸ‘ 6    πŸ” 1    πŸ’¬ 3    πŸ“Œ 0
Post image

Save yourselves the hours (or days) inferring all 64K examples, when using HELM
In arxiv.org/pdf/2308.116... we show that 160 examples 🀯🀯🀯 is enough to get a very good picture, #ComputeIsForTraining.

with
@lchoshen.bsky.social and more

13.11.2024 18:40 β€” πŸ‘ 6    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

Thanks!

12.11.2024 19:51 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@yamadashy

12.11.2024 19:50 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
GitHub - yamadashy/repomix: πŸ“¦ Repomix (formerly Repopack) is a powerful tool that packs your entire repository into a single, AI-friendly file. Perfect for when you need to feed your codebase to Large... πŸ“¦ Repomix (formerly Repopack) is a powerful tool that packs your entire repository into a single, AI-friendly file. Perfect for when you need to feed your codebase to Large Language Models (LLMs) o...

If you haven't tried it yet:
github.com/yamadashy/re...
will can turn your repo into one file,
making it super easy to feed to a chatbot asking questions

12.11.2024 19:50 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
BenchBench Leaderboad - a Hugging Face Space by ibm Discover amazing ML apps made by the community

✨ Developed a new benchmark or dataset for language models? ✨
Want the community to trust and adopt it? πŸ€”
Show that it (dis)agrees with common benchmarks

BenchBench makes it easy. Check it out:
πŸ‘‰ huggingface.co/spaces/ibm/b...

12.11.2024 19:47 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

hi β€ͺ@mariaa.bsky.social‬
Can I be added to the pack?
Mostly posting about AI evaluations and benchmarking :)

12.11.2024 19:44 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

hi @maosbot.bsky.social‬ can I be added to the AI pack?
mostly posting on Evaluations of AI but other things as well

12.11.2024 19:36 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Seems like it indeed measure what it claims to :)
Kudus to the authors
A faster, automatic (no annotators) alternative to the Chatbot arena https://t.co/WNk3UmXRSq

24.10.2024 10:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

https://t.co/TZlMiQdgWR

22.10.2024 12:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

we've now added the decentralized arena to benchbench,

check out how it fares with other benchmarks

https://t.co/pjhtr8CPZD

22.10.2024 12:56 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Get your benchmark game on: https://t.co/yY0swLQOHZ https://t.co/3qzkcIOd7u https://t.co/5Y7QUz0Ype

17.09.2024 18:42 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Me trying to choose the right LLM benchmark without BenchBench:

https://t.co/TZlMiQdgWR https://t.co/DQEttklUGQ

17.09.2024 11:19 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Shoutout to @streamlit, our framework of choice! Shoutout to @huggingface for hosting our space πŸ€— https://t.co/z8LFw6ZQG7

17.09.2024 11:16 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Explore the BenchBench Leaderboard to explore and visualize how established benchmarks compare: https://t.co/yY0swLQgSr
Use our Python package to perform your own BAT analysis: https://t.co/iU8favWVT6
And read the paper: https://t.co/RvCp3R6gU5 https://t.co/poHpewZkS3

16.09.2024 17:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

BenchBench can prove your benchmark measures unique skills ❄️(disagreement with existing benchmarks)

Or prove it captures the essence of others aimed at (agreement), for example, agreeing with @lmsys, but efficiently. https://t.co/KwtHtTRESc

16.09.2024 17:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

✨ Developed a new benchmark or dataset for language models? ✨

Want the community to trust and adopt it? πŸ€”

So, demonstrate its validity by comparing it to established benchmarks!

BenchBench makes it easy. Check it out:
πŸ‘‰ https://t.co/yY0swLQgSr

16.09.2024 17:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Shout-out to the amazing team at IBM behind Unitxt: @ElronBandel, @MatanOrbach, yoavkatz, eladv, @LChoshen, @yotamperlitz & more!

IBM is betting big on it (IBM Research AI VP πŸ‘‡) https://t.co/BKfK0JriYB

06.09.2024 10:41 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

HELM just got a great upgrade!
We've integrated with Unitxt for:

Easy dataset addition
2x the datasets
Sharable & reproducible pipelines

Check out the blogpost: https://t.co/UJXwfPKzGN
And the unitxt repo
https://t.co/GeqMCoQhjv

@ElronBandel @YifanMai

06.09.2024 10:27 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Everyone knows you never have to use the full test set
We shows how much they were right 🀯!

Check out our presentation at @naacl
in Efficient/Low-Resources and Evaluation Methods for NLP (18 June 2024 @ 02:12)

or watch our video here:
https://t.co/pPOpKyLbhT

See you! https://t.co/ocVvmVBBlW

16.06.2024 20:37 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

It is a great figure
and a great thing you did by sharing all your meta-data!

it had enabled a lot of great work
ours as well :)

https://t.co/9lGi8aW8IG https://t.co/Lz62fTdn7O

07.06.2024 20:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Bored with all benchmarks ranking models the same?
HOLMES doesn't πŸ’ͺ

Probing LMs for linguistic abilities is a fresh idea, @AndreasWaldis took it to the extreme 🦸

Give it a read!
or check out the leaderboard https://t.co/Byc1Nhp3nV https://t.co/zH0RLddkID

13.05.2024 16:11 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I've been working internally with this dataset
and let me tell you...

Its great! https://t.co/MOwn0OyVS3

03.04.2024 16:10 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

like the color scheme πŸ… https://t.co/sdAosgxypV

13.02.2024 18:49 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Using contrastive representation for optimized human evaluation πŸ‘οΈπŸ‘οΈπŸ‘οΈ

Nice! https://t.co/49leLodOAQ

13.02.2024 17:11 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Check out the paper for more insights :) https://t.co/7zhb8mGtQ0

01.02.2024 21:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

variance in evaluation has many sources,
this work really does a good job at profiling one of these https://t.co/nAf7zYDSd7

01.02.2024 21:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

these models keeps changing πŸ’©
tomorrow this figure will have no meaning https://t.co/OsA2WfiLHn

19.12.2023 14:38 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

this is a nice to have link :) https://t.co/DYApcasZen

19.12.2023 14:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

seems like there are more latest findings similar to that, BTW @adinamwilliams , where can I find the full paper? https://t.co/sl1Jqa1R1R

12.12.2023 10:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@yperlitz is following 19 prominent accounts