Potemkin Understanding in Large Language Models
Large language models (LLMs) are regularly evaluated using benchmark datasets. But what justifies making inferences about an LLM's capabilities based on its answers to a curated set of questions? This...
@simonwillison.net I'm not trying to be an LLM denier here, but man this paper hit home for me as not an ML kind of person and I'd love to see your take on it?
[2506.21521] Potemkin Understanding in Large Language Models share.google/W9cKIwYoWI5W...
Coherence seems like an important metric.
28.06.2025 19:20 — 👍 2 🔁 0 💬 1 📌 0
These are great!
18.11.2024 19:37 — 👍 2 🔁 0 💬 0 📌 0
keyboard.io. In past lives, I helped run VaccinateCA, created K-9 Mail for Android, created Request Tracker, and was the project lead for Perl.
I can usually be found in #Berkeley
jesse@fsck.com
jesse@keyboard.io
jesse@metasocial.com
was @obra on Twitter
The new way to hire Elixir developers, anywhere in the world 🌎.
https://elixirdevs.com
Elixir is a dynamic, functional language designed for building scalable and maintainable applications
🔗 https://elixir-lang.org/
Independent AI researcher, creator of datasette.io and llm.datasette.io, building open source tools for data journalism, writing about a lot of stuff at https://simonwillison.net/
Helping teams and individuals adopt Elixir and thrive. Elixir courses and podcast.
HELLDIVERS 2 developed by @ArrowheadGS
on PS5 and PC!
GIVE 'EM HELL.
parody* If you post cool art / screenshots I will retweet them.
JOIN THE FIGHT: http://discord.gg/helldivers
Artist based in San Francisco
The Windsurf Editor. Tomorrow's Editor, Today.
Creator of Elixir. Working at Dashbit and Livebook.
The Modern Software Engineering Channel ➡️ https://www.youtube.com/@ModernSoftwareEngineeringYT
The family of tech conferences focused on BEAM languages:
#Erlang, #Elixir and #Gleam
Code BEAM Europe: 02 Jun https://codebeameurope.com/
At wired.com where tomorrow is realized || Sign up for our newsletters: https://wrd.cm/newsletters
Find our WIRED journalists here: https://bsky.app/starter-pack/couts.bsky.social/3l6vez3xaus27
We're an Al safety and research company that builds reliable, interpretable, and steerable Al systems. Talk to our Al assistant Claude at Claude.ai.
Husband and father. 40th Governor of California. Former Lt. Governor of California. Former San Francisco Mayor. Personal account.
The official "Resistance" team of U.S. National Park Service. Our website: www.ourparks.org
Software Design Loudmouth. Works for Thoughtworks. Also hikes, watches theater, and plays modern board games. He/him.
host of https://martinfowler.com
Oil painter of uncomfortable intersections and sometimes concept artist.
I do a few prints sales a year.
Www.katebalfe.com