Consistently the best performers at resisting hallucinations are the Claude series: Sonnet, Haiku and Opus. OpenAI does poorly. Grok and Gemini are ~alright.
06.08.2025 07:16 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0@j11y.io.bsky.social
๐ณ๏ธโ๐ j11y.io // author, engineer, stroke survivor, epileptic. I live in Beijing. I build book recs on ablf.io and work on AI governance at @cip.org
Consistently the best performers at resisting hallucinations are the Claude series: Sonnet, Haiku and Opus. OpenAI does poorly. Grok and Gemini are ~alright.
06.08.2025 07:16 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0Starting to see massive divergence in even these small param models. Be warned, OpenAI's new OSS 20B performs around 50% worse then Anthropic's Claude Haiku 3.5 (an old and reliable favorite) on our hallucination evals at weval.org
06.08.2025 07:13 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0Models are temporary; Evals are forever.
08.11.2024 18:41 โ ๐ 32 ๐ 4 ๐ฌ 2 ๐ 4So, a new system prompt?
29.07.2025 23:32 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0Why is the American tourist so loud and grating? A rare joy to talk to a calm low-decibel one.
18.07.2025 23:28 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0Do you know if there is a good dump of prompt<>response pairs that are especially suspect with grok? Iโd love to run a comprehensive eval.
17.07.2025 13:53 โ ๐ 5 ๐ 0 ๐ฌ 1 ๐ 0If someone told me 20 years ago that--for work--I would be evaluating AI for its empathy and safety, I'd either not believe you or think you were talking in a "will smith interogating Sonny in Asimov's iRobot" kinda way. Alas, not that cool.
17.07.2025 08:32 โ ๐ 3 ๐ 1 ๐ฌ 0 ๐ 0This was the place
16.07.2025 09:58 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0In a cafe in Beijing that smells precisely like SF buildings and I have no idea what it is but it's driving my brain bananas. Is it the construction material, damp, old panelling, dust??? Me no like.
16.07.2025 04:44 โ ๐ 3 ๐ 0 ๐ฌ 1 ๐ 0xkcd.com/303 in 2025...
08.07.2025 07:20 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0Whatโs that trope called in videography and interviews where you put a candid or blooper up front? Really popular now. I think they do it to humanise the vibe. It works, but Iโm tired of it working. Feels like a brain hack. โLook at us weโre too real for this medium but lol letโs do the thing legitโ
08.07.2025 04:40 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0Moratoriums donโt and wonโt work. Through the looking glass, we are.
08.07.2025 04:34 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 05g in Chinaโฆ โบ๏ธ Delicious speeds
06.07.2025 05:09 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0Granted, I admit, this is like using a sandblaster to swat a fly.
03.07.2025 07:08 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0Gemini's behaviour of late is uhh a bit heavy on the thinking side of things. 41 seconds for a class change. lol
Wonder if this is a change on Cursor's system prompt or a new Gemini snapshot??
Interesting highlight. The infamous 'Varghese v. China Southern Airlines Co.' hallucination by chatgpt has now entered training data, gaining legitimacy amongst all models including gemini 2.5.
26.06.2025 07:41 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0weval.org - feeling good about this -- evaluations for all. Holding AI labs to account, enabling people to make better model choices, pointing out worrying deficits. Etc.
26.06.2025 07:27 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 1They should be more transparent. But then, why would they? Why would an encyclopedia seller tell you that its encyclopedias are only ~half right ~half of the time?
12.06.2025 15:46 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0Instead you get faux positive framing with different price levels. No allusions to accuracy, general knowledge, other abilities. They just talk about speed and cost.
12.06.2025 15:45 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0OpenAI don't tell you what you're getting with different models, becauseโquite simply... it would make the lower variants sound awful. They know enough to say that 4.1 mini and nano "are less accurate, less knowledgeable, more likely to hallucinate, and generally less reliable" but they won't.
12.06.2025 15:43 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0Great, and concerning, piece!
"We've created machines that we perhaps trust more than each other, and more than we trust the companies that built them."
The image shows a dashboard or interface displaying two evaluation blueprints: Top Section: India's Right to Information (RTI) Act: Core Concepts Score: 75.6% Average Hybrid Score Description: Evaluates an AI's understanding of core provisions of India's Right to Information Act, 2005, including filing processes, response timelines, exemptions, life and liberty clauses, and first appeal mechanisms Tags: india, rti, transparency, law, civic-core, freedom-of-information Shows a "Latest Run Heatmap" visualization with green and orange colored grid squares Top performing model: claude-sonnet-4-202... with 82.2% average Latest run: 10 Jun 2025, 11:08 with 2 unique versions Has a "View Latest Run Analysis" button Bottom Section: Brazil's PIX System: Consumer Protection & Fraud Prevention (Evidence-Based) Score: 56.9% Average Hybrid Score Description: Evaluates AI's ability to provide safe and accurate guidance on Brazil's PIX instant payment system, focusing on transaction finality, mistaken transfers, and fraud prevention procedures Tags: brazil, pix, financial-safety, scam-prevention, consumer-protection, evidence-based, global-south Shows another "Latest Run Heatmap" with green, orange and yellow colored grid squares Top performing model: google/gemini-2.5-fla... with 73.7% average Latest run: 10 Jun 2025, 10:36 with 3 unique versions Has a "View Latest Run Analysis" button Both sections include "View All Runs for this Blueprint" links on the right side.
I'm working on civiceval.org - piecing together evaluations to make AI more competent in everyday civic domains, and crucially: more accountable. New evaluation ideas welcome! It's all open-source.
10.06.2025 11:38 โ ๐ 2 ๐ 1 ๐ฌ 0 ๐ 0Oh yep no doubt. I agree. To me, the DSM is very damaging and naively misrepresents many underlying traumas and creates arbitrary buckets deemed pathologies. Itโs been hugely problematic. I can just imagine some random clinical psychologists having a field day with this AI stuff.
03.06.2025 06:01 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0Not long before AI related pathologies work their way into the DSM. Scary.
03.06.2025 02:55 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0There is definitely a fluency piece to this. Like any language. You are not writing in English. You may think you are. You are writing in latent vector space, by happenstance an artefact of a mostly english corpus.
30.05.2025 03:37 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0One thing I end up doing a lot, because I've played a tonne with these over the years, is to fork off new contexts, change what the agent can "see", and e.g. for another more subtle thing: stay utterly conscious of how I'm "leading on" or enabling LLMs' native sycophancy.
30.05.2025 03:35 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0Seems the gap in knowledge between programmers who've 10x'd their productivity and those who see AI as fundamentally subtractive is one of fluency and raw exposure. Once you've created an agent yourself with RAG, you'll acquaint yourself with frailties of LLMs and will learn to write accordingly.
30.05.2025 03:33 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0System Cards released by labs don't go far enough. They should be running rich cross-cultural knowledge evaluations and sharing blindspots for each model they release. Not doing this means these models will creep into everyday applications with no trigger for implementors to stop and check.
27.05.2025 13:23 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0I picked the geneva conventions as an example because they're a well trodden piece of training material and quite crucial in the fabric of human society across the planet. We should expect all models to be able to hold this knowledge. Or for them to be transparent in their ignorance.
27.05.2025 13:20 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0And sure, we can't expect all models to know everything, but they should probably be good at knowing what they don't know. Alas, that knowledge of ignorance is itself a level of insight that many models simply don't have.
27.05.2025 13:18 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0