It is getting harder and harder to test AIs as they get "smarter" at a wide variety of tasks. The average task in GDPval took an hour for experts to assess, and even those tasks did not push current AIs to their limits.
25.11.2025 01:59 โ ๐ 36 ๐ 1 ๐ฌ 3 ๐ 0
No
24.11.2025 21:54 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
Me: Claude 4.5 Opus, I need a strategy game based on the work of Weber
Claude: Here's one based on David Weber's space operas
Me: Not that Weber
C: Here's a game based on sociologist Max Weber
Me: Not that one
C: The operas of Carl Maria von Weber?
Me: No
C: Here is one using Weber grills!
24.11.2025 20:29 โ ๐ 62 ๐ 4 ๐ฌ 4 ๐ 3
I had early access to Opus 4.5 & it is a very impressive model that seem to be right at the frontier
Big gains in ability to do practical work (like make a PowerPoint from an Excel) and the best results ever (& in one shot) in my Lem poetry test, plus good results in Claude Code
24.11.2025 18:59 โ ๐ 57 ๐ 6 ๐ฌ 2 ๐ 0
I think my โotters on a plane using WiFiโ benchmark may saturated now that nano banana pro can do this.
21.11.2025 14:55 โ ๐ 143 ๐ 9 ๐ฌ 8 ๐ 2
Tell all the truth but tell it slantโ
Success in Circuit lies
Too bright for our infirm Delight
The Truth's superb surprise
This paper finds poetry is a universal single shot jailbreak for LLMs. Systems built to stop prosaic attacks fail when the request is phrased in verse arxiv.org/abs/2511.15304
20.11.2025 21:47 โ ๐ 40 ๐ 12 ๐ฌ 1 ๐ 4
Nano banana Pro: โi need a flowchart for how to toast bread, make it as wacky and over the top and complicated as possible.โ
Not absolutely perfect, but I canโt believe how much there is a coherent through-line, how clear the text is, and also parts of it are actually funny?
20.11.2025 19:19 โ ๐ 98 ๐ 16 ๐ฌ 9 ๐ 1
I estimate I used around 10,000 tokens (likely less), so that would translate to about 2-5 Wh (a standard query is .3 Wh), which would be about as much power as 4 minutes of watching Netflix on a TV.
I suspect that viewing and uploading the video uses more power than generating the code for it.
19.11.2025 22:26 โ ๐ 15 ๐ 0 ๐ฌ 1 ๐ 0
"Hey, Gemini 3, So I need DOOM, but more root vegetables, also no guns or demons or mars. And more of a focus on different flooring styles. but otherwise EXACTLY the same as DOOM."
Gemini: "Here is F.L.O.O.R. (First-person Lino Observation & Ornamental Review)."
Pretty good!
19.11.2025 21:08 โ ๐ 106 ๐ 6 ๐ฌ 2 ๐ 3
How well can Gemini 3 make a Henry James simulator?
Finally, a benchmark for LLMs with real-world value
As a fan of weird but revealing benchmarks, I enjoyed this historianโs attempts to have different frontier AIs build โa full featured RPG game where you play as Henry James wandering as a flรขneur at the 1889 Universal Exposition in Paris.โ HenryBench? open.substack.com/pub/resobscu...
19.11.2025 04:13 โ ๐ 67 ๐ 14 ๐ฌ 0 ๐ 2
Fun little Gemini 3 experiment where I asked it "build me a time machine simulator, make it very very good" and then "make it better" a few times. I like that it added calls to Gemini within the application, including adding speech & nano banana images. Play it: gemini.google.com/share/02e4e8...
18.11.2025 22:28 โ ๐ 49 ๐ 7 ๐ฌ 5 ๐ 0
Three Years from GPT-3 to Gemini 3
From chatbots to agents
I had access to Gemini 3. It is a very good, very fast model. It also demonstrates the change from chatbot to agent. www.oneusefulthing.org/p/three-year...
18.11.2025 18:57 โ ๐ 90 ๐ 15 ๐ฌ 4 ๐ 4
Interesting changes from Grok 4 to Grok 4.1. Decreases in harmful responses but also increases in sycophancy and deception.
It isnโt clear how to interpret the sycophancy score, but the MASK score for deception is quite high compared to big models.
Sycophancy leads to higher LMArena scoresโฆ
18.11.2025 02:55 โ ๐ 61 ๐ 6 ๐ฌ 7 ๐ 6
We are now seeing the first long-anticipated use of AI for semi-autonomous cyberattacks.
"This approach allowed the threat actor to achieve operational scale typically associated with nation-state campaigns while maintaining minimal direct involvement" www.anthropic.com/news/disrupt...
13.11.2025 19:12 โ ๐ 53 ๐ 11 ๐ฌ 0 ๐ 9
Some pretty eye-opening data on the effect of AI coding.
When Cursor added agentic coding in 2024, adopters produced 39% more code merges, with no sign of a decrease in quality (revert rates were the same, bugs dropped) and no sign that the scope of the work shrank. papers.ssrn.com/sol3/papers....
13.11.2025 05:18 โ ๐ 90 ๐ 10 ๐ฌ 2 ๐ 3
Giving your AI a Job Interview
As AI advice becomes more important, we are going to need to get better at assessing it
As AIs get smarter & more useful, our benchmarks become less useful. Measuring general knowledge or coding ability gives us only a glimpse into what an AI model can do.
Anyone who wants to use AI seriously for real work will need to assess it themselves. www.oneusefulthing.org/p/giving-you...
12.11.2025 02:55 โ ๐ 58 ๐ 10 ๐ฌ 4 ๐ 3
I keep warning that so many of our systems are still built around the assumption that quality writing and analysis are costly and therefore meaningful signals.
Our systems are very much not ready for the revelation that this is no longer true, as this planning objection AI shows
09.11.2025 23:39 โ ๐ 87 ๐ 13 ๐ฌ 3 ๐ 2
This is a cool paper showing that first-gen college students don't realize a lot of unwritten rules that lead to success (the value of internships, student clubs, letters from professors).
But giving them access to an LLM for guidance significantly closes the gap. mgcuna.github.io/website/JMP_...
09.11.2025 14:55 โ ๐ 95 ๐ 12 ๐ฌ 5 ๐ 7
Sora: "that infamous dramatic Oscar winning scene where the lead keeps getting hit by the boom mic but nobody notices"
05.11.2025 04:32 โ ๐ 56 ๐ 1 ๐ฌ 2 ๐ 0
I have been writing for years about the fact that we are not ready for the destruction of costly signalling mechanisms. Writing used to be a way of measuring effort, ability and diligence. We still have no easy substitute
Now this paper confirms that cover letters have lost their value as predictor
05.11.2025 01:48 โ ๐ 101 ๐ 12 ๐ฌ 4 ๐ 5
The big article on data centers in the New Yorker is pretty good, which I wasnโt expecting given the reaction on X. Lots of talk of the good and bad of AI, and it covers both bubble & non-bubble arguments.
It also featured the best version of โI spoke to a local farmer about a data centerโ
03.11.2025 06:23 โ ๐ 237 ๐ 43 ๐ฌ 7 ๐ 4
I donโt think how people are tracking how quickly this is happening, for better or worse.
02.11.2025 23:59 โ ๐ 145 ๐ 25 ๐ฌ 12 ๐ 7
Describing
02.11.2025 01:11 โ ๐ 9 ๐ 1 ๐ฌ 1 ๐ 0
The other option, from Pater
02.11.2025 01:10 โ ๐ 7 ๐ 2 ๐ฌ 1 ๐ 0
past: circus performer; historian of science; librarian; chief data officer at NEH.
present: dad; resident scholar at dartmouth; chief technology officer at the library of virginia.
personal account; views solely my own.
https://scottbot.github.io
Social science and other distractions. Old posts get deleted pretty quick.
https://kieranhealy.org /
https://theordinalsociety.com
I like utilitarianism, consciousness, AI, EA, space, kindness, liberalism, longtermism, progressive rock, economics, and most people. Substack: http://timfduffy.substack.com
There Is No Antimemetics Division (https://qntm.org/antimemetics) ~ "Lena" ~ Absurdle ~ HATETRIS ~ many other cool things
Waiting on a robot body. All opinions are universal and held by both employers and family.
Literally a professor. Recruiting students to start my lab.
ML/NLP/they/she.
NYT tech columnist, Hard Fork co-host, best at 0.8x speed
Writer, community college writing teacher, obsessed with AI in education, #OER advocate, author of HowArgumentsWork.org.
annarmills.com
Open source developer building tools to help journalists, archivists, librarians and others analyze, explore and publish their data. https://datasette.io [โฆ]
[bridged from https://fedi.simonwillison.net/@simon on the fediverse by https://fed.brid.gy/ ]
AI Architect | North Carolina | AI/ML, IoT, science
WARNING: I talk about kids sometimes
we can only see what we think is possible.
Assoc Professor of Strategic Management, University of Toronto; Chief Economist, Creative Destruction Lab Toronto; cofounder, AllDayTA; cofounder, NBER Innovation PhD Boot Camp. http://www.kevinbryanecon.com and @AFineTheorem on Twitter
that guy from the internet โข waging a victorious 2-front war against cars and xmas โข big fan of being a big fan of things โข see https://anildash.com
Of the San Gabriel Valley; investing for the year 2030; working to improve the second derivative; looking for troublesome ringleaders.
Founder & reigning monarch at TPM. Lapsed historian. Hand tool woodworker. Jew.
WSJ tech columnist. Dog person. Author of How to AI, a no-nonsense, bullshit-free guide to how to get actual utility from AI, aimed at the skeptics who are tired of the hype surrounding it.
๊ฎ surfed on by the information superhighway
๊ฎ ๐ @linneaisaac.bsky.social
๊ฎ she/they ๐ณ๏ธโโง๏ธ
๊ฎ blog posts and games @ https://vgel.me
๊ฎ still mostly active on twitter https://x.com/voooooogel
Email salesman at Platformer.news and podcast co-host at Hard Fork.
Director, @stanforddel.bsky.social
Professor Stanford Institute for Human-centered AI, SIEPR, Stanford department of Economics and GSB
Author https://amazon.com/Second-Machine-Age-Prosperity-Technologies/dp/0393350649
Techno-optimist, but AGI is not like the other technologies.
Step 1: make memes.
Step 2: ???
Step 3: lower p(doom)