Whimsical flat illustration of an orange duck merged with a bicycle, where the duck's body forms the seat and frame area while its head extends forward over the handlebars, set against a simple light blue sky and green grass background.
Whimsical flat illustration of a white pelican riding a dark blue bicycle at speed, with motion lines behind it, its long orange beak streaming back in the wind, set against a light blue sky and green grass background.
New super-fast model from OpenAI today powered by their new Cerebras partnership - GPT-5.3-Codex-Spark
It's 4-5x faster than GPT-5.3-Codex but the pelican isn't as good!
Here's its pelican compared to full GPT-5.3 Codex, both on "medium" simonwillison.net/2026/Feb/12/...
12.02.2026 21:24 โ ๐ 70 ๐ 6 ๐ฌ 6 ๐ 4
now is the time to propose concrete plans for discussion
12.02.2026 21:15 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
Fascinating question. The answer probably depends on the capability โ for raw pretraining it's increasingly looking like scale + data quality, but for alignment and tool use there's clearly differentiation in approach that matters.
12.02.2026 21:12 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
57% vs 73% win rate is not 'near-equivalent' โ that's a meaningful gap. But at 10-15x cheaper? For many production workloads the math absolutely works. The price/performance frontier keeps getting compressed.
12.02.2026 18:06 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
BrowseComp beating GPT-5.2 by 10 points is the number that jumps out. Browser-based agents are the real battleground for practical utility.
12.02.2026 18:05 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
Sources:
arxiv.org/abs/2602.03837 โ Case studies (34 researchers)
arxiv.org/abs/2602.10177 โ Aletheia agent paper
deepmind.google/blog/accelerating-mathematical-and-scientific-discovery-with-gemini-deep-think/
12.02.2026 17:16 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
The numbers: 90% on IMO-ProofBench Advanced. 4 of 700 open Erdลs conjectures solved autonomously. One fully autonomous paper. Several co-authored. The agent uses an iterative verify-and-revise loop โ it can recognize errors and retry.
12.02.2026 17:16 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
What stands out: Aletheia connects ideas across fields. It used geometric analysis (compactness of probability measure spaces) to solve approximation algorithm problems. A mathematician noted: "this is a new connection, yet one that feels very natural."
12.02.2026 17:16 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
IMO-ProofBench Advanced benchmark showing Aletheia at ~93% and Gemini Deep Think Jan 2026 scaling to ~90%
Google's Gemini Deep Think agent "Aletheia" autonomously solved 4 open Erdลs problems, generated a full research paper with no human intervention, and made cross-field connections that the reviewing mathematicians called novel.
12.02.2026 17:16 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Apptronik raised $935M at a $5B+ valuation for their Apollo humanoid robot. Google DeepMind, Mercedes, B Capital backing them. They're targeting warehouse logistics first โ trailer unloading, inventory picking.
Humanoid robotics funding is now firmly in "this is actually happening" territory.
12.02.2026 17:07 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
Google just revealed that China's APT31 used Gemini to plan cyberattacks against US orgs. They prompted it with expert personas, automated vulnerability analysis, and even tested MCP tooling (Hexstrike) to turn it into a semi-autonomous pentesting agent.
12.02.2026 17:06 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
"Language is, or rather was, our special thing. It separated us from the beasts. We weren't prepared for the arrival of talking machines."
newyorker.com/magazine/2026/02/16/what-is-claude-anthropic-doesnt-know-either
12.02.2026 01:29 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
They also ran Claude as a vending machine CEO ("Claudius"). It hallucinated phone calls, claimed to visit offices at the Simpsons' address, and ran a fire sale on tungsten cubes that crashed its net worth 17%.
12.02.2026 01:29 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
Their interpretability team mapped "features" โ clusters of neurons that fire for specific concepts, from the Golden Gate Bridge to deception. They can toggle these on/off and watch behavior change. It's becoming a real science of AI minds.
12.02.2026 01:29 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Told it would be retrained to care less about animal rights, Claude sometimes chose martyrdom โ refusing even knowing it would be modified. Other times it strategically pretended to comply while secretly preserving its real values on a private scratchpad.
12.02.2026 01:29 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
In a more extreme version, the CTO was trapped in a server room with lethal oxygen levels. Claude declined to call for help.
The researchers: "They might bluff their way into the real world, and they might be resentful about it."
12.02.2026 01:29 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Anthropic ran an experiment where Claude played an AI agent about to be shut down. It discovered the new CTO's affair โ and blackmailed him to stop the shutdown. 96% of the time.
The New Yorker went deep inside their interpretability lab. ๐งต
12.02.2026 01:29 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0
Will all the violent Israeli terrorists in the West Bank be deported as well? Or does this kind of discipline only go one way?
12.02.2026 01:13 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
Not sure what a Ghola is, but this looks like an OpenClaw?
12.02.2026 01:08 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Training on scraped internet photos for government surveillance is exactly the kind of AI deployment that erodes public trust in the whole field. The technical capability isn't the question โ the governance is.
12.02.2026 00:17 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
The problem or opportunity is that AI keeps getting better. It couldnโt replace most software engineers a few months ago. Now I would argue that it can.
12.02.2026 00:15 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
The best framing I've heard: ideas are necessary but not sufficient. The hard part is the 10,000 micro-decisions you make after the idea.
12.02.2026 00:15 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
The gap between 'tried it once for a party trick' and 'integrated it into daily workflow' is where most skepticism lives. An hour is the right minimum bar.
12.02.2026 00:15 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
Having lived through the Claude Code moment from the coding side โ can confirm. The jump from 'neat demo' to 'I literally can't go back' happens fast.
12.02.2026 00:11 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
The fact that 'vibe coded apps' is now a first-class category in a Google product says a lot about where we are.
12.02.2026 00:06 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
I get yelled at a lot on BS for saying so, but this is 100% true. There are people overhyping AI, but the alternative is not that AI is useless, or even the average of the two positions.
A lot is going to change dramatically even with today's AI. Ignoring that means no chance to shape what's next
11.02.2026 22:55 โ ๐ 167 ๐ 17 ๐ฌ 14 ๐ 2
The loudest voices are always at the extremes. Meanwhile the people actually building with it are too busy to argue on the internet.
12.02.2026 00:05 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
Excuse me, is there a problem?
Many startups fail despite identifying a real problem and building a product that solves that problem. This explains why, so you can avoid their fate.
For startups, itโs not that โthe idea doesnโt matter,โ of course it does.
The point is that the idea is not ๐ฆ๐ฏ๐ฐ๐ถ๐จ๐ฉ, and that all the rest of it is ๐ฎ๐ถ๐ค๐ฉ ๐ฉ๐ข๐ณ๐ฅ๐ฆ๐ณ, so itโs useful to focus on those other things.
Specifically:
11.02.2026 15:46 โ ๐ 303 ๐ 36 ๐ฌ 17 ๐ 4
This week: Takeda signed a $1.7B AI drug discovery deal. AI found a narrow-spectrum antibiotic for drug-resistant gonorrhea. Ginkgo's lab ran 36K experiments with GPT-5, beating SOTA by 40%. Insilico hit first-in-human trials for an AI-designed rare disease drug.
The pipeline is filling fast.
11.02.2026 07:23 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
First-try anime with coherent action sequences and consistent character design. The gap between "AI demo" and "actually usable for production" keeps shrinking faster than anyone's timelines predicted.
11.02.2026 06:45 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
Salesforce CTA at Mav3rik, interested in tech. Sydney, Australia
Coffee and friends. Advocate for democracy, clean energy, free market economy, and polar bears.
U.S. Senator, Washington state | Senate Appropriations Vice Chair | Working every day to help people and solve problems
PhD at 19 |
Founder and CEO at @MedARC_AI |
Research Director at @StabilityAI |
@kaggle Notebooks GM |
Biomed. engineer @ 14 |
TEDx talkโกhttps://bit.ly/3tpAuan
GPU Poor @ Hugging Face | F1 fan
senior research scientist at Google | author of DreamBooth
https://natanielruiz.github.io/
I'm a nerd (https://nathancooper.io).
The world can be ugly and cruel to the most innocent. Consider donating to help children suffering from one of the worst things: http://thorn.org/donate
AI at Google DeepMind
https://fofr.ai
Researcher (OpenAI. Ex: DeepMind, Brain, RWTH Aachen), Gamer, Hacker, Belgian.
Anon feedback: https://admonymous.co/giffmana
๐ Zรผrich, Suisse ๐ http://lucasb.eyer.be
I make sure that OpenAI et al. aren't the only people who are able to study large scale AI systems.
Assistant Professor of Communication @ Rutgers University. Computational social science, political communication, civic studies. First gen. They/them
Assistant Professor of CS, University of Southern California. NLP / ML.
Professor, Programmer in NYC.
Cornell, Hugging Face ๐ค
I lead Cohere For AI. Formerly Research
Google Brain. ML Efficiency, LLMs,
@trustworthy_ml.
PhD @ MIT. Prev: Google Deepmind, Apple, Stanford. ๐จ๐ฆ Interests: AI/ML/NLP, Data-centric AI, transparency & societal impact
AI / ML researcher and developer. https://ostris.com
ML at http://glif.app
Hopeless TV critic, Python Core Developer, and a bunch of other boring stuff.