Try it here: yiyan.baidu.com
16.03.2025 17:17 β π 0 π 0 π¬ 1 π 0@deedydas.bsky.social
VC at Menlo Ventures. Formerly founding team Glean, Google Search. Cornell CS. Tweets about tech, immigration, India, fitness and search.
Try it here: yiyan.baidu.com
16.03.2025 17:17 β π 0 π 0 π¬ 1 π 0Baidu, the Google of China, just dropped two models today:
β ERNIE 4.5: beats GPT 4.5 for 1% of price
β Reasoning model X1: beats DeepSeek R1 for 50% of price.
China continues to build intelligence too cheap to meter. The AI price war is on.
"Make it look like I was on a luxury five star hotel vacation"
Google Gemini really cooked with this one.
This is next gen photo editing.
WOW the new Google Flash model is the first time ever that you can do targetted edits of pictures with English.
"Make the steak vegetarian"
"Make the bridge go away"
"Make the keyboard more colorful"
And my favorite
"Give the OpenAI logo more personality"
Source: www.nature.com/articles/d4...
09.03.2025 03:40 β π 1 π 0 π¬ 0 π 0AI is now making cutting edge science better.
The Nature published that reasoning LLMs found errors in 1% of the 10,000 research papers it analyzed with 35% false positive rate for $0.15-1/paper.
Anthropic founderβs view of βa country of geniuses in a data centerβ is happening.
Source: arxiv.org/pdf/2503.00735
07.03.2025 17:02 β π 1 π 0 π¬ 0 π 0HUGE New research paper shows how a 7B param AI model (90%) can beat OpenAI o1 (80%) on the MIT Integration Bee.
LADDER:
β Generate variants of problem
β Solve, verify, use GRPO (DeepSeek) to learn
TTRL:
β Do 1&2 when you see a new problem
New form of test time compute scaling!
There are two categories:
β Daytona, for general purpose sort. Above numbers are Daytona.
β Indy, which can be specific to the 100-byte records with 10-byte keys.
Not super useful in practice though.
Link: sortbenchmark.org/
Google experiments on it: sortbenchmark.org/
How well can computers sort 1 trillion numbers?
SortBenchmark, in distributed systems, measures this.
β How fast? 134s
β How cheap? $97
β How many in 1 minute? 370B numbers
β How much energy? ~59kJ or walking for 15mins
Every software engineer should know this.
Source: github.com/deepseek-ai...
01.03.2025 05:07 β π 1 π 0 π¬ 0 π 0BREAKING DeepSeek just let the world know they make $200M/yr at 500%+ profit margin.
Revenue (/day): $562k
Cost (/day): $87k
Revenue (/yr): ~$205M
This is all while charging $2.19/M tokens on R1, ~25x less than OpenAI o1.
If this was in the US, this would be a >$10B company.
Source: claude.site/artifacts/3...
26.02.2025 02:34 β π 1 π 0 π¬ 0 π 0Claude's new Github "talk to your code" integration changes how engineers understand software.
Fork a repo.
Select a folder.
Ask it anything.
It even shows you what %age of the context window each folder takes.
Here it visualizes yt-dlp's (Youtube downloader) flow:
Perplexity: www.perplexity.ai/search/read...
OpenAI: chatgpt.com/share/67a41...
Gemini: docs.google.com/document/d/...
I asked all 3 Deep Researches to "compare 25 LLMs in a table on 20 axes" to figure out which one was the best.
The winner was OpenAI.
It had the most detailed, high-quality and accurate answer, but you do pay $200/mo for it.
Source: academics.hamilton.edu/documents/t...
14.02.2025 02:47 β π 1 π 0 π¬ 0 π 0"The Mundanity of Excellence" [1989] is a timeless essay everyone ought to read in todays day and age.
Excellence is boring. It's making the same boring "correct" choice over and over again. You win by being consistent for longer.
Our short attention spans tend to forget that.
Source: arxiv.org/pdf/2502.06807
(Check out the detailed code submissions and scoring in the appendix)
HUGE: OpenAI o3 scores 394 of 600 in the International Olympiad of Informatics (IOI) 2024, earning a Gold medal and 18 in the world.
The model was NOT contaminated with this data and the 50 submission limit was used.
We will likely see superhuman coding models this year.
bbycroft.net/llm
12.02.2025 03:06 β π 2 π 0 π¬ 1 π 0Everyone should be using this website to understand the inside of an LLM.
I'm surprised more people don't know about it. Benjamin Bycroft made this beautiful interactive visualization to show exactly how the inner workings of each of the weights of an LLM work.
Here's a link:
Source: chatgpt.com/share/67a40...
Full step by step: www.reddit.com/r/ChatGPTPr...
New research shows that LLMs don't perform well on long context.
Perfect needle-in-the-haystack scores are easyβattention mechanisms can match the word. When you require 1-hop of reasoning, performance degrades quickly.
This is why guaranteeing correctness for agents is hard.
Source: www.youtube.com/watch?v=8Lm...
09.02.2025 02:34 β π 0 π 0 π¬ 0 π 0Internal OpenAI models have improved to ~50 in the world or 3045 on Codeforces and will hit #1 by the end of the year, said Sam Altman yesterday in Japan, up from o3's 2727 (#175)!
This is a monumental result en route to AGI.
Source: www.sergey.fyi/articles/ge...
06.02.2025 17:37 β π 0 π 0 π¬ 0 π 0PDF parsing is pretty much solved at scale now.
Gemini 2 Flash's $0.40/M tokens and 1M token context means you can now parse 6000 long PDFs at near perfect quality for $1
Who has better Deep Research, Google or OpenAI?
Deep research generates ~10 page reports in ~15mins by scouring 100s of websites. This could replace a lot of human work. I tried both so you don't have to.
The verdict: OpenAI is faster and better quality despite being more $$
Gemini's just launched their new Flash models. They are cheaper, better and have 8x the context of GPT 4o-mini!
Per million input (cached), input and output tokens:
Gemini 2 Flash Lite: $0.01875, $0.075, $0.30
Gemini 2 Flash: $0.025, $0.1, $0.40
GPT 4o-mini: $0.075, $0.15, $0.60