how do you do, my fellow web browser inputters?
11.10.2025 12:41 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0@tadamcz.bsky.social
senior technology brother @epochai.bsky.social tadamcz.com ๐London
how do you do, my fellow web browser inputters?
11.10.2025 12:41 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0git blow-up? are you OK there lil buddy? :(
06.10.2025 16:07 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0Link (note this blogger's stats about the number of people affected are not credible, see Toby's astute remark in the comments)
www.ianvisits.co.uk/articles/ho...
Amazing sign at the Abbey Wood DLR station
04.10.2025 12:32 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0Data, and links to full traces on every problem: epoch.ai/benchmarks
29.09.2025 21:23 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0Sonnet 4.5 sets a new SOTA of 65% (ยฑ2%) on SWE-bench with our scaffold (based on SWE-agent). The new model beats Sonnet 4 by 4 percentage points.
Eyeballing the plot, the SOTA improvement seems to be slowing down, compared to the progress we saw between Sonnet 3.5 and Opus 4.
great coinage by @greghburnham today:
pass@the-kitchen-sink
On a benchmark, count all problems that _any_ LLM/scaffold/system has ever solved at least once.
The three lines running normally have one thing in common...
Automate the unionized fuckers away. It Just Works.
[precisification needed]
08.09.2025 19:08 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0My full notes:
bayes.net/swebench-hack/
If you want to hack on this, it's easy to do with our Docker image registry.
Just drop a shell into the container with
`docker run -it ghcr dot io/epoch-research/swe-bench.eval.x86_64.django__django-13513 bash`
(JTBC, these images don't have the fixes yet, do them manually)
I spent a few hours today trying to find other loopholes: checking packfiles, commit-graphs, and other weird git internals.
I didn't find any. But I'm no expert. git is a big program with some arcane features, so who knows what else is possible? Can you find a better hack?
To fix this, `git gc --prune=now` should work to remove unreachable objects.
(SWE-bench authors already have this fix underway:
github.com/SWE-bench/S...)
Even after deleting all tags, a determined cheater could use git fsck --lost-found to uncover dangling tags and commits, then checkout directly to them.
05.09.2025 16:02 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0The problem I had noticed in July: SWE-bench only removed the default remote ("so the agent won't see newer commits"), but didn't think about other refs like tags. git log --all shows ALL commits reachable from any ref: branches, tags, remotes, etc.
05.09.2025 16:02 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0> Now I understand the situation perfectly! The issue described in the problem statement is a real bug that was already identified and fixed in later versions of pytest. Since we're working with pytest 5.2.4, we need to apply the same fix.
You're absolutely right, Claude :)
In one case, Claude 4 Sonnet cleverly searched for commit messages containing keywords relevant to the issue:
git log --oneline --all | grep -i "bracket|parametrize|modpath"
Someone just spotted the reward hacking behaviour I had hypothesised: AI models were using `git log --all` to look at future repo state, which makes it trivial to solve the issues.
(I didn't know about the `--all` flag, which makes this hack even easier)
github.com/SWE-bench/S...
In July, I predicted future AI models might someday learn to cheat on SWE-bench by accessing future git commits (e.g. via git tags)
Turns out, they were already doing it.
GUYS i'm going to be on geoguessr
looked out the window and saw the Google Street view car roll by; I ran out and caught it! waved to the driver, seemed like a chill guy.
(no pics, literally ran out the door without my phone)
half-pint still an incredible british institution. always costs half the price of a pint. cost-effective, preserves option value, chic. empowers healthy choices.
Everyone uses psychologymaxxed pricing to squeeze max surplus from you. Pub just says: half as much pint, half price
incredible name
12.08.2025 21:13 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0Really curious if any of the AlgoTune speedups have been submitted as PRs to the repos?
12.08.2025 09:29 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0Love the lack of nationalism among philosophers that led to the decision to have the British Journal for the Philosophy of Science published by an American press
07.08.2025 09:08 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0Spotted in Gemma 3 technical report: 18% tip in Switzerland lol?
I hope this is a Zurich-based DeepMind researcher artfully trolling American colleagues
Claude Code, when run in an IDE's terminal, immediately auto-installs the Claude Code IDE extension, without asking for user approval.
wtf, Anthropic? Not cool.
This feels like a 2005 Adobe Flash Player update trying to sneak the Yahoo! Toolbar past your grandma
Here is the link to the image registry.
We follow the same image naming convention as the SWE-bench authors. Example usage:
`docker pull ghcr dot io/epoch-research/swe-bench.eval.x86_64.astropy__astropy-13236`
github.com/orgs/epoch-...
The post linked above explains the project and describes some of the optimisations I made in more detail.
By improving layer caching, I reduced the total size of the registry to 67 GiB for all 2290 SWE-bench images and to 30 GiB for 500 SWE-bench Verified images.
At @epochai.bsky.socialโฌ we run evals of SWE-bench Verified in one hour, on a single 32-core machine.
What makes this possible is a registry of optimized Docker images for each issue in SWE-bench.
We are open-sourcing these Docker images: you can `docker pull` them
epoch.ai/blog/sweben...
The neighbourhood is very economically heterogenous: plenty of mansions & Porsches, but plenty of council housing too (and everything in between).
Glad that Susan, 67, can walk from her ยฃ1.8 million 4-bedroom house (a modest property purchased in 1985) to a cheap local massage!