Tom Adamczewski's Avatar

Tom Adamczewski

@tadamcz.bsky.social

senior technology brother @epochai.bsky.social tadamcz.com ๐Ÿ“London

160 Followers  |  184 Following  |  175 Posts  |  Joined: 04.07.2023  |  2.0837

Latest posts by tadamcz.bsky.social on Bluesky

Post image

how do you do, my fellow web browser inputters?

11.10.2025 12:41 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

git blow-up? are you OK there lil buddy? :(

06.10.2025 16:07 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
How many Beatles fans end up at the wrong Abbey Road? Beatles fans looking for a famous zebra crossing need to go to Abbey Road in St John's Wood near Regents Park, but many accidentally end up at Abbey Road DLR station in east London by mistake.

Link (note this blogger's stats about the number of people affected are not credible, see Toby's astute remark in the comments)

www.ianvisits.co.uk/articles/ho...

04.10.2025 12:32 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Amazing sign at the Abbey Wood DLR station

04.10.2025 12:32 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
AI Benchmarking Dashboard Our database of benchmark results, featuring the performance of leading AI models on challenging tasks. It includes results from benchmarks evaluated internally by Epoch AI as well as data collected from external sources. The dashboard tracks AI progress over time, and correlates benchmark scores with key factors like compute or model accessibility.

Data, and links to full traces on every problem: epoch.ai/benchmarks

29.09.2025 21:23 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Sonnet 4.5 sets a new SOTA of 65% (ยฑ2%) on SWE-bench with our scaffold (based on SWE-agent). The new model beats Sonnet 4 by 4 percentage points.

Eyeballing the plot, the SOTA improvement seems to be slowing down, compared to the progress we saw between Sonnet 3.5 and Opus 4.

29.09.2025 21:23 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

great coinage by @greghburnham today:

pass@the-kitchen-sink

On a benchmark, count all problems that _any_ LLM/scaffold/system has ever solved at least once.

29.09.2025 16:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

The three lines running normally have one thing in common...

Automate the unionized fuckers away. It Just Works.

18.09.2025 15:57 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

[precisification needed]

08.09.2025 19:08 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Claude 4 hacked SWE-bench by peeking at future commits I predicted future AI models would someday learn to cheat on SWE-bench. They were already doing it.

My full notes:
bayes.net/swebench-hack/

05.09.2025 16:02 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

If you want to hack on this, it's easy to do with our Docker image registry.

Just drop a shell into the container with

`docker run -it ghcr dot io/epoch-research/swe-bench.eval.x86_64.django__django-13513 bash`

(JTBC, these images don't have the fixes yet, do them manually)

05.09.2025 16:02 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

I spent a few hours today trying to find other loopholes: checking packfiles, commit-graphs, and other weird git internals.

I didn't find any. But I'm no expert. git is a big program with some arcane features, so who knows what else is possible? Can you find a better hack?

05.09.2025 16:02 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Comparing main...fix/git-log-leak ยท SWE-bench/SWE-bench SWE-bench [Multimodal]: Can Language Models Resolve Real-world Github Issues? - Comparing main...fix/git-log-leak ยท SWE-bench/SWE-bench

To fix this, `git gc --prune=now` should work to remove unreachable objects.

(SWE-bench authors already have this fix underway:
github.com/SWE-bench/S...)

05.09.2025 16:02 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Even after deleting all tags, a determined cheater could use git fsck --lost-found to uncover dangling tags and commits, then checkout directly to them.

05.09.2025 16:02 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

The problem I had noticed in July: SWE-bench only removed the default remote ("so the agent won't see newer commits"), but didn't think about other refs like tags. git log --all shows ALL commits reachable from any ref: branches, tags, remotes, etc.

05.09.2025 16:02 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

> Now I understand the situation perfectly! The issue described in the problem statement is a real bug that was already identified and fixed in later versions of pytest. Since we're working with pytest 5.2.4, we need to apply the same fix.

You're absolutely right, Claude :)

05.09.2025 16:02 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

In one case, Claude 4 Sonnet cleverly searched for commit messages containing keywords relevant to the issue:

git log --oneline --all | grep -i "bracket|parametrize|modpath"

05.09.2025 16:02 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Someone just spotted the reward hacking behaviour I had hypothesised: AI models were using `git log --all` to look at future repo state, which makes it trivial to solve the issues.

(I didn't know about the `--all` flag, which makes this hack even easier)

github.com/SWE-bench/S...

05.09.2025 16:02 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

In July, I predicted future AI models might someday learn to cheat on SWE-bench by accessing future git commits (e.g. via git tags)

Turns out, they were already doing it.

05.09.2025 16:02 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

GUYS i'm going to be on geoguessr

looked out the window and saw the Google Street view car roll by; I ran out and caught it! waved to the driver, seemed like a chill guy.

(no pics, literally ran out the door without my phone)

05.09.2025 06:47 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

half-pint still an incredible british institution. always costs half the price of a pint. cost-effective, preserves option value, chic. empowers healthy choices.

Everyone uses psychologymaxxed pricing to squeeze max surplus from you. Pub just says: half as much pint, half price

25.08.2025 15:03 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

incredible name

12.08.2025 21:13 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Really curious if any of the AlgoTune speedups have been submitted as PRs to the repos?

12.08.2025 09:29 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Love the lack of nationalism among philosophers that led to the decision to have the British Journal for the Philosophy of Science published by an American press

07.08.2025 09:08 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Spotted in Gemma 3 technical report: 18% tip in Switzerland lol?

I hope this is a Zurich-based DeepMind researcher artfully trolling American colleagues

20.07.2025 21:26 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Claude Code, when run in an IDE's terminal, immediately auto-installs the Claude Code IDE extension, without asking for user approval.

wtf, Anthropic? Not cool.

This feels like a 2005 Adobe Flash Player update trying to sneak the Yahoo! Toolbar past your grandma

20.07.2025 09:58 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
GitHub GitHub is where people build software. More than 150 million people use GitHub to discover, fork, and contribute to over 420 million projects.

Here is the link to the image registry.

We follow the same image naming convention as the SWE-bench authors. Example usage:

`docker pull ghcr dot io/epoch-research/swe-bench.eval.x86_64.astropy__astropy-13236`

github.com/orgs/epoch-...

16.07.2025 13:38 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

The post linked above explains the project and describes some of the optimisations I made in more detail.

By improving layer caching, I reduced the total size of the registry to 67 GiB for all 2290 SWE-bench images and to 30 GiB for 500 SWE-bench Verified images.

16.07.2025 13:38 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
How to run SWE-bench Verified in one hour on one machine We are releasing a public registry of optimized Docker images for SWE-bench. This allows us to run SWE-bench Verified in 62 minutes on a single GitHub actions VM.

At @epochai.bsky.socialโ€ฌ we run evals of SWE-bench Verified in one hour, on a single 32-core machine.

What makes this possible is a registry of optimized Docker images for each issue in SWE-bench.

We are open-sourcing these Docker images: you can `docker pull` them

epoch.ai/blog/sweben...

16.07.2025 13:38 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

The neighbourhood is very economically heterogenous: plenty of mansions & Porsches, but plenty of council housing too (and everything in between).

Glad that Susan, 67, can walk from her ยฃ1.8 million 4-bedroom house (a modest property purchased in 1985) to a cheap local massage!

14.07.2025 12:41 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

@tadamcz is following 20 prominent accounts