Danny Wood @echostatements - Bluesky Profile

Why Busy Beaver Hunters Fear the Antihydra In which I explore the biggest barrier in the busy beaver game. What is Antihydra, what is the Collatz conjecture, how are they connected, and what makes them so daunting?

I published a new post on my rarely updated personal blog! It's a sequel of sorts to my Quanta coverage of the Busy Beaver game, focusing on a particularly fearsome Turing machine known by the awesome name Antihydra.

27.10.2025 16:04 — 👍 33 🔁 7 💬 2 📌 3

Alexander horned sphere - Wikipedia

The video is a fun watch but if you don't check it out, the counterexample I'm referring to is the Alexander horned sphere

en.wikipedia.org/wiki/Alexand...

26.10.2025 13:43 — 👍 0 🔁 0 💬 0 📌 0

YouTube video by Metamorphic The Most Obvious Theorem in All of Mathematics

I'm a big fan of when counterexamples to intuitive sounding theorems are so convoluted that outside of mathematics, the kind of rules-lawyering needed to construct them would be considered pedantic or even rude

youtu.be/pLgcZLysOFk?...

26.10.2025 13:39 — 👍 1 🔁 0 💬 1 📌 0

It's interesting that its blend of coherence and unexpected nonsense capture a dreamy quality that most things described as "dreamlike" don't quite match for me

24.10.2025 17:30 — 👍 1 🔁 0 💬 0 📌 0

Blocky Planet — Making Minecraft Spherical Discover the unique design challenges of creating a spherical planet out of Minecraft-like blocks.

How do you make Minecraft spherical? A really fun read about all the problem solving that goes into transferring Minecraft gameplay onto a spherical world

www.bowerbyte.com/posts/blocky...

21.10.2025 19:35 — 👍 1 🔁 0 💬 0 📌 0

YouTube video by Welch Labs What the Books Get Wrong about AI [Double Descent]

Really nice primer on double-descent and the bias-variance trade-off

I'm impressed by the depth that Welch Labs consistently manages to pack into their videos without sacrificing the storytelling for a popular science audience

www.youtube.com/watch?v=z64a...

20.10.2025 09:16 — 👍 2 🔁 1 💬 0 📌 0

How best to define "open problem" depends largely on whether you start off with a problem metric or just a problem topology

19.10.2025 10:56 — 👍 2 🔁 0 💬 0 📌 0

Crinkled Arcs And Brownian Motion A crinkled arc is a continuous curve that appears as if it is making right-angle turns at every point along its trajectory. Additionally, if you draw a straight line between two recent points and comp...

Since @spmontecarlo.bsky.social shared a post about Crinkled Arcs a couple of weeks ago, I've spent a fair bit of time digging into them

This new blog post attempts to collect what I found into an interactive introduction to crinkled arcs and their relationship to Brownian Motion

14.10.2025 11:25 — 👍 1 🔁 0 💬 0 📌 0

For me, it's roughly:

Monday: Odd
Tuesday: Odd
Wednesday: Even if you were just thinking about Tue/Thu, Odd if you were just thinking about Monday
Thursday: Odd
Friday: Same as Wednesday
Saturday: Even
Sunday: Weakly even

14.10.2025 10:37 — 👍 3 🔁 0 💬 0 📌 0

YouTube video by Naviary Mate-in-Omega, The Great Phenomenon of Infinite Chess

A surprisingly intuitive and practical use for limit ordinals (assuming you have an infinite chess board handy)

youtu.be/CQ4Ap5itTX4?...

11.10.2025 16:07 — 👍 1 🔁 0 💬 0 📌 0

https://en.wikipedia.org/wiki/Infinite-dimensional_vector_function#Crinkled_arcs

Answering my own question, the example from the "Infinite-dimensional vector function" Wikipedia page makes it feel clear that a curve with these properties will exist

Even so, without more thought, this feels more like it side-steps the issue in my intuition than directly tackles it

28.09.2025 10:30 — 👍 1 🔁 0 💬 0 📌 0

Wait... what?!

Even though continuity means that the arcs have to exist in a countable-basis subspace, it feels so unintuitive that this gives room for a curve with uncountably many orthogonal chord pairs set up like this

Do you have any intuition on how you reconcile these two things?

27.09.2025 14:19 — 👍 0 🔁 0 💬 1 📌 0

“Everyone knows” what an autoencoder is… but there's an important complementary picture missing from most introductory material.

In short: we emphasize how autoencoders are implemented—but not always what they represent (and some of the implications of that representation).🧵

06.09.2025 21:20 — 👍 69 🔁 10 💬 2 📌 1

It seems chronological for me too with the caveat that reposts are shown according to when they were reposted but display the time elapsed since original posting

19.08.2025 21:31 — 👍 2 🔁 0 💬 1 📌 0

Does anyone have a good reference for paradoxes in set theory? I'm looking for something self-contained.

14.08.2025 14:01 — 👍 51 🔁 9 💬 2 📌 2

This is what I've pieced together from a couple of hours of reading... If there's anything I've missed or got wrong, let me know!

07.08.2025 16:29 — 👍 0 🔁 0 💬 0 📌 0

gpt-oss-120b & gpt-oss-20b Model Card We introduce gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models available under the Apache 2.0 license and our gpt-oss usage policy.

Both the blog post and paper are cited in the GPT-OSS model card here:

openai.com/index/gpt-os...

07.08.2025 16:29 — 👍 0 🔁 0 💬 1 📌 0

This turns out to be a special case of the attention sink as implemented in the new GPT models, as explained in this paper:
arxiv.org/pdf/2309.174...

07.08.2025 16:29 — 👍 0 🔁 0 💬 1 📌 0

Attention Is Off By One Let’s fix these pesky Transformer outliers using Softmax One and QuietAttention.

This idea seems to have originated in a blog post by Evan Miller, which is a really nice read and suggests that just adding +1 to the denominator of the softmax could solve the extreme value issue:
www.evanmiller.org/attention-is...

07.08.2025 16:29 — 👍 0 🔁 0 💬 1 📌 0

But when you have mostly very small values and one or two extremely large ones, this make quantisation much lossier

The solution is essentially to add an extra element to the input sequence that the attention heads can "sink" their attention into but is removed from subsequent calculations

07.08.2025 16:29 — 👍 0 🔁 0 💬 1 📌 0

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing Transformer models have been widely adopted in various domains over the last years, and especially large language models have advanced the field of AI significantly. Due to their size, the capability ...

It looks like they're designed to solve a problem raised in this paper (amongst others), that when an attention head wants to not add additional information, it will give the vast majority of its attention probability mass to non-informative tokens...  arxiv.org/abs/2306.12929

07.08.2025 16:29 — 👍 0 🔁 0 💬 1 📌 0

The code for the attention mechanism in the `transformers` implementation of gpt_oss

Looking at the code in Hugging Face's library for the new GPT models, I was a bit disappointed by how similar they are to Llama & Mistral models, but there is one cool trick I hadn't seen before: attention sinks. These are a mechanism by which attention heads can say "I don't have anything to add"

07.08.2025 16:29 — 👍 2 🔁 0 💬 1 📌 0

There seems to be a lot of nuance around exactly how to translate "lemma": it can be anything between a premise and or more literally as something like "I take", as in something taken for granted. But "proposition" is nice in that it fits in both contexts

22.07.2025 13:01 — 👍 0 🔁 0 💬 1 📌 0

A little etymological fact that I like is that "lemma" and "dilemma" are from the same ancient Greek origin

If you translate "lemma" as meaning a proposition, a dilemma is literally having two propositions to consider

22.07.2025 12:59 — 👍 2 🔁 0 💬 1 📌 0

There are of course some details and caveats discussed in the blog:

echostatements.github.io/posts/2025/0...

You can also find code for the experiments on Github:

github.com/EchoStatemen...

9/9 🧵

18.07.2025 12:19 — 👍 0 🔁 0 💬 0 📌 0

Graph from start of thread repeated, showing effects of faster model and faster training

And even better, these tricks are not mutually exclusive, by doing both simultaneously, you get a 2.5x speed up (dependent on batch size)

8/9 🧵

18.07.2025 12:17 — 👍 0 🔁 0 💬 1 📌 0

Graph showing training speed of classic vs faster training (~250 seconds vs ~150 seconds)

Again, this gets some pretty significant speed-up

7/9 🧵

18.07.2025 12:17 — 👍 0 🔁 0 💬 1 📌 0

Image showing how paired samples can be offset by one in a batch to make unpaired samples

Optimisation 2:

When training Siamese networks, people tend to generate matching/non-matching pairs in equal ratio. However, you can train more efficiently if you generate only matching pairs, then creating the non-matching ones with some shifting of subnetwork outputs.

6/9 🧵

18.07.2025 12:17 — 👍 0 🔁 0 💬 1 📌 0

A diagram of a faster implementation of a Siamese network vs a standard one

Comparison of inference and throughput for classic vs faster network (1.35 seconds for classic, 0.91 seconds for faster)

Optimisation 1:

In practice, when implementing these networks there is only one subnetwork, called twice, once for each input

But by stacking the inputs, we actually only need to make one call to the network:

Depending on batch size, the effect can be significant

5/9 🧵

18.07.2025 12:17 — 👍 0 🔁 0 💬 1 📌 0

This is useful if you're doing, say, facial recognition. You train the network to learn whether pairs of images are the same person or different people

Then you can recognise someone not in your training set by providing a reference image for that person along with your test image

4/9 🧵

18.07.2025 12:17 — 👍 0 🔁 0 💬 1 📌 0

Danny Wood

Latest posts by echostatements.bsky.social on Bluesky

@echostatements is following 20 prominent accounts