Tim Mastny (@timmastny) — Bluesky Profile

11 months ago

#statstab #317 Tests for Pairwise Mean Differences in R
by @timmastny.bsky.social

Thoughts: A nice illustration of what {emmeans} is doing when computing contrasts using various functions.

#emmeans #r #pairwise #ttest #tutorial #dataviz

timmastny.com/blog/tests-p...

6 2 0 0

1 year ago

Manus turns out to be just Claude Sonnet + 29 other tools, Reflection 70B vibes ngl

Word is that it's just a wrapper around Claude 3.7! www.reddit.com/r/LocalLLaMA...

0 0 0 0

1 year ago

Dead Rise (Original Soundtrack) Ian Wright · EP · 2025 · 4 songs

My friend @timmastny.bsky.social made a game for the NES and asked me to write music. We made both the game and music under the constraints of the NES.

Check out the game on tmastny.itch.io/dead-rise

And the soundtrack on all streaming platforms!

Spotify link: open.spotify.com/album/2NDjba...

3 1 0 1

1 year ago

False Sharing | Tim Mastny

This is called false sharing - when CPUs coordinate over variables they don't actually share.

In this case the penalty was small, but in other cases it can cause 5%, 25%, or even 8000% performance degradation. If you want to learn more, check out my post!
timmastny.com/blog/false-s...

0 0 0 0

1 year ago

The cache line is the smallest unit of data unique to each CPU cache, commonly 64 or 128 bytes. If a CPU wants to modify a variable on a cache line, it must coordinate with all other CPUs to invalidate their copy, even if they some of the variables on that line are not shared.

0 0 1 0

1 year ago

Variables `a` and `b` are not shared between the two threads: thread 0 uses a and thread 1 uses b. But since they are initialized next to each other in the program on the left, they share the same cache line.

0 0 1 0

1 year ago

The program on the right is 0.7% faster measured over millions of iterations. Why?

0 0 1 0

1 year ago

Which program is faster? 1/

0 0 1 0

1 year ago

Expected Database Query Latency

Check out problem 2: sirupsen.com/napkin/probl...
And my napkin math: timmastny.com/projects/nap...

0 0 0 0

1 year ago

It’s also random for main memory, but we only need to load one new cache line from main memory per page: 50 * 50 ns = 2.5 us. The prefetcher should ensure the sequential read of the rest of the page is fast.

Even if we have to fetch more cache lines, it’s much faster than 1ms!

0 0 1 0

1 year ago

Solution: I didn’t consider that each page access is random! A random SSD of 8KB is 100 us, so 101 us ~ 100 us for 16KB, 1 ms for 10 pages.

When doing random reads, SSDs are 100x slower, not 10x!

0 0 1 0

1 year ago

Total:
50 pages * .25 us / page = 12.5 us
10 pages * 2 us / page = 20 us
12.5 us + 20 us = 32.5 us

Seems reasonable: SSD is 10x slower than main memory.

0 0 1 0

1 year ago

50 pages / query * 80% of pages in cache = 40 pages in cache, 10 on disk.

Sequential SSD read: 1 us / 8KB * 16KB / page = 2 us / page.

Page scan time: 1 ns / 64 bytes * 16KB / page = 250 ns / page = .25 us / page

0 0 1 0

1 year ago

Problem 2 from the @sirupsen.bsky.social's napkin math newsletter:

Your SSD-backed database has a usage-pattern that rewards you with a 80% page-cache hit-rate. The median is 50 distinct disk pages for a query to gather its query results. What is the expected average query time from your database?

0 0 1 0

1 year ago

Tuning B+ Trees for Memory Performance | Tim Mastny

The blue line in the first plot is our theoretical model and the black points are average search times on my implementation of B+ tree search. Our theoretical model agrees very closely with our benchmark!

Check out my blog post for more details: timmastny.com/blog/tuning-...

0 0 0 0

1 year ago

Pointer chasing is slow, but the number is limited to the height of the tree. Scanning is fast, but quickly adds up. Our model combines these costs: for any number of total keys, we can calculate the optimal number of keys per node 'b' that minimizes total search time.

0 0 1 0

1 year ago

On the other hand, a B+ tree with infinite keys per node is an array. A sequential scan of 1 trillion keys would take about 16 minutes at 1ns per key.

0 0 1 0

1 year ago

Let’s start with the extremes. With one key per node, our B+ tree is a binary search tree. Even with 1,000,000,000 keys, the maximum height is <30. We can find any key in less 3 us: 30 pointer jumps * 100 ns per jump = 3 us

0 0 1 0

1 year ago

Search in B+ trees is a balance of two operations: scanning within nodes (1ns each) and jumping between nodes (100ns each). The size of each node determines how many of each operation we do - and that's why finding the optimal node size is crucial for performance.

0 0 1 0

1 year ago

Problem 1 | Tim Mastny

Here's problem one from the newsletter:
sirupsen.com/napkin/probl...

And my napkin math:
timmastny.com/projects/nap...

0 0 0 0

1 year ago

The solution estimated the storage cost to be $0.01 GB/month, but didn't mention write costs. Simon mentions disk storage rather than blob, which I did not consider 🤔

0 0 1 0

1 year ago

Tips from solution: calculate and estimate with powers of 10 to simplify the math, e.g. 1 day = 9 * 10^4 seconds. I also like to use powers of 10 that are multiples of three to match SI prefixes: 5 * 10^5 bytes = 0.5 * 10^6 = 0.5 MB

0 0 1 0

1 year ago

At $5 / 1 million writes, we have to batch requests together before writing to the blob, or we would 250x our storage cost. Batching 1000 requests costs $15,000 / year. Batching 1,000,000 is only $15.

0 0 1 0

1 year ago

Data size: 1 character per byte, 5 characters per word, 200 words per request = 1KB per request. That works out to be 9TB / day. Blob storage at $0.02 GB / month is $65,000 per year. But we also have to pay for writes to blob storage!

0 0 1 0

1 year ago

Starting to work through the Napkin Math newsletter by x.com/Sirupsen

Problem 1: How much will the storage of logs cost for a standard, monolithic 100,000 RPS web application?

0 0 1 0

1 year ago

RGA Editor

The demo includes several interactive examples, showing how the data structure solves some typical synchronization problems.

Check out the demo here: timmastny.com/rga/

0 0 0 0

1 year ago

Latest demo: collaborative text editing synchronized with a conflict-free data structure.

The data structure ensures that insertions, edits, and deletions are properly merged during synchronization. Even if the changes were made offline. 1/

2 1 1 0

1 year ago

Why Branch Prediction Needs Real Data | Tim Mastny

Loops, hot paths, and repetitive data means programs are often predictable: even a simple "branch taken" prediction can achieve 70% accuracy.

Unlike sorting algos, we don't care how they perform on random data: we need to compare against real data
timmastny.com/blog/branch-...

0 0 0 0

1 year ago

Think about hashing like putting balls randomly into bins. gshare could put all the balls into one bin.

But with concatenation, each bin has a maximum of 4 balls, leading to fewer collisions on average.

The problem is that branch history is not random! 2/

0 0 1 0

1 year ago

When analyzing branch prediction algorithms, the "better" hashing approach (gshare) is theoretically worse than the simple one (concatenation).

What's going on?

0 0 1 0