Tim Mastny

Tim Mastny

@timmastny.bsky.social

Software engineer bringing digital finance to Africa. Writes about low-level programming, data science, and algorithms at https://timmastny.com

20 Followers 53 Following 33 Posts Joined Jan 2025
11 months ago
Post image

#statstab #317 Tests for Pairwise Mean Differences in R
by @timmastny.bsky.social

Thoughts: A nice illustration of what {emmeans} is doing when computing contrasts using various functions.

#emmeans #r #pairwise #ttest #tutorial #dataviz

timmastny.com/blog/tests-p...

6 2 0 0
1 year ago
Manus turns out to be just Claude Sonnet + 29 other tools, Reflection 70B vibes ngl

Word is that it's just a wrapper around Claude 3.7! www.reddit.com/r/LocalLLaMA...

0 0 0 0
1 year ago
Preview
Dead Rise (Original Soundtrack) Ian Wright · EP · 2025 · 4 songs

My friend @timmastny.bsky.social made a game for the NES and asked me to write music. We made both the game and music under the constraints of the NES.

Check out the game on tmastny.itch.io/dead-rise

And the soundtrack on all streaming platforms!

Spotify link: open.spotify.com/album/2NDjba...

3 1 0 1
1 year ago
False Sharing | Tim Mastny

This is called false sharing - when CPUs coordinate over variables they don't actually share.

In this case the penalty was small, but in other cases it can cause 5%, 25%, or even 8000% performance degradation. If you want to learn more, check out my post!
timmastny.com/blog/false-s...

0 0 0 0
1 year ago
Post image

The cache line is the smallest unit of data unique to each CPU cache, commonly 64 or 128 bytes. If a CPU wants to modify a variable on a cache line, it must coordinate with all other CPUs to invalidate their copy, even if they some of the variables on that line are not shared.

0 0 1 0
1 year ago

Variables `a` and `b` are not shared between the two threads: thread 0 uses a and thread 1 uses b. But since they are initialized next to each other in the program on the left, they share the same cache line.

0 0 1 0
1 year ago

The program on the right is 0.7% faster measured over millions of iterations. Why?

0 0 1 0
1 year ago
Post image

Which program is faster? 1/

0 0 1 0
1 year ago
Expected Database Query Latency

Check out problem 2: sirupsen.com/napkin/probl...
And my napkin math: timmastny.com/projects/nap...

0 0 0 0
1 year ago

It’s also random for main memory, but we only need to load one new cache line from main memory per page: 50 * 50 ns = 2.5 us. The prefetcher should ensure the sequential read of the rest of the page is fast.

Even if we have to fetch more cache lines, it’s much faster than 1ms!

0 0 1 0
1 year ago

Solution: I didn’t consider that each page access is random! A random SSD of 8KB is 100 us, so 101 us ~ 100 us for 16KB, 1 ms for 10 pages.

When doing random reads, SSDs are 100x slower, not 10x!

0 0 1 0
1 year ago

Total:
50 pages * .25 us / page = 12.5 us
10 pages * 2 us / page = 20 us
12.5 us + 20 us = 32.5 us

Seems reasonable: SSD is 10x slower than main memory.

0 0 1 0
1 year ago

50 pages / query * 80% of pages in cache = 40 pages in cache, 10 on disk.

Sequential SSD read: 1 us / 8KB * 16KB / page = 2 us / page.

Page scan time: 1 ns / 64 bytes * 16KB / page = 250 ns / page = .25 us / page

0 0 1 0
1 year ago

Problem 2 from the @sirupsen.bsky.social's napkin math newsletter:

Your SSD-backed database has a usage-pattern that rewards you with a 80% page-cache hit-rate. The median is 50 distinct disk pages for a query to gather its query results. What is the expected average query time from your database?

0 0 1 0
1 year ago
Tuning B+ Trees for Memory Performance | Tim Mastny

The blue line in the first plot is our theoretical model and the black points are average search times on my implementation of B+ tree search. Our theoretical model agrees very closely with our benchmark!

Check out my blog post for more details: timmastny.com/blog/tuning-...

0 0 0 0
1 year ago
Post image

Pointer chasing is slow, but the number is limited to the height of the tree. Scanning is fast, but quickly adds up. Our model combines these costs: for any number of total keys, we can calculate the optimal number of keys per node 'b' that minimizes total search time.

0 0 1 0
1 year ago

On the other hand, a B+ tree with infinite keys per node is an array. A sequential scan of 1 trillion keys would take about 16 minutes at 1ns per key.

0 0 1 0
1 year ago

Let’s start with the extremes. With one key per node, our B+ tree is a binary search tree. Even with 1,000,000,000 keys, the maximum height is <30. We can find any key in less 3 us: 30 pointer jumps * 100 ns per jump = 3 us

0 0 1 0
1 year ago
Post image

Search in B+ trees is a balance of two operations: scanning within nodes (1ns each) and jumping between nodes (100ns each). The size of each node determines how many of each operation we do - and that's why finding the optimal node size is crucial for performance.

0 0 1 0
1 year ago
Problem 1 | Tim Mastny

Here's problem one from the newsletter:
sirupsen.com/napkin/probl...

And my napkin math:
timmastny.com/projects/nap...

0 0 0 0
1 year ago

The solution estimated the storage cost to be $0.01 GB/month, but didn't mention write costs. Simon mentions disk storage rather than blob, which I did not consider 🤔

0 0 1 0
1 year ago

Tips from solution: calculate and estimate with powers of 10 to simplify the math, e.g. 1 day = 9 * 10^4 seconds. I also like to use powers of 10 that are multiples of three to match SI prefixes: 5 * 10^5 bytes = 0.5 * 10^6 = 0.5 MB

0 0 1 0
1 year ago

At $5 / 1 million writes, we have to batch requests together before writing to the blob, or we would 250x our storage cost. Batching 1000 requests costs $15,000 / year. Batching 1,000,000 is only $15.

0 0 1 0
1 year ago

Data size: 1 character per byte, 5 characters per word, 200 words per request = 1KB per request. That works out to be 9TB / day. Blob storage at $0.02 GB / month is $65,000 per year. But we also have to pay for writes to blob storage!

0 0 1 0
1 year ago

Starting to work through the Napkin Math newsletter by x.com/Sirupsen

Problem 1: How much will the storage of logs cost for a standard, monolithic 100,000 RPS web application?

0 0 1 0
1 year ago
RGA Editor

The demo includes several interactive examples, showing how the data structure solves some typical synchronization problems.

Check out the demo here: timmastny.com/rga/

0 0 0 0
1 year ago
Video thumbnail

Latest demo: collaborative text editing synchronized with a conflict-free data structure.

The data structure ensures that insertions, edits, and deletions are properly merged during synchronization. Even if the changes were made offline. 1/

2 1 1 0
1 year ago
Why Branch Prediction Needs Real Data | Tim Mastny

Loops, hot paths, and repetitive data means programs are often predictable: even a simple "branch taken" prediction can achieve 70% accuracy.

Unlike sorting algos, we don't care how they perform on random data: we need to compare against real data
timmastny.com/blog/branch-...

0 0 0 0
1 year ago

Think about hashing like putting balls randomly into bins. gshare could put all the balls into one bin.

But with concatenation, each bin has a maximum of 4 balls, leading to fewer collisions on average.

The problem is that branch history is not random! 2/

0 0 1 0
1 year ago
Video thumbnail

When analyzing branch prediction algorithms, the "better" hashing approach (gshare) is theoretically worse than the simple one (concatenation).

What's going on?

0 0 1 0