Reproducing the AWS Outage Race Condition with a Model Checker | Waqas Younas' blog
Welcome to Waqas' blog
Reproducing the AWS Outage Race Condition with a Model Checker
wyounas.github.io/aws/concurre...
Weβll use a model checker to see how such a race could happen. Formal verification canβt prevent every failure, but it helps us think more clearly about correctness and reason about subtle bugs.
10.11.2025 12:02 β π 0 π 0 π¬ 0 π 0
TLA+ Modeling of AWS outage DNS race condition
On Oct 19β20, 2025, AWSβs N. Virginia region suffered a major DynamoDB outage triggered by a DNS automation defect that broke endpoint resol...
TLA+ Modeling of AWS outage DNS race condition
muratbuffalo.blogspot.com/2025/11/tla-...
AWSβs N. Virginia region suffered a DynamoDB outage triggered by a DNS automation defect.This post focuses narrowly on the race condition at the core of the bug, which is best understood through TLA+ modeling
06.11.2025 12:01 β π 0 π 0 π¬ 0 π 0
TernFS β an exabyte scale, multi-region distributed filesystem
www.xtxmarkets.com/tech/2025-te...
This post motivates TernFS, explains its high-level architecture, and then explores some key implementation details.
03.11.2025 12:01 β π 0 π 0 π¬ 0 π 0
Dynamo, DynamoDB, and Aurora DSQL - Marc's Blog
Names are hard, ok?
Dynamo, DynamoDB, and Aurora DSQL
brooker.co.za/blog/2025/08...
People often ask me about the architectural relationship between Amazon Dynamo, Amazon DynamoDB and Aurora DSQL. Iβll start off on comparing how the systems achieve a few key properties.
14.10.2025 18:52 β π 0 π 0 π¬ 0 π 0
Linearizability testing S2 with deterministic simulation
s2.dev/blog/lineari...
We can gain confidence that S2 is linearizable by taking an empirical validation approach, using a model checker like Knossos, or Porcupine.
30.09.2025 11:00 β π 1 π 0 π¬ 0 π 0
How I solved a distributed queue problem after 15 years
dbos.dev/blog/durable...
What we really needed to make distributed task queueing robust are durable queues that checkpoint the status of our queued tasks to a durable store like Postgres.
22.09.2025 18:27 β π 0 π 0 π¬ 0 π 0
Understanding Paxos the intuitive way
relentless-leader.com/dive-deep-in...
09.08.2025 14:07 β π 0 π 0 π¬ 0 π 0
Murat Demirbas and Aleksey Charapko read and discuss the HotOS paper""Real Life Is Uncertain. Consensus Should Be Too!"
31.07.2025 21:37 β π 0 π 0 π¬ 0 π 0
Learning about distributed systems: where to start?
muratbuffalo.blogspot.com/2020/06/lear...
A principled, from the foundations-up, studying of distributed systems, which will take a good three months in the first pass, and many more months to build competence after that.
30.05.2025 11:00 β π 0 π 0 π¬ 0 π 0
FLP Result: Impossibility of Distributed Consensus with One Faulty Process (1985)
groups.csail.mit.edu/tds/papers/L...
29.05.2025 11:00 β π 0 π 0 π¬ 0 π 0
Just make it scale: An Aurora DSQL story
www.allthingsdistributed.com/2025/05/just...
a few weeks ago, at our internal dev conference I watched a talk from two of our PEs on building DSQL. I asked if theyβd be willing to turn their insights into a deeper exploration of DSQLβs development.
28.05.2025 11:01 β π 1 π 0 π¬ 0 π 0
Reasoning about Distributed Protocols with Smart Casual Verification
decentralizedthoughts.github.io/2025-05-23-s...
Reasoning about distributed algorithms is hard at the best of times, with state split across remote nodes, asynchrony, concurrency, and non-determinism in the order that event occur
27.05.2025 11:00 β π 0 π 0 π¬ 0 π 0
Apache Iceberg Internals Dive Deep On Performance
relentless-leader.com/apache-icebe...
Apache Iceberg is an ACID table format designed for large-scale analytics workloads.
15.05.2025 11:01 β π 0 π 0 π¬ 0 π 0
Concurrency bugs in Lucene: How to fix optimistic concurrency failures
www.elastic.co/search-labs/...
Debugging concurrency bugs is no picnic, but we're going to get into it. Enter Fray, a deterministic concurrency testing framework that turns flaky failures into reproducible ones.
12.05.2025 11:04 β π 0 π 0 π¬ 0 π 0
Erlangβs not about lightweight processes and message passingβ¦
stevana.github.io/erlangs_not_...
To me itβs clear that the big idea there isnβt lightweight processes2 and message passing, but rather the generic components which in Erlang are called behaviours.
09.05.2025 11:01 β π 0 π 0 π¬ 0 π 0
So, You Want to Learn More About Deterministic Simulation Testing?
pierrezemb.fr/posts/learn-...
A curated collection of resources about deterministic simulation testing for distributed systems.
08.05.2025 11:01 β π 0 π 0 π¬ 0 π 0
May thy bits chip and shatter: Patterns for Building High-Performance Observability Pipelines at Scale
sumercip.com/posts/patter...
07.05.2025 11:02 β π 0 π 0 π¬ 0 π 0
Parallel, Concurrent and Distributed Programming
ilyasergey.net/YSC4231/
This course on basic concurrent and parallel algorithms has been taught by Ilya Sergey at Yale-NUS College in 2019-2024.
06.05.2025 11:01 β π 0 π 0 π¬ 0 π 0
Systems Correctness Practices at AWS: Leveraging Formal and Semi-formal Methods
dl.acm.org/doi/10.1145/...
05.05.2025 11:03 β π 0 π 0 π¬ 0 π 0
Distributed consensus
shachaf.net/w/consensus
This page is a relatively informal discussion of distributed consensus and Paxos, what it does, how it works, and some tricks and variants.
28.04.2025 11:02 β π 0 π 0 π¬ 0 π 0
Why is the raft consensus algorithm called "raft"?
groups.google.com/g/raft-dev/c...
25.04.2025 11:01 β π 0 π 0 π¬ 0 π 0
Three Clocks are Better than One
Insights, updates, and technical deep dives on building a high-performance financial transactions database.
Three Clocks are Better than One
tigerbeetle.com/blog/2021-08...
CLOCK_MONOTONIC_RAW, CLOCK_MONOTONIC and CLOCK_BOOTTIME, all monotonic clock stopwatches provided by the Linux kernel through the clock_gettime(2) syscall to measure elapsed time
22.04.2025 11:00 β π 0 π 0 π¬ 0 π 0
Building a modern Durable Execution Engine from First Principles
restate.dev/blog/buildin...
We built a precursor and from all the lessons learned there, we arrived at a design with a self-contained complete stack, centered around a command log and event-processor, shipping as a single Rust binary
21.04.2025 17:10 β π 0 π 0 π¬ 0 π 0
Decomposing Transactional Systems
transactional.blog/blog/2025-de...
Every transactional system does four things: execute, orders, validate and persists transactions.
All four of these things must be done before the system may acknowledge a transactionβs result to a client.
18.04.2025 11:00 β π 0 π 0 π¬ 0 π 0
How crawlers impact the operations of the Wikimedia projects
diff.wikimedia.org/2025/04/01/h...
Since the beginning of 2024, the demand for the content created by the Wikimedia volunteer community β especially for the 144 million images, videos, and other files on Wikimedia Commons β has grown.
15.04.2025 11:02 β π 0 π 0 π¬ 0 π 0
Memcached: VerifyThis Long-term Challenge
verifythis.github.io/ltc/03memcac...
VerifyThis Long-Term Challenge aims at proving that deductive program verification can produce relevant results for real systems with acceptable effort on a large scale in a collaborative manner.
10.04.2025 11:02 β π 0 π 0 π¬ 0 π 0
Testing Distributed Systems
asatarin.github.io/testing-dist...
01.04.2025 18:35 β π 0 π 0 π¬ 0 π 0
How concurrency works: A visual guide
wyounas.github.io/concurrency/...
Concurrent programming is hard.
24.03.2025 22:02 β π 1 π 0 π¬ 0 π 0