Two rules to make pause_minority work:
1. Odd number of nodes (3, 5, 7). Even numbers can deadlock - both halves think they're the minority.
2. Monitor rabbitmq_partitions metric. Partitions are network problems. Fix the network.
Full RabbitMQ production guide:
podostack.com/p/rabbitmq-...
Decision matrix:
"My messages can't be lost" -> pause_minority
"Uptime matters more than a few lost messages" -> autoheal
"I like chaos" -> ignore (please don't)
99% of teams should use pause_minority.
autoheal - the "move fast" choice.
Both sides run during the split (like ignore). When they reconnect, the side with fewer clients gets wiped and resynced.
Fast recovery. But messages from the losing side are gone forever.
pause_minority - the safe choice.
Nodes in the minority partition freeze. They stop accepting connections. Only the majority keeps running.
When the network heals, minority nodes sync and resume. No split-brain. No data loss.
cluster_partition_handling has 3 options:
ignore (default) - both sides keep running. Split-brain. Guaranteed data loss when they reconnect.
It's the default and it's the most dangerous. Let that sink in.
Network split in your RabbitMQ cluster. What happens next depends on one setting.
The default? The worst possible choice.
Top use cases we use policies for:
- DLX routing (dead letter handling)
- TTL enforcement (auto-expire stale messages)
- Queue length limits (protect against runaway producers)
- Quorum queue migration (switch type without code changes)
Full guide:
podostack.com/p/rabbitmq-...
One gotcha: client-side arguments beat policies.
If your app declares a queue with x-max-length=50, the policy's max-length is ignored for that queue.
Policies set defaults. Client args are overrides.
Priority system resolves conflicts:
Priority 0: global default (DLX for all)
Priority 10: group-specific (TTL for temp queues)
Priority 20: override (quorum for critical queues)
Higher number wins for each argument.
The real power is pattern matching.
"^temp\." gets TTL.
"^critical\." gets quorum type.
".*" is the global default.
Name your queues with prefixes and policies practically write themselves.
A policy is a server-side regex rule:
rabbitmqctl set_policy DLX ".*" \
'{"dead-letter-exchange":"dlx"}' \
--apply-to queues
Every queue now routes dead letters to your DLX. Applied instantly. Zero downtime.
Stop hardcoding TTL and DLX in your application code.
RabbitMQ policies do it better. One CLI command. No code change. No restart.
I learned this the hard way - a $2M revenue report attributed to the wrong state because someone ran an UPDATE instead of an INSERT.
Full guide with SQL examples and common mistakes:
podostack.com/p/slowly-ch...
Decision guide:
Type 1 for: typo fixes, attributes nobody reports on
Type 2 for: geography, category, status - anything you'd analyze over time
Type 6 for: when compliance needs both current AND historical views
Type 2 covers 80% of cases.
The trick is surrogate keys.
customer_key 1001 = NYC version
customer_key 1002 = Austin version
customer_id stays "CUST-42" on both
Facts point to the key, not the ID. That's how you preserve what was true at the time.
SCD Type 2 in action:
Customer moves from NYC to Austin.
Old row: NYC, valid_from 2022, valid_to 2025
New row: Austin, valid_from 2025, valid_to 9999
Past sales still show NYC. New sales show Austin. History intact.
Slowly Changing Dimensions (SCD) fix this.
6 types, but you really need to know 2:
Type 1: overwrite (simple, history gone)
Type 2: add a new row (history preserved)
Your data warehouse tracks what happened.
But does it track what CHANGED?
If you UPDATE customer.city in place, you just erased history. Every past sale now shows the new address.
Enable it:
rabbitmqctl enable_feature_flag quorum_queue_non_voters
It's transparent to your application code. Publishers and consumers don't know or care.
Deep dive with architecture diagrams:
podostack.com/p/rabbitmq-...
Best use cases:
- Geo-distributed clusters (replicas in every AZ, voting in the fast zone)
- Compliance requiring 5+ copies
- Hardware migrations without write degradation
If a voter goes down, a non-voter gets auto-promoted.
Your quorum recovers without operator intervention. The other non-voters keep holding their copies.
Same idea as ZooKeeper observers or etcd learners.
Non-voter replicas break this trade-off.
They receive the full replication stream. They store complete copies. But they don't participate in consensus voting.
7 replicas, 3 voters. Quorum stays at 2. Write speed stays fast.
Raft's scaling problem:
3 voters, quorum of 2 - fast.
5 voters, quorum of 3 - slower.
7 voters, quorum of 4 - noticeably slower.
Every voter adds a network round trip to every write. More durability = more latency.
You can replicate a RabbitMQ queue to 7 nodes without slowing down writes.
The trick: not all replicas need to vote.
Full breakdown with Mermaid diagrams, K8s YAML examples, and a decision matrix for when to use quorum vs streams vs classic.
Read the full issue:
podostack.com/p/rabbitmq-...
Network split in your cluster? The default config (ignore) gives you split-brain.
pause_minority is the only safe choice. Odd number of nodes. Always.
Broker policies: stop hardcoding TTL and DLX in your app.
One CLI command applies dead-letter routing to every queue in your cluster. Pattern-matched. Priority-based. No code changes. No restart.
Non-voter replicas: the scaling trick nobody talks about.
7 copies of your data for durability, but only 3 vote for consensus. Write latency stays fast. Storage stays safe.
Same idea as ZooKeeper observers.
RabbitMQ now has Kafka-like streams.
Append-only log. Multiple consumers. Offset tracking. Message replay.
You might not need that Kafka cluster anymore.
Mirrored queues are deprecated. Quorum queues replaced them.
The difference? Raft consensus vs best-effort sync. One guarantees your messages survive node failures. The other hopes for the best.