Consensus algorithms have a reputation problem. Paxos is famously described as “simple” by its author and “incomprehensible” by everyone else. Raft was designed to be understandable, yet production implementations still routinely surprise their operators. This article distills the lessons from running distributed state machines in production.
Why Consensus Matters
At its core, consensus solves one problem: getting multiple machines to agree on a sequence of operations. This enables:
- Replicated databases — all nodes see the same writes in the same order
- Configuration management — cluster-wide settings that are always consistent
- Leader election — exactly one node acts as primary at any given time
- Distributed locks — mutual exclusion across machines
Raft: The Practical Choice
Raft decomposes consensus into three sub-problems:
1. Leader Election
type Node struct {
state NodeState // Follower, Candidate, Leader
term uint64
votedFor string
log []LogEntry
commitIdx uint64
}
func (n *Node) startElection() {
n.term++
n.state = Candidate
n.votedFor = n.id
votes := 1
for _, peer := range n.peers {
resp := peer.RequestVote(n.term, n.id, n.lastLogIndex())
if resp.Granted {
votes++
}
}
if votes > len(n.peers)/2 {
n.state = Leader
n.sendHeartbeats()
}
} 2. Log Replication
Once a leader is elected, it accepts client requests and replicates them:
- Client sends command to leader
- Leader appends to local log
- Leader sends
AppendEntriesto all followers - Once majority confirms, leader commits the entry
- Leader responds to client
3. Safety
The critical invariants:
- Election Safety — at most one leader per term
- Leader Append-Only — leaders never overwrite or delete log entries
- Log Matching — if two logs contain an entry with the same index and term, all preceding entries are identical
Production Lessons
Lesson 1: Timeouts Kill You
The most common production issue isn’t split-brain or data loss—it’s timeout tuning. Too aggressive, and you get unnecessary leader elections during GC pauses. Too conservative, and failover takes minutes.
# Start conservative, tune down
election_timeout_min: 1000ms
election_timeout_max: 2000ms
heartbeat_interval: 200ms Rule of thumb: heartbeat_interval << election_timeout << MTBF
Lesson 2: Snapshots Are Not Optional
Without snapshots, your log grows unbounded. Nodes that fall behind must replay the entire history to catch up. In practice:
- Snapshot every 10,000 entries or 5 minutes, whichever comes first
- Compress snapshots — they’re often highly compressible
- Transfer snapshots in chunks for large state
Lesson 3: Membership Changes Are the Hardest Part
Adding or removing nodes from a running cluster is where most implementations break. The single-server membership change approach (one node at a time) is dramatically simpler than joint consensus and sufficient for most use cases.
When Not to Use Consensus
Consensus is expensive. Every write requires a round-trip to a majority of nodes. Before reaching for Raft or Paxos, consider:
- Do you actually need strong consistency? — eventual consistency is cheaper and sufficient for many workloads
- Is your cluster small? — consensus works best with 3-7 nodes
- Can you partition your state? — independent consensus groups for independent data
The right amount of consensus is the minimum that satisfies your consistency requirements. Everything beyond that is wasted latency.