Distributed State Machines: Consensus Without Complexity

Consensus algorithms have a reputation problem. Paxos is famously described as “simple” by its author and “incomprehensible” by everyone else. Raft was designed to be understandable, yet production implementations still routinely surprise their operators. This article distills the lessons from running distributed state machines in production.

Why Consensus Matters

At its core, consensus solves one problem: getting multiple machines to agree on a sequence of operations. This enables:

Replicated databases — all nodes see the same writes in the same order
Configuration management — cluster-wide settings that are always consistent
Leader election — exactly one node acts as primary at any given time
Distributed locks — mutual exclusion across machines

Raft: The Practical Choice

Raft decomposes consensus into three sub-problems:

1. Leader Election

type Node struct {
    state     NodeState // Follower, Candidate, Leader
    term      uint64
    votedFor  string
    log       []LogEntry
    commitIdx uint64
}

func (n *Node) startElection() {
    n.term++
    n.state = Candidate
    n.votedFor = n.id
    votes := 1

    for _, peer := range n.peers {
        resp := peer.RequestVote(n.term, n.id, n.lastLogIndex())
        if resp.Granted {
            votes++
        }
    }

    if votes > len(n.peers)/2 {
        n.state = Leader
        n.sendHeartbeats()
    }
}

2. Log Replication

Once a leader is elected, it accepts client requests and replicates them:

Client sends command to leader
Leader appends to local log
Leader sends AppendEntries to all followers
Once majority confirms, leader commits the entry
Leader responds to client

3. Safety

The critical invariants:

Election Safety — at most one leader per term
Leader Append-Only — leaders never overwrite or delete log entries
Log Matching — if two logs contain an entry with the same index and term, all preceding entries are identical

Production Lessons

Lesson 1: Timeouts Kill You

The most common production issue isn’t split-brain or data loss—it’s timeout tuning. Too aggressive, and you get unnecessary leader elections during GC pauses. Too conservative, and failover takes minutes.

# Start conservative, tune down
election_timeout_min: 1000ms
election_timeout_max: 2000ms
heartbeat_interval: 200ms

Rule of thumb: heartbeat_interval << election_timeout << MTBF

Lesson 2: Snapshots Are Not Optional

Without snapshots, your log grows unbounded. Nodes that fall behind must replay the entire history to catch up. In practice:

Snapshot every 10,000 entries or 5 minutes, whichever comes first
Compress snapshots — they’re often highly compressible
Transfer snapshots in chunks for large state

Lesson 3: Membership Changes Are the Hardest Part

Adding or removing nodes from a running cluster is where most implementations break. The single-server membership change approach (one node at a time) is dramatically simpler than joint consensus and sufficient for most use cases.

When Not to Use Consensus

Consensus is expensive. Every write requires a round-trip to a majority of nodes. Before reaching for Raft or Paxos, consider:

Do you actually need strong consistency? — eventual consistency is cheaper and sufficient for many workloads
Is your cluster small? — consensus works best with 3-7 nodes
Can you partition your state? — independent consensus groups for independent data

The right amount of consensus is the minimum that satisfies your consistency requirements. Everything beyond that is wasted latency.