Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

PBFT: When You Can’t Trust Anyone

In 1999, Miguel Castro and Barbara Liskov published a paper that changed the landscape of Byzantine fault tolerance. Before PBFT, Byzantine agreement protocols were largely academic curiosities — theoretically interesting, practically useless. The existing protocols had exponential message complexity or required synchronous networks, which is another way of saying they required networks that don’t exist. Castro and Liskov’s contribution was making BFT practical. The word “practical” is doing an enormous amount of heavy lifting in that title, as we’re about to discover.

PBFT can tolerate up to f Byzantine (arbitrarily faulty) nodes out of a total of n = 3f + 1 replicas. It guarantees safety in an asynchronous network and liveness when the network is eventually synchronous. The protocol operates in views, each with a designated primary (leader), and uses a three-phase commit protocol for normal operation. If the primary misbehaves, a view-change protocol replaces it.

This all sounds straightforward. It isn’t.

The System Model

Before we dive into the protocol, let’s be precise about what PBFT assumes:

  • n = 3f + 1 replicas. To tolerate f Byzantine faults, you need at least 3f + 1 total nodes. This comes from the fundamental impossibility result: with fewer nodes, a Byzantine adversary can create ambiguity that prevents correct nodes from distinguishing valid states. If you want to tolerate 1 faulty node, you need 4 replicas. Tolerate 2? Seven replicas. The overhead adds up.

  • Asynchronous network with eventual synchrony. PBFT does not assume messages arrive within bounded time for safety. A Byzantine node can delay messages, reorder them, duplicate them — the protocol remains safe. However, liveness requires eventual synchrony: messages must eventually get through within some (unknown) bound. Without this, FLP impossibility kicks in and nobody makes progress.

  • Cryptographic assumptions. Every message is signed. PBFT uses public-key cryptography to authenticate messages and MACs (message authentication codes) for efficiency. The adversary cannot forge signatures or break the cryptographic primitives. This assumption is doing real work — the entire protocol falls apart without it.

  • Independent failures. The Byzantine nodes are assumed to fail independently, not in a coordinated attack by a single adversary controlling multiple nodes. In practice, if someone compromises your deployment infrastructure and controls multiple replicas simultaneously, the 3f + 1 bound may not hold. This is the kind of thing the paper mentions in passing and practitioners lose sleep over.

The Three-Phase Protocol: Normal Case Operation

PBFT’s normal case protocol has three phases: pre-prepare, prepare, and commit. The client sends a request to the primary, and the primary initiates the three-phase protocol to replicate the request across all replicas.

Message Flow: Normal Case

Here’s the complete message flow for a single client request with n = 4 (f = 1):

Client    Primary(0)  Replica 1   Replica 2   Replica 3
  |           |           |           |           |
  |--REQUEST->|           |           |           |
  |           |           |           |           |
  |           |--PRE-PREPARE-------->|           |
  |           |--PRE-PREPARE---------------->|   |
  |           |--PRE-PREPARE---------------------->|
  |           |           |           |           |
  |           |<--PREPARE-|           |           |
  |           |<--PREPARE------------|           |
  |           |<--PREPARE--------------------->  |
  |           |           |<--PREPARE(0)-------->|
  |           |           |<--PREPARE(2)-------->|
  |           |           |           |<-PREPARE(0)
  |           |           |           |<-PREPARE(1)
  |           |           |           |           |
  |           | (each replica collects 2f PREPARE |
  |           |  messages matching PRE-PREPARE)   |
  |           |           |           |           |
  |           |--COMMIT-->|           |           |
  |           |--COMMIT------------>|           |
  |           |--COMMIT--------------------->|   |
  |           |<--COMMIT--|           |           |
  |           |<--COMMIT------------|           |
  |           |<--COMMIT--------------------->  |
  |           |           |<--COMMIT(0)--------->|
  |           |           |<--COMMIT(2)--------->|
  |           |           |           |<-COMMIT(0)
  |           |           |           |<-COMMIT(1)
  |           |           |           |           |
  |<--REPLY---|           |           |           |
  |<--REPLY---------------|           |           |
  |<--REPLY--------------------------|           |
  |<--REPLY----------------------------------->  |

The client waits for f + 1 matching replies (2 in this case). Since at most f replicas can be Byzantine, f + 1 matching replies guarantee at least one came from a correct replica. The client doesn’t need to know which replicas are faulty.

Notice the all-to-all communication in the prepare and commit phases. Every replica sends to every other replica. This is the O(n^2) per phase that will haunt PBFT’s scalability story.

Phase 1: Pre-Prepare

The primary assigns a sequence number to the client request and broadcasts a PRE-PREPARE message to all backups.

// Primary replica p in view v
function on_client_request(request):
    if i_am_not_primary(v):
        forward request to primary
        return

    seq_num = next_sequence_number()

    // Check watermarks
    if seq_num < low_watermark or seq_num > high_watermark:
        drop request  // Outside our window
        return

    msg = PrePrepare{
        view:     v,
        seq:      seq_num,
        digest:   hash(request),
        request:  request
    }

    sign(msg)
    log.append(msg)
    broadcast_to_all_backups(msg)
// Backup replica i receives PRE-PREPARE from primary
function on_pre_prepare(msg):
    // Validate the message
    if msg.view != current_view:
        discard(msg)
        return

    if not verify_signature(msg, primary_of(msg.view)):
        discard(msg)
        return

    if msg.seq < low_watermark or msg.seq > high_watermark:
        discard(msg)
        return

    // Critical check: have we already accepted a different
    // PRE-PREPARE for this view and sequence number?
    if exists_pre_prepare(msg.view, msg.seq) and
       existing.digest != msg.digest:
        discard(msg)  // Primary is equivocating!
        return

    // Accept the PRE-PREPARE
    log.append(msg)

    // Enter PREPARE phase
    prepare_msg = Prepare{
        view:   msg.view,
        seq:    msg.seq,
        digest: msg.digest,
        sender: my_id
    }
    sign(prepare_msg)
    log.append(prepare_msg)
    broadcast_to_all_replicas(prepare_msg)

A backup accepts the PRE-PREPARE if:

  1. It’s in the correct view.
  2. The signature is valid.
  3. The sequence number is within the watermark window.
  4. It hasn’t accepted a different PRE-PREPARE for the same view and sequence number. This last check prevents a Byzantine primary from assigning the same sequence number to different requests for different backups — a form of equivocation.

The primary does not send a PREPARE message. It already demonstrated its intent via the PRE-PREPARE. This is a minor detail the paper handles carefully and many implementations get wrong initially.

Phase 2: Prepare

Each backup that accepts the PRE-PREPARE multicasts a PREPARE message to all replicas. A replica collects PREPARE messages until it has a prepared certificate: the PRE-PREPARE plus 2f matching PREPARE messages from different replicas (including itself, excluding the primary).

function on_prepare(msg):
    if msg.view != current_view:
        discard(msg)
        return

    if not verify_signature(msg, msg.sender):
        discard(msg)
        return

    if msg.seq < low_watermark or msg.seq > high_watermark:
        discard(msg)
        return

    log.append(msg)
    check_prepared(msg.view, msg.seq, msg.digest)

function check_prepared(view, seq, digest):
    // Do we have a matching PRE-PREPARE?
    if not has_pre_prepare(view, seq, digest):
        return

    // Count matching PREPAREs from distinct replicas
    prepare_count = count_matching_prepares(view, seq, digest)

    if prepare_count >= 2 * f:
        // We are PREPARED for this request
        mark_as_prepared(view, seq, digest)

        // Enter COMMIT phase
        commit_msg = Commit{
            view:   view,
            seq:    seq,
            digest: digest,
            sender: my_id
        }
        sign(commit_msg)
        log.append(commit_msg)
        broadcast_to_all_replicas(commit_msg)

Why 2f? Because out of n = 3f + 1 replicas, at most f are Byzantine. The primary sent the PRE-PREPARE (1 replica). We need 2f PREPAREs from distinct backups. Since at most f of those could be Byzantine, at least f + 1 correct replicas have prepared. This is enough to guarantee that no two different requests can both become prepared at the same sequence number in the same view. This is the quorum intersection argument: any two sets of 2f + 1 replicas out of 3f + 1 total must overlap in at least f + 1 replicas, and at least one of those must be correct.

The prepared predicate is the critical invariant: if a correct replica is prepared for request m at sequence number n in view v, then no correct replica can be prepared for a different request m’ at the same sequence number in the same view. This is what prevents the Byzantine primary from causing inconsistency.

Phase 3: Commit

Once prepared, a replica broadcasts a COMMIT message. A replica collects COMMIT messages until it has a committed certificate: 2f + 1 matching COMMIT messages from different replicas (including itself).

function on_commit(msg):
    if msg.view != current_view:
        discard(msg)
        return

    if not verify_signature(msg, msg.sender):
        discard(msg)
        return

    log.append(msg)
    check_committed(msg.view, msg.seq, msg.digest)

function check_committed(view, seq, digest):
    // Do we have a prepared certificate?
    if not is_prepared(view, seq, digest):
        return

    // Count matching COMMITs from distinct replicas
    commit_count = count_matching_commits(view, seq, digest)

    if commit_count >= 2 * f + 1:
        // We are COMMITTED-LOCAL for this request
        mark_as_committed(view, seq, digest)

        // Execute requests in sequence number order
        execute_pending_requests()

function execute_pending_requests():
    while has_committed(last_executed + 1):
        last_executed += 1
        request = get_request(last_executed)
        result = execute(request)

        // Send reply to client
        reply = Reply{
            view:      current_view,
            timestamp: request.timestamp,
            client:    request.client_id,
            replica:   my_id,
            result:    result
        }
        sign(reply)
        send(request.client_id, reply)

The commit phase is necessary because the prepared predicate alone isn’t sufficient across view changes. A replica might be prepared, but if the view changes before it commits, other correct replicas might not know about it. The commit phase ensures that enough correct replicas know the request is prepared, so the information survives a view change. Without the commit phase, a view change could cause a prepared request to be “forgotten,” violating safety.

This is one of those subtleties that looks like over-engineering until you try removing it and watch your protocol break in testing.

Why Three Phases? Can’t We Do Two?

A natural question: Raft and Paxos work with essentially two phases (propose and accept). Why does PBFT need three?

The answer lies in the difference between crash fault tolerance and Byzantine fault tolerance. In CFT protocols, a node either works correctly or crashes — it never lies. So when a leader says “commit this entry,” replicas can trust it. In BFT, the leader might be lying. The three-phase structure ensures:

  1. Pre-prepare: The primary proposes an ordering (and might be lying about it).
  2. Prepare: Replicas verify they all received the same proposal. This catches equivocation by the primary — if the primary sent different proposals to different replicas, they’ll discover the inconsistency. The prepared certificate proves that 2f + 1 replicas agree on what the primary proposed.
  3. Commit: Replicas verify that enough of them are prepared. This ensures the decision survives view changes even if the primary is replaced.

You can think of it this way: in CFT, you trust the leader and just need to ensure durability. In BFT, you first need to establish what the leader actually said (prepare), and then ensure enough replicas know what the leader actually said (commit).

View Changes: The Hard Part

The view-change protocol is, without exaggeration, the most complex part of PBFT. It handles leader replacement when the primary is suspected of being faulty. If you’ve implemented Raft, think of view changes as leader election, except the leader might be actively trying to sabotage the process.

A replica triggers a view change when:

  • It suspects the primary is faulty (e.g., timeout on expected PRE-PREPARE).
  • It receives f + 1 VIEW-CHANGE messages for a higher view (indicating other replicas also suspect the primary).

View Change Protocol

// Replica i suspects the primary is faulty
function start_view_change(new_view):
    if new_view <= current_view:
        return  // Don't go backwards

    // Stop accepting messages for current view
    stop_timers()

    // Construct the VIEW-CHANGE message
    // P is the set of prepared certificates for sequence numbers
    // between low and high watermarks
    P = collect_prepared_certificates()

    msg = ViewChange{
        new_view: new_view,
        last_stable_checkpoint: last_checkpoint_seq,
        checkpoint_proof: checkpoint_certificate,
        prepared_set: P,     // Set of (seq, digest, prepare_certificate)
        sender: my_id
    }
    sign(msg)
    send_to_new_primary(msg)
    // Also keep a copy for ourselves
    log_view_change(msg)

The new primary (determined by new_view mod n) collects VIEW-CHANGE messages:

// New primary for view v collects VIEW-CHANGE messages
function on_view_change(msg):
    if msg.new_view != pending_view:
        return

    if not verify_signature(msg, msg.sender):
        return

    if not verify_checkpoint_proof(msg.checkpoint_proof):
        return

    if not verify_prepared_certificates(msg.prepared_set):
        return

    view_change_messages[msg.sender] = msg

    if count(view_change_messages) >= 2 * f + 1:
        construct_new_view(msg.new_view)

function construct_new_view(v):
    // Determine the starting sequence number:
    // the highest stable checkpoint across all VIEW-CHANGE messages
    min_s = max(vc.last_stable_checkpoint
                for vc in view_change_messages.values())

    // Determine the highest sequence number mentioned in any
    // prepared certificate across all VIEW-CHANGE messages
    max_s = max(seq for vc in view_change_messages.values()
                    for (seq, _, _) in vc.prepared_set)

    // For each sequence number from min_s+1 to max_s,
    // determine what request should be assigned
    O = {}  // The set of PRE-PREPARE messages for the new view
    for seq in range(min_s + 1, max_s + 1):
        // Find if any VIEW-CHANGE message has a prepared certificate
        // for this sequence number
        prepared = find_highest_view_prepared(seq, view_change_messages)

        if prepared != null:
            // Re-propose the request that was prepared
            O[seq] = PrePrepare{
                view: v, seq: seq, digest: prepared.digest
            }
        else:
            // No replica had this prepared — assign a null request
            O[seq] = PrePrepare{
                view: v, seq: seq, digest: NULL_DIGEST
            }

    // Broadcast NEW-VIEW message
    new_view_msg = NewView{
        view: v,
        view_changes: view_change_messages,  // The 2f+1 VIEW-CHANGE msgs
        pre_prepares: O                      // The re-proposed requests
    }
    sign(new_view_msg)
    broadcast_to_all(new_view_msg)

    // Enter the new view
    current_view = v
    // Start processing the pre-prepares in O
    for pp in O.values():
        process_as_pre_prepare(pp)
// Backup receives NEW-VIEW message
function on_new_view(msg):
    if msg.view <= current_view:
        return

    // Verify the NEW-VIEW message
    if not verify_signature(msg, primary_of(msg.view)):
        return

    // Verify that the VIEW-CHANGE messages are valid
    if count(msg.view_changes) < 2 * f + 1:
        return

    for vc in msg.view_changes.values():
        if not verify_view_change(vc):
            return

    // Independently recompute what O should be
    // using the same deterministic algorithm
    expected_O = recompute_pre_prepares(msg.view, msg.view_changes)

    if expected_O != msg.pre_prepares:
        // New primary is lying about what should be re-proposed!
        return  // Don't enter new view; may trigger another view change

    // Accept the new view
    current_view = msg.view
    for pp in msg.pre_prepares.values():
        process_as_pre_prepare(pp)

Message Flow: View Change

Here’s the view change flow when replica 0 (primary of view 0) is suspected faulty. Replica 1 becomes the new primary for view 1:

Replica 0     Replica 1     Replica 2     Replica 3
(old primary)  (new primary)
   |              |              |              |
   | (timeout)    | (timeout)    | (timeout)    |
   |              |              |              |
   |              |<-VIEW-CHANGE-|              |
   |              |<-VIEW-CHANGE---------------|
   |              |  (also sends own VC)        |
   |              |              |              |
   |              | (has 2f+1 = 3 VIEW-CHANGE   |
   |              |  messages including own)     |
   |              |              |              |
   |              |--NEW-VIEW--->|              |
   |              |--NEW-VIEW------------------>|
   |              |--NEW-VIEW-->                |
   |              |              |              |
   |              | (replicas verify NEW-VIEW,  |
   |              |  recompute O, enter view 1) |
   |              |              |              |
   |              | (normal protocol resumes    |
   |              |  in view 1)                 |

Why View Changes Are Nightmarishly Complex

Let me enumerate the things that make view changes the hardest part of PBFT:

  1. State transfer. The new primary needs to reconstruct the state from VIEW-CHANGE messages. Each VIEW-CHANGE carries prepared certificates, checkpoint proofs, and sequence number information. Assembling this into a consistent view is non-trivial, especially when some of those certificates may have been constructed by Byzantine nodes.

  2. The O set computation. The new primary must correctly determine, for every sequence number in the window, whether a request was prepared in a previous view and re-propose it. Get this wrong and you violate safety. The paper describes this as a deterministic computation, which it is, but implementing it involves iterating over potentially thousands of prepared certificates across multiple VIEW-CHANGE messages and handling edge cases around null requests, gaps, and partially-prepared requests.

  3. Verification. Backups must independently verify the NEW-VIEW message by recomputing the O set themselves. This means every backup performs the same complex computation and checks it against what the new primary claims. A Byzantine new primary could try to subtly alter the O set to drop a committed request.

  4. Concurrent view changes. What happens if the view change to view 2 fails because the new primary for view 2 is also Byzantine? You need another view change to view 3. Multiple concurrent view changes can interact in subtle ways, especially if messages from different views are in flight simultaneously.

  5. Liveness during view changes. While a view change is in progress, the system makes no progress on client requests. Byzantine nodes can trigger unnecessary view changes to degrade performance (a denial-of-service vector). The paper addresses this with exponentially increasing timeouts, but tuning these timeouts in practice is an art form.

I’ve seen production PBFT implementations where the normal-case protocol worked fine for months, and the view-change code had critical bugs that were never triggered because the primary never failed. Then the primary failed, and the system didn’t recover. The view-change protocol accounts for maybe 20% of the paper’s text but 80% of the implementation complexity.

Watermarks and Garbage Collection

PBFT uses a sliding window mechanism controlled by two watermarks:

  • Low watermark (h): The sequence number of the last stable checkpoint.
  • High watermark (H): h + k, where k is a configurable window size (the paper suggests k = log2(n) * constant).
// Checkpoint protocol
function maybe_checkpoint(seq):
    if seq mod CHECKPOINT_INTERVAL != 0:
        return

    state_digest = hash(application_state)

    msg = Checkpoint{
        seq:    seq,
        digest: state_digest,
        sender: my_id
    }
    sign(msg)
    broadcast_to_all(msg)

function on_checkpoint(msg):
    checkpoint_messages[msg.seq][msg.sender] = msg

    // Do we have 2f+1 matching checkpoints?
    matching = count(cp for cp in checkpoint_messages[msg.seq].values()
                     if cp.digest == msg.digest)

    if matching >= 2 * f + 1:
        // Stable checkpoint!
        stable_checkpoint = msg.seq
        low_watermark = msg.seq
        high_watermark = msg.seq + WINDOW_SIZE

        // Garbage collect: discard all messages with seq <= msg.seq
        garbage_collect(msg.seq)

The watermarks serve two purposes:

  1. Memory management. Without garbage collection, replicas would accumulate messages forever. The checkpoint protocol allows replicas to discard old messages once 2f + 1 replicas agree on the state at a sequence number.
  2. Flow control. The high watermark prevents a faulty primary from exhausting memory by assigning arbitrarily high sequence numbers.

The checkpoint interval is a tuning knob: too frequent and you waste bandwidth on checkpoint messages; too infrequent and replicas accumulate too much state. In practice, I’ve seen intervals ranging from 128 to 1024, depending on the application’s state size and message rate.

Message Complexity: The Scalability Wall

Let’s count the messages for a single client request:

PhaseMessagesComplexity
Request1 (client to primary)O(1)
Pre-preparen - 1 (primary to backups)O(n)
Prepare(n - 1) * n (each backup to all)O(n^2)
Commitn * n (each replica to all)O(n^2)
Replyn (all replicas to client)O(n)
Total~2n^2O(n^2)

For n = 4 (f = 1): approximately 32 messages per request. For n = 7 (f = 2): approximately 98 messages per request. For n = 13 (f = 4): approximately 338 messages per request. For n = 31 (f = 10): approximately 1,922 messages per request.

Each message must be authenticated (either MAC or digital signature). Each replica must process all messages it receives. The quadratic blowup means that PBFT hits a practical scalability wall around 20-30 replicas. Beyond that, the communication overhead dominates, and throughput collapses.

This is the fundamental challenge that motivated protocols like HotStuff (Chapter 11). The quadratic message complexity isn’t a bug in PBFT — it’s a consequence of the all-to-all communication needed to ensure that correct replicas can independently verify consensus without trusting the primary. Reducing this requires different cryptographic tools (threshold signatures) or different protocol structures.

Bandwidth Analysis

It’s not just message count — it’s message size. Each PREPARE and COMMIT message carries a view number, sequence number, digest, and signature. With RSA-2048 signatures, that’s at least 300+ bytes per message. With n = 31 replicas, a single request generates roughly 1,922 messages * 300 bytes = ~576 KB of authentication overhead alone, not counting the actual request data.

Using MACs instead of signatures (as the paper suggests for prepare and commit messages) reduces per-message overhead but introduces complexity in verification during view changes, where you need transferable proofs. Castro and Liskov’s paper handles this with an elaborate scheme of MAC vectors, which is yet another source of implementation headaches.

Optimizations

The base PBFT protocol is not fast enough for most practical use cases. Castro and Liskov’s paper includes several optimizations, and subsequent work added more.

Batching

Instead of running a three-phase protocol for every client request, the primary batches multiple requests into a single PRE-PREPARE. This amortizes the O(n^2) message overhead across many requests.

function batch_requests():
    // Accumulate requests until batch is full or timeout expires
    while batch.size() < MAX_BATCH_SIZE and not timeout_expired():
        if has_pending_request():
            batch.add(next_request())

    if batch.size() > 0:
        seq_num = next_sequence_number()
        msg = PrePrepare{
            view:     current_view,
            seq:      seq_num,
            digest:   hash(batch),
            requests: batch
        }
        broadcast_to_all_backups(msg)

Batching is the single most important optimization. In the original PBFT paper’s benchmarks, batching improved throughput by an order of magnitude. Without it, PBFT would be too slow for any serious workload.

The tradeoff is latency: individual requests wait in the batch buffer. With a batch timeout of, say, 10ms and a maximum batch size of 100, you add up to 10ms of latency but amortize the protocol overhead 100x.

Speculative Execution

Replicas can execute requests speculatively after the prepare phase, before collecting the full commit certificate. The speculative result is sent to the client immediately. The client accepts the speculative result if it receives 3f + 1 matching speculative replies (all replicas agree).

function on_prepared(seq, digest):
    // Speculatively execute before commit
    result = execute_speculatively(get_request(seq))

    spec_reply = SpeculativeReply{
        view:      current_view,
        seq:       seq,
        client:    request.client_id,
        result:    result,
        sender:    my_id
    }
    send_to_client(spec_reply)

    // Continue with normal commit phase
    broadcast_commit(seq, digest)

If the speculative execution turns out to be wrong (due to a view change or reordering), the replica rolls back and re-executes. This optimization reduces perceived latency for the common case but requires application-level support for rollback, which is not always feasible.

Read-Only Optimization

Read-only requests can be processed without the full three-phase protocol. A replica that is up-to-date can respond directly to a read request. The client collects 2f + 1 matching replies.

This is safe because read-only requests don’t modify state, so there’s no ordering concern. The client waits for 2f + 1 matching replies to ensure at least f + 1 came from correct replicas, guaranteeing the read reflects a committed state.

In practice, this optimization provides significant throughput gains for read-heavy workloads, which is most workloads.

Tentative Execution

A variant where replicas execute requests tentatively after collecting the prepared certificate but don’t wait for the commit certificate. They send a tentative reply to the client. The client waits for 2f + 1 matching tentative replies. If the replies match, the client treats them as final.

This reduces the effective latency from three message delays (pre-prepare, prepare, commit) to two (pre-prepare, prepare), at the cost of requiring the client to collect more replies and handle the rare case where tentative execution needs to be rolled back.

Benchmarks and Real-World Performance

Castro and Liskov’s original benchmarks (1999, running on 200 MHz Pentium Pros connected by 100 Mbps Ethernet) showed:

  • BFS (Byzantine File System) throughput: approximately 50% of the unreplicated NFS server
  • Latency: approximately 3ms for small requests in a LAN setting

More modern benchmarks (circa 2010-2020, on contemporary hardware) show PBFT achieving:

ConfigurationThroughput (ops/sec)Latency (ms)Notes
n=4, LAN50,000 - 100,0001-3With batching
n=7, LAN30,000 - 60,0002-5With batching
n=13, LAN10,000 - 25,0005-15Throughput drops noticeably
n=31, LAN1,000 - 5,00020-100Quadratic overhead dominates
n=4, WAN1,000 - 5,000100-500Network latency dominates

These numbers depend heavily on request size, batch size, signing algorithm, and network characteristics. The general pattern is clear: PBFT performs well at small scale and degrades rapidly as you add replicas.

For comparison, a Raft cluster with n=5 on similar hardware typically achieves 100,000+ ops/sec with sub-millisecond latency. The cost of Byzantine tolerance is roughly a 2-5x throughput penalty at small scale, increasing to 10x or more at larger scale. This is the cost of not trusting your peers.

The Crypto Tax

Cryptographic operations are a significant portion of PBFT’s overhead. Each message requires signature verification. Let’s look at the costs:

OperationRSA-2048ECDSA P-256Ed25519HMAC-SHA256
Sign~1.5 ms~0.3 ms~0.05 ms~0.001 ms
Verify~0.05 ms~0.3 ms~0.1 ms~0.001 ms

With n = 31 replicas, each replica processes roughly 2n = 62 PREPARE messages and 2n = 62 COMMIT messages per request. If using RSA signatures, that’s 124 signature verifications * 0.05 ms = 6.2 ms just for crypto. With Ed25519, it’s ~12.4 ms (Ed25519 verify is slower than RSA verify). With MACs, it’s ~0.124 ms.

This is why the original PBFT paper uses MACs for the prepare and commit phases and reserves full signatures for view-change messages that need transferability. It’s also why modern BFT protocols increasingly use threshold signatures (more on this in Chapter 11).

When PBFT Makes Sense

PBFT is appropriate when:

  • You have a small, known set of replicas (4-20). The quadratic overhead is manageable at this scale.
  • You genuinely face Byzantine threats. If your replicas might be compromised, buggy in non-crash ways, or operated by mutually distrusting parties, CFT protocols are insufficient.
  • You need deterministic finality. PBFT provides immediate, irrevocable finality. Once a request is committed, it’s committed. No probabilistic guarantees, no confirmation delays.
  • You’re in a permissioned setting. PBFT requires knowing the identity and public key of every replica. It doesn’t work for open, permissionless networks.

PBFT is not appropriate when:

  • You trust your replicas to be non-Byzantine. If the only failure mode is crashes, use Raft or Multi-Paxos. They’re simpler, faster, and more scalable.
  • You need more than ~20 replicas. The O(n^2) overhead makes PBFT impractical for large replica sets. Use HotStuff or a protocol with linear complexity.
  • You’re building a public blockchain. PBFT can’t handle open membership. You need Nakamoto-style consensus or a protocol designed for permissionless settings.
  • You can’t afford the latency. Three network round trips (pre-prepare, prepare, commit) plus cryptographic overhead means PBFT adds significant latency compared to CFT protocols.

The Gap Between Paper and Implementation

Let me enumerate some things the paper handles gracefully in a paragraph that take weeks to implement correctly:

  1. Message retransmission. The paper assumes reliable point-to-point channels. In practice, you need retransmission logic. But how do you retransmit efficiently without flooding the network? How do you distinguish a lost message from a delayed one from a Byzantine node claiming it never received a message?

  2. Request deduplication. Clients may retransmit requests. Replicas must detect duplicates using timestamps or nonces. The paper mentions this; implementing it correctly, especially across view changes and state transfers, is surprisingly fiddly.

  3. State transfer. When a replica falls behind (crashed and recovered, or was partitioned), it needs to catch up. The paper describes a checkpoint-based state transfer mechanism, but implementing it efficiently — transferring gigabytes of application state while the system continues processing requests — is a significant engineering challenge.

  4. Key management. Every message is signed. In a long-running system, keys need rotation. How do you rotate keys without stopping the protocol? The paper doesn’t address this because it’s an operational concern, not a protocol concern. But it’s very much your concern.

  5. Performance isolation. A single slow replica can affect the whole system because the protocol waits for 2f + 1 responses. In practice, you need timeout tuning, adaptive batching, and careful resource management to prevent stragglers from dominating latency.

  6. Testing Byzantine behavior. How do you test that your implementation actually tolerates Byzantine faults? Unit tests are insufficient — you need a test harness that can inject arbitrary Byzantine behavior: equivocation, selective message dropping, message reordering, invalid signatures, and combinations thereof. Building this test infrastructure is almost as much work as building the protocol itself.

Historical Context and Legacy

PBFT was a breakthrough. Before it, Byzantine fault tolerance was considered impractical — the province of theoreticians, not systems builders. Castro and Liskov showed that BFT could be “only” 3x slower than unreplicated execution, which was a dramatic improvement over prior BFT protocols.

PBFT directly influenced:

  • Hyperledger Fabric v0.6, which used PBFT as its consensus protocol (later replaced with a pluggable model).
  • BFT-SMaRt, a well-known Java implementation of BFT state machine replication.
  • Every subsequent BFT protocol, including HotStuff, Tendermint, SBFT, and dozens of others that define themselves in relation to PBFT.

The paper has been cited over 10,000 times, making it one of the most influential distributed systems papers ever published. And yet, if you ask anyone who has implemented PBFT from the paper, they’ll tell you it was one of the most painful engineering experiences of their career.

That’s the legacy of PBFT: a protocol that proved BFT could be practical, while also demonstrating that “practical” is a relative term. The subsequent two decades of BFT research have been, in large part, attempts to make PBFT’s ideas work better at scale. We’ll look at the most successful of those attempts next.