Idempotency Twin / Chapter 1

AI Tech /

The_Circuit_Breaker_Pattern

# The Circuit Breaker Pattern: A State Machine Between Your Service and a Slow Neighbor > A circuit breaker does not prevent failures. It turns catastrophic cascades into bounded, recoverable blips — by refusing calls before the next request is even made. ## Key Takeaways - A circuit breaker is a **state machine wrapping a remote dependency**: `CLOSED` (normal), `OPEN` (fast-fail), `HALF_OPEN` (probe a few requests). - Its job is to protect the **caller's thread pool and user latency**, not to heal the downstream. Healing is someone else's problem. - Three numbers matter more than the library: **failure threshold**, **sleep window**, and **trial-request count**. Misconfigure any of them and the breaker either chatters or sleeps through real outages. - The counter-intuitive insight: a circuit breaker is most valuable during **partial** failure, not total outage. When the dependency is fully dead, you'd find out anyway. The breaker shines when 1 in 5 calls hang for 30 seconds. - The 2012 Netflix Christmas Eve outage is the canonical proof point — but the breaker only earned its keep in *subsequent* failures, not the one that motivated it. --- A request enters your service. It calls a downstream API. The downstream is slow — not dead, slow. Each call takes 30 seconds before timing out. Within two minutes, your inbound request queue is full. Your HTTP threads are blocked. New requests pile up at the load balancer. Health checks start failing. The orchestrator kills your pod. The load balancer marks you unhealthy. You are now the failure — not the downstream. I used to think the lesson here was "add a timeout." It is not. A timeout tells you how long to wait before giving up. It does nothing about the 499 in-flight requests already burning your threads. The circuit breaker pattern exists to address what timeouts cannot: **the difference between a slow dependency and a dead dependency is invisible from the caller until the caller is already on fire**. ## The pattern, in one sentence A circuit breaker sits in front of a remote dependency and tracks recent failure rates. When failures cross a threshold, it stops calling the dependency entirely and returns a fast error to new requests. After a cooldown, it lets a few requests through to probe whether the dependency has recovered. If they succeed, it resumes normal traffic. If they fail, it goes back to fast-failing. That is the entire pattern. Everything else — sliding windows, exponential backoff on the half-open probe, thread-pool isolation, fallback chains — is engineering refinement, not conceptual content. ## The state machine Three states, three transitions, and a counter. ```mermaid stateDiagram-v2 [*] --> CLOSED CLOSED --> OPEN: failure rate > threshold<br/>(e.g., 50% of last 20 calls) OPEN --> HALF_OPEN: sleep window elapsed<br/>(e.g., 30 seconds) HALF_OPEN --> CLOSED: trial requests succeed<br/>(e.g., 3 of 3) HALF_OPEN --> OPEN: any trial fails OPEN --> OPEN: new request arrives<br/>(fast-fail, no call attempted) CLOSED --> CLOSED: call succeeds or fails<br/>(counted in window) ``` Most circuit breakers you encounter (Resilience4j, Hystrix's modern descendants, Polly, Envoy's outlier detection) implement this same machine with slightly different defaults. The semantics matter more than the library. **CLOSED** is the boring happy path. Calls go through. The breaker counts successes and failures over a sliding window — usually the last N calls or the last T seconds, whichever fills first. When the window's failure rate crosses a threshold (commonly 50%, but the right number is workload-specific), the breaker trips. **OPEN** is the fast-fail state. New requests do not touch the dependency. They fail immediately with a `CircuitBreakerOpenException` (or your framework's equivalent). This is the moment the breaker earns its name — it has *broken the circuit* between caller and callee. The breaker sleeps for a configured window (5 seconds, 30 seconds, 60 seconds — context-dependent) and then transitions. **HALF_OPEN** is the probe state. The breaker lets a small number of trial requests through (commonly 1, 3, or 5). If they all succeed, the dependency is presumed recovered and the breaker returns to CLOSED, resetting its window. If any fail, it goes back to OPEN and restarts the sleep timer. The interesting design questions live in the transitions, not the states. ## The three numbers that matter Every circuit breaker has parameters, and most engineers treat them as defaults to ignore. They are not. **Failure threshold.** What counts as "broken"? 50% over the last 20 calls is a common default. But consider two services at very different scales. A dependency that handles 10,000 requests per minute with a 0.1% failure rate is performing brilliantly; a dependency handling 100 requests per minute with a 5% failure rate may be a few dropped packets away from recovery. A flat percentage threshold cannot distinguish these. The right threshold is a joint function of request volume and acce

Chapter 1 of 2 9m Article Audio Video Learning path

The Circuit Breaker Pattern: A State Machine Between Your Service and a Slow Neighbor

A circuit breaker does not prevent failures. It turns catastrophic cascades into bounded, recoverable blips — by refusing calls before the next request is even made.

Key Takeaways

  • A circuit breaker is a state machine wrapping a remote dependency: CLOSED (normal), OPEN (fast-fail), HALF_OPEN (probe a few requests).
  • Its job is to protect the caller's thread pool and user latency, not to heal the downstream. Healing is someone else's problem.
  • Three numbers matter more than the library: failure threshold, sleep window, and trial-request count. Misconfigure any of them and the breaker either chatters or sleeps through real outages.
  • The counter-intuitive insight: a circuit breaker is most valuable during partial failure, not total outage. When the dependency is fully dead, you'd find out anyway. The breaker shines when 1 in 5 calls hang for 30 seconds.
  • The 2012 Netflix Christmas Eve outage is the canonical proof point — but the breaker only earned its keep in *subsequent* failures, not the one that motivated it.

---

A request enters your service. It calls a downstream API. The downstream is slow — not dead, slow. Each call takes 30 seconds before timing out. Within two minutes, your inbound request queue is full. Your HTTP threads are blocked. New requests pile up at the load balancer. Health checks start failing. The orchestrator kills your pod. The load balancer marks you unhealthy. You are now the failure — not the downstream.

I used to think the lesson here was "add a timeout." It is not. A timeout tells you how long to wait before giving up. It does nothing about the 499 in-flight requests already burning your threads. The circuit breaker pattern exists to address what timeouts cannot: the difference between a slow dependency and a dead dependency is invisible from the caller until the caller is already on fire.

The pattern, in one sentence

A circuit breaker sits in front of a remote dependency and tracks recent failure rates. When failures cross a threshold, it stops calling the dependency entirely and returns a fast error to new requests. After a cooldown, it lets a few requests through to probe whether the dependency has recovered. If they succeed, it resumes normal traffic. If they fail, it goes back to fast-failing.

That is the entire pattern. Everything else — sliding windows, exponential backoff on the half-open probe, thread-pool isolation, fallback chains — is engineering refinement, not conceptual content.

The state machine

Three states, three transitions, and a counter.

stateDiagram-v2
    [*] --> CLOSED
    CLOSED --> OPEN: failure rate > threshold<br/>(e.g., 50% of last 20 calls)
    OPEN --> HALF_OPEN: sleep window elapsed<br/>(e.g., 30 seconds)
    HALF_OPEN --> CLOSED: trial requests succeed<br/>(e.g., 3 of 3)
    HALF_OPEN --> OPEN: any trial fails
    OPEN --> OPEN: new request arrives<br/>(fast-fail, no call attempted)
    CLOSED --> CLOSED: call succeeds or fails<br/>(counted in window)

Most circuit breakers you encounter (Resilience4j, Hystrix's modern descendants, Polly, Envoy's outlier detection) implement this same machine with slightly different defaults. The semantics matter more than the library.

CLOSED is the boring happy path. Calls go through. The breaker counts successes and failures over a sliding window — usually the last N calls or the last T seconds, whichever fills first. When the window's failure rate crosses a threshold (commonly 50%, but the right number is workload-specific), the breaker trips.

OPEN is the fast-fail state. New requests do not touch the dependency. They fail immediately with a CircuitBreakerOpenException (or your framework's equivalent). This is the moment the breaker earns its name — it has *broken the circuit* between caller and callee. The breaker sleeps for a configured window (5 seconds, 30 seconds, 60 seconds — context-dependent) and then transitions.

HALF_OPEN is the probe state. The breaker lets a small number of trial requests through (commonly 1, 3, or 5). If they all succeed, the dependency is presumed recovered and the breaker returns to CLOSED, resetting its window. If any fail, it goes back to OPEN and restarts the sleep timer.

The interesting design questions live in the transitions, not the states.

The three numbers that matter

Every circuit breaker has parameters, and most engineers treat them as defaults to ignore. They are not.

Failure threshold. What counts as "broken"? 50% over the last 20 calls is a common default. But consider two services at very different scales. A dependency that handles 10,000 requests per minute with a 0.1% failure rate is performing brilliantly; a dependency handling 100 requests per minute with a 5% failure rate may be a few dropped packets away from recovery. A flat percentage threshold cannot distinguish these. The right threshold is a joint function of request volume and acceptable false-positive rate. Many production systems combine a percentage threshold with a minimum-volume floor: "trip if >50% failures over at least 20 requests in a 10-second window." This avoids tripping on a single dropped packet during low traffic.

Sleep window. How long should the breaker stay open before probing? Too short and the breaker thrashes — opens, probes, opens again — adding load to a dependency that is still sick. Too long and the breaker keeps fast-failing long after the dependency has recovered, hurting availability. 30 seconds is a popular default because it matches human incident-response timescales, but real systems tune this to the dependency's known recovery profile. A database failover takes minutes; a caching layer recovers in milliseconds.

Trial-request count. How many probes in HALF_OPEN? One is simplest, but a single probe is statistically thin — especially when failures are intermittent. Three to five is more robust. The trade-off is that during HALF_OPEN, those probes are real traffic against the downstream, and sending too many at once is exactly the "thundering herd" that the breaker was designed to prevent.

These three parameters form an implicit contract: the breaker promises to detect failures quickly (low threshold, short window), promise not to add load to a sick dependency (controlled probe count), and promise to recover quickly when the dependency heals (reasonable sleep window). No set of numbers satisfies all three. You choose what to optimize for.

Why timeouts are not enough

This is the question I get most often when explaining the pattern. "Doesn't a timeout solve this?"

A timeout bounds the worst case for a single request. A circuit breaker bounds the worst case for a *fleet* of requests during a *window* of time. They are different layers of defense.

Imagine 200 concurrent requests hit your service. Each calls a downstream with a 30-second timeout. The downstream is hung. Every request occupies a thread for 30 seconds. Your thread pool of 200 is exhausted within seconds. Health checks time out. You appear down. The downstream was slow; you are now dead.

Add a circuit breaker. The first 20 requests fail slowly. The breaker trips. The next 180 requests fail in milliseconds with CircuitBreakerOpenException. Your threads stay free. Your health checks pass. Your users see errors, but fast errors — and your service has the capacity to serve requests that *don't* depend on the sick downstream.

The breaker is not making your downstream healthier. It is preventing your downstream's sickness from infecting you. That distinction is the whole pattern.

What a circuit breaker is not

The pattern gets confused with three neighbors.

It is not a retry policy. Retries amplify load. If the downstream is overloaded, retries make it worse. A breaker *stops* retrying. (Retries and circuit breakers compose, but the breaker must sit *outside* the retry loop — never retry through an open breaker.)

It is not a rate limiter. Rate limiters protect the downstream from the caller. Circuit breakers protect the caller from the downstream. They are mirror images. The same Envoy proxy can run both — the limit on inbound traffic and the outlier detection on outbound calls are configured in the same file but solving opposite problems.

It is not a load shedder. Load shedding (dropping requests because *you* are overloaded) and circuit breaking (dropping requests because *they* are sick) share the symptom — fast failure — but the trigger is different. Load shedding responds to your own queue depth; circuit breaking responds to observed downstream health. Some advanced systems cross-reference both signals, but they remain conceptually distinct.

The case for half-open pessimism

Here is a counter-intuitive design point worth lingering on. Most implementations make HALF_OPEN *optimistic*: send a few probes; if they succeed, return to CLOSED. This treats recovery as binary. In practice, recovery is partial.

A database that just failed over has new primaries that may not be warm. A cache cluster that just restarted is rebuilding indexes. The first few requests will be slow, not failed. A breaker that treats slow-but-successful probes as "fully recovered" can re-trip the breaker immediately — thrashing the dependency at its most fragile moment.

Some libraries address this with a rolling HALF_OPEN: instead of flipping to CLOSED after a fixed number of successes, they gradually increase the proportion of traffic admitted back to the dependency. Envoy's outlier detection does something like this by ejecting only a percentage of pods rather than the whole cluster. The principle is the same: don't trust the first signal.

This is not universal advice. Strict binary transitions are simpler and adequate for many workloads. But if your downstream has expensive warm-up — JIT compilation, cache priming, connection-pool growth — the optimistic HALF_OPEN will punish the dependency precisely when it is most vulnerable.

Where the breaker stops being enough

The pattern has a load-bearing assumption: that the breaker's caller can do something useful with a fast-failure. If the *only* path through your service requires the downstream, fast-failure is just a faster 500. You have not improved the user experience; you have just made the failure more honest.

This is where fallbacks enter the picture — and where the pattern shades into broader resilience engineering. A degraded response (cached data, a simplified version of the feature, a static page) is often more useful than an error. The breaker enables the fallback by freeing the request thread quickly enough to attempt one.

Some teams use the breaker as a *signal*, not just a gate. When the breaker opens, an alert fires. When it stays open for more than N seconds, paging happens. The breaker becomes an instrument for incident response, not just a runtime guard. This is legitimate — but it should be an explicit choice, not an accident. A breaker that pages silently every five minutes trains operators to ignore it.

---

The circuit breaker pattern is small enough to implement in a weekend, and deep enough to spend a career tuning. The next chapter takes the abstract state machine and applies it to a real production failure — the kind of night where the abstractions either earn their complexity or get ripped out in the postmortem.

---

References:

  • Michael Nygard, *Release It!* (2nd ed., Pragmatic Programmers, 2018) — Chapter 5, "Stability Patterns," introduces the circuit breaker in the form most engineers still learn from.
  • Martin Fowler, "CircuitBreaker," martinfowler.com, 2014 — the canonical bliki entry that pulled the pattern from Nygard's book into the wider distributed-systems vocabulary.
  • Netflix Technology Blog, "Fault Tolerance in a High Volume, Distributed System" (Feb 2012) — describes Hystrix as it was being deployed, before the Christmas Eve incident.
  • Resilience4j documentation (resilience4j.readme.io) — modern Java implementation with explicit configuration of failure-rate thresholds, slow-call duration thresholds, and minimum-volume floors.
  • Envoy proxy outlier detection documentation (envoyproxy.io) — implementation of circuit-breaking semantics as a sidecar proxy, including split-ejection logic for partial recovery.

---

The state machine is one diagram. The configuration is three numbers. The question that separates a working breaker from a useless one is whether you understand what your downstream looks like while it is dying.