learning path

Idempotency Twin

2 chapters 1 audio lessons 2 videos 3 free previews Fresh topic

Start here

1. The_Circuit_Breaker_Pattern

The Circuit Breaker Pattern: A State Machine Between Your Service and a Slow Neighbor

A circuit breaker does not prevent failures. It turns catastrophic cascades into bounded, recoverable blips — by refusing calls before the next request is even made.

Key Takeaways

A circuit breaker is a state machine wrapping a remote dependency: CLOSED (normal), OPEN (fast-fail), HALF_OPEN (probe a few requests).
Its job is to protect the caller's thread pool and user latency, not to heal the downstream. Healing is someone else's problem.
Three numbers matter more than the library: failure threshold, sleep window, and trial-request count. Misconfigure any of them and the breaker either chatters or sleeps through real outages.
The counter-intuitive insight: a circuit breaker is most valuable during partial failure, not total outage. When the dependency is fully dead, you'd find out anyway. The breaker shines when 1 in 5 calls hang for 30 seconds.
The 2012 Netflix Christmas Eve outage is the canonical proof point — but the breaker only earned its keep in *subsequent* failures, not the one that motivated it.

---

A request enters your service. It calls a downstream API. The downstream is slow — not dead, slow. Each call takes 30 seconds before timing out. Within two minutes, your inbound request queue is full. Your HTTP threads are blocked. New requests pile up at the load balancer. Health checks start failing. The orchestrator kills your pod. The load balancer marks you unhealthy. You are now the failure — not the downstream.

I used to think the lesson here was "add a timeout." It is not. A timeout tells you how long to wait before giving up. It does nothing about the 499 in-flight requests already burning your threads. The circuit breaker pattern exists to address what timeouts cannot: the difference between a slow dependency and a dead dependency is invisible from the caller until the caller is already on fire.

The pattern, in one sentence

A circuit breaker sits in front of a remote dependency and tracks recent failure rates. When failures cross a threshold, it stops calling the dependency entirely and returns a fast error to new requests. After a cooldown, it lets a few requests through to probe whether the dependency has recovered. If they succeed, it resumes normal traffic. If they fail, it goes back to fast-failing.

That is the entire pattern. Everything else — sliding windows, exponential backoff on the half-open probe, thread-pool isolation, fallback chains — is engineering refinement, not conceptual content.

The state machine

Three states, three transitions, and a counter.

stateDiagram-v2
    [*] --> CLOSED
    CLOSED --> OPEN: failure rate > threshold<br/>(e.g., 50% of last 20 calls)
    OPEN --> HALF_OPEN: sleep window elapsed<br/>(e.g., 30 seconds)
    HALF_OPEN --> CLOSED: trial requests succeed<br/>(e.g., 3 of 3)
    HALF_OPEN --> OPEN: any trial fails
    OPEN --> OPEN: new request arrives<br/>(fast-fail, no call attempted)
    CLOSED --> CLOSED: call succeeds or fails<br/>(counted in window)

Most circuit breakers you encounter (Resilience4j, Hystrix's modern descendants, Polly, Envoy's outlier detection) implement this same machine with slightly different defaults. The semantics matter more than the library.

CLOSED is the boring happy path. Calls go through. The breaker counts successes and failures over a sliding window — usually the last N calls or the last T seconds, whichever fills first. When the window's failure rate crosses a threshold (commonly 50%, but the right number is workload-specific), the breaker trips.

OPEN is the fast-fail state. New requests do not touch the dependency. They fail immediately with a CircuitBreakerOpenException (or your framework's equivalent). This is the moment the breaker earns its name — it has *broken the circuit* between caller and callee. The breaker sleeps for a configured window (5 seconds, 30 seconds, 60 seconds — context-dependent) and then transitions.

HALF_OPEN is the probe state. The breaker lets a small number of trial requests through (commonly 1, 3, or 5). If they all succeed, the dependency is presumed recovered and the breaker returns to CLOSED, resetting its window. If any fail, it goes back to OPEN and restarts the sleep timer.

The interesting design questions live in the transitions, not the states.

The three numbers that matter

Every circuit breaker has parameters, and most engineers treat them as defaults to ignore. They are not.

Failure threshold. What counts as "broken"? 50% over the last 20 calls is a common default. But consider two services at very different scales. A dependency that handles 10,000 requests per minute with a 0.1% failure rate is performing brilliantly; a dependency handling 100 requests per minute with a 5% failure rate may be a few dropped packets away from recovery. A flat percentage threshold cannot distinguish these. The right threshold is a joint function of request volume and acce

9m / Article + audio + video

2. When_a_Blackout_Became_a_Brownout

When a Blackout Became a Brownout: Netflix, Christmas Eve 2012, and What the Breaker Actually Bought Them

The circuit breaker did not save Netflix on December 24, 2012. It saved them on every outage after that — and the difference between those two facts is the whole story of why the pattern is worth the engineering.

Key Takeaways

The Netflix Christmas Eve 2012 outage was caused by an AWS Elastic Load Balancer bug, not by Netflix's code. No application-level circuit breaker could have prevented it. It was infrastructure-layer failure.
The pattern's payoff came in subsequent regional failures: Netflix services that had been retrofitted with Hystrix survived partial AWS degradations that took down unprotected services.
A circuit breaker converts a blackout (every request hangs, callers exhaust, total regional failure) into a brownout (fast errors, degraded features, users see something instead of nothing).
The lesson is not "wrap everything in a breaker." The lesson is "decide which dependencies are allowed to take down the whole experience if they misbehave" — and then enforce that decision with a state machine.
Counter-intuitive result: the most valuable circuit breakers are the ones that protect against dependencies you also consider highly reliable. The 2015 DynamoDB throttling incident hit a service Netflix treated as nearly perfect — and the breakers were what kept "nearly perfect" from meaning "completely unavailable."

---

At 12:24 PM Pacific on December 24, 2012, the AWS Elastic Load Balancer in us-east-1 began reporting instances as unhealthy when they were, in fact, healthy. Within minutes, ELBs across the region had evicted the majority of their backends. Netflix's streaming API, fronted by these ELBs, became unreachable for several hours. It was Christmas Eve. Hundreds of thousands of users trying to watch *It's a Wonderful Life* could not.

I want to start with this story not because it is a circuit breaker success. It is not. The breaker played no role in this outage — Hystrix, the library Netflix later open-sourced, had been introduced at re:Invent only three weeks earlier and was not yet broadly deployed. The Christmas Eve outage is the motivation, not the validation.

The validation is in what happened the next time AWS us-east-1 had a bad day.

What the outage actually was

To be precise about what Netflix's breakers did and did not do, you have to be precise about the failure mode.

sequenceDiagram
    participant User as User device
    participant ELB as AWS ELB (us-east-1)
    participant API as Netflix API tier
    participant Cache as EVCache (memcached)
    participant S3 as AWS S3

    User->>ELB: GET /catalog
    Note over ELB: 12:24 PM PT: health check bug
    ELB--xAPI: Health check fails<br/>(backend is healthy)
    ELB->>API: Evict from pool
    Note over ELB: Cascades across region
    API--xUser: 503 (no backend available)
    Note over API: No application code<br/>is failing — infra layer is

    User->>API: Retry, retry, retry
    Note over User: Total blackout for hours

This was an infrastructure-layer failure. The ELB was the only thing wrong. Netflix's API tier was healthy. Their caches were healthy. S3 was healthy. But the load balancer in front of the API was refusing to forward requests, so the health of everything behind it was moot. No application-level pattern — circuit breaker, retry, fallback — sits *upstream* of an ELB that has evicted your fleet. The blast radius was total.

This is important to internalize before praising the pattern: circuit breakers operate at the boundary between your service and a dependency. If the failure is in your service's own ingress, the breaker cannot help. I have watched multiple teams invest in resilience patterns and then experience an outage caused by their own load balancer, CDN, or DNS — exactly the layer the breakers do not see.

What Hystrix was actually for

Three weeks before the outage, at AWS re:Invent 2012, Netflix had introduced Hystrix — a library wrapping every outbound dependency call in a circuit breaker, with thread-pool isolation and a fallback chain. The library was not yet widely deployed on December 24. But by mid-2013, it had become the default for any new service joining the Netflix stack.

The architecture that Hystrix enforced looked like this:

flowchart LR
    A[API request thread] -->|acquire| B[Hystrix command<br/>thread pool: 10]
    B -->|HTTP| C[(Downstream service)]
    B -->|success| D[Return result]
    B -->|failure| E[Fast-fail<br/>+ fallback]
    E --> F[Cached response]
    E --> G[Degraded feature]
    E --> H[Default value]

    style B fill:#fff4e1
    style E fill:#ffe1e1
    style C fill:#e1f4ff

The key insight: the API request thread is never blocked by the downstream call. Hystrix uses a dedicated thread pool (size 10 is the famous default) per dependency. If the dependency hangs, only those 10 threads are stuck — and once they all hang, the breaker trips and new requests fail fast. The calling API request thread, meanwhile, is free to return a degraded response or fail quickly to its own caller.

The 10-thread default is a real number, not a coincidence. Netflix observed that with 10 threads, a downstream that takes 30 seconds to time out would saturate in 5 seconds of inbound traffic. The user would notice within 5 seconds, not 5 minutes. That window — "user notices something is

10m / Article + audio + video