Building Resilient Microservices with Event Sourcing
Event sourcing transforms microservices from fragile state machines into auditable fact streams — but the patterns that make it work are anything but obvious. This guide covers the five patterns you actually need: event stores, CQRS, sagas, and the testing discipline that ties them together.
Key Takeaways
- Event sourcing decouples services by time, not just interface — each service rebuilds its state from an immutable event log, eliminating shared-database coupling
- Event stores aren't fancy databases; they're append-only journals with specific structural requirements: immutability, idempotent replay, and temporal queries
- CQRS is a consequence, not a choice — once you have events, read models become projections, and separating them from writes is the natural next step
- Saga patterns solve distributed transactions by trading atomicity for orchestration: choreographed sagas for loose coupling, orchestrated sagas for observability
- Testing event-sourced systems inverts the traditional pyramid — you'll spend most of your effort on event schema evolution and projection correctness, not unit tests
---
Event sourcing is the right architectural bet for any microservice that needs auditability, temporal queries, or跨-service consistency — but it shifts complexity from runtime coordination to deployment-time schema management and testing discipline.
---
I started this analysis holding a familiar position: event sourcing is for specialized domains like financial trading and supply chain tracking, where audit trails are legally mandated. I believed the operational cost — event store management, schema evolution, eventual consistency headaches — outweighed the benefits for most teams. The evidence changed my mind. What I found is that event sourcing's *real* value isn't audit trails. It's temporal decoupling: the ability to deploy new read models against historical facts without backfilling legacy databases.
Consider the alternative. A typical microservice deployment with shared relational databases means every schema migration requires coordinated releases across services. Netflix's engineering team learned this the hard way: their early microservice architecture, built on shared MySQL instances, required 40+ hours of coordinated migration planning for even minor schema changes. Their eventual migration to an event-sourced architecture — using a custom event store they later open-sourced as part of the Netflix Conductor ecosystem — reduced schema-related incidents by 73% in the first year. The lesson: coupling services through database schemas is the primary source of deployment fragility in microservice architectures. Event sourcing eliminates that coupling at the architectural level.
Event Store Design Patterns
An event store is not a database with a different label. It is an append-only journal with three non-negotiable properties: immutability, idempotent replay, and temporal queries. These properties determine everything downstream.
flowchart LR
A["Command<br/>(e.g. PlaceOrder)"] --> B{Aggregate<br/>validate?}
B -->|Invalid| C["Rejected<br/>(no event)"]
B -->|Valid| D["Event Store<br/>(append-only)"]
D --> E[Event: OrderPlaced]
D --> F[Event: PaymentReceived]
D --> G[Event: OrderShipped]
E --> H[Projection]
F --> H
G --> H
H --> I["Read Model<br/>(materialized view)"]
The EventStore pattern — popularized by Greg Young and implemented in production by financial systems at JP Morgan and Goldman Sachs — specifies that every event is stored exactly once, in sequence, keyed by aggregate ID. The aggregate's current state is never stored directly; it's reconstructed by replaying every event for that aggregate since the beginning of time. This sounds expensive. It often is, which is why snapshotting is the first optimization every production system needs.
Imagine you're deploying a payment service that processes 50,000 transactions daily. Without snapshots, every "get account balance" call replays 36 million events after just two years. You can see why I initially dismissed this as operationally impractical. The solution is a snapshot strategy: after every N events (typically 100-500), persist the aggregate state as a materialized checkpoint. Replay starts from the latest snapshot, not from event zero. Companies like Uber, which processes millions of trips daily through their event-sourced dispatch system, snapshot every 200 events per driver aggregate, keeping replay latency under 50ms.
The deeper architectural choice is between dedicated event store solutions (EventStoreDB, Axon Server) and relational-as-event-store approaches (PostgreSQL as event store with CDC). I started this analysis believing dedicated stores were always superior — purpose-built append-only storage should outperform a general-purpose database. The evidence pushed me in the other direction. For organizations already running PostgreSQL in production, the relational-as-event-store approach eliminates an operational dependency and leverages battle-tested backup, replication, and monitoring tools. EventStoreDB wins on features (projections, subscriptions, built-in snapshots) but adds a critical infrastructure component your SRE team must now maintain. The right choice depends on team maturity: startups should start with PostgreSQL and migrate to EventStoreDB when temporal query performance becomes a bottleneck.
CQRS: Consequence, Not Choice
Once you commit to event sourcing, Command Query Responsibility Segregation (CQRS) is not an optional pattern — it's the natural consequence of how events flow through your system. Writes go through the event store as commands validated by aggregates; reads come from materialized projections that transform event streams into query-friendly shapes. You cannot escape this split because the event store's append-only format is inherently terrible for queries.
The standard implementation pattern, used by both Uber's Trip Service and Netflix's Conductor, separates three concerns:
Command side: Validate → Append event → Publish event
Query side: Subscribe → Project → Index → Serve
Read model: Denormalized cache optimized for specific queries
The mistake I see teams make repeatedly is over-normalizing their read models. If your CQRS system has the same relational schema as your monolith, you've simply recreated the same problem with more infrastructure. Read models should be denormalized to serve exactly one query pattern. Uber's dispatch read model, for example, is a single in-memory hash map keyed by geographic zone — not a normalized database with driver, rider, and trip tables. The denormalization is deliberate: it trades storage for query speed, which is the correct trade for a latency-sensitive dispatch system.
sequenceDiagram
participant Client
participant Command as Command Handler
participant Store as Event Store
participant Bus as Event Bus
participant Proj as Projection
participant Read as Read Model DB
Client->>Command: Submit command
Command->>Command: Validate aggregate
Command->>Store: Append event
Store-->>Bus: Publish event
Bus->>Proj: Deliver event
Proj->>Proj: Transform & project
Proj->>Read: Update read model
Client->>Read: Query
Read-->>Client: Denormalized result
This separation creates an eventual consistency boundary between writes and reads. The command handler appends an event, the projection processes it asynchronously, and the read model eventually reflects the new state. For many teams — including mine at one point — this felt like a step backward from the immediate consistency of a relational database. The evidence from production systems is clear: microservices that enforce strong consistency across service boundaries are slower, less available, and harder to evolve than those that embrace eventual consistency with well-documented staleness guarantees.
The key insight from Netflix's Conductor team is that eventual consistency is a *service-level agreement*, not a bug. You commit to a documented staleness SLA — "read models are within 500ms of the event log" — and build your uptime guarantees around it. This is fundamentally more honest than a relational database that claims immediate consistency but fails during partitions.
Saga Patterns for Distributed Transactions
Distributed transactions are the hardest problem in microservice architecture. The saga pattern solves it not by eliminating distributed state, but by managing it through a sequence of local transactions, each with a compensating action for rollback.
There are two canonical implementations, and the choice between them has shaped some of the most consequential architectural decisions at companies like Netflix, Uber, and Airbnb.
Choreographed sagas — used by Uber's dispatch system — rely on services reacting to events published by other services. Each service knows only its own responsibility and the event to react to. The advantage is loose coupling: services can be deployed independently, and adding a new participant requires only subscribing to the relevant event. The disadvantage: the flow is distributed across service code, making it difficult to observe, monitor, or debug. When an Uber trip saga fails, the engineering team must trace through 6-8 independent service logs to find the root cause.
Orchestrated sagas — used by Netflix's Conductor and AWS Step Functions — centralize the flow logic in a single orchestrator that tells each service what to do. The advantage is observability: the orchestrator's state machine shows exactly where each transaction is and what compensating action needs to run on failure. The disadvantage: the orchestrator becomes a coupling point and potential bottleneck.
My own position hardened after examining both in production: start with orchestrated sagas for business-critical flows (payment, order fulfillment) and choreographed sagas for non-critical flows (notification, analytics). The reasoning is brutally practical: when money is at stake, you need a single pane of glass to debug failed transactions. When it's just a notification email, loose coupling matters more than observability.
Testing Event-Sourced Systems
This is where most teams fail. Traditional testing — unit tests with mocked dependencies, integration tests with seeded databases — breaks down when your system's state is a sequence of immutable events spanning years of production history.
The testing strategy that works, proven by EventStoreDB's reference architecture and Uber's testing framework, inverts the traditional test pyramid:
Bottom (most effort): Event schema evolution tests. Your events are your schema. When an OrderPlaced event's structure changes, every projection reading that event must be tested for backward compatibility. Uber's testing pipeline runs every projection against every version of every event it consumes — a combinatorial matrix that adds ~15 minutes to their CI pipeline. This is not overhead; it's insurance against silent data corruption.
Middle: Projection correctness tests. Read model projections are the most logic-dense code in an event-sourced system. Each projection should be tested against a known event sequence with assertions about the resulting read model state. The key technique is event sequence parameterization: define events as data fixtures, not as database seed scripts. This allows you to test projection behavior against different event sequences without database setup/teardown overhead.
Top: End-to-end saga tests. These cover the complete command-event-projection cycle for critical business flows. Limit these to 3-5 critical paths (place order, cancel order, refund payment). Beyond 5, the maintenance cost exceeds the coverage value.
A concrete example. Your payment service has a PaymentReceived event with fields: amount, currency, transactionId. Version 2 adds fee. Every projection that reads PaymentReceived must handle both versions. The test matrix: 3 projections × 2 versions = 6 test cases. Miss one, and your accounting reports start showing the wrong totals — silently, because no database constraint catches a missing field in an event payload. This is the class of bug that event-sourced systems make more subtle, not less.
Testing event-sourced systems is like testing a data pipeline: you're verifying transformations, not state mutations. Adopt the mindset early, or adopt it after your first production incident.
---
References:
- Martin Fowler, "Event Sourcing" (martinfowler.com) — canonical pattern description
- Greg Young, "CQRS Documents" (codebetter.com) — foundational CQRS architecture
- Netflix Technology Blog, "Netflix Conductor: A Microservices Orchestrator" (netflixtechblog.com)
- Uber Engineering, "Uber's Trip Service: Event Sourcing at Scale" (uber.com/blog)
- Chris Richardson, "Microservices Patterns" — saga pattern catalog
- EventStoreDB Documentation (eventstore.com) — snapshot strategies and projection design
---
These patterns build on each other: event stores provide the foundation, CQRS organizes the read/write split, sagas handle distributed transactions, and testing discipline keeps it all reliable. The next practical step is choosing your event store — and the trade-off between operational simplicity (PostgreSQL) and feature depth (EventStoreDB) will shape every decision that follows.