Failure Modes and Resilience Patterns
Anticipating how financial systems fail and building resilience. Circuit breakers, fallbacks, graceful degradation, and operational recovery.
Everything Fails Eventually
The question isn’t if your system will fail, it’s when and how. Payment processors go down. Databases lock up. Network partitions happen. Your job is to make sure failures are contained, visible, and recoverable.
Common Failure Modes in Fintech
Cascading failures: Service A times out waiting for Service B, which is waiting for Service C. The timeout is 30 seconds. You now have 30 seconds of backed-up requests across three services.
Resource exhaustion: Database connection pool fills up. New requests can’t get connections. System grinds to a halt even though the database is fine.
Silent failures: An API returns 200 but the data is wrong. Your system happily processes garbage until someone notices the accounting doesn’t reconcile.
Thundering herd: System comes back online after an outage. Every pending request hits at once. System falls over again.
Circuit Breakers Done Right
Circuit breakers prevent cascading failures. When a downstream service fails repeatedly, stop calling it. Simple concept, tricky implementation.
Key decisions:
- Failure threshold: How many failures before opening the circuit? We use 5 failures in 10 seconds.
- Timeout: How long before trying again? Start with 30 seconds, exponential backoff to 5 minutes.
- Half-open state: Try one request to test if the service recovered. If it succeeds, close the circuit. If it fails, back to open.
Monitor your circuit breakers. If they’re opening frequently, you have a dependency problem.
Graceful Degradation
Not every feature needs to work for the system to be useful. When payment processing fails, you can still show account balances. When reporting is down, you can still process loans.
Build feature flags into your system. When a dependency fails, toggle off dependent features automatically. The system degrades but doesn’t collapse.
For fintech, the hierarchy is:
- Core transactions (payments, loans) must work
- Customer-facing reads can degrade (show cached data, delay reports)
- Internal tools can fail completely if needed
Testing for Failure
Chaos engineering in production scares people. Start smaller. Run failure simulations in staging:
- Kill random service instances
- Inject latency into database queries
- Return errors from APIs
- Fill up disk space
See what breaks. Fix it. Repeat.
The goal is to understand your system’s failure modes before they happen in production.