Failure Modes and Resilience Patterns

Everything Fails Eventually

The question isn’t if your system will fail, it’s when and how. Payment processors go down. Databases lock up. Network partitions happen. Your job is to make sure failures are contained, visible, and recoverable.

Common Failure Modes in Debt Collections platforms

Cascading failures: Service A times out waiting for Service B, which is waiting for Service C. The timeout is 30 seconds. You now have 30 seconds of backed-up requests across three services.

Resource exhaustion: Database connection pool fills up. New requests can’t get connections. System grinds to a halt even though the database is fine.

Silent failures: An API returns 200 but the data is wrong. Your system happily processes garbage until someone notices the accounting doesn’t reconcile.

Thundering herd: System comes back online after an outage. Every pending request hits at once. System falls over again.

Circuit Breakers Done Right

Circuit breakers prevent cascading failures. When a downstream service fails repeatedly, stop calling it. Simple concept, tricky implementation.

Key decisions:

Failure threshold: How many failures before opening the circuit? We use 5 failures in 10 seconds.
Timeout: How long before trying again? Start with 30 seconds, exponential backoff to 5 minutes.
Half-open state: Try one request to test if the service recovered. If it succeeds, close the circuit. If it fails, back to open.

Monitor your circuit breakers. If they’re opening frequently, you have a dependency problem.

Graceful Degradation

Not every feature needs to work for the system to be useful. When payment processing fails, you can still show account balances. When reporting is down, you can still process loans.

Build feature flags into your system. When a dependency fails, toggle off dependent features automatically. The system degrades but doesn’t collapse.

For Debt collections platforms (finTech in general), the hierarchy typically is:

Core transactions (payments, loans) must work
Customer-facing reads can degrade (show cached data, delay reports)
Internal tools can fail completely if needed

Testing for Failure

Chaos engineering in production scares people but a number or people believe it should be implemented in this domain. Therefore, start smaller. Run failure simulations in staging:

Kill random service instances
Inject latency into database queries
Return errors from APIs
Fill up disk space

See what breaks. Fix it. Repeat.

The goal is to understand your system’s failure modes before they happen in production.