Observability Strategies for Production Systems

Observability vs Monitoring

Monitoring tells you something is broken. Observability tells you why.

Monitoring: “Payment success rate dropped to 85%” Observability: “Payments to Processor X are timing out after 10 seconds, started at 14:23 UTC, affects users in the EU region”

The difference is context. Observability gives you the data you need to debug production without deploying new code.

The Three Pillars

Logs: What happened, in narrative form. “Payment X attempted, processor returned 503, retrying”

Metrics: Counters and gauges. Payment attempts, success rate, latency p95.

Traces: Request flow across services. This payment request hit these 7 services in this order with these latencies.

You need all three. Metrics alert you. Logs explain. Traces connect the dots.

Structured Logging That Doesn’t Suck

JSON logs are searchable. Text logs are readable. Do both:

logger.info({
  msg: "Payment processed successfully",
  payment_id: "pay_123",
  loan_id: "loan_456",
  amount: 100.00,
  pursuable: 40.00,
  source: "host"
})

Log as JSON but format it nicely in development. Parse it in production.

What to Log

Log business events, not debug info. These logs are for 3am debugging, not development.

Do log:

Every state transition (file received, bulkloader processed succeeded)
Every external API call with latency
Every error with full context
Every host account status

Don’t log:

Verbose debug output
Sensitive PII (redact it) - better not logged
Info that’s better suited for metrics

Metrics That Matter

RED metrics for every service:

Rate: Requests per second
Errors: Error rate
Duration: Latency (p50, p95, p99)

USE metrics for resources:

Utilization: CPU, memory, disk
Saturation: Queue depth, connection pool
Errors: Failed connections, timeouts

For debt-collection (or broader finTech), add business metrics:

Payment success rate by processor
DCA returns velocity
Collections contact rate

Distributed Tracing

When a request touches 5 services, you need traces to connect the logs. This can also be if a workflow process event call multiple workflows. Use correlation IDs (or similar):

const requestId = generateId();
logger.info({ event_id: eventId, event: "Agency Placement Returned" });
// Pass eventId or request to downstream services (or called workflows etc.)

Now you can query logs and see the entire flow.

Better: Use a tracing tool. Automatically captures service dependencies and latencies.

Alerting Rules

Alert on symptoms, not causes. You may not want to alert on “disk is 80% full”. But preferable to alert on “API latency > 1 second”.

For each alert, answer:

Does this need immediate action?
Can we auto-remediate?
Who needs to know?

If the answer to #1 is no, it’s not an alert, it’s a metric to review later.

The Support or On-Call Dashboard

When you’re on-call at 2am, you need answers fast. Build a dashboard that shows:

Current error rates across all services
Recent deployments
Active circuit breakers
Queue depths
Database connection pools

Keep it simple. Too much data is worse than too little. You want anomalies to jump out.

Testing Observability

Before you go on-call, test your instrumentation:

Can you find all logs for a specific payment or transaction?
Can you trace a request (or event) across services?

If you can’t answer these in staging, you won’t be able to in production.