Engineering Knowledge Base

❯

Software Architecture

❯

Production Readiness Checklist

Production Readiness Checklist

Properties1

tags	software-architecture, operations, checklist

Jun 25, 20262 min read

Production Readiness Checklist

What a service needs before it carries real traffic — and what an auditor (or a 3 a.m. page) will expose if it’s missing.

Observability

Structured logging with correlation/trace IDs
Metrics for the four golden signals: latency, traffic, errors, saturation
Distributed tracing in place (OpenTelemetry)
Dashboards per service + actionable alerts (alert on symptoms, not causes)
SLOs defined with error budgets

Delivery

CI/CD pipeline — automated build, test, deploy (GitHub Actions)
Progressive delivery — Canary Deployments / blue-green
Automatic rollback on alarm (errors, latency, data checks)
Infrastructure as Code (Terraform / AWS CDK / CloudFormation) — no console drift
Tidy git branches; trunk-based or short-lived branches (not an explosion of them)

Reliability & Data

Health checks + autoscaling
Backups and a tested restore (PITR for RDS/DynamoDB)
Idempotent, retry-safe operations; DLQs on async paths
Defined failure modes, timeouts, and circuit breakers
Capacity / quota headroom reviewed

Security

Least-privilege IAM; no long-lived keys in code
Secrets in Secrets Manager / Parameter Store (not env files in git)
Encryption at rest + in transit
Dependency and image scanning in CI

Process

Runbook for on-call (common alerts → actions)
Jira/issue tracking — fields filled and actually used
Ownership clear (who’s paged, who decides)
Post-incident reviews are blameless and produce owned action items

The cheap signal

If you can’t answer “how would I know this is broken, and how would I roll it back?” the service isn’t ready — regardless of how good the code is.

Graph View

Production Readiness Checklist
Observability
Delivery
Reliability & Data
Security
Process

Backlinks

GitHub Actions
OpenTelemetry

Created with Quartz v5.0.0 © 2026

GitHub
Discord Community