Production Readiness Checklist
What a service needs before it carries real traffic — and what an auditor (or a 3 a.m. page) will expose if it’s missing.
Observability
- Structured logging with correlation/trace IDs
- Metrics for the four golden signals: latency, traffic, errors, saturation
- Distributed tracing in place (OpenTelemetry)
- Dashboards per service + actionable alerts (alert on symptoms, not causes)
- SLOs defined with error budgets
Delivery
- CI/CD pipeline — automated build, test, deploy (GitHub Actions)
- Progressive delivery — Canary Deployments / blue-green
- Automatic rollback on alarm (errors, latency, data checks)
- Infrastructure as Code (Terraform / AWS CDK / CloudFormation) — no console drift
- Tidy git branches; trunk-based or short-lived branches (not an explosion of them)
Reliability & Data
- Health checks + autoscaling
- Backups and a tested restore (PITR for RDS/DynamoDB)
- Idempotent, retry-safe operations; DLQs on async paths
- Defined failure modes, timeouts, and circuit breakers
- Capacity / quota headroom reviewed
Security
- Least-privilege IAM; no long-lived keys in code
- Secrets in Secrets Manager / Parameter Store (not env files in git)
- Encryption at rest + in transit
- Dependency and image scanning in CI
Process
- Runbook for on-call (common alerts → actions)
- Jira/issue tracking — fields filled and actually used
- Ownership clear (who’s paged, who decides)
- Post-incident reviews are blameless and produce owned action items
The cheap signal
If you can’t answer “how would I know this is broken, and how would I roll it back?” the service isn’t ready — regardless of how good the code is.