Step Functions

Serverless workflow orchestration. You define a state machine in Amazon States Language (ASL, JSON); AWS runs it, tracks state, retries, and gives you a visual execution history — so you don’t hand-roll orchestration logic in a Lambda.

stateDiagram-v2
  [*] --> Validate
  Validate --> Charge
  Charge --> Ship: success
  Charge --> Refund: failure
  Ship --> [*]
  Refund --> [*]

Standard vs Express

StandardExpress
Max duration1 year5 min
Execution semanticsExactly-onceAt-least-once / at-most-once
ThroughputLowerVery high (>100k/s)
PricingPer state transitionPer duration + memory
HistoryFull, in consoleVia CloudWatch Logs
Use caseLong-running, auditableHigh-volume event processing

State Types

  • Task — do work (Lambda, or 200+ service integrations).
  • Choice — branch on input.
  • Parallel — run branches concurrently.
  • Map — fan out over an array; Distributed Map scales to millions of items (up to 10k parallel) over S3/JSON.
  • Wait / Pass / Succeed / Fail — delay, inject, terminate.

Error Handling

  • Retry — backoff on matched errors (ErrorEquals, IntervalSeconds, MaxAttempts, BackoffRate).
  • Catch — route failures to a fallback state.
  • This built-in resilience is the main reason to use Step Functions over chained Lambdas.

Service Integration Patterns

  • Request/Response — call and continue.
  • Run a Job (.sync) — call and wait for completion (e.g. ECS task, Glue job).
  • Wait for Callback (.waitForTaskToken) — pause until an external system returns a token (human approval, third-party webhook).

vs EventBridge

Step Functions orchestrates a known sequence of steps. EventBridge routes events between decoupled services with no central flow. Use both: events trigger workflows.