The Problem
I’ve built event-driven pipelines for clients before, but usually with custom retry logic and lots of Lambda-to-Lambda calls that got messy to debug. I wanted to see how Step Functions would handle the orchestration instead.
The typical approaches have issues: cron jobs on EC2 sit idle most of the time, direct Lambda triggers create tight coupling, and custom retry logic is error-prone. Step Functions promises visual workflows with built-in error handling, but I’d only seen toy examples in the docs.
What I Built
A distributed order fulfillment system using Step Functions to orchestrate the entire workflow. API Gateway receives orders, Step Functions coordinates validation and storage, then SQS handles the actual fulfillment processing with automatic error recovery.
The Flow
Step Functions handles the sequential workflow (validate → store → queue), while SQS provides the asynchronous processing layer. Failed orders get tracked in both the main orders table and a separate failed_orders table for analysis.
Why This Architecture
I used Step Functions instead of Lambda-to-Lambda calls because retries are configured declaratively, the visual workflow makes debugging much easier, and error states are explicit. No custom orchestration logic to maintain.
Implementation Details
Step Functions Orchestration
The state machine definition coordinates the entire order flow:
|
|
Error Handling Strategy
The SQS dead letter queue captures orders that fail fulfillment after retries:
|
|
Failed orders get processed by a DLQ handler that writes them to a separate failed_orders
table for analysis. This gives you a clear audit trail of what failed and why.
Lambda Function Architecture
Four specialized functions handle different parts of the workflow:
- API Handler: Validates request format, starts Step Functions execution
- Validator: Checks business rules (customer exists, inventory available)
- Order Storage: Writes to DynamoDB orders table with PROCESSING status
- Fulfillment: Simulates order processing (configurable success rate for testing)
I intentionally built in a 70% success rate for the fulfillment function to test error handling under realistic failure conditions.
What I Learned
Step Functions vs Lambda Chains
Step Functions adds ~$25/million state transitions, but eliminates custom orchestration code. The visual workflow in the AWS console makes debugging so much easier than digging through Lambda logs to trace execution.
SQS for Decoupling
The Step Functions workflow ends by sending to SQS, not calling the fulfillment Lambda directly. This decouples the orchestration from the processing. If fulfillment takes 5 minutes instead of 5 seconds, it doesn’t affect the API response time.
Failure Rate Testing
I built in a configurable failure rate (70% success) to validate error handling. In real systems, this helps you test dead letter queue processing, retry limits, and monitoring alerts before deployment.
Two-Tier Validation
API Gateway handles format validation (required fields, data types), while the validator Lambda handles business logic (customer exists, inventory check). This separates concerns and gives better error messages.
Prior Experience
I’d built similar event-driven architectures for a Canadian startup, but that was more focused on real-time data ingestion. This project let me explore the order processing use case and see how Step Functions handles sequential vs parallel workflows.
Why This Pattern Works
Serverless orchestration solves the scaling problems without managing infrastructure. Step Functions gives you visual workflows, automatic retries, and explicit error handling. SQS provides the async processing so your API stays responsive.
See the DOFS project for the actual code and deployment patterns. The failure simulation and DLQ handling patterns apply to any event-driven pipeline.
Working on event-driven architecture? Contact me to discuss your specific pipeline requirements.