Building a Serverless ETL Pipeline with Step Functions

The Problem

I’ve built event-driven pipelines for clients before, but usually with custom retry logic and lots of Lambda-to-Lambda calls that got messy to debug. I wanted to see how Step Functions would handle the orchestration instead.

The typical approaches have issues: cron jobs on EC2 sit idle most of the time, direct Lambda triggers create tight coupling, and custom retry logic is error-prone. Step Functions promises visual workflows with built-in error handling, but I’d only seen toy examples in the docs.

What I Built

A distributed order fulfillment system using Step Functions to orchestrate the entire workflow. API Gateway receives orders, Step Functions coordinates validation and storage, then SQS handles the actual fulfillment processing with automatic error recovery.

The Flow

API Gateway → API Handler Lambda → Step Functions Orchestrator
                                        ↓
                                   Validate Order
                                        ↓
                                   Store Order (DynamoDB)
                                        ↓
                                     SQS Queue
                                        ↓
                               Fulfillment Lambda (70% success)
                                   ↓        ↓
                              FULFILLED   FAILED → DLQ → Failed Orders Table

Step Functions handles the sequential workflow (validate → store → queue), while SQS provides the asynchronous processing layer. Failed orders get tracked in both the main orders table and a separate failed_orders table for analysis.

Why This Architecture

I used Step Functions instead of Lambda-to-Lambda calls because retries are configured declaratively, the visual workflow makes debugging much easier, and error states are explicit. No custom orchestration logic to maintain.

Implementation Details

Step Functions Orchestration

The state machine definition coordinates the entire order flow:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
resource "aws_sfn_state_machine" "order_processor" {
  name     = "${var.project_name}-${var.environment}-order-processor"
  role_arn = aws_iam_role.step_function.arn

  definition = jsonencode({
    StartAt = "ValidateOrder"
    States = {
      ValidateOrder = {
        Type     = "Task"
        Resource = var.validator_lambda_arn
        Retry = [{
          ErrorEquals     = ["Lambda.ServiceException"]
          IntervalSeconds = 2
          MaxAttempts     = 3
          BackoffRate     = 2
        }]
        Next = "StoreOrder"
      }
      StoreOrder = {
        Type     = "Task"
        Resource = var.order_storage_lambda_arn
        Next     = "SendToQueue"
      }
      SendToQueue = {
        Type     = "Task"
        Resource = "arn:aws:states:::sqs:sendMessage"
        Parameters = {
          QueueUrl    = var.order_queue_url
          MessageBody = "$"
        }
        End = true
      }
    }
  })
}

Error Handling Strategy

The SQS dead letter queue captures orders that fail fulfillment after retries:

1
2
3
4
5
resource "aws_sqs_queue" "order_dlq" {
  name                       = "${var.project_name}-${var.environment}-order-dlq"
  visibility_timeout_seconds = 300
  message_retention_seconds  = 1209600  # 14 days
}

Failed orders get processed by a DLQ handler that writes them to a separate failed_orders table for analysis. This gives you a clear audit trail of what failed and why.

Lambda Function Architecture

Four specialized functions handle different parts of the workflow:

API Handler: Validates request format, starts Step Functions execution
Validator: Checks business rules (customer exists, inventory available)
Order Storage: Writes to DynamoDB orders table with PROCESSING status
Fulfillment: Simulates order processing (configurable success rate for testing)

I intentionally built in a 70% success rate for the fulfillment function to test error handling under realistic failure conditions.

What I Learned

Step Functions vs Lambda Chains

Step Functions adds ~$25/million state transitions, but eliminates custom orchestration code. The visual workflow in the AWS console makes debugging so much easier than digging through Lambda logs to trace execution.

SQS for Decoupling

The Step Functions workflow ends by sending to SQS, not calling the fulfillment Lambda directly. This decouples the orchestration from the processing. If fulfillment takes 5 minutes instead of 5 seconds, it doesn’t affect the API response time.

Failure Rate Testing

I built in a configurable failure rate (70% success) to validate error handling. In real systems, this helps you test dead letter queue processing, retry limits, and monitoring alerts before deployment.

Two-Tier Validation

API Gateway handles format validation (required fields, data types), while the validator Lambda handles business logic (customer exists, inventory check). This separates concerns and gives better error messages.

Prior Experience

I’d built similar event-driven architectures for a Canadian startup, but that was more focused on real-time data ingestion. This project let me explore the order processing use case and see how Step Functions handles sequential vs parallel workflows.

Why This Pattern Works

Serverless orchestration solves the scaling problems without managing infrastructure. Step Functions gives you visual workflows, automatic retries, and explicit error handling. SQS provides the async processing so your API stays responsive.

See the DOFS project for the actual code and deployment patterns. The failure simulation and DLQ handling patterns apply to any event-driven pipeline.

Working on event-driven architecture? Contact me to discuss your specific pipeline requirements.

The Problem#

What I Built#

The Flow#

Why This Architecture#

Implementation Details#

Step Functions Orchestration#

Error Handling Strategy#

Lambda Function Architecture#

What I Learned#

Step Functions vs Lambda Chains#

SQS for Decoupling#

Failure Rate Testing#

Two-Tier Validation#

Prior Experience#

Why This Pattern Works#