Event-driven serverless ETL using AWS Lambda, Step Functions, EMR and Fargate

Siddharth Sharma
3 min readMar 6, 2019

--

In this crisp article, I will show you how to design an end-to-end serverless ETL application that is event-driven and orchestrated using managed services only.

Photo by Larisa Birta on Unsplash

TL;DR

  1. As soon a file is uploaded in an S3 bucket, an event is triggered to an SQS queue
  2. SQS queue is subscribed by a Lambda function
  3. Lambda function reads metadata for the given file and triggers a Step Function execution
  4. Step function orchestrates a series of Lambdas
  5. One Lambda triggers an EMR to process and load data in Dynamo using S3-Hive-Dynamo loader.
  6. Another Lambda triggers a Fargate task to load data in Redshift
Architecture
Step Functions State Machine

Excited to know more? Read on!

Let’s look at the benefits of each of the AWS services used in the above architecture. I won’t go into the details of the code, CI/CD and pricing as that would require a separate article for each of the components involved in the workflow.

  • AWS Lambda: Lets you run code without provisioning or managing servers. In the above scenario, Lambda is mostly used to execute a small snippet of code that either starts execution of a long-running service or executes some periodic conditional logic.
  • AWS Step Functions: Lets you coordinate multiple AWS services into serverless workflows. It is a simple service that helps in orchestrating fairly limited AWS services viz. Lambda, Glue, SNS, SQS, Fargate, Dynamodb etc. It provides a rich JSON DSL on configuring a finite state machine and passing input/output parameters from one step to another. Prior to Step Functions, the only way to orchestrate distributed services was either by using AWS DataPipeline (hasn’t been updated by AWS for a long time) or Apache Airflow(very powerful and customizable workflow engine). Both these options require some coding to wire in Lambdas.
  • AWS EMR: Lets you run big-data processing on managed Hadoop framework. Data flow workloads running Spark/Flink are a great choice for performing distributed batch computing on EMR. My previous two articles highlight on how to run EMR workloads in a cost-effective manner using Instance Fleets.
  • AWS Fargate: Lets you run Docker containers without having to manage the underlying compute instances viz. EC2 servers or ECS clusters. These containers can run long-running batch processes which don’t require distributed processing and can be run on a single server. In the above scenario, Fargate task is used to run a multithreaded long-running Java process.

I hope this succinct article showcases how seamless it is to design and orchestrate serverless architecture using AWS services. It can be used to solve myriad of problems keeping in mind time to market and pay-as-you-go model.

--

--