Event-driven serverless ETL using AWS Lambda, Step Functions, EMR and Fargate

3 min readMar 6, 2019

In this crisp article, I will show you how to design an end-to-end serverless ETL application that is event-driven and orchestrated using managed services only.

TL;DR

As soon a file is uploaded in an S3 bucket, an event is triggered to an SQS queue
SQS queue is subscribed by a Lambda function
Lambda function reads metadata for the given file and triggers a Step Function execution
Step function orchestrates a series of Lambdas
One Lambda triggers an EMR to process and load data in Dynamo using S3-Hive-Dynamo loader.
Another Lambda triggers a Fargate task to load data in Redshift

Excited to know more? Read on!

Let’s look at the benefits of each of the AWS services used in the above architecture. I won’t go into the details of the code, CI/CD and pricing as that would require a separate article for each of the components involved in the workflow.

AWS Lambda: Lets you run code without provisioning or managing servers. In the above scenario, Lambda is mostly used to execute a small snippet of code that either starts execution of a long-running service or executes some periodic conditional logic.
AWS Step Functions: Lets you coordinate multiple AWS services into serverless workflows. It is a simple service that helps in orchestrating fairly limited AWS services viz. Lambda, Glue, SNS, SQS, Fargate, Dynamodb etc. It provides a rich JSON DSL on configuring a finite state machine and passing input/output parameters from one step to another. Prior to Step Functions, the only way to orchestrate distributed services was either by using AWS DataPipeline (hasn’t been updated by AWS for a long time) or Apache Airflow(very powerful and customizable workflow engine). Both these options require some coding to wire in Lambdas.
AWS EMR: Lets you run big-data processing on managed Hadoop framework. Data flow workloads running Spark/Flink are a great choice for performing distributed batch computing on EMR. My previous two articles highlight on how to run EMR workloads in a cost-effective manner using Instance Fleets.

Part 1: Event-driven serverless architecture for supporting EMR Spot Instance Fleet in DataPipeline

This post describes the problem statement, implementation, and the learnings of how we solved an AWS limitation using…

medium.com

Distributed Batch Processing using Apache Flink on AWS EMR YARN Cluster

Batch processing using Apache Flink on AWS EMR

medium.com

AWS Fargate: Lets you run Docker containers without having to manage the underlying compute instances viz. EC2 servers or ECS clusters. These containers can run long-running batch processes which don’t require distributed processing and can be run on a single server. In the above scenario, Fargate task is used to run a multithreaded long-running Java process.

I hope this succinct article showcases how seamless it is to design and orchestrate serverless architecture using AWS services. It can be used to solve myriad of problems keeping in mind time to market and pay-as-you-go model.

References

Fargate: https://lobster1234.github.io/2017/12/03/run-tasks-with-aws-fargate-and-lambda/ — Manish Pandit
Step Functions: https://epsagon.com/blog/hitchhikers-guide-to-aws-step-functions/

Event-driven serverless ETL using AWS Lambda, Step Functions, EMR and Fargate

Part 1: Event-driven serverless architecture for supporting EMR Spot Instance Fleet in DataPipeline

This post describes the problem statement, implementation, and the learnings of how we solved an AWS limitation using…

Distributed Batch Processing using Apache Flink on AWS EMR YARN Cluster

Batch processing using Apache Flink on AWS EMR

Written by Siddharth Sharma