AWS EMR and Hadoop both produce hundreds of log files that report status on the cluster. AWS EMR writes step, bootstrap action, and instance state logs. Apache Hadoop writes logs to report the processing of YARN jobs, tasks, and task attempts. Hadoop also records logs of its daemons. This introduces the complexity of debugging an issue if thousands of clusters are executed per day. One monotonous solution to debugging an issue in a specific cluster is to look at the cluster that has failed and SSH into tens of servers to look at the logs. This is possible only if the cluster is long-lived and it archives all the logs.
For transient clusters, the logs have to be copied over to a persistent store before the cluster terminates. For this, EMR provides a way to archive cluster logs into an S3 bucket. However, it doesn’t provide any inherent tool for log analysis. One cumbersome alternative is to use EMR UI and view logs per cluster. But this approach doesn’t scale and proves inefficient for log aggregation and pattern finding across myriad of clusters.
Log aggregation system
The idea behind log aggregation is to not only get all of your logs in one place but to turn those logs into more than just text. To be useful, logs have to be something you can search on, report on, and even get statistics from. Graphs and charts are much easier to read than millions of lines of logs.
Below is a serverless solution you can use to ingest EMR logs using Lambda, AWS ElasticSearch + Kibana for log aggregation/searching/visualization as an alternative to either rolling your own ELK stack or using a paid 3rd party SaaS solution such as Loggly, Splunk, Sumo Logic etc.
Assuming you’ve got a basic understanding of the following services
- EMR — Managed Hadoop service
- S3 — Object Store
- Lambda — Serverless computing
- AWS Elasticsearch — Managed Elasticsearch service
- Kibana — Open-source visualization tool
AWS Elasticsearch + Kibana can be replaced with self-managed ELK stack or any 3rd party log aggregation tool that supports log ingestion from S3 bucket.
To save on data storage and cost, the aim is to ingest only EMR Step logs as most often developers are interested in application specific log statements for every step submitted in the cluster. Log lines can be enriched with MDC such that every log line has some custom context viz. class name, user id etc.
Application Step logs in EMR can be identified using below regex pattern
- EMR auto archives and uploads hundreds of logs belonging to a cluster in a S3 bucket. Every cluster is partitioned by a cluster id key in the bucket.
- Enable S3 event on the bucket to trigger a Lambda on every new file that gets uploaded with the suffix
3. In Lambda handler code, check if the key (filename) that was uploaded matches above regex, and if it does
- Fetches cluster id from the key (filename) using regex
- Queries EMR with cluster id and fetches tags (team, feature name) and other metadata
- Reads each line in the file, converts it to JSON, enriches it with custom metadata, and inserts it in Elasticsearch — Sample code
4. Enable Kibana plugin for Elasticsearch for log analysis and visualization.
- Instrument and enrich Hadoop APIs to log error stack traces and other important metrics (Job Name, No. of input files, File split size, Input/Output Paths).
- Use MDC to inject custom application values in every log line such that the values help in optimizing the searches on indexed fields.
- Ingesting logs in near real-time helps in building alerts based off predefined queries.
- It is important to keep the cost of log aggregation system low by filtering out unwanted logs and expiring Elasticsearch documents or indexes on ttl.
- For more granular control over log ingestion, Lambda can be replaced by self-managed Docker instance of Logstash which has a native plugin to periodically poll S3 buckets, enrich, and ingest logs in Elasticsearch instance.