Part 2: Event-driven monitoring and alerting using CloudWatch, API Gateway, Lambda and Slack

Siddharth Sharma
5 min readNov 6, 2018

--

Systems that Run Forever Self-heal and Scale

— Joe Armstrong

Preface

This post complements my previous article on event-driven serverless architecture and explains how we’ve implemented monitoring and alerting using the same principles of

  • Event sourcing — Workflow modeling using loosely coupled stateless components such that monitoring and alerting are decoupled from the business logic.
  • No server provisioning — Using managed services and FaaS for a pay-as-you-go model

With a daily scale of ~3000ephemeral EMR clusters on ~100,000 Spot Instances, it’s imperative that any system is inherently built with fault tolerance. The two core pillars for any self-healing and resilient system are real-time monitoring and interactive alerting.

I’ve maintained and at times built systems which had proper monitors and alarms, yet the systems were not resilient to failures. Often, these systems raised email alerts and missed a key component called interactive alerts for automated self-healing — alerts that reach the right person or team at the right time with the right information and are actionable.

What’s wrong with email alerts?

Life is a dream when the system being monitored is small and sends out few emails to a small group of people. However, as the complexity and scale of the system grows, failures grow proportionately and emails quickly move from great to okay to unmanageable. Before you know it, tens or even hundreds of email alerts are pouring in every day. Looking at the last email is a reasonable solution, but it won’t be easy if you have to sift through tens of emails. Email is a poor tool due to

  • No Actions — Getting only the content isn’t enough. As a start, after receiving an alert, I want to specify that I’m working on it or if this alert should be handled by someone else.
  • No Team Collaboration — Handling email alerts in a team is a real pain. Looking at the email, how do you share information? How do you know who is working on it? Or how do you see the notes from the last time the alert happened?
  • Reply All — It’s so easy to Reply All and provide $0.02 and derail the actual meaning of the message.
  • Too much information — Difficult to find a needle in a haystack.

What did we choose?

Slack — Because we were already using Slack for team chat, collaboration, and because it has a beautifully documented API, it was an obvious choice. We didn’t jettison email as we acknowledged that email still remains ubiquitous and is quite useful for getting summarized periodic updates, especially for upper management reporting.

We quickly implemented a custom monitoring and real time interactive alerting system. These alerts were categorized using metadata tags and routed to team and/or feature specific Slack channels for better accountability and auditing.

Implementation

Assuming you’ve basic understanding of the following AWS services

  • EMR — Managed Hadoop service
  • CloudWatch — Monitoring and alerting service
  • SQS — Managed queuing service
  • SNS — Managed topic service
  • Lambda — Serverless computing
  • DynamoDb — Managed NoSQL store
  • SES — Managed Email Service
  • API Gateway — Managed API web service
  • Slack — Team collaboration chat tool with rich API
Real time EMR monitoring and alerting on AWS

Preconditions

  1. Tags — For better cost and alert management, it is important to tag every AWS resource that is utilized. In this case, all EMRs have Feature and Team tags to identify which team or what feature owns the EMR.
  2. Slack WebHooks — Enable Incoming Webhooks on Slack channels that will receive alerts. Each channel will have a unique Webhook URL that will receive alerts in the form of POST payloads.
  3. CloudWatch Event Rule — Enable below CloudWatch Rule Event Source to capture all TERMINATED_WITH_ERRORS cluster state change events and add a SNS topic as the event target.
CloudWatch Event Rule

4. CloudWatch Event Cron — Enable cron in CloudWatch to invoke a Lambda twice a day, one at 8am and another at 8pm PST 0 4,16 * * ? * . This trigger will consolidate all EMR failures in last 12 hrs and send one summary report.

Walkthrough

  1. CloudWatch Event Rule is subscribed to EMR cluster state changes, and on any failure event sends a JSON payload with cluster id to a SNS topic. The advantage of using SNS over Lambda or SQS is that multiple disparate subscribers can be notified at the same time.
  2. Above SNS topic is subscribed to
  • Lambda for real-time alerting — Reads individual messages in real time and sends an alert on Slack.
  • SQS Queue for consolidated reporting — Queues up messages until a listener reads them in bulk and sends out a summary email.

Real time alerting

  • As soon a message arrives in SNS topic, Lambda gets triggered and queries EMR service with the given cluster id and reads metadata tags to determine which team or feature group should be notified.
  • It builds an Interactive Slack attachment payload, reads team to channel webhook url mapping and invokes a POST request to that URL. There is support for feature to channel webhook url in case users are interested in alerts of business critical features.
  • Message gets delivered to the intended Slack channel with an interactive button for Retry
Interactive Slack Message
  • If a user intends to retry, it can click on Retry button which invokes an AWS API Gateway endpoint and passes along the cluster id and name of the user who attempted retry.
  • API Gateway endpoint relays the message to a Lambda which queries DynamoDb to read original EMR cluster definition and submits a new attempt. There is business logic to avoid more than x retries.

Consolidated reporting

  • CloudWatch raises an event every 12 hrs and triggers a Lambda invocation
  • Lambda polls SQS queue in batches, queries EMR and consolidates individual failures and sends out an email using SES.

Learnings

  1. Significant improvement in Mean Team To Repair (MTTR) as failures were detected and fixed within SLA and with greater accountability.
  2. Root Cause Analysis (RCA) of failures were discussed in-context and made visible to a wider audience.
  3. Improved productivity as dedicated alerting ensured engineers weren’t getting disturbed and were able to identify signal from noise.
  4. Summarized reports helped management and product owners in identifying top failure reasons and deciding where to spend engineering hours.

--

--