Avoid Observability MELTdown using Metrics, Events, Logs and Traces
At the end of August 2022, I was honoured to represent Habu and attend and talk at the inaugural super{summit} developer conference in Denver with nearly 100 like-minded early stage startup builders from 11 different super{set} companies for three days.
This post recaps some of the learnings and takeaways from one of my talks that I co-presented with Peter Han and Tyler Bonilla from Ketch, on a topic that is very near and dear to my heart over the last decade and a half of my software development journey — Observability (o11y)!
It’s amusing to see how our industry flocks around the latest tech buzzword. It’s cool, hip and trending, so why not get on the bandwagon?
The new buzzword kid on the blocks is Observability!
Reading along should get you abreast with some of the concepts, patterns, tools and cultural aspects of embracing observability to the best of my knowledge. All references to open source and vendor tools are for illustration purposes only to guide you on the path of embracing observability. Depending on the cost and maintenance overhead tradeoffs you could evaluate the right tool that suits your business needs.
What is Observability?
In control theory, Observability lets us understand a system from the outside, by letting us ask questions, without knowing the inner workings of that system. Observability lets us deal with unknown unknowns.
In software parlance, Observability is the ability to ask new questions of the health of your running services without deploying new code
Why is Observability Important?
In a world of cloud-native loosely coupled services, polyglot persistence, and dynamic infrastructure, traditional metrics-based monitoring approaches are woefully inadequate when it comes to understanding system state, and triaging and diagnosing behavioural and performance issues.
Over the years, Kubernetes has become a developer’s choice for designing and deploying scalable and distributed applications. However, Kubernetes is unaware of the internal state of an application. However, its dynamic nature also gave rise to an increased number of problems for platform engineers who needed to keep track of its performance despite the pace.
Observability vs. Monitoring
These two are often used interchangeably; however, they are different in their application. By monitoring, you can verify if the infrastructure and applications are functioning as expected. On the other hand, observability provides you with comprehensive actionable insights to take steps towards improving performance and making the applications and the entire infrastructure more stable and resilient.
Observability Challenges
Managing networking in a monolithic architecture is a relatively simple task: the path between the client to the server is generally through a finite collection of points.
In a distributed microservice architecture, the network becomes much more critical and complex: the path between client and application got much more winding and harder to reason about due to the
- The complexity of dynamic multi-cloud environments
- Dynamic microservices and containers in Kubernetes on Spot Nodes in real-time (Pets vs Cattle)
- The volume, velocity, and variety of data and alerts
- Siloed Dev and Ops Teams
It makes root cause analysis and incident resolution potentially a lot harder, yet the same questions still need answering:
- What is the overall health of my solution?
- What is the root cause of errors and defects?
- What are the performance bottlenecks?
- Which of these problems could impact the user experience?
Observability Pillars
In the past few years, much has been talked about and written about the “three pillars of observability”: metrics, logs, and traces.
Traces
A Trace records the paths taken by requests as they propagate through microservices.
A Span represents a unit of work or operation. It tracks specific operations that a request makes, painting a picture of what happened during the time in which that operation was executed.
Metrics
Metrics are aggregations over a period of time of numeric data about your infrastructure or application
Logs
Log is a record of events that happened over time: a screenshot of something with an associated timestamp.
However, as pointed out by Charity Majors, these 3 pillars are not finite and can be complemented with other pillars viz. Events and Profiles
Events
Though pretty much all signals are events, Events in specific are external to the observed system that cause some changes in that system. The most common examples are: deployments of application code, configuration changes, experiments, auto-scaling events, etc
Events can be analogous to structured logs, however, they differ from logs due to the below traits
- Zero data loss — As each event can indicate an important signal for incident troubleshooting.
- High Precision — Accuracy in finding the right event
- Low Volume —Events are usually produced in much fewer volumes than logs.
Profiling
Profiling is the act of measuring a program’s behavior using data we gather as our code executes (for example, frequency and duration of function calls, CPU time or memory usage, and more).
Profiling is a new addition to the Observability stack and the profiling SIG (special interest group) has just kickstarted. I will be keeping a close eye on it as this evolves and gets supported by various open source and vendor tools.
Best Practices
Below are some of the best practices for implementing Observability in your stack. They can be used to build common shim layers or shared libraries for crosscutting concerns. Developing such generic layers in your programming language of choice will avoid code duplication and help in applying standard conventions across the stack and simplify maintenance and upgrades.
Logs
- Centralized
Ingest, index and visualize all the logs from various sources in a centralized store using either open source (FluentBit, Loki, ELK..) or vendor tools (DataDog, Honeycomb, NewRelic)
- Structured (JSON)
Use JSON structured logging as an alternative to traditional logging. Logs written in JSON are easily readable by both humans and machines, and structured JSON logs are easily tabularized to enable filtering and queries.
- Contextual (Baggage)
Include meaningful information about the event that triggered the log, as well as the additional context that can help understand what happened, find correlations with other events, and diagnose potential issues that require further investigation. A few examples of fields are
— User Request Identifiers (X-Request-Id)
— Unique Identifiers (X-User-Id, X-Tenant-Id)
- Redact PII or Sensitive Data
Avoid logging sensitive data and personally identifiable information (PII) that may be covered by data privacy and security regulations or standards like the European GDPR, HIPAA, or PCI DSS.
- Common Interceptors
Use interceptors wherever applicable for logging gRPC/REST requests/responses and prefer using language constructs (MDC in Java or Go Context) to inject context across all logs that are part of the same request/response lifecycle.
- Levels
Use appropriate log levels (INFO, WARN, ERROR) to avoid logging non-essential information that doesn’t help with diagnostics or root cause analysis resulting in increased time-to-insights, data volumes, and higher costs.
- Hot vs Cold Persistence
Set different retention policies (S3 IA, Glacier) for different types of logs, depending on the cost and compliance needs.
Metrics
- Infrastructure Metrics
More metrics are always better if you have the right tools. Hence, gather all the infrastructure golden signals using either open source (Prometheus with Thanos or Cortex) or vendor tools (DataDog, Honeycomb..).
- Custom Application Metrics
Build language-specific libraries (Micrometer, Prometheus) to instrument application code to emit key business metrics that need alerting (count of custom jobs failed, count of in-flight messages in the queue).
- Types
Depending on the use case, use the appropriate metric type to capture data points which can then be aggregated to build Sums, Gauges, Histograms
- Common Interceptors
Like logging, use interceptors wherever applicable to capture gRPC/REST request/response metrics and prefer using language constructs (MDC in Java or Go Context) to inject context across all metrics that are part of the same request/response lifecycle.
- Methods to collect Golden Signals
RED (Rate, Error, Duration)
Request-scoped — For every request, check utilization, saturation, and errors.
- Rate: Request Throughput, in requests per second
- Errors: Request Error Rate, as either a throughput metric or a fraction of overall throughput
- Duration: Request Latency, Duration, or Response Time
USE (Utilization, Saturation, Error)
Resource-scoped — For every resource, check utilization, saturation, and errors.
- Utilization: the average time the resource was busy servicing work
- Saturation: the degree to which the resource has extra work which it can’t service, often queued
- Errors: the count of error events
Once the golden signals are collected, they can be used collectively for alerting, troubleshooting or tuning and capacity planning.
- Cardinality
Cardinality is the number of unique combinations of metric names and dimension values. Choose which dimensions you want to attach to your metrics based on what meaningful information you want to extract from your telemetry data. Immutable infrastructures on Kubernetes and Containers lead to cardinality explosion as once a resource is created, it is never updated.
Traces
- Context
In addition to the default W3C Span Context, code can be instrumented to use custom key-value attributes to annotate a Span to carry information about the operation it is tracking.
For example, if a span tracks an operation that adds an item to a user’s shopping cart in an eCommerce system, you can capture the user’s ID, the ID of the item to add to the cart, and the cart ID.
- Implicit vs Explicit Propagation
An application can be instrumented for emitting traces either automatically or manually.
- With automatic instrumentation, the API and SDK take the configuration provided (through code or environment variables) and do most of the work automatically (for example, starting and ending spans).
- Manual instrumentation, while requiring more work on the user/developer side, enables far more options for customization, from naming various components within OpenTelemetry (for example, spans and the tracer) to adding your own attributes, specific exception handling, and more.
Tools, Tools, Tools….
The monitoring market used to be dominated by proprietary vendors. Each vendor had its own share of pros and cons related to cost and feature support. In response, various free and open-source software projects started or were spun out of tech companies. Early examples include Prometheus for metrics and Zipkin and Jaeger for tracing. In the logging space, the “ELK stack” (Elasticsearch, Logstash, and Kibana) gained market share and became popular.
The market has hit an inflection point, and cloud-native architectures are much larger in scale, more distributed and too interdependent. Developers need the flexibility to choose and control the data they collect and analyze.
OpenTelemetry to the rescue
One key milestone was the merger of the OpenTracing and OpenCensus projects to form OpenTelemetry, a major project within CNCF.
- OpenTelemetry is a collection of tools, APIs, and SDKs.
- It provides a vendor-agnostic standard for observability as it aims to standardise the generation of traces.
- This is good because that means that we are not tied to any tool (or vendor). Not only can we use any programming language we want, but we can also pick and choose the storage backend, thus avoiding a potential buy-in from commercial vendors.
- It also means that developers can instrument their applications without having to know where the data will be stored.
- It isn’t a data ingest, storage, backend, or visualization component. Such components will be provided either by other open-source projects or by vendors.
Sample OpenTelemetry Collector Config
The configuration below should help you get started locally by running Docker containers of Prometheus, Jaeger and OpenTelemetry Collector.
You can then instrument your applications either automatically by using the OpenTelemetry Agents or by manually instrumenting them using OpenTelemetry SDK.
In both approaches, metrics and traces from your applications will be sent to the OpenTelemetry collector either using gRPC or HTTP. OpenTelemetry collector can then be configured to build pipelines for filtering or enriching telemetry data before emitting them to the tools like Jaeger, Prometheus, DataDog or NewRelic.
This multiplexing gateway approach avoids first having to select the tool and then instrument your application using the selected tool’s SDKs or Agents. You can seamlessly swap out tools without changing your application code.
Observability is more than Logs, Metrics, Events & Traces
Similar to DevOps, one cannot buy observability off the shelf. Tooling is part of the equation — you’ll need a platform that can ingest, correlate and analyze data — but tools alone aren’t the key to observability. It’s more than deploying certain tools or adopting certain workflows since it has to be embraced in your engineering culture. The culture has to be supported and backed by the engineering leadership and applied to all aspects of the software development lifecycle.
Across the board, super{set} companies are embracing observability by leveraging some of the above patterns and tools as part of their engineering culture and using it to improve the developer and customer experience.
If you’re passionate about Platform Engineering and a motivated self-starter with an entrepreneurial spirit and love working in fast-paced environments, we would like to hear from you. Please use our careers portal to check out the opportunities and apply today.