Learning how first to instrument an application can be a daunting task. There's a fair amount of terminology to understand before jumping into the code. I always find that seeing the finish line helps me get motivated and stay on track. This chapter's goal is to see what telemetry generated by OpenTelemetry looks like in practice while learning about the theory. In this chapter, we will dive into the specifics of the following:
To help us get a more practical sense of the terminology and get comfortable with telemetry, we will look at the data using various open source tools that can help us to query and visualize telemetry.
This chapter will use an application that is already instrumented with OpenTelemetry, a grocery store, and several backends to walk through the different concepts of the signals. The environment we will be launching relies on Docker Compose. The first step is to install Docker by following the installation instructions at https://docs.docker.com/get-docker/. Ensure Docker is running on your local system by using the following command:
$ docker version
Client:
Cloud integration: 1.0.14
Version: 20.10.6
API version: 1.41
Go version: go1.16.3 ...
Next, let's ensure Compose is also installed by running the following command:
$ docker compose version
Docker Compose version 2.0.0-beta.1
Important Note
Compose was added to the Docker client in more recent client versions. If the previous command returns an error, follow the instructions on the Docker website (https://docs.docker.com/compose/install/) to install Compose. Alternatively, you may want to try the docker-compose command to see if you already have an older version installed.
The following diagram shows an overview of the containers we are launching in the Docker environment to give you an idea of the components involved. The applications on the left are emitting telemetry processed by the Collector and forwarded to the telemetry backends. The diagram also shows the port number exposed by each container for future reference.
This chapter briefly introduces the following open source projects that support the storage and visualization of OpenTelemetry data:
I strongly recommend visiting the website for each project to gain familiarity with the tools as we will use them throughout the chapter. Each of these tools will be revisited in Chapter 10, Configuring Backends. No prior knowledge of them is required to go through the examples, but they are pretty helpful to have in your toolbelt. The configuration files necessary to launch the applications in this chapter are available in the companion repository (https://github.com/PacktPublishing/Cloud-Native-Observability) in the chapter2 directory. The following downloads the repository using the git command:
$ git clone https://github.com/PacktPublishing/Cloud-Native-Observability
$ cd Cloud-Native-Observability/chapter02
To bring up the applications and telemetry backends, run the following command:
$ docker compose up
We will test the various tools to ensure each one is working as expected and is accessible from your browser. Let's start with Jaeger by accessing the following URL: http://localhost:16686. The following screenshot shows the interface you should see:
The next backend this chapter will use for metrics is Prometheus; let's test the application by visiting http://localhost:9090. The following screenshot is a preview of the Prometheus web interface:
The last tool we need to ensure is working in our backend for logs is Loki. We will use Grafana as a dashboard to visualize the logs being emitted. Begin by visiting http://localhost:3000/explore to ensure Grafana is up; you should be greeted by an interface like the one in Figure 2.4:
The next application we will check is the OpenTelemetry Collector, which acts as the routing layer for all the telemetry produced by the example application. The Collector exposes a health check endpoint discussed in Chapter 8, OpenTelemetry Collector. For now, it's enough to know that accessing the endpoint will give us information about the health of the Collector, using the following curl command:
$ curl localhost:13133
{"status":"Server available","upSince":"2021-10-03T15:42:02.7345149Z","uptime":"9.3414709s"}
Lastly, let's ensure the containers forming the grocery store demo application are running. To do this, we use curl again in the following commands to access an endpoint in the applications that returns a status showing the application's health. It's possible to use any other tool capable of making HTTP requests, including the browser, to accomplish this. The following checks the status of the grocery store:
$ curl localhost:5000/healthcheck
{
"service": "grocery-store",
"status": "ok"
}
The same command can be used to check the status of the inventory application by specifying port 5001:
$ curl localhost:5001/healthcheck
{
"service": "inventory",
"status": "ok"
}
The shopper application represents a client application and does not provide any endpoint to expose its health status. Instead, we can look at the logs emitted by the application to get a sense of whether it's doing the right thing or not. The following uses the docker logs command to look at the output from the application. Although it may vary slightly, the output should contain information about the shopper connecting to the grocery store:
$ docker logs -n 2 shopper
DEBUG:urllib3.connectionpool:http://grocery-store:5000 "GET /products HTTP/1.1" 200 107
INFO:shopper:message="add orange to cart"
The same docker logs command can be used on any of the other containers if you're interested in seeing more information about them. Once you're done with the chapter, you can clean up all the containers by running stop to terminate the running containers, and rm to delete the containers themselves:
$ docker compose stop
$ docker compose rm
All the examples in this chapter will expect that the Docker Compose environment is already up and running. When in doubt, come back to this technical requirement section to ensure your environment is still running as expected. Now, let's see what these OpenTelemetry signals are all about, starting with traces.
Distributed tracing is the foundation behind the tracing signal of OpenTelemetry. A distributed trace is a series of event data generated at various points throughout a system tied together via a unique identifier. This identifier is propagated across all components responsible for any operation required to complete the request, allowing each operation to associate the event data to the originating request. The following diagram gives us a simplified example of what a single request may look like when ordering groceries through an app:
Each trace represents a unique request through a system that can be either synchronous or asynchronous. Synchronous requests occur in sequence with each unit of work completed before continuing. An example of a synchronous request may be of a client application making a call to a server and waiting or blocking until a response is returned before proceeding. In contrast, asynchronous requests can initiate a series of operations that can occur simultaneously and independently. An example of an asynchronous request is a server application submitting messages to a queue or a process that batches operations. Each operation recorded in a trace is represented by a span, a single unit of work done in the system. Let's see what the specifics of the data captured in the trace look like.
The definition of what constitutes a trace has evolved as various systems have been developed to support distributed tracing. The World Wide Web Consortium (W3C), an international group that collaborates to move the web forward, assembled a working group in 2017 to produce a definition for tracing. In February 2020, the first version of the Trace Context specification was completed, with its details available on the W3C's website (https://www.w3.org/TR/trace-context-1/). OpenTelemetry follows the recommendation from the W3C in its definition of the SpanContext, which contains information about the trace and must be propagated throughout the system. The elements of a trace available within a span context include the following:
A span can represent a method call or a subset of the code being called within a method. Multiple spans within a trace are linked together in a parent-child relationship, with each child span containing information about its parent. The first span in a trace is called the root span and is identified because it does not have a parent span identifier. The following shows a typical visualization of a trace and the spans associated with it. The horizontal axis indicates the duration of the entire trace operation. The vertical axis shows the order in which the operations captured by spans took place, starting with the first operation at the top:
Let's look closer at a trace by bringing up a sample generated from the telemetry produced by the grocery store application. Access the Jaeger web interface by opening a browser to the following URL: http://localhost:16686/.
Search for a trace by selecting a service from the drop-down and clicking the Find Traces button. The following screenshot shows the traces found for the shopper service:
To obtain details about a specific trace, select one of the search results by clicking on the row. The following screenshot, Figure 2.8, shows the details of the trace generated by a request through the grocery store applications. It includes the following:
The preceding screenshot gives us an immediate sense of where time may be spent as the system processes the request. It also provides us with a glimpse into what the underlying code may look like without ever opening an editor. Additional details are captured in spans; let's look at those now.
As mentioned previously, the work captured in a trace is broken into separate units or operations, each represented by a span. The span is a data structure containing the following information:
In OpenTelemetry, a span identifier is represented by a 64-bit integer. The start and end times are used to calculate the operation's duration. Additionally, spans can contain metadata in the form of key-value pairs. In the case of Jaeger and Zipkin, these pairs are referred to as tags, whereas OpenTelemetry calls them attributes. The goal is to enrich the data provided with the additional context in both cases.
Look for the following details in Figure 2.9, which shows the detailed view of a specific span as shown in Jaeger:
Many of the tags captured in the span shown previously rely on semantic conventions, which will be discussed further in this chapter.
When producing distributed traces in a system, it's worth considering the additional visibility's tradeoffs. Generating tracing information can potentially incur performance overhead at the application level. It can result in added latency if tracing information is gathered and transmitted inline. There is also memory overhead to consider, as collecting information inevitably allocates resources. These concerns can be largely mitigated using configuration available in OpenTelemetry, as we'll see in Chapter 4, Distributed Tracing – Tracing Code Execution.
Depending on where the data is sent, additional costs, such as bandwidth or storage, can also become a factor. One of the ways to mitigate these costs is to reduce the amount of data produced by sampling only a certain amount of the data. We will dive deeper into sampling in Chapter 12, Sampling.
Another challenging aspect of producing distributed tracing data is ensuring that all the services correctly propagate the context. Failing to propagate the trace ID across the system means that requests will be broken into multiple traces, making them difficult to use or not helpful at all.
The last thing to consider is the effort required to instrument an application correctly. This is a non-trivial amount of effort, but as we'll see in future chapters, OpenTelemetry provides instrumentation libraries to make this easier.
Now that we have a deeper understanding of traces, let's look at metrics.
Just as distributed traces do, metrics provide information about the state of a running system to developers and operators. The data collected via metrics can be aggregated over time to identify trends and patterns in applications graphed through various tools and visualizations. The term metrics has a broad range of applications as they can capture low-level system metrics such as CPU cycles, or higher-level details such as the number of blue sweaters sold today. These examples would be helpful to different groups in an organization.
Additionally, metrics are critical to monitoring the health of an application and deciding when an on-call engineer should be alerted. They form the basis of service level indicators (SLIs) (https://en.wikipedia.org/wiki/Service_level_indicator) that measure the performance of an application. These indicators are then used to set service level objectives (SLOs) (https://en.wikipedia.org/wiki/Service-level_objective) that organizations use to calculate error budgets.
Important Note
SLIs, SLOs, and service level agreements (SLAs) are essential topics in production environments where third-party dependencies can impact the availability of your service. There are entire books dedicated to the issue that we will not cover here. The Google site reliability engineering (SRE) book is a great resource for this: https://sre.google/sre-book/service-level-objectives/.
The metrics signal of OpenTelemetry combines various existing open source formats into a unified data model. Primarily, it looks to OpenMetrics, StatsD, and Prometheus for existing definitions, requirements, and usage, wanting to ensure the use-cases of each of those communities are understood and addressed by the new standard.
Just about anything can be a metric; record a value at a given time, and you have yourself a metric. The common fields a metric contains include the following:
Let's look at data produced by metrics sent from the demo application. Access the Prometheus interface via a browser and the following URL: http://localhost:9090. The user interface for Prometheus allows us to query the time-series database by using the metric's name. The following screenshot contains a table showing the value of the request_counter metric. Look for the following details in the resulting table:
The table view shows the current value as cumulative. An alternative representation of the recorded metric is shown in the following figure. As the data received by Prometheus is stored over time, a line graph can be generated. Click the Graph tab of the interface to see what the data in a chart looks like:
By looking at the values for the metric over time, we can deduce additional information about the service, for example, its start time or trends in its usage. Visualizing metrics also provides opportunities to identify anomalies.
A metric is a more generic term that encapsulates different measurements that can be used to represent a wide array of information. As such, the data is captured using various data point types. The following diagram compares different kinds of data points that can be captured within a metric:
Each data point type can be used in different scenarios and has slightly different meanings. It's worth noting that even though competing standards provide support for types using the same name, their definition may vary. For example, a counter in StatsD (https://github.com/statsd/statsd/blob/master/docs/metric_types.md#counting) resets every time the value has been flushed, whereas, in Prometheus (https://prometheus.io/docs/concepts/metric_types/#counter), it keeps its cumulative value until the process recording the counter is restarted. The following definitions describe how data point types are represented in the OpenTelemetry specification:
Important Note
A cumulative sum will reset when an application restarts. This is useful to identify an event in the application but may be surprising if it's not accounted for.
The following diagram shows an example of a sum counter reporting the number of visits over a period of time. The table on the right-hand side shows what values are to be expected depending on the type of temporal aggregation chosen:
A sum data point also includes the time window for calculating the sum.
Unlike gauge definitions in other specifications, a gauge in OpenTelemetry is never incremented or decremented; it is only ever set to the value being recorded. A timestamp of the observation time must be included with the data point.
Like sums, histograms also support a delta or a cumulative aggregation and must contain a time window for the recorded observation. Note that in the case of cumulative aggregation, the data points captured in the distribution will continue to accumulate with each recording.
A summary is somewhat similar to a histogram, where the histogram contains a maximum and a minimum value; the summary includes a 1.0-quantile and 0.0-quantile to represent the same information. The 0.5-quantile, also known as median, is often expressed in the summary. For a summary data point, the quantile calculations happen in the producer of the telemetry, which can become expensive for applications. OpenTelemetry supports summaries to provide interoperability with OpenMetrics (https://openmetrics.io) and Prometheus and prefers the usage of a histogram, which moves the calculation of quantiles to the receiver of the telemetry. The following screenshot shows histogram values recorded by the inventory service for the http_request_duration_milliseconds_bucket metric stored in Prometheus. The data shown represents requests grouped into buckets. Each bucket represents the request duration in milliseconds:
The count of requests per bucket can then calculate quantiles for further analysis. Now that we're familiar with the different types of metric data points, let's see how metrics can be combined with tracing to provide additional insights.
Metrics are often helpful on their own, but when correlated with tracing information, they provide much more context and depth on the events occurring in a system. Exemplars offer a tool to accomplish this in OpenTelemetry by enabling a metric to contain information about an active span. Data points defined in OpenTelemetry include an exemplar field as part of their definition. This field contains the following:
The direct correlation that exemplars provide replaces the guesswork that involves cobbling metrics and traces with timestamps today. Although exemplars are already defined in the stable metrics section of the OpenTelemetry protocol, the implementation of exemplars is still under active development at the time of writing.
A concern that often arises with any telemetry is the importance of managing cardinality. Cardinality refers to the uniqueness of a value in a set. While counting cars in a parking lot, the number of wheels will likely offer a meager value and low cardinality result as most cars have four wheels. The color, make, and model of cars produces higher cardinality. The license plate, or vehicle identification number, results in the highest cardinality, providing the most valuable data to know in an event concerning a specific vehicle. For example, if the lights have been left on and the owners should be notified, calling out for the person with a four-wheeled car won't work nearly as well as calling for a specific license plate. However, the count of cars with specific license plates will always be one, making the counter itself somewhat useless.
One of the challenges with high-cardinality data is the increased storage cost. Specifically, in the case of metrics, it's possible to significantly increase the number of metrics being produced and stored by adding a single attribute or label. Suppose an application creating a counter for each request processed uses a unique identifier as the metric's name. In that case, the producer or receiver may translate this into a unique time series for each request. This results in a sudden and unexpected increase in load in the system. This is sometimes referred to as cardinality explosion.
When choosing attributes associated with produced metrics, it's essential to consider the scale of the services and infrastructure producing the telemetry. Some questions to keep in mind are as follows:
Using labels with finite and knowable values (for example, countries rather than street names) may be preferable depending on how the data is stored. When choosing a solution, understanding the storage model and limits of the telemetry backend must also be considered.
Although logs have evolved, what constitutes a log is quite broad. Also known as log files, a log is a record of events written to output. Traditionally, logs would be written to a file on disk, searching through as needed. A more recent practice is to emit logs to remote services using the network. This provides long-term storage for the data in a location and improves searchability and aggregation.
Many applications define their formats for what constitutes a log. There are several existing standard formats. An example includes the Common Log Format often used by web servers. It's challenging to identify commonalities across formats, but at the very least, a log should consist of the following:
This message can take many forms and include various application-specific information. In the case of structured logging, the log is formatted as a series of key-value pairs to simplify identifying the different fields contained within the log. Other formats record logs in a specific order with a separating character instead. The following shows an example log emitted by the standard formatter in Flask, a Python web framework that shows the following:
172.20.0.9 - - [11/Oct/2021 18:50:25] "GET /inventory HTTP/1.1" 200 -
The previous sample is an example of the Common Log Format mentioned earlier. The same log may look something like this as a structured log encoded as JSON:
{
"host": "172.20.0.9",
"date": "11/Oct/2021 18:50:25",
"method": "GET",
"path": "/inventory",
"protocol": "HTTP/1.1",
"status": 200
}
As you can see with structured logs, identifying the information is more intuitive if you're not already familiar with the type of logs produced. Let's see what logs our demo application produces by looking at the Grafana interface, at http://localhost:3000/explore.
This brings us to the explore view, which allows us to search through telemetry generated by the demo application. Ensure that Loki is selected from the data source drop-down in the top left corner. Filter the logs using the {job="shopper"} query to retrieve all the logs generated by the shopper application. The following screenshot shows a log emitted to the Loki backend, which contains the following:
Now that we can search for logs, let's see how we can combine the information provided by logs with other signals via correlation to give us more context.
In the same way that information provided by metrics can be augmented by combining them with other signals, logs too can provide more context by embedding tracing information. As we'll see in Chapter 6, Logging - Capturing Events, one of the goals of the logging signal in OpenTelemetry is to provide correlation capability to already existing logging libraries. Logs recorded via OpenTelemetry contain the trace ID and span ID for any span active at the time of the event. The following screenshot shows the details of a log record containing the traceID and spanID attributes:
Using these attributes can then reveal the specific request that triggered this event. The following screenshot demonstrates what the corresponding trace looks like in Jaeger. If you'd like to try for yourself, copy the traceID attribute into the Lookup by Trace ID field to search for the trace:
The correlation demonstrated in the previous example makes exploring events faster and less error-prone. As we will see in Chapter 6, Logging - Capturing Events, the OpenTelemetry specification provides recommendations for what information should be included in logs being emitted. It also provides guidelines for how existing formats can map their values with OpenTelemetry.
The free form of traditional logs makes them incredibly convenient to use without considering their structure. If you want to add any data to the logs, just call a function and print anything you'd like; it'll be great. However, this can pose some challenges. One of these challenges is the opportunity for leaking potentially private information into the logs and transmitting it to a centralized logging platform. This problem applies to all telemetry, but it's particularly easy to do with logs. This is especially true when logs contain debugging information, which may include data structures with passwords fields or private keys. It's good to review any logging calls in the code to ensure the logged data does not contain information that should not be logged.
Logs can also be overly verbose, which can cause unexpected volumes to be generated. This may make sifting through the logs for useful information difficult, if not impossible, depending on the size of the environment. It can also lead to unanticipated costs when using centralized logging platforms. Specific libraries or frameworks generate much debugging information. Ensuring the correct severity level is configured goes a long towards addressing this concern. However, it's hard to predict just how much data will be needed upfront. On more than one occasion, I've responded to alerts in the middle of the night, wishing for a more verbose log level to be configured.
High-quality telemetry allows the data consumer to find answers to questions when needed. Sometimes critical operations can lack instrumentation causing blind spots in the observability of a system. Other times, the processes are instrumented, but the data is not rich enough to be helpful. The OpenTelemetry project attempts to solve this through semantic conventions defined in the specification. These conventions cover the following:
These semantic conventions help ensure that the data generated when following the OpenTelemetry specification is consistent. This simplifies the work of folks instrumenting applications or libraries by providing guidelines for what should be instrumented and how. It also means that anyone analyzing telemetry produced by standard-compliant code can understand the meaning of the data by referencing the specification for additional information.
Following semantic conventions recommendations from a specification in a Markdown document can be challenging when writing code. Thankfully, OpenTelemetry also provides some tools to help.
Semantic conventions are great, but it makes sense to turn the recommendations into code to make it practical for developers to use them. The OpenTelemetry specification repository provides a folder that contains the semantic conventions described as YAML for this specific reason (https://github.com/open-telemetry/opentelemetry-specification/tree/main/semantic_conventions). These are combined with the semantic conventions generator (https://github.com/open-telemetry/build-tools/blob/v0.7.0/semantic-conventions/) to produce code in various languages. This code is shipped as independent libraries in some languages, helping guide developers. We will repeatedly rely upon the semantic conventions package in Python in further chapters as we instrument application code.
A challenge of semantic conventions is that as telemetry and observability evolve, so will the terminology used to describe events that we want to observe. An example of this happened when the db.hbase.namespace and db.cassandra.keyspace keys were renamed to use db.name instead. Such a change would cause problems for anyone already using this field as part of their analysis, or even alerting. To ensure the semantic conventions can evolve as needed while remaining backward-compatible with existing instrumentation, the OpenTelemetry community introduced the schema URL.
Important Note
The OpenTelemetry community understands the importance of backward compatibility in instrumentation code. Going back and re-instrumenting an application because of a new version of a telemetry library is a pain. As such, a significant amount of effort has gone into ensuring that components defined in OpenTelemetry remain interoperable with previous versions. The project defines its versioning and stability guarantees as part of the specification (https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md).
The schema URL is a field added to the telemetry generated for logs, metrics, resources, and traces tying the emitted telemetry to a version of the semantic conventions. This field allows the producers and consumers of telemetry to understand how to interpret the data. The schema also provides instructions for converting data from one version to another, as per the following example:
1.8.0 schema
file_format: 1.0.0
schema_url: https://opentelemetry.io/schemas/1.8.0
versions:
1.8.0:
spans:
changes:
- rename_attributes:
attribute_map:
db.cassandra.keyspace: db.name
db.hbase.namespace: db.name
1.7.0:
1.6.1:
Continuing with the previous example, imagine a producer of Cassandra telemetry is emitting db.cassandra.keyspace as the name for a Cassandra database and specifying the schema as 1.7.0. It sends the data to a backend that implements schema 1.8.0. By reading the schema URL and implementing the appropriate translation, the backend can produce telemetry in its expected version, which is powerful! Schemas decouple systems involved in telemetry, providing them with the flexibility to evolve independently.
This chapter allowed us to learn or review some concepts that will assist us when instrumenting applications using OpenTelemetry. We looked at the building blocks of distributed tracing, which will come in handy when we go through instrumenting our first application with OpenTelemetry in Chapter 4, Distributed Tracing – Tracing Code Execution. We also started analyzing tracing data using tools that developers and operators make use of every day.
We then switched to the metrics signal; first, looking at the minimal contents of a metric, then comparing different data types commonly used to produce metrics and their structures. Discussing exemplars gave us a brief introduction to how correlating metrics with traces can create a more complete picture of what is happening within a system by combining telemetry across signals.
Looking at log formats and searching through logs to find information about the demo application allowed us to get familiar with yet another tool available in the observability practitioner's toolbelt.
Lastly, by leveraging semantic conventions defined in OpenTelemetry, we can begin to produce consistent, high-quality data. Following these conventions removes the painful task of naming things, which everyone in the software industry agrees is hard for producers of telemetry. Additionally, these conventions remove the guesswork when interpreting the data.
Knowing the theory and concepts behind instrumentation and telemetry is excellent to provide us with the tools to do all the instrumentation work ourselves. Still, what if I were to tell you it may not be necessary to instrument every call in every library manually? The next chapter will cover how auto-instrumentation looks to help developers in their quest for better visibility into their systems.