Chapter 2: OpenTelemetry Signals – Traces, Metrics, and Logs

Learning how first to instrument an application can be a daunting task. There's a fair amount of terminology to understand before jumping into the code. I always find that seeing the finish line helps me get motivated and stay on track. This chapter's goal is to see what telemetry generated by OpenTelemetry looks like in practice while learning about the theory. In this chapter, we will dive into the specifics of the following:

  • Distributed tracing
  • Metrics
  • Logs
  • Producing consistent quality data with semantic conventions

To help us get a more practical sense of the terminology and get comfortable with telemetry, we will look at the data using various open source tools that can help us to query and visualize telemetry.

Technical requirements

This chapter will use an application that is already instrumented with OpenTelemetry, a grocery store, and several backends to walk through the different concepts of the signals. The environment we will be launching relies on Docker Compose. The first step is to install Docker by following the installation instructions at https://docs.docker.com/get-docker/. Ensure Docker is running on your local system by using the following command:

$ docker version

Client:

Cloud integration: 1.0.14

Version:           20.10.6

API version:       1.41

Go version:        go1.16.3 ...

Next, let's ensure Compose is also installed by running the following command:

$ docker compose version

Docker Compose version 2.0.0-beta.1

Important Note

Compose was added to the Docker client in more recent client versions. If the previous command returns an error, follow the instructions on the Docker website (https://docs.docker.com/compose/install/) to install Compose. Alternatively, you may want to try the docker-compose command to see if you already have an older version installed.

The following diagram shows an overview of the containers we are launching in the Docker environment to give you an idea of the components involved. The applications on the left are emitting telemetry processed by the Collector and forwarded to the telemetry backends. The diagram also shows the port number exposed by each container for future reference.

Figure 2.1 – Containers within Docker environment

Figure 2.1 – Containers within Docker environment

This chapter briefly introduces the following open source projects that support the storage and visualization of OpenTelemetry data:

I strongly recommend visiting the website for each project to gain familiarity with the tools as we will use them throughout the chapter. Each of these tools will be revisited in Chapter 10, Configuring Backends. No prior knowledge of them is required to go through the examples, but they are pretty helpful to have in your toolbelt. The configuration files necessary to launch the applications in this chapter are available in the companion repository (https://github.com/PacktPublishing/Cloud-Native-Observability) in the chapter2 directory. The following downloads the repository using the git command:

$ git clone https://github.com/PacktPublishing/Cloud-Native-Observability

$ cd Cloud-Native-Observability/chapter02

To bring up the applications and telemetry backends, run the following command:

$ docker compose up

We will test the various tools to ensure each one is working as expected and is accessible from your browser. Let's start with Jaeger by accessing the following URL: http://localhost:16686. The following screenshot shows the interface you should see:

Figure 2.2 – The Jaeger web interface

Figure 2.2 – The Jaeger web interface

The next backend this chapter will use for metrics is Prometheus; let's test the application by visiting http://localhost:9090. The following screenshot is a preview of the Prometheus web interface:

Figure 2.3 – The Prometheus web interface

Figure 2.3 – The Prometheus web interface

The last tool we need to ensure is working in our backend for logs is Loki. We will use Grafana as a dashboard to visualize the logs being emitted. Begin by visiting http://localhost:3000/explore to ensure Grafana is up; you should be greeted by an interface like the one in Figure 2.4:

Figure 2.4 – The Grafana web interface

Figure 2.4 – The Grafana web interface

The next application we will check is the OpenTelemetry Collector, which acts as the routing layer for all the telemetry produced by the example application. The Collector exposes a health check endpoint discussed in Chapter 8, OpenTelemetry Collector. For now, it's enough to know that accessing the endpoint will give us information about the health of the Collector, using the following curl command:

$ curl localhost:13133

{"status":"Server available","upSince":"2021-10-03T15:42:02.7345149Z","uptime":"9.3414709s"}

Lastly, let's ensure the containers forming the grocery store demo application are running. To do this, we use curl again in the following commands to access an endpoint in the applications that returns a status showing the application's health. It's possible to use any other tool capable of making HTTP requests, including the browser, to accomplish this. The following checks the status of the grocery store:

$ curl localhost:5000/healthcheck

{

  "service": "grocery-store",

  "status": "ok"

}

The same command can be used to check the status of the inventory application by specifying port 5001:

$ curl localhost:5001/healthcheck

{

  "service": "inventory",

  "status": "ok"

}

The shopper application represents a client application and does not provide any endpoint to expose its health status. Instead, we can look at the logs emitted by the application to get a sense of whether it's doing the right thing or not. The following uses the docker logs command to look at the output from the application. Although it may vary slightly, the output should contain information about the shopper connecting to the grocery store:

$ docker logs -n 2 shopper

DEBUG:urllib3.connectionpool:http://grocery-store:5000 "GET /products HTTP/1.1" 200 107

INFO:shopper:message="add orange to cart"

The same docker logs command can be used on any of the other containers if you're interested in seeing more information about them. Once you're done with the chapter, you can clean up all the containers by running stop to terminate the running containers, and rm to delete the containers themselves:

$ docker compose stop

$ docker compose rm

All the examples in this chapter will expect that the Docker Compose environment is already up and running. When in doubt, come back to this technical requirement section to ensure your environment is still running as expected. Now, let's see what these OpenTelemetry signals are all about, starting with traces.

Traces

Distributed tracing is the foundation behind the tracing signal of OpenTelemetry. A distributed trace is a series of event data generated at various points throughout a system tied together via a unique identifier. This identifier is propagated across all components responsible for any operation required to complete the request, allowing each operation to associate the event data to the originating request. The following diagram gives us a simplified example of what a single request may look like when ordering groceries through an app:

Figure 2.5 – Example request through a simplified ordering system

Figure 2.5 – Example request through a simplified ordering system

Each trace represents a unique request through a system that can be either synchronous or asynchronous. Synchronous requests occur in sequence with each unit of work completed before continuing. An example of a synchronous request may be of a client application making a call to a server and waiting or blocking until a response is returned before proceeding. In contrast, asynchronous requests can initiate a series of operations that can occur simultaneously and independently. An example of an asynchronous request is a server application submitting messages to a queue or a process that batches operations. Each operation recorded in a trace is represented by a span, a single unit of work done in the system. Let's see what the specifics of the data captured in the trace look like.

Anatomy of a trace

The definition of what constitutes a trace has evolved as various systems have been developed to support distributed tracing. The World Wide Web Consortium (W3C), an international group that collaborates to move the web forward, assembled a working group in 2017 to produce a definition for tracing. In February 2020, the first version of the Trace Context specification was completed, with its details available on the W3C's website (https://www.w3.org/TR/trace-context-1/). OpenTelemetry follows the recommendation from the W3C in its definition of the SpanContext, which contains information about the trace and must be propagated throughout the system. The elements of a trace available within a span context include the following:

  • A unique identifier, referred to as a trace ID, identifies the request through the system.
  • A second identifier, the span ID, is associated with the span that last interacted with the context. This may also be referred to as the parent identifier.
  • Trace flags include additional information about the trace, such as the sampling decision and trace level.
  • Vendor-specific information is carried forward using a Trace state field. This allows individual vendors to propagate information necessary for their systems to interpret the tracing data. For example, if a vendor needs an additional identifier to be present in the trace information, this identifier could be inserted as vendorA=123456 in the trace state field. Other vendors would add their own as needed, allowing traces to be shared across vendors.

A span can represent a method call or a subset of the code being called within a method. Multiple spans within a trace are linked together in a parent-child relationship, with each child span containing information about its parent. The first span in a trace is called the root span and is identified because it does not have a parent span identifier. The following shows a typical visualization of a trace and the spans associated with it. The horizontal axis indicates the duration of the entire trace operation. The vertical axis shows the order in which the operations captured by spans took place, starting with the first operation at the top:

Figure 2.6 – Visual representation of a trace

Figure 2.6 – Visual representation of a trace

Let's look closer at a trace by bringing up a sample generated from the telemetry produced by the grocery store application. Access the Jaeger web interface by opening a browser to the following URL: http://localhost:16686/.

Search for a trace by selecting a service from the drop-down and clicking the Find Traces button. The following screenshot shows the traces found for the shopper service:

Figure 2.7 – Traces search result

Figure 2.7 – Traces search result

To obtain details about a specific trace, select one of the search results by clicking on the row. The following screenshot, Figure 2.8, shows the details of the trace generated by a request through the grocery store applications. It includes the following:

  1. The unique trace ID for this request. In OpenTelemetry, this is represented by a 128-bit integer. It's worth noting that other systems may represent this as a 64-bit integer. The integer is encoded into a string containing hexadecimal characters in many systems.
  2. The start time for the request.
  3. The total duration of the request through the system is calculated by subtracting the time the root span is finished from its start time.
  4. A count of the number of services included in this request.
  5. A count of spans recorded in this request is shown in Total Spans.
  6. A hierarchical view of the spans in the trace.
Figure 2.8 – A trace in Jaeger

Figure 2.8 – A trace in Jaeger

The preceding screenshot gives us an immediate sense of where time may be spent as the system processes the request. It also provides us with a glimpse into what the underlying code may look like without ever opening an editor. Additional details are captured in spans; let's look at those now.

Details of a span

As mentioned previously, the work captured in a trace is broken into separate units or operations, each represented by a span. The span is a data structure containing the following information:

  • A unique identifier
  • A parent span identifier
  • A name describing the work being recorded
  • A start and end time

In OpenTelemetry, a span identifier is represented by a 64-bit integer. The start and end times are used to calculate the operation's duration. Additionally, spans can contain metadata in the form of key-value pairs. In the case of Jaeger and Zipkin, these pairs are referred to as tags, whereas OpenTelemetry calls them attributes. The goal is to enrich the data provided with the additional context in both cases.

Look for the following details in Figure 2.9, which shows the detailed view of a specific span as shown in Jaeger:

  1. The name identifies the operation represented by this span. In this case, /inventory is the operation's name.
  2. SpanID is the unique 64-bit identifier represented in hex-encoded formatting.
  3. Start Time is when the operation recorded its start time relative to the start of the request. In the case shown here, the operation started 8.36 milliseconds after the beginning of the request.
  4. Duration is the time it took for the operation to complete and is calculated using the start and end times recorded in the span.
  5. The Service name identifies the application that triggered the operation and recorded the telemetry.
  6. Tags represent additional information about the operation being recorded.
  7. Process shows information about the application or process fulfilling the requested operation.
Figure 2.9 – Span details

Figure 2.9 – Span details

Many of the tags captured in the span shown previously rely on semantic conventions, which will be discussed further in this chapter.

Additional considerations

When producing distributed traces in a system, it's worth considering the additional visibility's tradeoffs. Generating tracing information can potentially incur performance overhead at the application level. It can result in added latency if tracing information is gathered and transmitted inline. There is also memory overhead to consider, as collecting information inevitably allocates resources. These concerns can be largely mitigated using configuration available in OpenTelemetry, as we'll see in Chapter 4, Distributed Tracing – Tracing Code Execution.

Depending on where the data is sent, additional costs, such as bandwidth or storage, can also become a factor. One of the ways to mitigate these costs is to reduce the amount of data produced by sampling only a certain amount of the data. We will dive deeper into sampling in Chapter 12, Sampling.

Another challenging aspect of producing distributed tracing data is ensuring that all the services correctly propagate the context. Failing to propagate the trace ID across the system means that requests will be broken into multiple traces, making them difficult to use or not helpful at all.

The last thing to consider is the effort required to instrument an application correctly. This is a non-trivial amount of effort, but as we'll see in future chapters, OpenTelemetry provides instrumentation libraries to make this easier.

Now that we have a deeper understanding of traces, let's look at metrics.

Metrics

Just as distributed traces do, metrics provide information about the state of a running system to developers and operators. The data collected via metrics can be aggregated over time to identify trends and patterns in applications graphed through various tools and visualizations. The term metrics has a broad range of applications as they can capture low-level system metrics such as CPU cycles, or higher-level details such as the number of blue sweaters sold today. These examples would be helpful to different groups in an organization.

Additionally, metrics are critical to monitoring the health of an application and deciding when an on-call engineer should be alerted. They form the basis of service level indicators (SLIs) (https://en.wikipedia.org/wiki/Service_level_indicator) that measure the performance of an application. These indicators are then used to set service level objectives (SLOs) (https://en.wikipedia.org/wiki/Service-level_objective) that organizations use to calculate error budgets.

Important Note

SLIs, SLOs, and service level agreements (SLAs) are essential topics in production environments where third-party dependencies can impact the availability of your service. There are entire books dedicated to the issue that we will not cover here. The Google site reliability engineering (SRE) book is a great resource for this: https://sre.google/sre-book/service-level-objectives/.

The metrics signal of OpenTelemetry combines various existing open source formats into a unified data model. Primarily, it looks to OpenMetrics, StatsD, and Prometheus for existing definitions, requirements, and usage, wanting to ensure the use-cases of each of those communities are understood and addressed by the new standard.

Anatomy of a metric

Just about anything can be a metric; record a value at a given time, and you have yourself a metric. The common fields a metric contains include the following:

  • A name identifies the metric being recorded.
  • A data point value may be an integer or a floating-point value. Note that in the case of a histogram or a summary, there is more than one value associated with the metric.
  • Additional dimension information about the metric. The representation of these dimensions varies depending on the metrics backend. In Prometheus, these dimensions are represented by labels, whereas in StatsD, it is common to add a prefix in the metric's name. In OpenTelemetry, dimensions are added to metrics via attributes.

Let's look at data produced by metrics sent from the demo application. Access the Prometheus interface via a browser and the following URL: http://localhost:9090. The user interface for Prometheus allows us to query the time-series database by using the metric's name. The following screenshot contains a table showing the value of the request_counter metric. Look for the following details in the resulting table:

  1. The name of the metric, in this case, request_counter.
  2. The dimensions recorded for this metric are displayed in curly braces as key-value pairs with the key emboldened. In the example shown, the service_name label caused two different metrics to be recorded, one for the shopper service and another for the store service.
  3. A reported value, in this example, is an integer. This value may be the last received or a calculated current value depending on the metric type.
Figure 2.10 – Table view of metric in Prometheus

Figure 2.10 – Table view of metric in Prometheus

The table view shows the current value as cumulative. An alternative representation of the recorded metric is shown in the following figure. As the data received by Prometheus is stored over time, a line graph can be generated. Click the Graph tab of the interface to see what the data in a chart looks like:

Figure 2.11 – Graph view of the same metric in Prometheus

Figure 2.11 – Graph view of the same metric in Prometheus

By looking at the values for the metric over time, we can deduce additional information about the service, for example, its start time or trends in its usage. Visualizing metrics also provides opportunities to identify anomalies.

Data point types

A metric is a more generic term that encapsulates different measurements that can be used to represent a wide array of information. As such, the data is captured using various data point types. The following diagram compares different kinds of data points that can be captured within a metric:

Figure 2.12 – Comparison of counter, gauge, histogram, and summary data points

Figure 2.12 – Comparison of counter, gauge, histogram, and summary data points

Each data point type can be used in different scenarios and has slightly different meanings. It's worth noting that even though competing standards provide support for types using the same name, their definition may vary. For example, a counter in StatsD (https://github.com/statsd/statsd/blob/master/docs/metric_types.md#counting) resets every time the value has been flushed, whereas, in Prometheus (https://prometheus.io/docs/concepts/metric_types/#counter), it keeps its cumulative value until the process recording the counter is restarted. The following definitions describe how data point types are represented in the OpenTelemetry specification:

  • A sum measures incremental changes to a recorded value. This incremental change is either monotonic or non-monotonic and must be associated with an aggregation temporality. The temporality can be either of the following:
    1. Delta aggregation: The reported values contain the change in value from its previous recording.
    2. Cumulative aggregation: The value reported includes the previously reported sum in addition to the delta being reported.

    Important Note

    A cumulative sum will reset when an application restarts. This is useful to identify an event in the application but may be surprising if it's not accounted for.

The following diagram shows an example of a sum counter reporting the number of visits over a period of time. The table on the right-hand side shows what values are to be expected depending on the type of temporal aggregation chosen:

Figure 2.13 – Sum showing delta and cumulative aggregation values

Figure 2.13 – Sum showing delta and cumulative aggregation values

A sum data point also includes the time window for calculating the sum.

  • A gauge represents non-monotonic values that only measure the last or current known value at observation. This likely means some information is missing, but it may not be relevant. For example, the following diagram represents temperatures recorded at an hourly interval. More specific data points could provide greater granularity as to the rise and fall of the temperature. These incremental changes in the temperature may not be required if the goal is to observe trends over weeks or months.
Figure 2.14 – Gauge values recorded

Figure 2.14 – Gauge values recorded

Unlike gauge definitions in other specifications, a gauge in OpenTelemetry is never incremented or decremented; it is only ever set to the value being recorded. A timestamp of the observation time must be included with the data point.

  • A histogram data point provides a compressed view into a more significant number of data points by grouping the data into a distribution and summarizing the data, rather than reporting individual measurements for every detail represented. The following diagram shows sample histogram data points representing a distribution of response durations.
Figure 2.15 – Histogram data points

Figure 2.15 – Histogram data points

Like sums, histograms also support a delta or a cumulative aggregation and must contain a time window for the recorded observation. Note that in the case of cumulative aggregation, the data points captured in the distribution will continue to accumulate with each recording.

  • The summary data type provides a similar capability to histograms, but it's specifically tailored around providing quantiles of a distribution. A quantile, sometimes also referred to as percentile, is a fraction between zero and one, representing a percentage of the total number of values recorded that falls under a certain threshold. For example, consider the following 10 response times in milliseconds: 1.1, 2.9, 7.5, 8.3, 9, 10, 10, 10, 10, 25. The 0.9-quantile, or the 90th percentile, equals 10 milliseconds.
Figure 2.16 – Summary data points

Figure 2.16 – Summary data points

A summary is somewhat similar to a histogram, where the histogram contains a maximum and a minimum value; the summary includes a 1.0-quantile and 0.0-quantile to represent the same information. The 0.5-quantile, also known as median, is often expressed in the summary. For a summary data point, the quantile calculations happen in the producer of the telemetry, which can become expensive for applications. OpenTelemetry supports summaries to provide interoperability with OpenMetrics (https://openmetrics.io) and Prometheus and prefers the usage of a histogram, which moves the calculation of quantiles to the receiver of the telemetry. The following screenshot shows histogram values recorded by the inventory service for the http_request_duration_milliseconds_bucket metric stored in Prometheus. The data shown represents requests grouped into buckets. Each bucket represents the request duration in milliseconds:

Figure 2.17 – Histogram value in Prometheus

Figure 2.17 – Histogram value in Prometheus

The count of requests per bucket can then calculate quantiles for further analysis. Now that we're familiar with the different types of metric data points, let's see how metrics can be combined with tracing to provide additional insights.

Exemplars

Metrics are often helpful on their own, but when correlated with tracing information, they provide much more context and depth on the events occurring in a system. Exemplars offer a tool to accomplish this in OpenTelemetry by enabling a metric to contain information about an active span. Data points defined in OpenTelemetry include an exemplar field as part of their definition. This field contains the following:

  • A trace ID of the current span in progress
  • The span ID of the current span in progress
  • A timestamp of the event measured
  • A set of attributes associated with the exemplar
  • The value being recorded

The direct correlation that exemplars provide replaces the guesswork that involves cobbling metrics and traces with timestamps today. Although exemplars are already defined in the stable metrics section of the OpenTelemetry protocol, the implementation of exemplars is still under active development at the time of writing.

Additional considerations

A concern that often arises with any telemetry is the importance of managing cardinality. Cardinality refers to the uniqueness of a value in a set. While counting cars in a parking lot, the number of wheels will likely offer a meager value and low cardinality result as most cars have four wheels. The color, make, and model of cars produces higher cardinality. The license plate, or vehicle identification number, results in the highest cardinality, providing the most valuable data to know in an event concerning a specific vehicle. For example, if the lights have been left on and the owners should be notified, calling out for the person with a four-wheeled car won't work nearly as well as calling for a specific license plate. However, the count of cars with specific license plates will always be one, making the counter itself somewhat useless.

One of the challenges with high-cardinality data is the increased storage cost. Specifically, in the case of metrics, it's possible to significantly increase the number of metrics being produced and stored by adding a single attribute or label. Suppose an application creating a counter for each request processed uses a unique identifier as the metric's name. In that case, the producer or receiver may translate this into a unique time series for each request. This results in a sudden and unexpected increase in load in the system. This is sometimes referred to as cardinality explosion.

When choosing attributes associated with produced metrics, it's essential to consider the scale of the services and infrastructure producing the telemetry. Some questions to keep in mind are as follows:

  • Will scaling components of the system increase the number of metrics in a way that is understood? When a system scales, the last thing anyone wants is for an unexpected spike in metrics to cause outages.
  • Are any attributes specific to instances of an application? This could cause problems in the case of a crashing application.

Using labels with finite and knowable values (for example, countries rather than street names) may be preferable depending on how the data is stored. When choosing a solution, understanding the storage model and limits of the telemetry backend must also be considered.

Logs

Although logs have evolved, what constitutes a log is quite broad. Also known as log files, a log is a record of events written to output. Traditionally, logs would be written to a file on disk, searching through as needed. A more recent practice is to emit logs to remote services using the network. This provides long-term storage for the data in a location and improves searchability and aggregation.

Anatomy of a log

Many applications define their formats for what constitutes a log. There are several existing standard formats. An example includes the Common Log Format often used by web servers. It's challenging to identify commonalities across formats, but at the very least, a log should consist of the following:

  • A timestamp recording the time of the event
  • The message or payload representing the event

This message can take many forms and include various application-specific information. In the case of structured logging, the log is formatted as a series of key-value pairs to simplify identifying the different fields contained within the log. Other formats record logs in a specific order with a separating character instead. The following shows an example log emitted by the standard formatter in Flask, a Python web framework that shows the following:

  • A timestamp is enclosed in square brackets.
  • A space-delimited set of elements forms the message logged, including the client IP, the HTTP method used to make a request, the request's path, the protocol version, and the response code:

172.20.0.9 - - [11/Oct/2021 18:50:25] "GET /inventory HTTP/1.1" 200 -

The previous sample is an example of the Common Log Format mentioned earlier. The same log may look something like this as a structured log encoded as JSON:

{

    "host": "172.20.0.9",

    "date": "11/Oct/2021 18:50:25",

    "method": "GET",

    "path": "/inventory",

    "protocol": "HTTP/1.1",

    "status": 200

}

As you can see with structured logs, identifying the information is more intuitive if you're not already familiar with the type of logs produced. Let's see what logs our demo application produces by looking at the Grafana interface, at http://localhost:3000/explore.

This brings us to the explore view, which allows us to search through telemetry generated by the demo application. Ensure that Loki is selected from the data source drop-down in the top left corner. Filter the logs using the {job="shopper"} query to retrieve all the logs generated by the shopper application. The following screenshot shows a log emitted to the Loki backend, which contains the following:

  1. The name of the application is under the job label.
  2. A timestamp of the log is shown both as a timestamp and as a nanosecond value.
  3. The body of the logged event.
  4. Additional labels and values associated with the event.
Figure 2.18 – Log shown in Grafana

Figure 2.18 – Log shown in Grafana

Now that we can search for logs, let's see how we can combine the information provided by logs with other signals via correlation to give us more context.

Correlating logs

In the same way that information provided by metrics can be augmented by combining them with other signals, logs too can provide more context by embedding tracing information. As we'll see in Chapter 6, Logging - Capturing Events, one of the goals of the logging signal in OpenTelemetry is to provide correlation capability to already existing logging libraries. Logs recorded via OpenTelemetry contain the trace ID and span ID for any span active at the time of the event. The following screenshot shows the details of a log record containing the traceID and spanID attributes:

Figure 2.19 – Log containing trace ID

Figure 2.19 – Log containing trace ID

Using these attributes can then reveal the specific request that triggered this event. The following screenshot demonstrates what the corresponding trace looks like in Jaeger. If you'd like to try for yourself, copy the traceID attribute into the Lookup by Trace ID field to search for the trace:

Figure 2.20 – Corresponding trace in Jaeger

Figure 2.20 – Corresponding trace in Jaeger

The correlation demonstrated in the previous example makes exploring events faster and less error-prone. As we will see in Chapter 6, Logging - Capturing Events, the OpenTelemetry specification provides recommendations for what information should be included in logs being emitted. It also provides guidelines for how existing formats can map their values with OpenTelemetry.

Additional considerations

The free form of traditional logs makes them incredibly convenient to use without considering their structure. If you want to add any data to the logs, just call a function and print anything you'd like; it'll be great. However, this can pose some challenges. One of these challenges is the opportunity for leaking potentially private information into the logs and transmitting it to a centralized logging platform. This problem applies to all telemetry, but it's particularly easy to do with logs. This is especially true when logs contain debugging information, which may include data structures with passwords fields or private keys. It's good to review any logging calls in the code to ensure the logged data does not contain information that should not be logged.

Logs can also be overly verbose, which can cause unexpected volumes to be generated. This may make sifting through the logs for useful information difficult, if not impossible, depending on the size of the environment. It can also lead to unanticipated costs when using centralized logging platforms. Specific libraries or frameworks generate much debugging information. Ensuring the correct severity level is configured goes a long towards addressing this concern. However, it's hard to predict just how much data will be needed upfront. On more than one occasion, I've responded to alerts in the middle of the night, wishing for a more verbose log level to be configured.

Semantic conventions

High-quality telemetry allows the data consumer to find answers to questions when needed. Sometimes critical operations can lack instrumentation causing blind spots in the observability of a system. Other times, the processes are instrumented, but the data is not rich enough to be helpful. The OpenTelemetry project attempts to solve this through semantic conventions defined in the specification. These conventions cover the following:

  • Attributes that should be present for traces, metrics, and logs.
  • Resource attribute definitions for various types of workloads, including hosts, containers, and functions. The resource attributes described by the specification also include characteristics specific to multiple popular cloud platforms.
  • Recommendations for what telemetry should be emitted by components participating in various scenarios such as messaging systems, client-server applications, and database interactions.

These semantic conventions help ensure that the data generated when following the OpenTelemetry specification is consistent. This simplifies the work of folks instrumenting applications or libraries by providing guidelines for what should be instrumented and how. It also means that anyone analyzing telemetry produced by standard-compliant code can understand the meaning of the data by referencing the specification for additional information.

Following semantic conventions recommendations from a specification in a Markdown document can be challenging when writing code. Thankfully, OpenTelemetry also provides some tools to help.

Adopting semantic conventions

Semantic conventions are great, but it makes sense to turn the recommendations into code to make it practical for developers to use them. The OpenTelemetry specification repository provides a folder that contains the semantic conventions described as YAML for this specific reason (https://github.com/open-telemetry/opentelemetry-specification/tree/main/semantic_conventions). These are combined with the semantic conventions generator (https://github.com/open-telemetry/build-tools/blob/v0.7.0/semantic-conventions/) to produce code in various languages. This code is shipped as independent libraries in some languages, helping guide developers. We will repeatedly rely upon the semantic conventions package in Python in further chapters as we instrument application code.

Schema URL

A challenge of semantic conventions is that as telemetry and observability evolve, so will the terminology used to describe events that we want to observe. An example of this happened when the db.hbase.namespace and db.cassandra.keyspace keys were renamed to use db.name instead. Such a change would cause problems for anyone already using this field as part of their analysis, or even alerting. To ensure the semantic conventions can evolve as needed while remaining backward-compatible with existing instrumentation, the OpenTelemetry community introduced the schema URL.

Important Note

The OpenTelemetry community understands the importance of backward compatibility in instrumentation code. Going back and re-instrumenting an application because of a new version of a telemetry library is a pain. As such, a significant amount of effort has gone into ensuring that components defined in OpenTelemetry remain interoperable with previous versions. The project defines its versioning and stability guarantees as part of the specification (https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md).

The schema URL is a field added to the telemetry generated for logs, metrics, resources, and traces tying the emitted telemetry to a version of the semantic conventions. This field allows the producers and consumers of telemetry to understand how to interpret the data. The schema also provides instructions for converting data from one version to another, as per the following example:

1.8.0 schema

file_format: 1.0.0

schema_url: https://opentelemetry.io/schemas/1.8.0

versions:

  1.8.0:

    spans:

      changes:

        - rename_attributes:

            attribute_map:

              db.cassandra.keyspace: db.name

              db.hbase.namespace: db.name

  1.7.0:

  1.6.1:

Continuing with the previous example, imagine a producer of Cassandra telemetry is emitting db.cassandra.keyspace as the name for a Cassandra database and specifying the schema as 1.7.0. It sends the data to a backend that implements schema 1.8.0. By reading the schema URL and implementing the appropriate translation, the backend can produce telemetry in its expected version, which is powerful! Schemas decouple systems involved in telemetry, providing them with the flexibility to evolve independently.

Summary

This chapter allowed us to learn or review some concepts that will assist us when instrumenting applications using OpenTelemetry. We looked at the building blocks of distributed tracing, which will come in handy when we go through instrumenting our first application with OpenTelemetry in Chapter 4, Distributed Tracing – Tracing Code Execution. We also started analyzing tracing data using tools that developers and operators make use of every day.

We then switched to the metrics signal; first, looking at the minimal contents of a metric, then comparing different data types commonly used to produce metrics and their structures. Discussing exemplars gave us a brief introduction to how correlating metrics with traces can create a more complete picture of what is happening within a system by combining telemetry across signals.

Looking at log formats and searching through logs to find information about the demo application allowed us to get familiar with yet another tool available in the observability practitioner's toolbelt.

Lastly, by leveraging semantic conventions defined in OpenTelemetry, we can begin to produce consistent, high-quality data. Following these conventions removes the painful task of naming things, which everyone in the software industry agrees is hard for producers of telemetry. Additionally, these conventions remove the guesswork when interpreting the data.

Knowing the theory and concepts behind instrumentation and telemetry is excellent to provide us with the tools to do all the instrumentation work ourselves. Still, what if I were to tell you it may not be necessary to instrument every call in every library manually? The next chapter will cover how auto-instrumentation looks to help developers in their quest for better visibility into their systems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset