Chapter 12: Sampling

One of the challenges of telemetry, in general, is managing the quantity of data that can be produced by instrumentation. This can be problematic at the time of generation if the tools producing telemetry consume too many resources. It can also be costly to transfer the data across various points of the network. And, of course, the more data is produced, the more storage it consumes, and the more resources are required to sift through it at the time of analysis. The last topic we'll discuss in this book focuses on how we can reduce the amount of data produced by instrumentation while retaining the value and fidelity of the data. To achieve this, we will be looking at sampling. Although primarily a concern of tracing, sampling has an impact across metrics and logs as well, which we'll learn about throughout this chapter. We'll look at the following areas:

  • Concepts of sampling, including sampling strategies, across the different signals of OpenTelemetry
  • How to configure sampling at the application level via the OpenTelemetry Software Development Kit (SDK)
  • Using the OpenTelemetry collector to sample data

Along the way, we'll look at some common pitfalls of sampling to learn how they can best be avoided. Let's start with the technical requirements for the chapter.

Technical requirements

All the code for the examples in the chapter is available in the companion repository, which can be downloaded using git with the following command. The examples are under the chapter12 directory:

$ git clone https://github.com/PacktPublishing/Cloud-Native-Observability

$ cd Cloud-Native-Observability/chapter12

The first example in the chapter consists of an example application that uses the OpenTelemetry Python SDK to configure a sampler. To run the code, we'll need Python 3.6 or greater installed:

$ python --version

Python 3.8.9

$ python3 --version

Python 3.8.9

If Python is not installed on your system, or the installed version of Python is less than the supported version, follow the instructions from the Python website (https://www.python.org/downloads/) to install a compatible version.

Next, install the following OpenTelemetry packages via pip. Note that through dependency requirements, additional packages will automatically be installed:

$ pip install opentelemetry-distro

              opentelemetry-exporter-otlp

$ pip freeze | grep opentelemetry

opentelemetry-api==1.8.0

opentelemetry-distro==0.27b0

opentelemetry-exporter-otlp==1.8.0

opentelemetry-exporter-otlp-proto-grpc==1.8.0

opentelemetry-exporter-otlp-proto-http==1.8.0

opentelemetry-instrumentation==0.27b0

opentelemetry-proto==1.8.0

opentelemetry-sdk==1.8.0

The second example will use the OpenTelemetry Collector, which can be downloaded from GitHub directly. The example will focus on the tail sampling processor, which currently resides in the opentelemetry-collector-contrib repository. The version used in this chapter can be found at the following location: https://github.com/open-telemetry/opentelemetry-collector-releases/releases/tag/v0.43.0. Download a binary that matches your current system from the available releases. For example, the following command downloads the macOS for AMD64-compatible binary. It also ensures the executable flag is set and runs the binary to check that things are working:

$ wget -O otelcol.tar.gz https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.43.0/otelcol-contrib_0.43.0_darwin_amd64.tar.gz

$ tar -xzf otelcol.tar.gz otelcol-contrib

$ chmod +x ./otelcol-contrib

$ ./otelcol-contrib --version

otelcol-contrib version 0.43.0

If a package matching your environment isn't available, you can compile the collector manually. The source is available on GitHub: https://github.com/open-telemetry/opentelemetry-collector-contrib. With this in place, let's get started with sampling!

Concepts of sampling across signals

A method often used in the domain of research, the process of sampling selects a subset of data points across a larger dataset to reduce the amount of data to be analyzed. This can be done because either analyzing the entire dataset would be impossible, or unnecessary to achieve the research goal, or because it would be impractical to do so. For example, if we wanted to record how many doors on average each car in a store parking lot has, it may be possible to go through the entire parking lot and record the data in its entirety. However, if the parking lot contains 20,000 cars, it may be best to select a sample of those cars, say 2,000, and analyze that instead. There are many sampling methods used to ensure that a representational subset of the data is selected, to ensure the meaning of the data is not lost because of the sampling.

Methods for sampling can be grouped as either of the following:

  • Probabilistic (https://en.wikipedia.org/wiki/Probability_sampling): The probability of sampling is a known quantity, and that quantity is applied across all the data points in the dataset. Returning to the parking lot example, a probabilistic strategy would be to sample 10% of all cars. To accomplish this, we could record the data for every tenth car parked. In small datasets, probabilistic sampling is less effective as the variability between data points is higher.
  • Non-probabilistic (https://en.wikipedia.org/wiki/Nonprobability_sampling): The selection of data is based on specific characteristics of the data. An example of this may be to choose the 2,000 cars closest to the store out of convenience. This introduces bias into the selection process. The parking area located closest to the store may include designated spots or even spots reserved for smaller cars, therefore impacting the results.

Traces

Specifically, sampling in the context of OpenTelemetry really means deciding what to do with spans that form a particular trace. Spans in a trace are either processed or dropped, depending on the configuration of the sampler. Various components of OpenTelemetry are involved in carrying the decision throughout the system:

  • A Sampler is the starting point, allowing users to select a sampling level. Several samplers are defined in the OpenTelemetry specification, more on this shortly.
  • The TracerProvider class receives a sampler as a configuration parameter. This ensures that all traces produced by the Tracer provided by a specific TracerProvider are sampled consistently.
  • Once a trace is created, a decision is made on whether to sample the trace. This decision is stored in the SpanContext associated with all spans in this trace. The sampling decision is propagated to all the services participating in the distributed trace via the Propagator configured.
  • Finally, once a span has ended, the SpanProcessor applies the sampling decision. It passes the spans for all sampled traces to the SpanExporter. Traces that are not sampled are not exported.

Metrics

For certain types of data, sampling just doesn't work. Sampling in the case of metrics may severely alter the data, rendering it effectively useless. For example, imagine recording data for each incoming request to a service, incrementing a counter by one with each request. Sampling this data would mean that any increment that is not sampled would result in unaccounted requests. Values recorded as a result would lose the meaning of the original data.

A single metric data point is smaller than a single trace. This means that typically, managing metrics data creates less overhead to process and store. I say typically here because this depends on many factors, such as the dimensions of the data and the frequency at which data points are collected.

Reducing the amount of data produced by the metrics signal focuses on aggregating the data, which reduces the number of data points transmitted. It does this by combining data points rather than selecting specific points and discarding others. There is, however, one aspect of metrics where sampling comes into play: exemplars. If you recall from Chapter 2, OpenTelemetry Signals – Traces, Metrics, and Logs, exemplars are data points that allow metrics to be correlated with traces. There is no need to produce exemplars that reference unsampled traces. The details of how exemplars and their sampling should be configured are still being discussed in the OpenTelemetry specification as of December 2021. It is good to be aware that this will be a feature of OpenTelemetry in the near future.

Logs

At the time of writing, there is no specification in OpenTelemetry around if or how the logging signal should be sampled. The following shows a couple of ways that are currently being considered:

  • OpenTelemetry provides the ability for logs to be correlated with traces. As such, it may make sense to provide a configuration option to only emit log records that are correlated with sampled traces.
  • Log records could be sampled in the same way that traces can be configured via a sampler, to only emit a fraction of the total logs (https://github.com/open-telemetry/opentelemetry-specification/issues/2237).

An alternative to sampling for logging is aggregation. Log records that contain the same message could be aggregated and transmitted as a single record, which could include a counter of repeated events. As these options are purely speculative, we won't focus any additional efforts on sampling and logging in this chapter.

Before diving into the code and what samplers are available, let's get familiar with some of the sampling strategies available.

Sampling strategies

When deciding on how to best configure sampling for a distributed system, the strategy selected often depends on the environment. Depending on the strategy chosen, the sampling decision is made at different points in the system, as shown in the following diagram:

Figure 12.1 – Different points at which sampling decisions can take place

Figure 12.1 – Different points at which sampling decisions can take place

The previous diagram shows where the decisions to sample are made, but before choosing a strategy, we must understand what they are and when they are appropriate.

Head sampling

The quickest way to decide about a trace is to decide at the very beginning whether to drop it or not; this is known as head sampling. The application that creates the first span in a trace, the root span, decides whether to sample the trace or not, and propagates that decision via the context to every subsequent service called. This signals to all other participants in the trace whether they should be sending this span to a backend.

Head sampling reduces the overhead for the entire system, as each application can discard unnecessary spans without computing a sampling decision. It also reduces the amount of data transmitted, which can have a significant impact on network costs.

Although it is the most efficient way to sample data, deciding at the beginning of the trace whether it should be sampled or not doesn't always work. As we'll see shortly, when exploring the different samplers available, it's possible for applications to configure sampling differently from one another. This could cause applications to not respect the decision made by the root span, causing broken traces to be received by the backend. Figure 12.2 shows five applications interacting and combining into a distributed system producing spans. It highlights what would happen if two applications, B and C, were configured to sample a trace, but the other applications in the system were not:

Figure 12.2 – Inconsistent sampling configuration

Figure 12.2 – Inconsistent sampling configuration

The backend would receive four spans and some context about the system but would be missing four additional spans and quite a bit of information.

Important Note

Inconsistent sampler configuration is a problem that affects all sampling strategies. Configuring multiple applications in a distributed system introduces the possibility of inconsistencies. Using a consistent sampling configuration across applications is critical.

Making a sampling decision at the very beginning of a trace can also cause valuable information to be missed. Continuing with the example from the previous diagram, if an error occurs in application D, but the sampling decision made by application A discards the trace, that error would not be reported to the backend. An inherent problem with head sampling is that the decision is made before all the information is available.

Tail sampling

If making the decision at the beginning of a trace is problematic because of a lack of information, what about making the decision at the end of a trace? Tail sampling is another common strategy that waits until a trace is complete before making a sampling decision. This allows the sampler to perform some analysis on the trace to detect potentially anomalous or interesting occurrences.

With tail sampling, all the applications in a distributed system must produce and transmit the telemetry to a destination that decides to sample the data or not. This can become costly for large distributed systems. Depending on where the tail sampling is performed, this option may cause significant amounts of data to be produced and transferred over the network, which could have little value.

Additionally, to make sampling decisions, the sampler must buffer in memory or store the data for the entire trace until it is ready to decide. This will inevitably lead to an increase in memory and storage consumed, depending on the size and duration of traces. As mitigation around memory concerns, a maximum trace duration can be configured in tail sampling. However, this leads to data gaps for any traces that never finish within that set time. This is problematic as those traces can help identify problems within a system.

Probability sampling

As discussed earlier in the chapter, probability sampling ensures that data is selected randomly, removing bias from the data sampled. Probability sampling is somewhat different from head and tail sampling, as it is both a configuration that can be applied to those other strategies and a strategy in itself. The sampling decision can be made by each component in the system individually, so long as the components share the same algorithm for applying the probability. In OpenTelemetry, the TraceIdRatioBased sampler (https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#traceidratiobased) combined with the standard random trace ID generator provides a mechanism for probability sampling. The decision to sample is calculated by applying a configurable ratio to a hash of the trace ID. Since the trace ID is propagated across the system, all components configured with the same ratio and the TraceIdRatioBased sampler can apply the same logic at decision time independently:

Figure 12.3 – Probabilistic sampling decisions can be applied at every step of the system

Figure 12.3 – Probabilistic sampling decisions can be applied at every step of the system

There are other sampling strategies available, but these are the ones we'll concern ourselves with for the remainder of this chapter.

Samplers available

There are a few different options when choosing a sampler. The following options are defined in the OpenTelemetry specification and are available in all implementations:

  • Always on: As the name suggests, the always_on sampler samples all traces.
  • Always off: This sampler does not sample any traces.
  • Trace ID ratio: The trace ID ratio sampler, as discussed earlier, is a type of probability sampler available in OpenTelemetry.
  • Parent-based: The parent-based sampler is a sampler that supports the head sampling strategy. The parent-based sampler can be configured with always on, always_off, or with a trace ID ratio decision as a fallback, when a sampling decision has not already been made for a trace.

Using the OpenTelemetry Python SDK will give us a chance to put these samplers to use.

Sampling at the application level via the SDK

Allowing applications to decide what to sample, provides a great amount of flexibility to application developers and operators, as these applications are the source of the tracing data. Samplers can be configured in OpenTelemetry as a property of the tracer provider. In the following code, a configure_tracer method configures the OpenTelemetry tracing pipeline and receives Sampler as a method argument. This method is used to obtain three different tracers, each with its own sampling configuration:

  • ALWAYS_ON: A sampler that always samples.
  • ALWAYS_OFF: A sampler that never samples.
  • TraceIdRatioBased: A probability sampler, which in the example is configured to sample traces 50% of the time.

The code then produces a separate trace using each tracer to demonstrate how sampling impacts the output generated by ConsoleSpanExporter:

sample.py

from opentelemetry.sdk.trace import TracerProvider

from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter

from opentelemetry.sdk.trace.sampling import ALWAYS_OFF, ALWAYS_ON, TraceIdRatioBased

def configure_tracer(sampler):

    provider = TracerProvider(sampler=sampler)

    provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))

    return provider.get_tracer(__name__)

always_on_tracer = configure_tracer(ALWAYS_ON)

always_off_tracer = configure_tracer(ALWAYS_OFF)

ratio_tracer = configure_tracer(TraceIdRatioBased(0.5))

with always_on_tracer.start_as_current_span("always-on") as span:

    span.set_attribute("sample", "always sampled")

with always_off_tracer.start_as_current_span("always-off") as span:

    span.set_attribute("sample", "never sampled")

with ratio_tracer.start_as_current_span("ratio") as span:

    span.set_attribute("sample", "sometimes sampled")

Run the code using the following command:

$ python sample.py

The output should do one of the following:

  • Contain a trace with a span named always-on.
  • Not contain a trace with a span named always-off.
  • Maybe contain a trace with a span named ratio. You may need to run the code a few times to get this trace to produce output.

The following sample output is abbreviated to only show the name of the span and significant attributes:

output

{

    "name": "ratio",

    "attributes": {

        "sample": "sometimes sampled"

    },

}

{

    "name": "always-on",

    "attributes": {

        "sample": "always sampled"

    },

}

Note that although the example configures three different samplers, a real-world application would only ever use one sampler. An exception to this is a single application containing multiple services with separate sampling requirements.

Note

In addition to configuring a sampler via code, it's also possible to configure it via the OTEL_TRACES_SAMPLER and OTEL_TRACES_SAMPLER_ARG environment variables.

Using application configuration allows us to use head sampling, but individual applications don't have the information needed to make tail sampling decisions. For that, we need to go further down the pipeline.

Using the OpenTelemetry Collector to sample data

Configuring the application to sample traces is great, but what if we wanted to use tail sampling instead? The OpenTelemetry Collector provides a natural point where sampling can be performed. Today, it supports both tail sampling and probabilistic sampling via processors. As we've already discussed the probabilistic sampling processor in Chapter 8, The OpenTelemetry Collector, we'll focus this section on the tail sampling processor.

Tail sampling processor

In addition to supporting the configuration of sampling via specifying a probabilistic sampling percentage, the tail sampling processor can make sampling decisions based on a variety of characteristics of a trace. It can choose to sample based on one of the following:

  • Overall trace duration
  • Span attributes' values
  • Status code of a span

To accomplish this, the tail sampling processor supports the configuration of policies to sample traces. To better understand how tail sampling can impact the tracing data produced by configuring a variety of policies in the collector, let's look at the following code snippet, which configures a collector with the following:

  • The OpenTelemetry protocol listener, which will receive the telemetry from an example application
  • A logging exporter to allow us to see the tracing data in the terminal
  • The tail sampling processor with a policy to always sample all traces

The following code snippet contains the elements of the previous list:

config/collector/config.yml

receivers:

  otlp:

    protocols:

      grpc:

exporters:

  logging:

    loglevel: debug

processors:

  tail_sampling:

    decision_wait: 5s

    policies: [{ name: always, type: always_sample }]

service:

  pipelines:

    traces:

      receivers: [otlp]

      processors: [tail_sampling]

      exporters: [logging]

Start the collector using the following command, which includes the configuration previously shown:

$ ./otelcol-contrib --config ./config/collector/config.yml

Next, the ensuing code is an application that will send multiple traces to the collector to demonstrate some of the capabilities of the tail sampling processor:

multiple_traces.py

import time

from opentelemetry import trace

tracer = trace.get_tracer_provider().get_tracer(__name__)

with tracer.start_as_current_span("slow-span"):

    time.sleep(1)

for i in range(0, 20):

    with tracer.start_as_current_span("fast-span"):

        pass

Open a new terminal and start the program using OpenTelemetry auto-instrumentation, as per the following command:

$ opentelemetry-instrument python multiple_traces.py

Looking through the output in the collector terminal, you should see a total of 21 traces being emitted. Let's now update the collector configuration to only sample 10% of all traces. This can be configured via a policy, as per the following:

config/collector/config.yml

processors:

  tail_sampling:

    decision_wait: 5s

    policies:

      [

        {

          name: probability,

          type: probabilistic,

          probabilistic: { sampling_percentage: 10 },

        },

      ]

Restart the collector and run multiple_traces.py once more to see the effects of applying the new policy. The results should show roughly 10% of traces, which in this case would be about two traces. I say roughly here because the configuration relies on probabilistic sampling using the trace identifier. Since the trace ID is randomly generated, there is some variance in the results with such a small sample set. Run the command a few times if needed to see the sampling policy in action:

output

Span #0

    Trace ID       : 9581c95ae58bc8368050728f50c32f73

    Parent ID      :

    ID             : b9c3fb8838eb0f33

    Name           : fast-span

    Kind           : SPAN_KIND_INTERNAL

    Start time     : 2021-12-28 21:29:01.144907 +0000 UTC

    End time       : 2021-12-28 21:29:01.144922 +0000 UTC

    Status code    : STATUS_CODE_UNSET

    Status message :

Span #0

    Trace ID       : 2a8950f2365e515324c62dfdc23735ba

    Parent ID      :

    ID             : c5217fb16c4d90ff

    Name           : fast-span

    Kind           : SPAN_KIND_INTERNAL

    Start time     : 2021-12-28 21:29:01.14498 +0000 UTC

    End time       : 2021-12-28 21:29:01.144996 +0000 UTC

    Status code    : STATUS_CODE_UNSET

    Status message :

Note that in the previous output, only the spans named fast-span were emitted. It's unfortunate, because the information about slow-span may be more useful to us. It's additionally possible to configure the tail sampling processor to combine policies to create more complex sampling decisions.

For example, you may want to continue capturing only 10% of all traces but always capture traces representing operations that took longer than 1 second to complete. In this case, the following combination of a latency-based policy with a probabilistic policy would make this possible:

config/collector/config.yml

processors:

  tail_sampling:

    decision_wait: 5s

    policies:

      [

        {

          name: probability,

          type: probabilistic,

          probabilistic: { sampling_percentage: 10 },

        },

        { name: slow, type: latency, latency: { threshold_ms: 1000 } },

      ]

Restart the collector one last time and run the example code. You'll notice that both a percentage of traces and the trace containing slow-span are visible in the output from the collector. There are other characteristics that can be configured, but this gives you an idea of how the tail sampling processor works. Another example is to base the sampling decision on the status code, which is a convenient way to capture errors in a system. Another yet is to sample custom attributes, which could be used to scope the sampling to specific systems.

Important Note

Choosing to sample traces on known characteristics introduces bias in the selection of spans that could inadvertently hide useful telemetry. Tread carefully when configuring sampling to use non-probabilistic data as it may exclude more information than you'd like. Combining probabilistic and non-probabilistic sampling, as in the previous example, allows us to work around this limitation.

Summary

Understanding the different options for sampling provides us with the ability to manage the amount of data produced by our applications. Knowing the trade-offs of different sampling strategies and some of the methods available helps decrease the level of noise in a busy environment.

The OpenTelemetry configuration and samplers available to configure sampling at the application level can help reduce the load and cost upfront in systems via head sampling. Configuring tail sampling at collection time provides the added benefit of making a more informed decision on what to keep or discard. This benefit comes at the added cost of having to run a collection point with sufficient resources to buffer the data until a decision can be reached.

Ultimately, the decisions made when configuring sampling will impact what data is available to observe what is happening in a system. Sample too little and you may miss important events. Sample too much and the cost of producing telemetry for a system may be too high or the data too noisy to search through. Sample only for known issues and you may miss the opportunity to find abnormalities you didn't even know about.

During development, sampling 100% of the data makes sense as the volume is low. In production, a much smaller percentage of data, under 10%, is often representative of the data as a whole.

The information in this chapter has given us an understanding of the concepts of sampling. It has also given us an idea of the trade-offs in choosing different sampling strategies. In the end, choosing the right strategy requires experimenting and tweaking as we learn more about our systems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset