Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2. Cloud Native Approach to Uniform Observability

In this chapter, we walk through a cloud native approach to uniform observability through the lens of service meshes. Split into three sections, this chapter breaks down the concepts of cloud native, observability, and uniform observability. In the first section, we examine the amorphous concept of cloud native by characterizing its various facets. Then we’ll consider the difference between the act of monitoring a service and the property of a service being observable. In the last section, we reflect on the power of having autogenerated telemetry that provides ubiquitous and consistent insight into your running services. As service meshes are a product of cloud native, so let’s start our discussion by defining what cloud native really means.

What Does It Mean to Be Cloud Native?

“Cloud native” as an umbrella term is a combination of both technology and process. Driven by the need for both machine and human efficiency, cloud native technology spans application architecture, packaging, and infrastructure. Cloud native process embodies the full software life cycle. Often, but not always, cloud native process reduces historically separate organizational functions and life cycle steps (e.g., architecture, QA, security, documentation, development, operations, sustaining, and so on) to two functions: development and operations. Development and operations are the two primary functions of individuals who deliver software as a service, commonly employing DevOps practices and culture. Cloud native software is commonly, but not always, continuously delivered as a service.

Tip

The more services you deploy, the greater the return on investment you’ll see from using a service mesh. Cloud native architectures lend themselves to higher numbers of services, hence our need to understand what it means to be cloud native. Service meshes provide value to noncontainerized workloads as well as monolithic services. Examples of this added value are highlighted throughout this book.

Cloud native applications typically run in a public or private cloud. Minimally, they run on top of programmatically addressable infrastructure. That said, lifting and shifting an application into a cloud doesn’t quite make it cloud native.

The following are characteristics of cloud native applications:

They run on programmatically addressable infrastructure, and are dynamic and decoupled from physical resources by one or more layers of abstraction across compute, network, and storage resources.
They are distributed and decentralized with the focus often being on how the application behaves, not on where it’s running. They account for software life cycle events to allow for (rolling) updates, smoothly upgrading services without service disruption.
They are resilient and scalable, designed to run redundantly without single points of failure and to survive continually.
They are observable through their own instrumentation and/or that provided by underlying layers. Given their dynamic nature, distributed systems are relatively more difficult to inspect and debug, and so their observability must be accounted for.

The Path to Cloud Native

For most organizations, the path to cloud native is an evolutionary act of applying cloud native principles to existing services, whether through retrofit or rewrite. Others are fortunate enough to have started projects after cloud native principles and tools were generally available and accepted. Whether your journey calls for dealing with an existing service or writing a new collection of them, service meshes offer considerable value—value that increases as you increase the number of services that you own and run. Service meshes are the next logical step after a container orchestration deployment. Figure 2-1 outlines various cloud native paths.

As some service meshes are easier to deploy than others and some meshes offer more value than others, depending on which mesh you deploy, you might need a certain number of microservices to make the deployment useful. In time, service meshes (and extensions to them) will alleviate developers of common application-level considerations (e.g., cost accounting and price planning) in that the service mesh simply provides these ubiquitous concerns.

Depending upon your teams’ experience levels and your specific projects, your path to cloud native will use different combinations of software development process, operational practices, application architecture, packaging and runtimes, and application infrastructure. Applications and teams exhibiting cloud native characteristics use one, a combination of, or all of the approaches highlighted in Figure 2-2.

Packaging and Deployment

Cloud native technology often takes the form of microservices (engaged through service endpoints), built-in containers (engaged through a scheduler), and function augmentation (engaged through event notifications). Evolutionary shifts in the packaging patterns are driven by engineers’ need for efficient utilization of machines and delivery speed. The path to cloud native pushes for smaller and smaller units of deployment, enabled through high levels of resource isolation. Isolation through virtualization and containerization provide higher levels of efficiency as smaller packages make for more tightly bin-packed servers.

Each phase of the packaging evolution, from bare-metal servers to VMs to containers to unikernels to functions, has seen varying degrees of use when measured by number of deployments in the wild. Some package types provide better guarantees of portability, interoperability, isolation, efficiency, and so on. For example, containers deliver higher degrees of portability and interoperability than VMs. Though they’re lightweight, isolated, and infinitely scalable, functions suffer in regard to portability, exhibiting possibly the highest degree of lock-in among the various types of packaging. Irrespective of your chosen packaging—whether you deploy services directly on the host OS, in a VM, in a container, as a unikernel, or a function—service meshes can provide connection, control, observability, and security.

Application Architecture

More important than the form they take are the characteristics exhibited by cloud native application architecture. Central to cloud native are qualities such as ephemerality, actively scheduled workloads, loose coupling with dependencies explicitly described, event-driven, horizontally scaled, and cleanly separated stateless and stateful services. Cloud native applications often exemplify an architectural approach that is commonly declarative and incorporating resiliency, availability, and observability as upfront design concerns.

Cloud native technologies empower organizations to build and run scalable applications in dynamic environments such as public, private, and hybrid clouds. Here, these applications are centered around declarative APIs to interface with the infrastructure. These techniques enable loosely coupled systems that are resilient, manageable, and observable. Istio and other open source service meshes deliver the next generation of networking designed for cloud native applications.

Development and Operations Processes

Developer and operator experience is also central to the philosophy of cloud native design and process, which fosters code and component reuse and a high degree of automation. When married with infrastructure as code (IaC), operators aggressively automate the methods by which cloud native applications and their infrastructure are deployed, monitored, and scaled. When combined with robust automation, microservices enable engineers to make high-impact changes frequently and predictably with minimal toil, typically using multiple continuous integration (CI) and continuous delivery (CD) pipelines to build and deploy microservices.

High levels of granular observability is a key focus of systems and services that site reliability engineers monitor and manage. Istio generates metrics, logs, and traces pertaining to requests sent across the mesh, facilitating instrumentation of services so that metrics creation and log and trace generation is done without code changes (save for context propagation in traces). Istio, and service meshes in general, insert a dedicated infrastructure layer between Dev and Ops, separating common concerns of service communication by providing independent control over services. Without a service mesh, operators would still be tied to developers for many concerns as they’d need new application builds to control network traffic, shaping, affecting access control, and which services talk to downstream services. The decoupling of Dev and Ops is key to providing autonomous independent iteration.

Cloud Native Infrastructure

Public, hybrid, and private cloud are clearly core to the definition of what it means to be cloud native. In a nutshell, the cloud is software-defined infrastructure. The use of APIs as the primary interface to infrastructure is a principal cloud concept. Natively integrated workloads use these APIs (or abstractions of these APIs) instead of nonnative workloads ignorant of their infrastructure. As the definition of “cloud native” advances, so do cloud services themselves. Broadly, cloud services have evolved from IaaS to managed services to serverless offerings. Given that most FaaS compute systems execute inside a container, these FaaS platforms can run on a service mesh and benefit from uniform observability.

Cloud native technology and process radically improves machine efficiency and resource utilization while reducing costs associated with maintenance and operations and significantly increasing the overall agility and maintainability of applications. Though employing a container orchestrator addresses a layer of infrastructure needs, it doesn’t meet all application or service-level requirements. Service meshes provide a layer of tooling for the unmet service-level needs of cloud native applications.

What Is Observability?

Proper definition of any new terms we use is important not only to facilitate common nomenclature (and understanding) and avoid debate. The notion of a system being observable versus being monitored has been discussed at length within the industry. To clarify, let’s define monitoring (a verb) as a function performed, an activity; whereas observability (an adjunct noun) is an attribute of a system.

When speaking of a system’s observability, you describe how well and in what way the system provides you with signals to monitor. Observable software is typically instrumented to capture and expose information (telemetry/measurements), allowing you to reason over complex software.

In contrast, monitoring is the action of observing and checking the behavior and outputs of a system and its components over time, evaluating system state. Your ability to monitor a system is improved by its number of observable attributes (its observability). Monitoring asserts whether a state is true or not true (e.g., a system is degraded or not).

Consider monitorability (a noun) as the condition of being monitorable; the ability to be monitored. Monitoring is being on the lookout for failures, typically through polling observable endpoints. Simplistically, early monitoring systems target uptime as a key metric to measure resilience. Modern monitoring tooling is oriented toward top-level services metrics like latency, errors (rate of requests that fail), traffic volume (by requests per second for web service or transactions retrievals per second for a key/value store), and saturation (a measurement of how utilized a resource is). Modern monitoring systems are often infused with analytics for identifying anomalous behavior, predicting capacity breaches, and so on. Service meshes bridge observability and monitoring by providing some of both by way of generating, aggregating, and reasoning over telemetry. Various service meshes incorporate monitoring tooling as a capability or easy add-on.

Note

Debate would subside if we were to use “monitoring” and “monitorability.” This verb and its noun adjunct can be left alone in synonymous company with its sister “observing” and “observability.” That said, we’ll need to offer a new term for vendors to claim, coin, wield, and posture around. Lacking a term and definition to debate would be like a quinceañera without a piñata to beat up.

Is observability for developers, and monitoring for operators? Maybe, but that’s beside the point. Until service meshes arrived, it was unclear whose responsibility it was to make a system observable and to perform monitoring of it. Most teams give different answers to the question of whose responsibility it is to define and deliver a service-level objective for a given service. Responsibilities such as these are often diffused. Service meshes decouple development and delivery teams by introducing a management layer—Layer 5 (L5)—between the lower-layer infrastructure and higher-layer application services.

Pillars of Telemetry

Observability can include logs in the form of events and errors; traces in the form of spans and annotations; and metrics in the form of histograms, gauges, summaries, and counters, as depicted in Figure 2-3.

Logs

Logs provide additional context for data such as metrics, They’re well suited for debugging, making it difficult to strike the correct balance in tuning which logs to centrally save versus which to allow to eventually rotate out. However, logs are also costly in terms of performance because they tend to be of the highest volume and require the most storage. Though structured logging doesn’t have the downsides inherent in pure string-based log manipulation, it still takes up far more storage and is slower to query and process (as is evident from log-based monitoring vendors’ pricing models. Some best practices for logging include enforcing quotas and dynamic rate of adjustment of log generation.

Metrics

Metrics, unlike logs, have a constant overhead and are good for alerts. Taken together, logs and metrics give insight into individual systems, but they make it difficult to see into the lifetime of a request that has traversed multiple systems. This is pretty common in distributed systems. Metrics can be powerful and, when aggregated, quite insightful—good for identifying known-unknowns. The high compression rate on metrics only pushes them to have a smaller footprint, considering that they’re optimized for storage (a good Gorilla implementation can get a sample down to 2.37 bytes) and enable historical trends through long-term retention.

Traces

Tracing allows you to granularly track request segments (spans) as the request is processed across various services. It’s difficult to introduce later, as (among other reasons) third-party libraries used by the application also need to be instrumented. Distributed tracing can be costly; thus, most service mesh tracing systems employ various forms of sampling to capture only a portion of the observed traces. When traces are sampled, performance overhead and storage costs are reduced, but so is visibility. The sampling rate is balanced against the frequency by which traces are captured (typically expressed as a percentage in relation to service request volume).

Trace Sampling Algorithms

Each sampling algorithm comes with trade-offs. These algorithms tend to fall into two categories: head-based and tail-based. Head-based consistent sampling (or upfront sampling) makes the sampling decision once per trace at the trace’s start. Tail-based sampling makes the sampling decision at the end of the request execution so that additional criteria can be considered in whether a trace should be saved. Here are some different sampling algorithms:

Probabilistic sampling: This decision is based on a coin toss with a certain probability.
Rate limiting sampling: This decision employs a rate limiter to ensure that only a fixed number of traces are sampled per time interval.
Adaptive sampling: This decision dynamically adjusts sampling parameters to align the actual amount of traces generated to a preset desired rate of generation.
Context-sensitive sampling: This one is used for ad hoc or debug sampling (e.g., using a special header to signal the tracing instrumentation that a given request should be sampled).

Whereas most current tracing systems implement head-based sampling, some newer systems employ tail-based sampling. With either approach’s application and infrastructure overhead being imposed, it’s important to weigh the ROI of your telemetry (i.e., are year-over-year comparisons necessary for this signal?). Various sampling algorithms can be used to tune the sampling behavior and negate the impact on tracing backends in the management plane.

Combining Telemetry Pillars

A maximally observable system exploits each internal signal, including synthetic checks, end-user experience monitoring (real user monitoring), and tooling for distributed debugging. Black-box testing and synthetic checks are still needed because they are end-to-end validation of everything you might not have observed.

Figure 2-4 presents a spectrum demonstrating how collecting telemetry in production is a compromise between cost, in terms of storage and performance (CPU, memory, and request latency) overhead, and the value of information collection, typically in terms of how expressive it is or useful in fixing slow or errored responses.

Arguably, metrics provide the best ROI. Given that some service meshes facilitate distributed tracing, you could argue that distributed tracing provides the greatest value from the least investment (relative to level of insight provided). Ideally, your instrumentation allows you to dial back the verbosity levels and sampling rate to give control over your overhead costs versus desired observability.

Many organizations are now used to having individual monitoring solutions for distributed tracing, logging, security, access control, and so on. Service meshes centralize and assist in solving these observability challenges. Istio generates and sends multiple telemetric signals based on requests sent into the mesh, as shown in Figure 2-5.

Why Is Observability Key in Distributed Systems?

The ability to aggregate and correlate logs, metrics, and traces together when running distributed systems is key to your ability to reason over what’s happening within your application across the disparate infrastructure upon which it runs. When running a distributed system, you understand that failures will happen and can account for some percentage of these known-unknowns. You will not know beforehand all of the ways in which failure will occur (unknown-unknowns); therefore, your system must be granularly observable (and debuggable) so you can ask new questions and reason over the behavior of your application (in context of its infrastructure). Of the many signals available, which are the most critical to monitor?

As a service owner, you need to explore these complex and interconnected systems and explain anomalies based on telemetry delivered from your instrumentation. It’s through a combination of internal observables and external monitoring that service meshes illuminate service operation, where you might otherwise be blind.

Which KPIs Are Most Significant?

Popular methodologies have different descriptions as to what key performance indicators (KPIs) should be measured, and how:

USE stands for “utilization, saturation, and errors.” These are resource scoped (e.g., CPU, memory, etc.).
RED stand for “rate, errors, and duration.” These are request scoped. Duration is explicitly taken to mean distributions, not averages.
The four golden signals are: latency, requests, saturation, and errors.

Reviewing the popular methodologies of USE, RED, and the four golden signals, you’ll find that they have requests, latency, and errors in common.

Requests: This measures how much demand is being placed on your system and is measured in requests per second.
Latency: This is the time it takes to service a request typically separating the latency of successful requests from that of the latency of failed requests.
Errors: This is the rate by which requests fail.

Monitoring is an activity you perform, by simply observing the state of a system over a period of time. Observability (the condition of being observable) is a measure of how well the internal states of a system can be inferred from knowledge of its external outputs; a measure of the extent to which something is observable.

Rather than attempting to overcome distributed systems concerns by writing infrastructure logic into application code, you can manage these challenges with a service mesh. A service mesh helps ensure that the responsibility of service management is centralized, avoiding redundant instrumentation, and making observability ubiquitous and uniform across services.

Uniform Observability with a Service Mesh

Insight (observability) is the number one reason why people deploy a service mesh. Not only do service meshes provide a level of immediate insight, but they also do so uniformly and ubiquitously. You might be accustomed to having individual monitoring solutions for distributed tracing, logging, security, access control, metering, and so on. Service meshes centralize and assist in consolidating these separate panes of glass by generating metrics, logs, and traces of requests transiting the mesh. Taking advantage of automatically generated span identifiers from the data plane, Istio provides a baseline of distributed tracing in order to visualize dependencies, request volumes, and failure rates. Istio’s default attribute template (more on attributes in Chapter 9) emits metrics for global request volume, global success rate, and individual service responses by version, source, and time. When metrics are ubiquitous across your cluster, they unlock new insights, and also free developers from having to instrument code to emit these metrics.

The importance of ubiquity and uniformity of insight (and control over request behavior) is well illustrated by the challenges that arise from using client libraries.

Client Libraries

Client libraries (sometimes referred to as microservices frameworks) are yesterday’s go-to tooling for developers looking to infuse resilience into their microservices. There are a number of popular language-specific client libraries that offer resiliency features like timing out a request or backing off and retrying when a service isn’t responding in a timely fashion.

Client libraries became popular as microservices gained a foothold in cloud native application design, as a means of avoiding having to rewrite the same infrastructure and operational logic in every service. One problem with microservices frameworks is that they couple those same infrastructure and operational concerns with your code. This leads to code duplication across your services and inconsistency in what different libraries provide and how they behave. As shown in Figure 2-6, when running multiple versions of the same library or different libraries, getting service teams to update their libraries can be an arduous process. When these distributed systems concerns are embedded into your service code, you need to chase your engineers to update and correct their libraries (of which there might be a few, used to varying degrees). Getting a consistent and recent version deployed can take some time. Achieving and enforcing consistency is challenging.

Interfacing with Monitoring Systems

From the application’s vantage point, service meshes largely provide black-box monitoring (observing a system from the outside) of service-to-service communication, leaving white-box monitoring (observing a system from within—reporting measurements from inside out) of an application as the responsibility of the microservice. Proxies that comprise the data plane are well positioned (transparently, in-band) to generate metrics, logs, and traces providing uniform and thorough observability throughout the mesh as a whole. Istio provides adapters to translate this telemetry and transmit to your monitoring system(s) of choice.

Driven by the need for speed of delivery, potential global scale, and judicious resource utilization, cloud native applications run as immutable, isolated, ephemeral packages on what is typically shared infrastructure.

Client libraries and microservices frameworks come with challenges. Service meshes move these concerns into the service proxy and decouple from the application code.

Is your application easy to monitor in production? Many applications are, but sadly, some are designed with observability as an afterthought. Ideally you should consider observability in advance, as this is one important factor of running apps at scale, just like backups, security, auditability, and the like. In this way, you can make the trade-offs consciously. Whether observability was considered upfront in your environments or not, a service mesh offers much value.

There’s a cost to telemetry. Various techniques and algorithms are used to gather only what signals are most insightful.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
2. Cloud Native Approach to Uniform Observability

Chapter 2. Cloud Native Approach to Uniform Observability

What Does It Mean to Be Cloud Native?

Tip

The Path to Cloud Native

Figure 2-1. The paths taken to cloud native are varied and replete with choice.

Figure 2-2. The evolution to cloud native through process and technology (architecture, packaging, and infrastructure)

Packaging and Deployment

Application Architecture

Development and Operations Processes

Cloud Native Infrastructure

What Is Observability?

Note

Pillars of Telemetry

Figure 2-3. The three pillars of observability: key types of telemetry

Logs

Metrics

Traces

Combining Telemetry Pillars

Figure 2-4. A comparative spectrum showing value provided by each pillar versus cost

Figure 2-5. Istio’s Mixer can collect and send telemetric signals to backend monitoring, authentication, and quota systems via adapters.

Why Is Observability Key in Distributed Systems?

Uniform Observability with a Service Mesh

Client Libraries

Figure 2-6. Applications tightly coupled with infrastructure control logic

Interfacing with Monitoring Systems

Table of Contents for 2. Cloud Native Approach to Uniform Observability

Create new playlist

Sign In

Sign Up

Chapter 2. Cloud Native Approach to Uniform Observability

What Does It Mean to Be Cloud Native?

Tip

The Path to Cloud Native

Figure 2-1. The paths taken to cloud native are varied and replete with choice.

Figure 2-2. The evolution to cloud native through process and technology (architecture, packaging, and infrastructure)

Packaging and Deployment

Application Architecture

Development and Operations Processes

Cloud Native Infrastructure

What Is Observability?

Note

Pillars of Telemetry

Figure 2-3. The three pillars of observability: key types of telemetry

Logs

Metrics

Traces

Combining Telemetry Pillars

Figure 2-4. A comparative spectrum showing value provided by each pillar versus cost

Figure 2-5. Istio’s Mixer can collect and send telemetric signals to backend monitoring, authentication, and quota systems via adapters.

Why Is Observability Key in Distributed Systems?

Uniform Observability with a Service Mesh

Client Libraries

Figure 2-6. Applications tightly coupled with infrastructure control logic

Interfacing with Monitoring Systems

Table of Contents for
2. Cloud Native Approach to Uniform Observability