Chapter 12. Real-World Considerations for Application Deployment

People adopt service meshes for many reasons, not least of which is to improve the reliability of services they deliver to their users. A key consideration of improving the reliability of a workload on Istio is understanding Istio’s reliability characteristics. As you might expect, the reliability of the two are linked. Although use of Istio (or any service mesh) stands to dramatically lift the boats of all workloads (on and off the mesh), the introduction of additional components to the system, like a service proxy that creates a services-network, presents new modes of potential failure. Considerations for deploying Istio and workloads on Istio reliably is the focus of this chapter.

In earlier chapters, we talked about how Istio can improve your application’s reliability (with outlier detection, circuit breakers, and retries, among others). We examined how Istio can allow you to control where traffic flows through your mesh very precisely, and how Istio helps you gain visibility into your deployment (by generating telemetry for your applications). We have addressed features of Istio that are helpful in protecting you against certain classes of failures, but we haven’t addressed, in detail, how Istio helps when it comes to mitigating the most common sources of outages: deploying new versions of your applications. Fortunately, the ability to control traffic and see how that traffic is behaving is exactly what we need to minimize the risk of (re)deploying applications.

Let’s dig into considerations for deploying Istio’s control-plane components and walk through a case study of a canary deployment of an application. As we review how Istio can help us deploy our own applications more safely, we’ll reflect on how Istio control-plane components interact at runtime and how their behavior affects that reliability.

Control-Plane Considerations

Each of Istio’s components have a variety of failure modes, and their failures manifest in the mesh in different ways. To best understand these modes and their behavior, for each control-plane component, we review a common set of failure modes (components partitioned from a workload, partitioned from one another, etc.), any failure modes specific to that component, and how these issues manifest in the mesh at runtime. We cover the main failure modes, but it’s impossible for us to be exhaustive. Our goal here is to arm you with the knowledge of behavior patterns and dependency implications and combine this with the information from the component-specific chapters in the hopes that you should be able to understand new failure modes when they appear.

In this section, we discuss most failures in terms of a network partition, which is really a stand-in for many failure modes. For example, the failure “Galley is partitioned from its config store,” could be caused by many things; for example, an actual network partition, the config store serving 500 errors, the config store not accepting connections at all, or the config store serving at unacceptably high latencies. Irrespective of the root cause, for the purposes of understanding reliability characteristics of control-plane components, all of these failures are equivalent in that one component in the system cannot get the data it needs when it needs it.

It’s also worth discussing Istio component upgrades at a high level. Historically, upgrading Istio has been a painful process, fraught with error (outage). As of Istio’s 1.0 release, the project committed to ensuring smooth upgrade processes, but it’s still an ongoing learning process for the project. For example, it was discovered after release that for the user configurations deployed by a small number of people, the upgrade from 1.0 to 1.1 would break applications traffic in the mesh. As a result of this incident and other learnings, the project has started several long-running initiatives around upgradability of the mesh-control plane and Istio’s components generally (the control plane, but also node agents and data plane). These efforts primarily focus on the ability to canary control-plane components themselves. As of this writing, this work hasn’t yet landed. For each component we look at in this chapter, we’ll mention special upgrade considerations, but we will not cite specific known issues of upgrading from one version to another.

Galley

Galley is responsible for configuration distribution to other Istio control-plane components. When Galley is unavailable (partitioned from its own source of truth, the other Istio control-plane components; crash-looping; or otherwise) the primary symptom you’ll see is the inability to push new configurations into your mesh. Even though the mesh should continue to function in its current steady state, you’ll be unable to effect changes in the configuration of the mesh until Galley is restored.

A typical Istio mesh installation will have relatively few Galley instances. One or two (if you’re running an HA pair) Galley instances per control-plane deployment is usual. Galley does not necessarily need to be “close” (in terms of network latency) to the rest of the control-plane components. Higher latency between Galley and the rest of the control plane means a longer time for user configurations to take effect; even a single global Galley instance controlling a control-plane instance on the other side of the world would have latency low enough for the mesh to function correctly.

Partitioned from the configuration store

When Galley cannot reach its configuration store no new configuration will flow from Galley into other Istio components. All of Istio’s components cache their current state in memory—Galley is no exception; so as long as Galley itself doesn’t die, it will continue to serve its current configuration to the rest of the control plane while attempting to reestablish a connection to the configuration store. If Galley itself dies and is restarted during this time, it will be unable to serve any configuration to the rest of the control plane until it reestablishes a connection to its own configuration store.

One way to mitigate this category of failure is to cache local configurations more persistently. Galley has the ability to ingest configurations from the local filesystem in addition to remote sources (like the Kubernetes API server). A base set of configurations can be provided on the local filesystem (persistent across restarts of Galley) that Galley can always serve while it attempts to establish a connection to remote configuration stores. Taken to the extreme, in systems with low rates of change, it’s entirely possible to run Galley using only the filesystem-based configuration source.

When Galley is deployed on Kubernetes, it acts as a Validating Admission Controller, too; that is, Galley is responsible for validating configurations that’s submitted to the Kubernetes API server. In this case, configurations pushed into Kubernetes will be rejected at push time (i.e., kubectl apply will fail).

Partitioned from other Istio components

When Galley is partitioned from other control-plane components, Galley itself won’t fail, but those components will not receive configuration updates. See the section the previous section for details about their failure modes when they can’t access Galley.

Partitioned from mesh workloads

Galley does not interact directly with workloads or nodes deployed in the mesh. It interacts only with other Istio control-plane components and Galley’s own configuration store. Galley being unreachable for all of the workloads in the mesh is completely fine so long as the Istio control-plane components themselves can communicate with Galley.

Upgrades

Because of the nature of Galley’s failure mode, an in-place or rolling upgrade of Galley is fairly easy (an in-place upgrade is effectively the same as a temporary partition in this context because the previous job is descheduled and a new one created to replace it). The other Istio components locate Galley by DNS and will attempt to reconnect to Galley as their connection is severed. Istio performs skew testing (in which we test different versions of control- and data-plane components together) to ensure that between any two adjacent versions (e.g., 1.0 to 1.1), upgrades are not breaking. There’s no guarantee that skipping multiple versions is safe (e.g., 1.0 to 1.2).

Pilot

Pilot is responsible for configuring the service mesh’s data plane at runtime. When it’s unavailable, you’ll be unable to change the mesh’s current networking configuration; new workloads will not be able to start, but existing workloads will continue to serve under the configuration they had just prior to loss of communication with Pilot. Service proxies will retain this same configuration until either they reestablish a connection to Pilot or until they restart. Other data-plane configurations that require runtime (not bootstrap) configurations—for example, updates to policy or telemetry settings across the mesh—will also not take effect until Pilot recovers.

A typical service mesh deployment will have several Pilot instances. Pilot, like the other Istio control-plane components, is a stateless service that can be horizontally scale as needed. In fact, this is the recommended best practice for production deployments. Underlying platforms like Kubernetes make such production configurations relatively simple with support for horizontal autoscaling of pods out of the box. Latency between the service registry, Pilot, and the service proxies under Pilot management is the critical path in updating endpoints in the mesh as workloads are scheduled or move around. Keeping this latency low results in better overall mesh performance. Generally, Pilot should be “near” (lower latency to reach) the service proxies for which it is providing configuration. Pilot’s performance is less sensitive to its distance from its configuration sources.

There is one pain point related to scaling Pilot that’s worth discussing. Because Envoy uses a gRPC stream to communicate configuration, there is no per-request load balancing across Pilot instances. Instead, each Envoy in an Istio deployment is sticky to its associated Pilot. A given Envoy will not communicate to another Pilot unless its associated Pilot severs the connection (or dies, severing its connections). For this reason, scaling up Pilots can be tricky. Often, you’ll need to scale out several instances and then kill an overloaded Pilot instance to force Envoys to rebalance across the newly deployed Pilot instances. This maintenance issue is being addressed in upcoming Istio releases in which Pilot will shed load by closing some connections itself when overloaded, forcing Envoys to reconnect or find a new instance.

Partitioned from the configuration store

As with the other Istio components, Pilot caches its current state in memory. The configuration in Pilot is divided into two categories: Istio networking configuration and environmental state from service registries.

When Pilot is unable to communicate with its configuration store for Istio networking configuration (Galley or the Kubernetes API server in older versions of Istio), it will continue to serve out of its currently cached state. New workloads will be able to be scheduled and their service proxies will receive configurations based on Pilot’s currently cached configuration. If Pilot itself restarts while it’s unable to communicate with the configuration store (or a new instance of Pilot is started), it will be unable to serve configurations for any service proxies that communicate with it.

When Pilot is unable to communicate with its service registries, it will again serve its current state out of memory while attempting to reconnect to the source. During this time, new services introduced into the mesh (e.g., the creation of a new Service resource in Kubernetes) will not be routable by workloads in the mesh. Similarly, new endpoints won’t be pushed to Envoy service proxies—this means that while Pilot is disconnected from its service discovery source, workloads being descheduled or otherwise moved will not have their network endpoints removed from the load-balancing set of the other service proxies in the mesh, and can still attempt to send traffic to those now-dead endpoints. Setting automatic retries with outlier detection across your deployment will help keep application traffic healthy during transient failures like this. And as before, if Pilot itself restarts in this window when the service registry is unavailable, no services from that registry will be routable in the mesh.

Partitioned from other Istio components

Pilot receives its network identity from Citadel, just like the other Istio components; being unable to reach Citadel when Pilot needs a new identity document (e.g., when a new instance is scheduled or the current credential expires) will result in workloads being unable to communicate with Pilot. (See “Citadel” for more information.) Troubleshooting the inability to communicate with Galley was described in the previous section and in the section on Galley. Pilot does not communicate directly with Mixer, so being able to contact either the Mixer policy or telemetry services does not affect Pilot at runtime.

Partitioned from mesh workloads

When workloads in the mesh cannot communicate with Pilot, service proxies for those workloads cannot receive new runtime configurations. Specifically missing will be updates to networking configuration, service-to-service authorization policy, and addition of new services, and endpoint changes will not be pushed to service proxies. As with every other Istio component, service proxies cache their current configuration and will continue to serve that configuration until they reestablish a connection to some Pilot instance. While Pilot is unavailable, newly scheduled workloads will not receive any configuration and therefore will not be able to communicate over the network at all (Istio configures the sidecar to fail closed). Newly scheduled workloads will also not be able to receive an identity from Citadel, because their identity is first populated by Pilot. Existing workloads will continue to serve using their current identity (as of the time they lost contact with Pilot) and will continue to be able to receive fresh credentials for that identity from Citadel even while Pilot is unavailable.

As we discussed earlier, configuring default retry, circuit breaking, and outlier detection policies across your mesh can help mitigate the impact of the transient Pilot outages that cause stale runtime configurations. One key benefit of client-side load balancing is the ability for individual clients to choose the servers they communicate with based on how available the server seems to the client. The mesh continues to function well with a degraded Pilot as long as the rate of change of the rest of the deployment is low and good network resiliency policies are in place across the mesh.

Upgrades

Like Galley, upgrading Pilot in a live deployment is similar to a network partition. In particular, even though the data plane will continue to serve, updates (like newly scheduled or descheduled workloads) will not propagate to Envoy instances. As we noted at the beginning of the Pilot section, Envoys do not load-balance requests across Pilot instances, therefore deploying a new Pilot side by side with an old one is not sufficient: you really need to perform a rolling upgrade (in which the new version replaces the old, shifting Envoy traffic as the old instances die) or manually deschedule old Pilot instances as new ones are brought up.

These restrictions are in part due to limitations in Envoy. Envoy takes a bootstrap configuration, which is immutable; part of this configuration is Pilot’s address. To update the bootstrap configuration, Envoy must be restarted, and the full suite of Envoy’s configuration is not available for configuring how Envoy talks to Pilot. For example, it is not possible to use Envoy’s (or Istio’s) own configuration to perform a percentage-based rollout of a new Pilot for Envoy to consume. (This is an intentional design decision on Envoy’s part to limit a large class of failures; the control plane misconfigures Envoy such that it is not able to communicate with the control plane again to receive correcting configuration.) Therefore, we’re limited in the techniques available for rolling out new Pilot versions incrementally.

Mixer

Mixer has two modes of operation with very different failure modes. In Policy mode, Mixer is part of the request path, and failures directly affect user traffic. This is because policy is necessarily blocking for requests. In Telemetry mode, Mixer is out of the request path and failures affect only the ability of the mesh to produce telemetry (this, of course, can result in all sorts of alarms going off as telemetry fails to come in for part of the mesh). The sections that follow address failure modes common to both and call out the special considerations for each mode separately.

Mixer is written almost as a router: its configuration really just describes how to create values from a set of data and where to forward those values. As a result, in both modes of operation Mixer usually communicates with a set of remote backends for every request it receives. This means Mixer is particularly sensitive to network partitions and increased network latency, much more so than the other control-plane components. It’s also important to note that in today’s model of a Mixer deployment, Istio assumes that the backends that Mixer communicates with are not in the mesh themselves. Istio makes this assumption for a number of reasons; for example, to avoid recursive calls. (Mixer sends traces to the collector, which triggers the collector’s sidecar to send a trace to Mixer; Mixer sends that trace to the collector, which triggers the collector’s sidecar.) The saving grace is that Mixer, unlike Pilot, sits behind a sidecar itself. This makes it possible to use Istio configuration to control how Mixer communicates with backends (including resiliency configuration like circuit breaking and automatic retries).

Partitioned from the configuration store

Like the other Istio components, Mixer holds its current serving configuration in memory. A partition from Galley means that it won’t receive new configurations, but it won’t hinder Mixer’s ability to keep executing its current configuration. If Mixer dies and is restarted while Galley is unavailable, Mixer will serve only its default configuration. We discuss the behavior of Mixer serving its default configuration in each mode separately in a moment. In both modes, it is possible to provide Mixer with a different default configuration by giving it configurations from the local filesystem.

Mixer policies

Mixer policy can be set at installation to default to open or closed connections when unconfigured. Istio’s default installation ships with a default-open configuration that applies no policy. With that configuration, a service proxy calling check against an unconfigured Mixer will always allow traffic. It does this so that installing Istio into an existing cluster does not break all traffic. If you are using Mixer to apply policy that is nonoptional, you should configure Mixer to default-closed when unconfigured during Istio installation. Authorization policy is a good example of policy that some service teams deem nonoptional. Other service teams choose to tolerate rate limiting or even abuse detection failing part of the time in favor of serving user traffic, so in this context, these policies could be considered “optional.”

Mixer telemetry

These are not in the request path, so their failing will never affect traffic in the mesh. An unconfigured Mixer telemetry will accept report data from service proxies in the mesh; however, it will not generate telemetry with it (an unconfigured Mixer’s report is a no-op). This situation can cause a pager storm as alerts for every affected service might trigger at the same time due to missing metrics.

Unfortunately, Mixer does not support a mode in which it reads configurations both from a local filesystem and from a remote configuration server. So, when using a remote configuration server (i.e., Pilot), today, you cannot set a default configuration for Mixer other than high-level flags (e.g., policy defaulting to open or closed). This is a known area for improvement, and subsequent versions of Istio should begin to address this.

Partitioned from other Istio components

For the most part, other Istio components do not communicate with Mixer, nor do they have a runtime dependency on Mixer. For Istio components (e.g., Pilot) that run behind service proxies that enforce policy, their failure mode when they’re unable to communicate with Mixer is identical to any other workload in the mesh. Otherwise, there are no special runtime dependencies between the other components and Mixer.

Partitioned from mesh workloads

When workloads cannot communicate with Mixer, they fail, as you might expect. For policy, the service proxy will enforce the default behavior set at install time (either fail open or closed). For telemetry, the service proxies will buffer as much data as they can until they’re able to forward it on to Mixer again. Service proxies use a fixed-size circular buffer to temporarily store report metadata that they are attempting to forward to Mixer, so eventually data will be lost. The size of the buffer is configurable in the flags passed to the service proxy at startup. This configuration item is not exposed in the Helm chart today.

Upgrades

Proxies use the protocol service unary to communicate with Mixer, unlike how the service proxies communicate with Pilot (Envoy using gRPC streams to Pilot’s xDS interface). That is to say, messages are sent individually and not streamed. Therefore, each request can be load balanced. This makes it a lot easier to roll out new versions of Mixer, where we can use Envoy and Istio’s regular primitives to canary a new version automatically.

When you’re upgrading Mixer policies, beware of the latency spike that you’ll see when transitioning to a new instance. Mixer policies cache policy decisions very aggressively, and when you transition traffic to a new instance, you’ll see latencies spike as checks miss cache and call policy backends for a decision. This means that your policy backends will also see an increase in traffic during the new instance’s cache-warming period. This is unlikely to affect your 50th percentile latency, but it can affect your 99th percentile.

There aren’t any special considerations for a Mixer telemetry, though the way that Mixer lazily loads its runtime configuration means that you’re likely to see very high latency for the first few report requests. Because reports are asynchronous and out-of-band of your traffic, this shouldn’t manifest as slowdowns in user traffic.

Citadel

Citadel is responsible for identity issuance and rotation in the service mesh. When Citadel is unavailable, nothing will happen until certificates begin to expire. Then, you’ll see failures to establish communication across the mesh. Existing traffic will continue to function while Citadel is down, but new workloads will not be able to communicate and new connections cannot be established (by either new or existing workloads) if the workload’s certificates expire while it’s unable to communicate with Citadel. When you’re using mTLS across Istio control-plane components, which is the default installation setting, the startup of all other control-plane components depends on Citadel starting up. This is because Citadel needs to mint an identity for each control-plane component before communication is allowed.

Partitioned from the configuration store

Like the other components, Citadel will continue to serve with its current state when it’s unable to reach its configuration store. Unlike most of the other control-plane components, Citadel receives little configuration from Galley, instead being more tightly coupled to its environmental configuration sources (in particular, the Kubernetes API server), which it uses to discover the set of identities for which it will mint certificates. When these identity sources are unavailable, Citadel is unable to mint certificates for new workloads and might be unable to rotate certificates for existing ones.

Partitioned from other Istio components

The other Istio components are just like normal mesh workloads from Citadel’s perspective: none of the control-plane components are special. The next section discusses this further.

Partitioned from mesh workloads

When workloads in the mesh can’t communicate with Citadel, they cannot receive new identity certificates. This means that new workloads starting up will be unable to communicate with anything in the mesh that requires mTLS because Citadel cannot mint an identity for those workloads. Existing workloads whose certificates are expiring will be unable to establish new connections, but their existing connections will remain open and valid until they receive a new certificate or close them. This failure to communicate will manifest as TCP handshakes failing, producing the dreaded “connection reset by peer” error. If you set shorter-lived certificates—for instance, a few hours—some edge cases around certificate rotation can manifest as 503s (due to a connection reset error) in the deployment. It is an ongoing effort in Istio to eliminate all 503s like these from the deployment, and some edge cases in certificate rotation are the last remaining source of these errors.

You should also note that to avoid “thundering-herd” problems, individual workloads will request refreshed certificates at random intervals before their certificate expires (to prevent every workload from asking for a new certificate every hour on the hour, for example). This means that, when partitioned from Citadel, various workloads for the same service might be unable to communicate, whereas others are still able to communicate.

Upgrades

Because of the random nature of certificate refresh requests, there’s no easy way to “schedule downtime” for Citadel, even though it’s called only very intermittently by workloads in the mesh. However, a new Citadel can be deployed along with an existing instance, and the existing instance can be drained (or killed entirely; e.g., in Kubernetes) to force traffic in the mesh to the new instance without interruption. Given that Citadel eagerly attempts to create certificates for all identities in the mesh at startup and that it cannot issue certificates until it finishes this process, you’ll likely want to deploy a new version of Citadel and wait for it to warm up before sending traffic to it.

Case Study: Canary Deployment

The information in the previous sections about how Istio’s control-plane components interact should help you to build a mental model for how the mesh as a whole interacts and how the failure of each component will manifest within or affect your applications. With this knowledge in-hand, you should be able to begin developing plans for safely and reliably running and managing Istio in production. Now, the question becomes, how do you use Istio’s functionality to improve the reliability of the applications in your deployment?

Nearly every outage is the result of some change(s). Controlling how changes are deployed into production and how they take effect are critical for controlling outages. For a service, deploying a new binary is the most common change, accompanied by a close second of deploying updated configuration for that service. We recommend that you treat management of changes—configuration and binary changes—identically. You’ll find that although binary deployments might cause more outages today, as your production deployment matures, it’s likely that the root cause of most outages will shift to configuration. By handling both in a single, consistent manner you can build a single set of practices and processes for mitigating service outages regardless of the root cause. Hopefully, it’s atypical to have an outage be just like previous outages you’ve experienced (because the issue that has caused that outage has been addressed). Assuming so, it follows that, generally speaking, there is no one-size-fits-all approach to problem resolution, but that in emergency situations having a known set of patterns to act against saves time, money, and error budget. Make sense? So, then, how can you safely use Istio to deploy a new binary?

Canarying is the process of gradually deploying a change, carefully controlling how it takes effect and who it affects. For example, it’s common for a company to have its employees test the next version of a product during development before it rolls out to customers; this is a canary. With Istio, we have a wide range of options to use in deciding how to route traffic to any groups of instances of a service—many of which are covered in Chapter 8. Here, we walk through a case study using a percentage-based traffic split to canary a new deployment of a service. To prepare a test environment in which we can see our canary in real time, in Example 12-1 we first create a simple deployment in Kubernetes with a service (see httpbin-svc-depl.yaml in this book’s GitHub repository).

Note

Notice the apiVersion: v1 label. It’s common in the Kubernetes community to use the version label to denote versions of deployments, and to use the app label to select a set of deployments for a service. A lot of tooling, including Istio’s own default dashboards, assumes the version label when drawing service graphs. We use this same label in our case study to control traffic routing.

Example 12-1. A Kubernetes service and deployment definition for the httpbin app
apiVersion: v1
kind: Service
metadata:
  name: httpbin
  labels:
    app: httpbin
spec:
  ports:
  - name: http
    port: 8000
    targetPort: 80
  selector:
    app: httpbin
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: httpbin-v1
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: httpbin
        version: v1
    spec:
      containers:
      - image: docker.io/kennethreitz/httpbin
        imagePullPolicy: IfNotPresent
        name: httpbin
        ports:
        - containerPort: 80

We can send traffic to the httpbin service within our cluster and we should see metrics appear. To make this easier, we can expose httpbin on our Gateway, as shown in Example 12-2, so that it’s accessible outside of our cluster (i.e., from our local machine). See httpbin-gw-vs.yaml on this book’s GitHub repository for the following example:

Example 12-2. An Istio Gateway and VirtualService exposing the httpbin service on the istio-ingressgateway deployment
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: httpbin-gateway
spec:
  selector:
    istio: ingressgateway
  servers:
  - port:
      number: 80
      name: http
      protocol: HTTP
    hosts:
    - "*"
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: httpbin
spec:
  hosts:
  - "httpbin.svc.default.cluster.local"
  gateways:
  - httpbin-gateway
  http:
  - route:
    - destination:
        host: httpbin
        port:
          number: 8000
Note

We use hosts: "*" here to make it easier to curl the Gateway’s IP address. If you have a DNS name for your istio-ingessgateway service, or already know the IP address, you can use that in the value for the hosts field in the Gateway and VirtualService (rather than using “*”).

We can issue a curl from our local machine to verify, as demonstrated in Example 12-3.

Example 12-3. A curl command issuing a request to the httpbin service
$ curl ${ISTIO_INGRESS_IP}/status/200

Now, to canary a new version of the httpbin application, we could just create a new deployment. This would result in a round-robin distribution of load across all of the httpbin instances in our cluster. If there are many instances of httpbin running with lots of traffic, this might be acceptable; but that’s not often the case. Instead, we’ll use Istio to ensure that traffic stays pinned to the known good version while we roll out a new deployment and then gradually shift traffic over.

To do this, we need to create a few resources. First, we need to create a DestinationRule for our httpbin service that lets us describe subsets of the deployment. Then, we use those subsets in our VirtualService to make sure traffic stays directed at v1 even as we roll out v2, as shown in Example 12-4 (see httpbin-destination-v2.yaml on this book’s GitHub repository).

Example 12-4. An Istio DestinationRule for the httpbin service that declares two subsets
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: httpbin
spec:
  host: httpbin
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2

We declare two subsets: the v1 subset for the workloads we’ve already deployed, and the v2 subset for the workloads we’re about to deploy.

Notice that we make sure we use the version: v1 label, with which we originally deployed our application. Also, note that we can define a new subset, v2, which targets labels that we haven’t deployed yet. This is totally fine. When we do deploy workloads with the version: v2 label, the DestinationRule will target them. Until then, traffic pointed to the httpbin v2 subset will result in a 500 error (because there’s no healthy server in the v2 set).

Now, let’s update our VirtualService to use the subset shown in Example 12-5 (see httpbin-vs-v1.yaml on this book’s GitHub repository).

Example 12-5. The VirtualService from Example 12-2, updated to include a subset in its destination clause
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: httpbin
spec:
  hosts:
  - "*"
  gateways:
  - httpbin-gateway
  - mesh # Also direct traffic in the mesh with the same VirtualService
  http:
  - route:
    - destination:
        host: httpbin
        subset: v1
        port:
          number: 8000

This ensures that all traffic, both within the mesh (due to gateways: mesh) and at ingress (gateways: httpbin-gateway), is pinned to the subset of httpbin that is v1. Now, it’s safe for us to deploy a new version of httpbin that we’re confident will not receive user traffic, as illustrated in Example 12-6 (see httpbin-depl-v2.yaml on this book’s GitHub repository).

Example 12-6. A second deployment of httpbin with a version: v2 label
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: httpbin-v2
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: httpbin
        version: v2
    spec:
      containers:
      - image: docker.io/kennethreitz/httpbin
        imagePullPolicy: IfNotPresent
        name: httpbin
        ports:
        - containerPort: 80

We can continue to send traffic to the httpbin service, outside the cluster or within it, and now we should see traffic arrive at this new deployment. You can use Istio’s metrics to verify this.

Now, we can finally canary this new deployment. We do this in Example 12-7 by directing 5% of our traffic to the new deployment, observing our service’s response codes and latency via Istio’s metrics to ensure the rollout looks good, and then ramping up the percentage gradually over time (see httpbin-vs-v2-5.yaml on this book’s GitHub repository).

Example 12-7. The VirtualService from Example 12-6 updated to send 5% of traffic to httpbin’s v2 subset
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: httpbin
spec:
  hosts:
  - "*"
  gateways:
  - httpbin-gateway
  - mesh # Also direct traffic in the mesh with the same VirtualService
  http:
  - route:
    - destination:
        host: httpbin
        subset: v1
        port:
          number: 8000
      weight: 95
    - destination:
        host: httpbin
        subset: v2
        port:
          number: 8000
      weight: 5

We can continue this process by incrementally increasing the weight for subset: v2 and decreasing the weight for subset: v1 in step. Keep in mind that all weights must add up to 100(%) and that you can have as many subsets receiving traffic at a time as you’d like (not just the two we use in this example).

After we have rolled out the new deployment, we then have to make a choice. Eiher we can clean up the DestinationRule, removing our (now-unused) subsets, or we can leave the old ones in place. We’d recommend a middle ground: keep your DestinationRules and VirtualServices fixed for dealing with two subsets, the current and next. In our example, we can keep subsets v1 and v2 around until it’s time to roll out v3. Then, we can replace v1 with v3 and do our entire rollout procedure again, migrating incrementally from v2 to v3. Then, we’ll have configurations (a DestinationRule and VirtualService) for subsets v2 and v3 in hand, and when it’s time for v4, we replace v2 and canary from v3 to v4 as before, and so on. This has the side effect of greatly lowering the amount of configurations that need to be changed in an emergency situation when you need to roll back to the previously known good deployment: we already have the configuration for the previous known good deployment and all we need to do is redeploy the binary and change the weights in our VirtualService.

Cross-Cluster Deployments

As we discussed in Chapter 8, techniques for routing traffic aren’t restricted for use only with services within the same cluster. Every company making a serious investment in Kubernetes must deal with the realities of managing and deploying into multiple clusters. Multiple clusters are commonly used to create multiple, isolated failure domains. If you’re using clusters to create failure domains, it’s important to be able to shift traffic across your clusters so that you can route around failures at runtime. Using the same traffic-splitting techniques we use for canarying, we can also incrementally (or all at once) force traffic from one cluster to another. Istio supports this as a first-class use case. (See Chapter 13 for more details.) To quickly highlight how this works, suppose that we have a remote cluster with an ingress IP address of 1.2.3.4 that also hosts the httpbin service. In our first cluster with the httpbin service, shown in Example 12-8, we can create a new Istio ServiceEntry pointing at the ingress of that new cluster (see httpbin-cross-cluster-svcentry.yaml on this book’s GitHub repository).

Example 12-8. A ServiceEntry for httpbin.remote.global
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: httpbin-remote
spec:
  hosts:
  - httpbin.remote.global # remote postfix is used for Istioa's DNS plugin
  location: MESH_INTERNAL # make sure we use mTLS
  ports:
  - name: http
    number: 8000
    protocol: http
  resolution: DNS
  addresses:
  # Does not need to be routable, but needs to be unique for each service you're
  # routing across clusters; used by Istio's DNS plug-in
  - 127.255.0.2
  endpoints:
  - address: 1.2.3.4 # address of our remote cluster's ingress
    ports:
      # Do not change this port value if you're using the Istio multi cluster
        installation
      http: 15443

In our local cluster, we can update our VirtualService to force traffic directed to httpbin over to the remote cluster, as demonstrated in Example 12-9 (see httpbin-cross-cluster-vs.yaml on this book’s GitHub repository).

Example 12-9. Updated version of Example 12-5
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: httpbin
spec:
  hosts:
  - "*"
  gateways:
  - httpbin-gateway
  - mesh # Also direct traffic in the mesh with the same VirtualService
  http:
  - route:
    - destination:
        host: httpbin.remote.global
        port:
          number: 8000

This will route traffic both at ingress into our local cluster and for traffic inside the mesh in the local cluster to call out to the remote cluster when trying to contact httpbin. Because in our ServiceEntry we declare that the endpoint is MESH_INTERNAL, we’re guaranteed that mTLS will be used end to end in the communication across clusters, so there’s no need to even set up VPN connectivity across the cluster. We can route over the internet, if needed.

In this chapter, we took a whirlwind tour of control-plane component failure modes as well as the effects of those failures on the service mesh from sibling control-plane components to data-plane service proxies and the workloads to which they are sidecarred. We walked through a case study for safely deploying a new version of an existing service so that we had a high level of control over how users accessed the new version, inclusive of retaining the ability to rollback user traffic to the old version, if needed. Finally, we looked at a very brief example of using the same traffic routing primitives we used for canaries to control failover across clusters. Istio shines in this area, enabling fairly low effort active/passive and active/active deployments.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset