Chapter 11. Running Kubernetes on Multiple Clouds and Cluster Federation

In this chapter we'll take it to the next level, with running on multiple clouds and cluster federation. A Kubernetes cluster is a closely-knit unit where all the components run in relative proximity and are connected by a fast network (a physical data center or cloud provider availability zone). This is great for many use cases, but there are several important use cases where systems need to scale beyond a single cluster. Kubernetes federation is a methodical way to combine multiple Kubernetes clusters and interact with them as a single entity. The topics we will cover include the following:

  • A deep dive into what cluster federation is all about
  • How to prepare, configure, and manage a cluster federation
  • How to run a federated workload across multiple clusters

Understanding cluster federation

Cluster federation is conceptually simple. You aggregate multiple Kubernetes clusters and treat them as a single logical cluster. There is a federation control plane that presents to clients a single unified view of the system.

The following diagram demonstrates the big picture of Kubernetes cluster federation:

Understanding cluster federation

The federation control plane consists of a federation API server and a federation controller manager that collaborate. The federated API server forwards requests to all the clusters in the federation. In addition, the federated controller manager performs the duties of the controller manager across all clusters by routing requests to the individual federation cluster members' changes. In practice, cluster federation is not trivial and can't be totally abstracted away. Cross-pod communication and data transfer may suddenly incur a massive latency and cost overhead. Let's look at the use cases for cluster federation first, understand how the federated components and resources work, and then examine the hard parts: location affinity, cross-cluster scheduling, and federated data access.

Important use cases for cluster federation

There are four categories of use cases that benefit from cluster federation.

Capacity overflow

The public cloud platforms such as AWS, GCE, and Azure are great and provide many benefits, but they are not cheap. Many large organizations have invested a lot in their own data centers. Other organizations work with private service providers such as OVS, Rackspace, or Digital Ocean. If you have the operational capacity to manage and operate infrastructure on your own it makes a lot of economic sense to run your Kubernetes cluster on your infrastructure rather than in the cloud. But what if some of your workloads fluctuate and for a relatively short amount of time require a lot more capacity?

For example, your system maybe hit especially hard on the weekends or maybe during holidays. The traditional approach is to just provision extra capacity. But in many dynamic situations, it is not easy. With capacity overflow, you can run the bulk of your work in a Kubernetes cluster running on an on-premise data center or with a private service provider and have a secondary cloud-based Kubernetes cluster running on one of the big platform providers. Most of the time, the cloud-based cluster will be shut down (stopped instances), but when the need arises you can elastically add capacity to your system by starting some stopped instances. Kubernetes cluster federation can make this configuration relatively straightforward. It eliminates a lot of headaches about capacity planning and paying for hardware that's not used most of the time.

This approach is sometimes called Cloud bursting.

Sensitive workloads

This is almost the opposite of capacity overflow. Maybe you've embraced the cloud native lifestyle and your entire system runs on the cloud, but some data or workloads deal with sensitive information. Regulatory compliance or your organization's security policies may dictate that those data and workloads must run in an environment that's fully controlled by you. Your sensitive data and workloads may be subject to external auditing. It may be critical to ensure no information ever leaks from the private Kubernetes cluster to the cloud-based Kubernetes cluster. But it may be desirable to have visibility into the public cluster and the ability to launch non-sensitive workloads from the private cluster to the cloud-based cluster. If the nature of a workload can change dynamically from non-sensitive to sensitive then it needs to be addressed by coming up with a proper policy and implementation. For example, you may prevent workloads from changing their nature. Alternatively, you may migrate a workload that suddenly became sensitive and ensure that it doesn't run on the cloud-based cluster anymore. Another important instance is national compliance, where certain data is required by law to remain and be accessed only from a designated geographical region (typically a country). In this case, a cluster must be created in that geographical region.

Avoiding vendor lock-in

Large organizations often prefer to have options and not be tied to a single provider. The risk is often too great, because the provider may shut down or be unable to provide the same level of service. Having multiple providers is often good for negotiating prices, too. Kubernetes is designed to be vendor-agnostic. You can run it on different cloud platforms, private service providers, and on-premise data centers.

However, this is not trivial. If you want to be sure that you are able to switch providers quickly or shift some workloads from one provider to the next, you should already be running your system on multiple providers. You can do it yourself or there are some companies that provide the service of running Kubernetes transparently on multiple providers. Since different providers run different data centers, you automatically get some redundancy and protection from vendor-wide outages.

Geo-distributing high availability

High availability means that a service will remain available to users even when some parts of the system fail. In the context of a federated Kubernetes cluster, the scope of failure is an entire cluster, which is typically due to problems with the physical data center hosting the cluster, or perhaps a wider issue with the platform provider. The key to high-availability is redundancy. Geo-distributed redundancy means having multiple clusters running in different locations. It may be different availability zones of the same cloud provider, different regions of the same cloud provider, or even different cloud providers altogether (see the Avowing vendor lock-in section). There are many issues to address when it comes to running a cluster federation with redundancy. We'll discuss some of these issues later. Assuming that the technical and organizational issues have been resolved, high availability will allow the switching of traffic from a failed cluster to another cluster. This should be transparent to the users up to a point (delay during switchover and some in-flight requests or tasks may disappear or fail). The system administrators may need to take extra steps to support the switchover and to deal with the original cluster failure.

The federation control plane

The federation control plane consists of two components that together enable a federation of Kubernetes clusters to appear and function as a single unified Kubernetes cluster.

Federation API server

The federation API server is managing the Kubernetes clusters that together comprise the federation. It manages the federation state (which clusters are part of the federation) in an etcd database the same as a regular Kubernetes cluster, but the state it keeps is just which clusters are members of the federation. The state of each cluster is stored in the etcd database of that cluster. The main purpose of the federation API server job is to interact with the federation controller manager and route requests to the federation member clusters. The federation members don't need to know they are part of a federation: they just work the same.

The following diagram demonstrates the relationships between the federation API server, the federation replication controllers, and the Kubernetes clusters in the federation:

Federation API server

Federation controller manager

The federation controller manager makes sure the federation's desired state matches the actual state. It forwards any necessary changes to the relevant cluster or clusters. The federated controller manager binary contains multiple controllers for all the different federated resources we'll cover later in the chapter. The control logic is similar, though: observes changes and brings cluster state to the desired state when they deviate. This is done for each member in the cluster federation.

The following diagram demonstrates this perpetual control loop:

Federation controller manager

Federated resources

Kubernetes federation is still a work in progress. As of Kubernetes 1.5, only some of the standard resources can be federated. We'll cover them here. To create a federated resource, you use the --context=federation-cluster command-line argument to Kubectl. When you use --context=federation-cluster, the command goes to the federation API server, which takes care of sending it to all the member clusters.

Federated ConfigMap

Federated ConfigMaps are very useful because they help centralize the configuration of applications that may be spread across multiple clusters.

Creating a federated ConfigMap

Here is an example of creating a federated ConfigMap:

> kubectl --context=federation-cluster create -f configmap.yaml

As you can see, the only difference from creating a ConfigMap in a single Kubernetes cluster is the context.

When a federated ConfigMap is created, it is stored in the control plane etcd database, but a copy is also stored in each member cluster. This way, each cluster can operate independently and doesn't need to access the control plane.

Viewing a federated ConfigMap

You can view ConfigMap by accessing the control plane or by accessing a member cluster. To access a ConfigMap in a member cluster, specify the federation cluster member name in the context:

> kubectl --context=cluster-1 get configmap configmap.yaml

Updating a federated ConfigMap

It's important to note that, when created through the control plane, the ConfigMap will be identical across all member clusters. However, since it is stored separately in each cluster in addition to the control plane cluster, there is no single source of true. It is possible (although not recommended) to later modify the ConfigMap of each member cluster independently. That leads to non-uniform configuration across the federation. There are valid use cases for different configurations for different clusters in the federation, but in those cases I suggest just configuring each cluster directly. When you create a federated ConfigMap you make a statement that means whole clusters should share this configuration. However, you would usually want to update the ConfigMap across all the federation clusters by specifying --context=federation-cluster.

Deleting a federated ConfigMap

That's right, you guessed it. You delete as usual, but specify the context:

> kubectl --context=federation-cluster delete configmap

There is just one little twist. As of Kubernetes 1.5, when you delete a federated ConfigMap, the individual ConfigMaps that were created automatically in each cluster remain. You must delete them separately in each cluster. That is, if you have three clusters in your federation called cluster-1, cluster-2, and cluster-3, you'll have to run these extra three commands to get rid of the ConfigMap across the federation:

> kubectl --context=cluster-1 delete configmap
> kubectl --context=cluster-2 delete configmap
> kubectl --context=cluster-3 delete configmap

This will be rectified in the future.

Federated DaemonSet

A federated DaemonSet is pretty much the same as a regular Kubernetes DaemonSet. You create it and interact with it via the control plane, and the control plane propagates it to all the member clusters. At the end of the day, you can be sure that your Daemons run on every node in every cluster of the federation.

Federated deployment

Federated deployments are a little smarter. When you create a federated deployment with X replicas and you have N clusters, the replicas will be distributed evenly between the clusters by default. If you have three clusters and the federated deployment has 15 pods, then each cluster will run five replicas. As other federated resources, the control plane will store the federated deployment with 15 replicas and then create three deployments (one for each cluster) with five replicas each. You can control the number of replicas per cluster by adding an annotation: federation.kubernetes.io/deployment-preferences. Federated deployment is still in alpha as of Kubernetes 1.5. In the future, the annotation will become a proper field in the federated deployment configuration.

Federated events

Federated events are different than the other federated resources. They are only stored in the control plane and are not propagated to the underlying Kubernetes member clusters.

You can query the federation events with --context=federation-cluster as usual:

> kubectl --context=federation-cluster get events

Federated ingress

The federated ingress does more than just create matching ingress objects in each cluster. One of the main features of federated ingress is that if a whole cluster goes down it can direct traffic to other clusters. As of Kubernetes 1.4, federated ingress is supported on Google Cloud Platform, both on GKE and GCE. In the future, hybrid cloud support for federated ingress will be added.

The federated ingress performs the following roles:

  • Create Kubernetes ingress objects in each cluster member of the federation
  • Provide a one stop logical L7 load balancer with a single IP address for all the cluster ingress objects.
  • Monitor the health and capacity of the service backend pods behind the ingress object in each cluster
  • Make sure to route client connections to a healthy service endpoint in the face of various failures, such as pod, cluster, availability zone, or a whole region, as long as there is one healthy cluster in the federation

Creating a federated ingress

You create a federated ingress by addressing the federation control plane:

> kubectl --context=federation-cluster create -f ingress.yaml

The federation control plane will create the corresponding ingress in each cluster. All the clusters will share the same namespace and name for the ingress object:

> kubectl --context=cluster-1 get ingress myingress
NAME        HOSTS     ADDRESS           PORTS     AGE
ingress      *         157.231.15.33    80, 443   1m

Request routing with a federated ingress

The federated ingress controller will route requests to the closest cluster. Ingress objects expose one or more IP addresses (via the Status.Loadbalancer.Ingress field) that remain static for the lifetime of the ingress object. When an internal or external client connects to an IP address of a cluster-specific ingress object, it will be routed to one of the pods in that cluster. However, when a client connects to the IP address of a federated ingress object it will be automatically routed, via the shortest network path, to a healthy pod in the closest cluster to the origin of the request. So, for example, HTTP(S) requests from Internet users in Europe will be routed directly to the closest cluster in Europe that has available capacity. If there are no such clusters in Europe, the request will be routed to the next closest cluster (often in the US).

Handling failures with federated ingress

There are two broad categories of failure:

  • Pod failure
  • Cluster failure

Pods might fail for many reasons. In a properly configured Kubernetes cluster (a cluster federation member or not), pods will be managed by services and ReplicaSets that can automatically handle pod failures. It shouldn't impact cross-cluster routing and load balancing done by the federated ingress. A whole cluster might fail due to problems with the data center or global connectivity. In this, the federated services and federated ReplicaSets will ensure that the other clusters in the federation run enough pods to handle the workload, and the federated ingress will take care of routing client requests away from the failed cluster. To benefit from this auto-healing capability, clients must always connect to the federation ingress object and not to individual cluster members.

Federated namespace

Kubernetes namespaces are used within a cluster to isolate independent areas and support multi-tenant deployments. Federated namespaces provide the same capabilities across a cluster federation. The API is identical. When a client is accessing the federation control plane, they will only get access to the namespaces they requested and are authorized to access across all the clusters in the federation.

You use the same commands and add --context=federation-cluster:

> kubectl --context=federation-cluster create -f namespace.yaml
> kubectl --context=cluster-1 get namespaces namespace
> kubectl --context=federation-cluster create -f namespace.yaml

Federated ReplicaSet

It is best to use deployments and federated deployments to manage the replicas in your cluster or federation. However, if for some reason you prefer to work directly with ReplicaSets, then Kubernetes supports a federated ReplicaSet. There is no federated replication controller because ReplicaSets supersede replication controllers.

When you create a federated ReplicaSets, the job of the control plane is to ensure that the number of replicas across the cluster matches your federated ReplicaSets configuration. The control plane will create a regular ReplicaSet in each federation member. Each cluster will get, by default, an equal (or as close as possible) number of replicas so that the total will add up to the specified number of replicas.

You can control the number of replicas per cluster by specifying using the following annotation: federation.kubernetes.io/replica-set-preferences.

The corresponding data structure is as follows:

type FederatedReplicaSetPreferences struct {
  Rebalance bool
  Clusters map[string]ClusterReplicaSetPreferences
}

If Rebalance is true, then running replicas may be moved between clusters as necessary. The clusters map determines the ReplicaSets preferences per cluster. If * is specified as the key, then all unspecified clusters will use that set of preferences. If there is no * entry, then replicas will only run on clusters that show up in the map. Clusters that belong to the federation but don't have an entry will not have pods scheduled (for that pod template).

The individual ReplicaSets preferences per cluster are specified using the following data structure:

type ClusterReplicaSetPreferences struct {
  MinReplicas int64
  MaxReplicas *int64
  Weight int64
}

MinReplicas is 0 by default. MaxReplicas is unbounded by default. Weight expresses the preference to add an additional replica to this ReplicaSets and defaults to 0.

Federated secrets

Federated secrets are simple. When you create a federated secret as usual through the control plane it gets propagated to the whole cluster. That's it.

The hard parts

So far, federation seems almost straightforward. You group a bunch of clusters together, access them through the control plane, and everything just gets replicated to all the clusters. But there are hard and difficult factors and basic concepts that complicate this simplified view. Much of the power of Kubernetes is derived from its ability to do a lot of work behind the scenes. Within a single cluster deployed fully in a single physical data center or availability zone where all the components are connected with a fast network, Kubernetes is very effective on its own. In a Kubernetes cluster federation, the situation is different. Latency, data transfer costs, and moving pods between clusters all have different trade-offs. Depending on the use case, making federation work may require extra attention, planning, and maintenance on the part of the system designers and operators. In addition, some of the federated resources are not as mature as their local counterparts, and that adds more uncertainty.

Federated unit of work

The unit of work in a Kubernetes cluster is the pod. You can't break a pod in Kubernetes. The entire pod will always be deployed together and be subject to the same lifecycle treatment. Should the pod remain the unit of work for a cluster federation? Maybe it makes more sense to be able to associate a bigger unit, such as a whole ReplicaSet deployment, or service with a specific cluster. If the cluster fails, the entire ReplicaSet deployment, or service is scheduled to a different cluster. How about a collection of tightly coupled ReplicaSets? The answers to these questions are not always easy and may even change dynamically as the system evolves.

Location affinity

Location affinity is a major concern. When can pods be distributed across clusters? What are the relationships between those pods? Are there any requirements for affinity between pods or pods and other resources, such as storage? There are several major categories:

  • Strictly-coupled
  • Loosely-coupled
  • Preferentially-coupled
  • Strictly-decoupled
  • Uniformly-spread

When designing the system and how to allocate and schedule services and pods across the federation it's important to make sure the location affinity requirements are always respected.

Strictly-coupled

The strictly-coupled requirement applies to applications where the pods must be in the same cluster. If you partition the pods, the application will fail (perhaps due to real-time requirements that can't be met networking across clusters) or the cost may be too high (pods accessing a lot of local data). The only way to move such tightly coupled applications to another cluster is to start a complete copy (including data) on another cluster and then shut down the application on the current cluster. If the data is too large, the application may practically be immovable and sensitive to catastrophic failure. This is the most difficult situation to deal with, and if possible you should architect your system to avoid the strictly-coupled requirement.

Loosely-coupled

Loosely-coupled applications are best when the workload is embarrassingly parallel and each pod doesn't need to know about the other pods or access a lot of data. In these situations, pods can be scheduled to clusters just based on capacity and resource utilization across the federation. If necessary, pods can be moved from one cluster to another without problems. For example, a stateless validation service that performs some calculation and gets all its input in the request itself and doesn't query or write any federation-wide data. It just validates its input and returns a valid/invalid verdict to the caller.

Preferentially-coupled

Preferentially-coupled applications perform better when all the pods are in the same cluster or the pods and the data are co-located, but it is not a hard requirement. For example, it could work with applications that require only eventual consistency, where some federation-wide cluster periodically synchronizes the application state across all clusters. In these cases, allocation is done explicitly to one cluster, but leaves a safety hatch for running or migrating to other clusters under stress.

Strictly-decoupled

Some services have fault isolation or high availability requirements that force partitioning across clusters. There is no point running three replicas of a critical service if all replicas might end up scheduled to the same cluster, because that cluster just becomes an ad hoc Single Point Of Failure (SPOF).

Uniformly-spread

Uniformly-spread is when an instance of a service, ReplicaSets, or pod must run on each cluster. It is similar to DaemonSet, but instead of ensuring there is one instance on each node, it's one per cluster. A good example is a Redis cache backed up by some external persistent storage. The pods in each cluster should have their own cluster-local Redis cache to avoid accessing the central storage that may be slower or become a bottleneck. On the other hand, there is no need for more than one Redis service per cluster (it could be distributed across several pods in the same cluster).

Cross-cluster scheduling

Cross-cluster scheduling goes hand-in-hand with location affinity. When a new pod is created or an existing pod fails and a replacement needs to be scheduled, where should it go? The current cluster federation doesn't handle all the scenarios and options for location affinity we mentioned earlier. At this point, cluster federation handles the loosely-coupled (including weighted distribution) and strictly-coupled (by making sure the number of replicas matches the number of clusters) categories well. Anything else will require that you don't use cluster federation. You'll have to add your own custom federation layer that takes more specialized concerns into account and can accommodate more intricate scheduling use cases.

Federated data access

This is a tough problem. If you have a lot of data and pods running in multiple clusters (possibly on different continents) and need to access it quickly, then you have several unpleasant options:

  • Replicate your data to each cluster (slow to replicate, expensive to transfer, expensive to store, and complicated to sync and deal with errors)
  • Access the data remotely (slow to access, expensive on each access, can be a SPOF)
  • Sophisticated hybrid solution with per-cluster caching of some of the hottest data (complicated, stale data, and you still need to transfer a lot of data)

Federated auto-scaling

There is currently no support for federated auto-calling. There are two dimensions of scaling that can be utilized, as well as a combination:

  • Per cluster scaling
  • Adding/removing clusters from the federation
  • Hybrid approach

Consider the relatively simple scenario of a loosely coupled application running on three clusters with five pods in each cluster. At some point, 15 pods can't handle the load anymore. We need to add more capacity. We can increase the number of pods per cluster, but if we do it at the federation level than we will have six pods running in each cluster. We've increased the federation capacity by three pods, when only one pod is needed. Of course, if you have more clusters the problem gets worse. Another option is to pick a cluster and just change its capacity. This is possible with annotations, but now we're explicitly managing capacity across the federation. It can get complicated very quickly if we have lots of clusters running hundreds of services with dynamically changing requirements.

Adding a whole new cluster is even more complicated. Where should we add the new cluster? There is no requirement for extra availability that can guide the decision. It is just about extra capacity. Creating a new cluster also often requires complicated first time setup that may take days to approve various quotas on public cloud platforms. The hybrid approach increases the capacity of existing clusters in the federation until reaching some threshold and then starts adding new clusters. The benefit of this approach is that when you're getting closer to capacity limit per cluster you start preparing new clusters that will be ready to go when necessary. Other than that, it requires a lot effort and you pay in increased complexity for the flexibility and scalability.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset