In this chapter, we’ll take it to the next level and consider options for running Kubernetes and deploying workloads on multiple clouds and multiple clusters. Since a single Kubernetes cluster has limits, once you exceed these limits you must run multiple clusters. A typical Kubernetes cluster is a closely-knit unit where all the components run in relative proximity and are connected by a fast network (typically, a physical data center or cloud provider availability zone). This is great for many use cases, but there are several important use cases where systems need to scale beyond a single cluster or a cluster needs to be stretched across multiple availability zones.
This is a very active area in Kubernetes these days. In the previous edition of the book, this chapter covered Kubernetes Federation and Gardener. Since then, the Kubernetes Federation project was abandoned. There are now many projects that provide different flavors of multi-cluster solutions, such as direct management, Virtual Kubelet solutions, and the gardener.cloud project, which is pretty unique.
The topics we will cover include the following:
There are several reasons to run multiple Kubernetes clusters:
For the first reason it is possible to use a stretched cluster; for the other reasons, you must run multiple clusters.
A stretched cluster (AKA wide cluster) is a single Kubernetes cluster where the control plane nodes and the work nodes are provisioned across multiple geographical availability zones or regions. Cloud providers offer this model for HA-managed Kubernetes clusters.
There are several benefits to the stretched cluster model:
However, the stretched model has its downsides too:
In short, it’s good to have the option for stretched clusters, but be prepared to switch to the multi-cluster model if some of the downsides are unacceptable.
Multi-cluster Kubernetes means provisioning multiple Kubernetes clusters. Large-scale systems often can’t be deployed on a single cluster for various reasons mentioned earlier. That means you need to provision multiple Kubernetes clusters and then figure out how to deploy your workloads on all these clusters and how to handle various use cases, such as some clusters being unavailable or having degraded performance. There are many more degrees of freedom.
Here are some of the benefits of the multi-cluster model:
However, there are some non-trivial downsides to the multi-cluster level:
There are solutions out there for some of these problems, but at this point in time, there is no clear winner you can just adopt and easily configure for your needs. Instead, you will need to adapt and solve problems depending on the specific issues raised with your organization’s multi-cluster structure.
In the previous editions of the book, we discussed Kubernetes Cluster Federation as a solution to managing multiple Kubernetes clusters as a single conceptual cluster. Unfortunately, this project has been inactive since 2019, and the Kubernetes multi-cluster Special Interest Group (SIG) is considering archiving it. Before we describe more modern approaches let’s get some historical context. It’s funny to talk about the history of a project like Kubernetes that didn’t even exist before 2014, but the pace of development and the large number of contributors took Kubernetes through an accelerated evolution. This is especially relevant for the Kubernetes Federation.
In March 2015, the first revision of the Kubernetes Cluster Federation proposal was published. It was fondly nicknamed “Ubernetes” back then. The basic idea was to reuse the existing Kubernetes APIs to manage multiple clusters. This proposal, now called Federation V1, went through several rounds of revision and implementation but never reached general availability, and the main repo has been retired: https://github.com/kubernetes-retired/federation.
The SIG multi-cluster workgroup realized that the multi-cluster problem is more complicated than initially perceived. There are many ways to skin this particular cat and there is no one-size-fits-all solution. The new direction for cluster federation was to use dedicated APIs for federation. A new project and a set of tools were created and implemented as Kubernetes Federation V2: https://github.com/kubernetes-sigs/kubefed.
Unfortunately, this didn’t take off either, and the consensus of the multi-cluster SIG is that since the project is not being maintained, it needs to be archived.
See the notes for the meeting from 2022-08-09: https://tinyurl.com/sig-multicluster-notes.
There are a lot of projects out there moving fast to try to solve the multi-cluster problem, and they all operate at different levels. Let’s look at some of the prominent ones. The goal here is just to introduce these projects and what makes them unique. It is beyond the scope of this chapter to fully explore each one. However, we will dive deeper into one of the projects – the Cluster API – in Chapter 17, Running Kubernetes in Production.
The Cluster API (AKA CAPI) is a project from the Cluster Lifecycle SIG. Its goal is to make provisioning, upgrading, and operating multiple Kubernetes clusters easy. It supports both kubeadm-based clusters as well as managed clusters via dedicated providers. It has a cool logo inspired by the famous “It’s turtles all the way down” story. The idea is that the Cluster API uses Kubernetes to manage Kubernetes clusters.
Figure 11.1: The Cluster API logo
The Cluster API has a very clean and extensible architecture. The primary components are:
Figure 11.2: Cluster API architecture
Let’s understand the role of each one of these components and how they interact with each other.
The management cluster is a Kubernetes cluster that is responsible for managing other Kubernetes clusters (work clusters). It runs the Cluster API control plane and providers, and it hosts the Cluster API custom resources that represent the other clusters.
The clusterctl CLI can be used to work with the management cluster. The clusterctl CLI is a command-line tool with a lot of commands and options, if you want to experiment with the Cluster API through its CLI, visit https://cluster-api.sigs.k8s.io/clusterctl/overview.html.
A work cluster is just a regular Kubernetes cluster. These are the clusters that developers use to deploy their workloads. The work clusters don’t need to be aware that they are managed by the Cluster API.
When CAPI creates a new Kubernetes cluster, it needs certificates before it can create the work cluster’s control plane and, finally, the worker nodes. This is the job of the bootstrap provider. It ensures all the requirements are met and eventually joins the worker nodes to the control plane.
The infrastructure provider is a pluggable component that allows CAPI to work in different infrastructure environments, such as cloud providers or bare-metal infrastructure providers. The infrastructure provider implements a set of interfaces as defined by CAPI to provide access to compute and network resources.
Check out the current providers’ list here: https://cluster-api.sigs.k8s.io/reference/providers.html.
The control plane of a Kubernetes cluster consists of the API server, the etcd stat store, the scheduler, and the controllers that run the control loops to reconcile the resources in the cluster. The control plane of the work clusters can be provisioned in various ways. CAPI supports the following modes:
Deployments
and StatefulSet
, and the API server is exposed as a Service
The custom resources represent the Kubernetes clusters and machines managed by CAPI as well as additional auxiliary resources. There are a lot of custom resources, and some of them are still considered experimental. The primary CRDs are:
Cluster
ControlPlane
(represents control plane machines)MachineSet
(represents worker machines)MachineDeployment
Machine
MachineHealthCheck
Some of these generic resources have references to corresponding resources offered by the infrastructure provider.
The following diagram illustrates the relationships between the control plane resources that represent the clusters and machine sets:
Figure 11.3: Cluster API control plane resources
CAPI also has an additional set of experimental resources that represent a managed cloud provider environment:
MachinePool
ClusterResourceSet
ClusterClass
See https://github.com/kubernetes-sigs/cluster-api for more details.
Karmada is a CNCF sandbox project that focuses on deploying and running workloads across multiple Kubernetes clusters. Its claim to fame is that you don’t need to make changes to your application configuration. While CAPI was focused on the lifecycle management of clusters, Karmada picks up when you already have a set of Kubernetes clusters and you want to deploy workloads across all of them. Conceptually, Karmada is a modern take on the abandoned Kubernetes Federation project.
It can work with Kubernetes in the cloud, on-prem, and on the edge.
See https://github.com/karmada-io/karmada.
Let’s look at Karmada’s architecture.
Karmada is heavily inspired by Kubernetes. It provides a multi-cluster control plane with similar components to the Kubernetes control plane:
If you understand how Kubernetes works, then it is pretty easy to understand how Karmada extends it 1:1 to multiple clusters.
The following diagram illustrates the Karmada architecture:
Figure 11.4: Karmada architecture
Karmada is centered around several concepts implemented as Kubernetes CRDs. You define and update your applications and services using these concepts and Karmada ensures that your workloads are deployed and run in the right place across your multi-cluster system.
Let’s look at these concepts.
The resource template looks just like a regular Kubernetes resource such as Deployment
or StatefulSet
, but it doesn’t actually get deployed to the Karmada control plane. It only serves as a blueprint that will eventually be deployed to member clusters.
The propagation policy determines where a resource template should be deployed. Here is a simple propagation policy that will place the nginx
Deployment into two clusters, called member1
and member2
:
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
name: cool-policy
spec:
resourceSelectors:
- apiVersion: apps/v1
kind: Deployment
name: nginx
placement:
clusterAffinity:
clusterNames:
- member1
- member2
Propagation policies operate across multiple clusters, but sometimes, there are exceptions. The override policy lets you apply fine-grained rules to override existing propagation policies. There are several types of rules:
ImageOverrider
: Dedicated to overriding images for workloadsCommandOverrider
: Dedicated to overriding commands for workloadsArgsOverrider
: Dedicated to overriding args for workloadsPlaintextOverrider
: A general-purpose tool to override any kind of resourcesThere is much more to Karmada, such as:
Check the Karmada documentation for more details: https://karmada.io/docs/.
Clusternet is an interesting project. It is centered around the idea of managing multiple Kubernetes clusters as “visiting the internet” (hence the name “Clusternet”). It supports cloud-based, on-prem, edge, and hybrid clusters. The core features of Clusternet are:
The Clusternet architecture is similar to Karmada but simpler. There is a parent cluster that runs the Clusternet hub and Clusternet scheduler. On each child cluster, there is a Clusternet agent. The following diagram illustrates the structure and interactions between the components:
Figure 11.5: Clusternet architecture
The hub has multiple roles. It is responsible for approving cluster registration requests and creating namespaces, service accounts, and RBAC resources for all child clusters. It also serves as an aggregated API server that maintains WebSocket connections to the agent on child clusters. The hub also provides a Kubernetes-like API to proxy requests to each child cluster. Last but not least, the hub coordinates the deployment of applications and their dependencies to multiple clusters from a single set of resources.
The Clusternet scheduler is the component that is responsible for ensuring that resources (called feeds in Clusternet terminology) are deployed and balanced across all the child clusters according to policies called SchedulingStrategy
.
The Clusternet agent runs on every child cluster and communicates with the hub. The agent on a child cluster is the equivalent of the kubelet on a node. It has several roles. The agent registers its child cluster with the parent cluster. The agent provides a heartbeat to the hub that includes a lot of information, such as the Kubernetes version, running platform, health, readiness, and liveness of workloads. The agent also sets up the WebSocket connection to the hub on the parent cluster to allow full-duplex communication channels over a single TCP connection.
Clusternet models multi-cluster deployment as subscriptions and feeds. It provides a Subscription
custom resource that can be used to deploy a set of resources (called feeds) to multiple clusters (called subscribers) based on different criteria. Here is an example of a Subscription
that deploys a Namespace
, a Service
, and a Deployment
to multiple clusters with a label of clusters.clusternet.io/cluster-id
:
# examples/dynamic-dividing-scheduling/subscription.yaml
apiVersion: apps.clusternet.io/v1alpha1
kind: Subscription
metadata:
name: dynamic-dividing-scheduling-demo
namespace: default
spec:
subscribers: # filter out a set of desired clusters
- clusterAffinity:
matchExpressions:
- key: clusters.clusternet.io/cluster-id
operator: Exists
schedulingStrategy: Dividing
dividingScheduling:
type: Dynamic
dynamicDividing:
strategy: Spread # currently we only support Spread dividing strategy
feeds: # defines all the resources to be deployed with
- apiVersion: v1
kind: Namespace
name: qux
- apiVersion: v1
kind: Service
name: my-nginx-svc
namespace: qux
- apiVersion: apps/v1 # with a total of 6 replicas
kind: Deployment
name: my-nginx
namespace: qux
See https://clusternet.io for more details.
Clusterpedia is a CNCF sandbox project. Its central metaphor is Wikipedia for Kubernetes clusters. It has a lot of capabilities around multi-cluster search, filtering, field selection, and sorting. This is unusual because it is a read-only project. It doesn’t offer to help with managing the clusters or deploying workloads. It is focused on observing your clusters.
The architecture is similar to other multi-cluster projects. There is a control plane element that runs the Clusterpedia API server and ClusterSynchro manager components. For each observed cluster, there is a dedicated component called cluster syncro that synchronizes the state of the clusters into the storage layer of Clusterpedia. One of the most interesting aspects of the architecture is the Clusterpedia aggregated API server, which makes all your clusters seem like a single huge logical cluster. Note that the Clusterpedia API server and the ClusterSynchro manager are loosely coupled and don’t interact directly with each other. They just read and write from a shared storage layer.
Figure 11.6: Clusterpedia architecture
Let’s look at each of the components and understand what their purpose is.
The Clusterpedia API server is an aggregated API server. That means that it registers itself with the Kubernetes API server and, in practice, extends the standard Kubernetes API server via custom endpoints. When requests come to the Kubernetes API server, it forwards them to the Clusterpedia API server, which accesses the storage layer to satisfy them. The Kubernetes API server serves as a forwarding layer for the requests that Clusterpedia handles.
This is an advanced aspect of Kubernetes. We will discuss API server aggregation in Chapter 15, Extending Kubernetes.
Clusterpedia observes multiple clusters to provide its search, filter, and aggregation features. One way to implement it is that whenever a request comes in, Clusterpedia would query all the observed clusters, collect the results, and return them. This approach is very problematic, as some clusters might be slow to respond and similar requests will require returning the same information, which is wasteful and costly. Instead, the ClusterSynchro manager collectively synchronizes the state of each observed cluster into Clusterpedia storage, where the Clusterpedia API server can respond quickly.
The storage layer is an abstraction layer that stores the state of all observed clusters. It provides a uniform interface that can be implemented by different storage components. The Clusterpedia API server and the ClusterSynchro manager interact with the storage layer interface and never talk to each other directly.
The storage component is an actual data store that implements the storage layer interface and stores the state of observed clusters. Clusterpedia was designed to support different storage components to provide flexibility for their users. Currently, supported storage components include MySQL, Postgres, and Redis.
To onboard clusters into Clusterpedia, you define a PediaCluster custom resource. It is pretty straightforward:
apiVersion: cluster.clusterpedia.io/v1alpha2
kind: PediaCluster
metadata:
name: cluster-example
spec:
apiserver: "https://10.30.43.43:6443"
kubeconfig:
caData:
tokenData:
certData:
keyData:
syncResources: []
You need to provide credentials to access the cluster, and then Clusterpedia will take over and sync its state.
This is where Clusterpedia shines. You can access the Clusterpedia cluster via an API or through kubectl. When accessing it through a URL it looks like you hit the aggregated API server endpoint:
kubectl get --raw="/apis/clusterpedia.io/v1beta1/resources/apis/apps/v1/deployments?clusters=cluster-1,cluster-2"
You can specify the target cluster as a query parameter (in this case, cluster-1
and cluster-2
).
When accessing through kubectl, you specify the target clusters as a label (in this case, "search.clusterpedia.io/clusters in (cluster-1,cluster-2)"
):
kubectl --cluster clusterpedia get deployments -l "search.clusterpedia.io/clusters in (cluster-1,cluster-2)"
Other search labels and queries exist for namespaces and resource names:
search.clusterpedia.io/namespaces
(query parameter is namespaces
)search.clusterpedia.io/names
(query parameter is names
)There is also an experimental fuzzy search label, internalstorage.clusterpedia.io/fuzzy-name
, for resource names, but no query parameter. This is useful as often, resources have generated names with random suffixes.
You can also search by creation time:
search.clusterpedia.io/before
(query parameter is before
)search.clusterpedia.io/since
(query parameter is since
)Other capabilities include filtering by resource labels or field selectors as well as organizing the results using OrderBy
and Paging
.
Another important concept is resource collections. The standard Kubernetes API offers a straightforward REST API where you can list or get one kind of resource at a time. However, often, users would like to get multiple types of resources at the same time. For example, the Deployment
, Service
, and HorizontalPodAutoscaler
with a specific label. This requires multiple calls via the standard Kubernetes API, even if all these resources are available on one cluster.
Clusterpedia defines a CollectionResource
that groups together resources that belong to the following categories:
any
(all resources)workloads
(Deployments
, StatefulSets
, and DaemonSets
)kuberesources
(all resources other than workloads)You can search for any combination of resources in one API call by passing API groups and resource kinds:
kubectl get --raw "/apis/clusterpedia.io/v1beta1/collectionresources/any?onlyMetadata=true&groups=apps&resources=batch/jobs,batch/cronjobs"
See https://github.com/clusterpedia-io/clusterpedia for more details.
Open Cluster Management (OCM) is a CNCF sandbox project for multi-cluster management, as well as multi-cluster scheduling and workload placement. Its claim to fame is closely following many Kubernetes concepts, extensibility via addons, and strong integration with other open source projects, such as:
The scope of OCM covers cluster lifecycle, application lifecycle, and governance.
Let’s look at OCM’s architecture.
OCM’s architecture follows the hub and spokes model. It has a hub cluster, which is the OCM control plane that manages multiple other clusters (the spokes).
The control plane’s hub cluster runs two controllers: the registration controller and the placement controller. In addition, the control plane runs multiple management addons, which are the foundation for OCM’s extensibility. On each managed cluster, there is a so-called Klusterlet that has a registration-agent and work-agent that interact with the registration controller and placement controller on the hub cluster. Then, there are also addon agents that interact with the addons on the hub cluster.
The following diagram illustrates how the different components of OCM communicate:
Figure 11.7: OCM architecture
Let’s look at the different aspects of OCM.
Cluster registration is a big part of OCM’s secure multi-cluster story. OCM prides itself on the secure double opt-in handshake registration. Since a hub-and-spoke cluster may have different administrators, this model provides protection for each side from undesired requests. Each side can terminate the relationship at any time.
The following diagram demonstrates the registration process (CSR means certificate signing request):
Figure 11.8: OCM registration process
The OCM application lifecycle supports creating, updating, and deleting resources across multiple clusters.
The primary building block is the ManifestWork
custom resource that can define multiple resources. Here is an example that contains only a single Deployment
:
apiVersion: work.open-cluster-management.io/v1
kind: ManifestWork
metadata:
namespace: <target managed cluster>
name: awesome-workload
spec:
workload:
manifests:
- apiVersion: apps/v1
kind: Deployment
metadata:
name: hello
namespace: default
spec:
selector:
matchLabels:
app: hello
template:
metadata:
labels:
app: hello
spec:
containers:
- name: hello
image: quay.io/asmacdo/busybox
command:
["sh", "-c", 'echo "Hello, Kubernetes!" && sleep 3600']
The ManifestWork
is created on the hub cluster and is deployed to the target cluster according to the namespace mapping. Each target cluster has a namespace representing it in the hub cluster. A work agent running on the target cluster will monitor all ManifestWork
resources on the hub cluster in their namespace and sync changes.
OCM provides a governance model based on policies, policy templates, and policy controllers. The policies can be bound to a specific set of clusters for fine-grained control.
Here is a sample policy that requires the existence of a namespace called Prod
:
apiVersion: policy.open-cluster-management.io/v1
kind: Policy
metadata:
name: policy-namespace
namespace: policies
annotations:
policy.open-cluster-management.io/standards: NIST SP 800-53
policy.open-cluster-management.io/categories: CM Configuration Management
policy.open-cluster-management.io/controls: CM-2 Baseline Configuration
spec:
remediationAction: enforce
disabled: false
policy-templates:
- objectDefinition:
apiVersion: policy.open-cluster-management.io/v1
kind: ConfigurationPolicy
metadata:
name: policy-namespace-example
spec:
remediationAction: inform
severity: low
object-templates:
- complianceType: MustHave
objectDefinition:
kind: Namespace # must have namespace 'prod'
apiVersion: v1
metadata:
name: prod
See https://open-cluster-management.io/ for more details.
Virtual Kubelet is a fascinating project. It impersonates a kubelet to connect Kubernetes to other APIs such as AWS Fargate or Azure ACI. The Virtual Kubelet looks like a node to the Kubernetes cluster, but the compute resources backing it up are abstracted away. The Virtual Kubelet looks like just another node to the Kubernetes cluster:
Figure 11.9: Virtual Kubelet, which looks like a regular node to the Kubernetes cluster
The features of the Virtual Kubelet are:
See https://github.com/virtual-kubelet/virtual-kubelet for more details.
This concept can be used to connect multiple Kubernetes clusters too, and several projects follow this approach. Let’s look briefly at some projects that use Virtual Kubelet for multi-cluster management such as tensile-kube, Admiralty, and Liqo.
Tensile-kube is a sub-project of the Virtual Kubelet organization on GitHub.
Tensile-kube brings the following to the table:
Tensile-kube uses the terminology of the upper cluster for the cluster that contains the Virtual Kubelets, and the lower clusters for the clusters that are exposed as virtual nodes in the upper cluster.
Here is the tensile-kube architecture:
Figure 11.10: Tensile-kube architecture
See https://github.com/virtual-kubelet/tensile-kube for more details.
Admiralty is an open source project backed by a commercial company. Admiralty takes the Virtual Kubelet concept and builds a sophisticated solution for multi-cluster orchestration and scheduling. Target clusters are represented as virtual nodes in the source cluster. It has a pretty complicated architecture that involves three levels of scheduling. Whenever a pod is created on a proxy, pods are created on the source cluster, candidate pods are created on each target cluster, and eventually, one of the candidate pods is selected and becomes a delegate pod, which is a real pod that actually runs its containers. This is all supported by custom multi-cluster schedulers built on top of the Kubernetes scheduling framework. To schedule workloads on Admiralty, you need to annotate any pod template with multicluster.admiralty.io/elect=""
and Admiralty will take it from there.
Here is a diagram that demonstrates the interplay between different components:
Figure 11.11: Admiralty architecture
Admiralty provides the following features:
See https://admiralty.io for more details.
Liqo is an open source project based on the liquid computing concept. Let your tasks and data float around and find the best place to run. Its scope is very impressive, as it targets not only the compute aspect of running pods across multiple clusters but also provides network fabric and storage fabric. These aspects of connecting clusters and managing data across clusters are often harder problems to solve than just running workloads.
In Liqo’s terminology, the management cluster is called the home cluster and the target clusters are called foreign clusters. The virtual nodes in the home cluster are called “Big” nodes, and they represent the foreign clusters.
Liqo utilizes IP address mapping to achieve a flat IP address space across all foreign clusters that may have internal IP conflicts.
Liqo filters and batches events from the foreign clusters to reduce pressure on the home cluster.
Here is a diagram of the Liqo architecture:
Figure 11.12: Liqo architecture
See https://liqo.io for more details.
Let’s move on and take an in-depth look at the Gardener project, which takes a different approach.
The Gardener project is an open source project developed by SAP. It lets you manage thousands (yes, thousands!) of Kubernetes clusters efficiently and economically. Gardener solves a very complex problem, and the solution is elegant but not simple. Gardener is the only project that addresses both the cluster lifecycle and application lifecycle.
In this section, we will cover the terminology of Gardener and its conceptual model, dive deep into its architecture, and learn about its extensibility features. The primary theme of Gardener is to use Kubernetes to manage Kubernetes clusters. A good way to think about Gardener is Kubernetes-control-plane-as-a-service.
See https://gardener.cloud for more details.
The Gardener project, as you may have guessed, uses botanical terminology to describe the world. There is a garden, which is a Kubernetes cluster responsible for managing seed clusters. A seed is a Kubernetes cluster responsible for managing a set of shoot clusters. A shoot cluster is a Kubernetes cluster that runs actual workloads.
The cool idea behind Gardener is that the shoot clusters contain only the worker nodes. The control planes of all the shoot clusters run as Kubernetes pods and services in the seed cluster.
The following diagram describes in detail the structure of Gardener and the relationships between its components:
Figure 11.13: The Gardener project structure
Don’t panic! Underlying all this complexity is a crystal clear conceptual model.
The architecture diagram of Gardener can be overwhelming. Let’s unpack it slowly and surface the underlying principles. Gardener really embraces the spirit of Kubernetes and offloads a lot of the complexity of managing a large set of Kubernetes clusters to Kubernetes itself. At its heart, Gardener is an aggregated API server that manages a set of custom resources using various controllers. It embraces and takes full advantage of Kubernetes’ extensibility. This approach is common in the Kubernetes community. Define a set of custom resources and let Kubernetes manage them for you. The novelty of Gardener is that it takes this approach to the extreme and abstracts away parts of Kubernetes infrastructure itself.
In a “normal” Kubernetes cluster, the control plane runs in the same cluster as the worker nodes. Typically, in large clusters, control plane components like the Kubernetes API server and etcd run on dedicated nodes and don’t mix up with the worker nodes. Gardener thinks in terms of many clusters and it takes all the control planes of all the shoot clusters and has a seed cluster to manage them. So the Kubernetes control plane of the shoot clusters is managed in the seed cluster as regular Kubernetes Deployments
, which automatically provides replication, monitoring, self-healing, and rolling updates by Kubernetes.
So, the control plane of a Kubernetes shoot cluster is analogous to a Deployment
. The seed cluster, on the other hand, maps to a Kubernetes node. It manages multiple shoot clusters. It is recommended to have a seed cluster per cloud provider. The Gardener developers actually work on a gardenlet controller for seed clusters that is similar to the kubelet on nodes.
If the seed clusters are like Kubernetes nodes, then the Garden cluster that manages those seed clusters is like a Kubernetes cluster that manages its worker nodes.
By pushing the Kubernetes model this far, the Gardener project leverages the strengths of Kubernetes to achieve robustness and performance that would be very difficult to build from scratch.
Let’s dive into the architecture.
Gardener creates a Kubernetes namespace in the seed cluster for each shoot cluster. It manages the certificates of the shoot clusters as Kubernetes secrets in the seed cluster.
The etcd data store for each cluster is deployed as a StatefulSet with one replica. In addition, events are stored in a separate etcd instance. The etcd data is periodically snapshotted and stored in remote storage for backup and restore purposes. This enables very fast recovery of clusters that lost their control plane (e.g., when an entire seed cluster becomes unreachable). Note that when a seed cluster goes down, the shoot cluster continues to run as usual.
As mentioned before, the control plane of a shoot cluster X runs in a separate seed cluster, while the worker nodes run in a shoot cluster. This means that pods in the shoot cluster can use internal DNS to locate each other, but communication to the Kubernetes API server running in the seed cluster must be done through an external DNS. This means the Kubernetes API server runs as a Service
of the LoadBalancer
type.
When creating a new shoot cluster, it’s important to provide the necessary infrastructure. Gardener uses Terraform for this task. A Terraform script is dynamically generated based on the shoot cluster specification and stored as a ConfigMap within the seed cluster. To facilitate this process, a dedicated component (Terraformer) runs as a job, performs all the provisioning, and then writes the state into a separate ConfigMap.
To provision nodes in a provider-agnostic manner that can work for private clouds too, Gardener has several custom resources such as MachineDeployment
, MachineClass
, MachineSet
, and Machine
. They work with the Kubernetes Cluster Lifecycle group to unify their abstractions because there is a lot of overlap. In addition, Gardener takes advantage of the cluster auto-scaler to offload the complexity of scaling node pools up and down.
The seed cluster and shoot clusters can run on different cloud providers. The worker nodes in the shoot clusters are often deployed in private networks. Since the control plane needs to interact closely with the worker nodes (mostly the kubelet), Gardener creates a VPN for direct communication.
Observability is a big part of operating complex distributed systems. Gardener provides a lot of monitoring out of the box using best-of-class open source projects like a central Prometheus server, deployed in the garden cluster that collects information about all seed clusters. In addition, each shoot cluster gets its own Prometheus instance in the seed cluster. To collect metrics, Gardener deploys two kube-state-metrics
instances for each cluster (one for the control plane in the seed and one for the worker nodes in the shoot). The node-exporter is deployed too to provide additional information on the nodes. The Prometheus AlertManager
is used to notify the operator when something goes wrong. Grafana is used to display dashboards with relevant data on the state of the system.
You can manage Gardener using only kubectl, but you will have to switch profiles and contexts a lot as you explore different clusters. Gardener provides the gardenctl
command-line tool that offers higher-level abstractions and can operate on multiple clusters at the same time. Here is an example:
$ gardenctl ls shoots
projects:
- project: team-a
shoots:
- dev-eu1
- prod-eu1
$ gardenctl target shoot prod-eu1
[prod-eu1]
$ gardenctl show prometheus
NAME READY STATUS RESTARTS AGE IP NODE
prometheus-0 3/3 Running 0 106d 10.241.241.42 ip-10-240-7-72.eu-central-1.compute.internal
URL: https://user:[email protected]
One of the most prominent features of Gardener is its extensibility. It has a large surface area and it supports many environments. Let’s see how extensibility is built into its design.
Gardener supports the following environments:
It started, like Kubernetes itself, with a lot of provider-specific support in the primary Gardener repository. Over time, it followed the Kubernetes example that externalized cloud providers and migrated the providers to separate Gardener extensions. Providers can be specified using a CloudProfile CRD such as:
apiVersion: core.gardener.cloud/v1beta1
kind: CloudProfile
metadata:
name: aws
spec:
type: aws
kubernetes:
versions:
- version: 1.24.3
- version: 1.23.8
expirationDate: "2022-10-31T23:59:59Z"
machineImages:
- name: coreos
versions:
- version: 2135.6.0
machineTypes:
- name: m5.large
cpu: "2"
gpu: "0"
memory: 8Gi
usable: true
volumeTypes:
- name: gp2
class: standard
usable: true
- name: io1
class: premium
usable: true
regions:
- name: eu-central-1
zones:
- name: eu-central-1a
- name: eu-central-1b
- name: eu-central-1c
providerConfig:
apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
kind: CloudProfileConfig
machineImages:
- name: coreos
versions:
- version: 2135.6.0
regions:
- name: eu-central-1
ami: ami-034fd8c3f4026eb39
# architecture: amd64 # optional
Then, a shoot cluster will choose a provider and configure it with the necessary information:
apiVersion: gardener.cloud/v1alpha1
kind: Shoot
metadata:
name: johndoe-aws
namespace: garden-dev
spec:
cloudProfileName: aws
secretBindingName: core-aws
cloud:
type: aws
region: eu-west-1
providerConfig:
apiVersion: aws.cloud.gardener.cloud/v1alpha1
kind: InfrastructureConfig
networks:
vpc: # specify either 'id' or 'cidr'
# id: vpc-123456
cidr: 10.250.0.0/16
internal:
- 10.250.112.0/22
public:
- 10.250.96.0/22
workers:
- 10.250.0.0/19
zones:
- eu-west-1a
workerPools:
- name: pool-01
# Taints, labels, and annotations are not yet implemented. This requires interaction with the machine-controller-manager, see
# https://github.com/gardener/machine-controller-manager/issues/174. It is only mentioned here as future proposal.
# taints:
# - key: foo
# value: bar
# effect: PreferNoSchedule
# labels:
# - key: bar
# value: baz
# annotations:
# - key: foo
# value: hugo
machineType: m4.large
volume: # optional, not needed in every environment, may only be specified if the referenced CloudProfile contains the volumeTypes field
type: gp2
size: 20Gi
providerConfig:
apiVersion: aws.cloud.gardener.cloud/v1alpha1
kind: WorkerPoolConfig
machineImage:
name: coreos
ami: ami-d0dcef3
zones:
- eu-west-1a
minimum: 2
maximum: 2
maxSurge: 1
maxUnavailable: 0
kubernetes:
version: 1.11.0
...
dns:
provider: aws-route53
domain: johndoe-aws.garden-dev.example.com
maintenance:
timeWindow:
begin: 220000+0100
end: 230000+0100
autoUpdate:
kubernetesVersion: true
backup:
schedule: "*/5 * * * *"
maximum: 7
addons:
kube2iam:
enabled: false
kubernetes-dashboard:
enabled: true
cluster-autoscaler:
enabled: true
nginx-ingress:
enabled: true
loadBalancerSourceRanges: []
kube-lego:
enabled: true
email: [email protected]
But, the extensibility goals of Gardener go far beyond just being provider agnostic. The overall process of standing up a Kubernetes cluster involves many steps. The Gardener project aims to let the operator customize each and every step by defining custom resources and webhooks. Here is the general flow diagram with the CRDs, mutating/validating admission controllers, and webhooks associated with each step:
Figure 11.14: Flow diagram of CRDs mutating and validating admission controllers
Here are the CRD categories that comprise the extensibility space of Gardener:
We have covered Gardener in depth, which brings us to the end of the chapter.
In this chapter, we’ve covered the exciting area of multi-cluster management. There are many projects that tackle this problem from different angles. The Cluster API project has a lot of momentum for solving the sub-problem of managing the lifecycle of multiple clusters. Many other projects take on the resource management and application lifecycle. These projects can be divided into two categories: projects that explicitly manage multiple clusters using a management cluster and managed clusters, and projects that utilize the Virtual Kubelet where whole clusters appear as virtual nodes in the main cluster.
The Gardener project has a very interesting approach and architecture. It tackles the problem of multiple clusters from a different perspective and focuses on the large-scale management of clusters. It is the only project that addresses both cluster lifecycle and application lifecycle.
At this point, you should have a clear understanding of the current state of multi-cluster management and what the different projects offer. You may decide that it’s still too early or that you want to take the plunge.
In the next chapter, we will explore the exciting world of serverless computing on Kubernetes. Serverless can mean two different things: you don’t have to manage servers for your long-running workloads, and also, running functions as a service. Both forms of serverless are available for Kubernetes, and both of them are extremely useful.