In this chapter, we’ll look at how Kubernetes manages storage. Storage is very different from compute, but at a high level they are both resources. Kubernetes as a generic platform takes the approach of abstracting storage behind a programming model and a set of plugins for storage providers. First, we’ll go into detail about the conceptual storage model and how storage is made available to containers in the cluster. Then, we’ll cover the common cloud platform storage providers, such as Amazon Web Services (AWS), Google Compute Engine (GCE), and Azure. Then we’ll look at a prominent open source storage provider, GlusterFS from Red Hat, which provides a distributed filesystem. We’ll also look into another solution – Ceph – that manages your data in containers as part of the Kubernetes cluster using the Rook operator. We’ll see how Kubernetes supports the integration of existing enterprise storage solutions. Finally, we will explore the Constrainer Storage Interface (CSI) and all the advanced capabilities it brings to the table.
This chapter will cover the following main topics:
At the end of this chapter, you’ll have a solid understanding of how storage is represented in Kubernetes, the various storage options in each deployment environment (local testing, public cloud, and enterprise), and how to choose the best option for your use case.
You should try the code samples in this chapter on minikube, or another cluster that supports storage adequately. The KinD cluster has some problems related to labeling nodes, which is necessary for some storage solutions.
In this section, we will understand the Kubernetes storage conceptual model and see how to map persistent storage into containers, so they can read and write. Let’s start by understanding the problem of storage.
Containers and pods are ephemeral. Anything a container writes to its own filesystem gets wiped out when the container dies. Containers can also mount directories from their host node and read or write to them. These will survive container restarts, but the nodes themselves are not immortal. Also, if the pod itself is evicted and scheduled to a different node, the pod’s containers will not have access to the old node host’s filesystem.
There are other problems, such as ownership of mounted hosted directories when the container dies. Just imagine a bunch of containers writing important data to various data directories on their host and then going away, leaving all that data all over the nodes with no direct way to tell what container wrote what data. You can try to record this information, but where would you record it? It’s pretty clear that for a large-scale system, you need persistent storage accessible from any node to reliably manage the data.
The basic Kubernetes storage abstraction is the volume. Containers mount volumes that are bound to their pod, and they access the storage wherever it may be as if it’s in their local filesystem. This is nothing new, and it is great, because as a developer who writes applications that need access to data, you don’t have to worry about where and how the data is stored. Kubernetes supports many types of volumes with their own distinctive features. Let’s review some of the main volume types.
It is very simple to share data between containers in the same pod using a shared volume. Container 1 and container 2 simply mount the same volume and can communicate by reading and writing to this shared space. The most basic volume is the emptyDir
. An emptyDir
volume is an empty directory on the host. Note that it is not persistent because when the pod is evicted or deleted, the contents are erased. If a container just crashes, the pod will stick around, and the restarted container can access the data in the volume. Another very interesting option is to use a RAM disk, by specifying the medium as Memory
. Now, your containers communicate through shared memory, which is much faster, but more volatile of course. If the node is restarted, the emptyDir
's volume contents are lost.
Here is a pod configuration file that has two containers that mount the same volume, called shared-volume
. The containers mount it in different paths, but when the hue-global-listener
container is writing a file to /notifications
, the hue-job-scheduler
will see that file under /incoming
:
apiVersion: v1
kind: Pod
metadata:
name: hue-scheduler
spec:
containers:
- image: g1g1/hue-global-listener:1.0
name: hue-global-listener
volumeMounts:
- mountPath: /notifications
name: shared-volume
- image: g1g1/hue-job-scheduler:1.0
name: hue-job-scheduler
volumeMounts:
- mountPath: /incoming
name: shared-volume
volumes:
- name: shared-volume
emptyDir: {}
To use the shared memory option, we just need to add medium: Memory
to the emptyDir
section:
volumes:
- name: shared-volume
emptyDir:
medium: Memory
Note that memory-based emptyDir
counts toward the container’s memory limit.
To verify it worked, let’s create the pod and then write a file using one container and read it using the other container:
$ k create -f hue-scheduler.yaml
pod/hue-scheduler created
Note that the pod has two containers:
$ k get pod hue-scheduler -o json | jq .spec.containers
[
{
"image": "g1g1/hue-global-listener:1.0",
"name": "hue-global-listener",
"volumeMounts": [
{
"mountPath": "/notifications",
"name": "shared-volume"
},
...
]
...
},
{
"image": "g1g1/hue-job-scheduler:1.0",
"name": "hue-job-scheduler",
"volumeMounts": [
{
"mountPath": "/incoming",
"name": "shared-volume"
},
...
]
...
}
]
Now, we can create a file in the /notifications
directory of the hue-global-listener
container and list it in the /incoming
directory of the hue-job-scheduler
container:
$ kubectl exec -it hue-scheduler -c hue-global-listener -- touch /notifications/1.txt
$ kubectl exec -it hue-scheduler -c hue-job-scheduler -- ls /incoming
1.txt
As you can see, we are able to see a file that was created in one container in the file system of another container; thereby, the containers can communicate via the shared file system.
Sometimes, you want your pods to get access to some host information (for example, the Docker daemon) or you want pods on the same node to communicate with each other. This is useful if the pods know they are on the same host. Since Kubernetes schedules pods based on available resources, pods usually don’t know what other pods they share the node with. There are several cases where a pod can rely on other pods being scheduled with it on the same node:
DaemonSet
pods always share a node with any other pod that matches their selectorFor example, in Chapter 5, Using Kubernetes Resources in Practice, we discussed a DaemonSet
pod that serves as an aggregating proxy to other pods. Another way to implement this behavior is for the pods to simply write their data to a mounted volume that is bound to a host directory, and the DaemonSet
pod can directly read it and act on it.
A HostPath
volume is a host file or directory that is mounted into a pod. Before you decide to use the HostPath
volume, make sure you understand the consequences:
HostPath
resourcesprivileged
set to true
or, on the host side, you need to change the permissions to allow writingHere is a configuration file that mounts the /coupons
directory into the hue-coupon-hunter
container, which is mapped to the host’s /etc/hue/data/coupons
directory:
apiVersion: v1
kind: Pod
metadata:
name: hue-coupon-hunter
spec:
containers:
- image: the_g1g1/hue-coupon-hunter
name: hue-coupon-hunter
volumeMounts:
- mountPath: /coupons
name: coupons-volume
volumes:
- name: coupons-volume
host-path:
path: /etc/hue/data/coupons
Since the pod doesn’t have a privileged security context, it will not be able to write to the host directory. Let’s change the container spec to enable it by adding a security context:
- image: the_g1g1/hue-coupon-hunter
name: hue-coupon-hunter
volumeMounts:
- mountPath: /coupons
name: coupons-volume
securityContext:
privileged: true
In the following diagram, you can see that each container has its own local storage area inaccessible to other containers or pods, and the host’s /data
directory is mounted as a volume into both container 1 and container 2:
Figure 6.1: Container local storage
Local volumes are similar to HostPath
, but they persist across pod restarts and node restarts. In that sense they are considered persistent volumes. They were added in Kubernetes 1.7. As of Kubernetes 1.14 they are considered stable. The purpose of local volumes is to support Stateful Sets where specific pods need to be scheduled on nodes that contain specific storage volumes. Local volumes have node affinity annotations that simplify the binding of pods to the storage they need to access.
We need to define a storage class for using local volumes. We will cover storage classes in depth later in this chapter. In one sentence, storage classes use a provisioner to allocate storage to pods. Let’s define the storage class in a file called local-storage-class.yaml
and create it:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
$ k create -f local-storage-class.yaml
storageclass.storage.k8s.io/local-storage created
Now, we can create a persistent volume using the storage class that will persist even after the pod that’s using it is terminated:
apiVersion: v1
kind: PersistentVolume
metadata:
name: local-pv
labels:
release: stable
capacity: 10Gi
spec:
capacity:
storage: 10Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Delete
storageClassName: local-storage
local:
path: /mnt/disks/disk-1
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- k3d-k3s-default-agent-1
While emptyDir
volumes can be mounted and used by containers, they are not persistent and don’t require any special provisioning because they use existing storage on the node. HostPath
volumes persist on the original node, but if a pod is restarted on a different node, it can’t access the HostPath
volume from its previous node. Local volumes are real persistent volumes that use storage provisioned ahead of time by administrators or dynamic provisioning via storage classes. They persist on the node and can survive pod restarts and rescheduling and even node restarts. Some persistent volumes use external storage (not a disk physically attached to the node) provisioned ahead of time by administrators. In cloud environments, the provisioning may be very streamlined, but it is still required, and as a Kubernetes cluster administrator you have to at least make sure your storage quota is adequate and monitor usage versus quota diligently.
Remember that persistent volumes are resources that the Kubernetes cluster is using, similar to nodes. As such they are not managed by the Kubernetes API server.
You can provision resources statically or dynamically.
Static provisioning is straightforward. The cluster administrator creates persistent volumes backed up by some storage media ahead of time, and these persistent volumes can be claimed by containers.
Dynamic provisioning may happen when a persistent volume claim doesn’t match any of the statically provisioned persistent volumes. If the claim specified a storage class and the administrator configured that class for dynamic provisioning, then a persistent volume may be provisioned on the fly. We will see examples later when we discuss persistent volume claims and storage classes.
Kubernetes originally contained a lot of code for storage provisioning “in-tree” as part of the main Kubernetes code base. With the introduction of CSI, storage provisioners started to migrate out of Kubernetes core into volume plugins (AKA out-of-tree). External provisioners work just like in-tree dynamic provisioners but can be deployed and updated independently. Most in-tree storage provisioners have been migrated out-of-tree. Check out this project for a library and guidelines for writing external storage provisioners: https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner.
Here is the configuration file for an NFS persistent volume:
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-777
spec:
capacity:
storage: 10Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
- ReadOnlyMany
persistentVolumeReclaimPolicy: Recycle
storageClassName: slow
mountOptions:
- hard
- nfsvers=4.2
nfs:
path: /tmp
server: nfs-server.default.svc.cluster.local
A persistent volume has a spec and metadata that possibly includes a storage class name. Let’s focus on the spec here. There are six sections: capacity, volume mode, access modes, reclaim policy, storage class, and the volume type (nfs
in the example).
Each volume has a designated amount of storage. Storage claims may be satisfied by persistent volumes that have at least that amount of storage. In the example, the persistent volume has a capacity of 10 gibibytes (a single gibibyte is 2 to the power of 30 bytes).
capacity:
storage: 10Gi
It is important when allocating static persistent volumes to understand the storage request patterns. For example, if you provision 20 persistent volumes with 100 GiB capacity and a container claims a persistent volume with 150 GiB, then this claim will not be satisfied even though there is enough capacity overall in the cluster.
The optional volume mode was added in Kubernetes 1.9 as an Alpha feature (moved to Beta in Kubernetes 1.13) for static provisioning. It lets you specify if you want a file system (Filesystem
) or raw storage (Block
). If you don’t specify volume mode, then the default is Filesystem
, just like it was pre-1.9.
ReadOnlyMany
: Can be mounted read-only by many nodesReadWriteOnce
: Can be mounted as read-write by a single nodeReadWriteMany
: Can be mounted as read-write by many nodesThe storage is mounted to nodes, so even with ReadWriteOnce
, multiple containers on the same node can mount the volume and write to it. If that causes a problem, you need to handle it through some other mechanism (for example, claim the volume only in DaemonSet pods that you know will have just one per node).
Different storage providers support some subset of these modes. When you provision a persistent volume, you can specify which modes it will support. For example, NFS supports all modes, but in the example, only these modes were enabled:
accessModes:
- ReadWriteMany
- ReadOnlyMany
The reclaim policy determines what happens when a persistent volume claim is deleted. There are three different policies:
Retain
– the volume will need to be reclaimed manuallyDelete
– the content, the volume, and the backing storage are removedRecycle
– delete content only (rm -rf /volume/*
)The Retain
and Delete
policies mean the persistent volume is not available anymore for future claims. The Recycle
policy allows the volume to be claimed again.
At the moment, NFS and HostPath
support the recycle policy, while AWS EBS, GCE PD, Azure disk, and Cinder volumes support the delete policy. Note that dynamically provisioned volumes are always deleted.
You can specify a storage class using the optional storageClassName
field of the spec. If you do then only persistent volume claims that specify the same storage class can be bound to the persistent volume. If you don’t specify a storage class, then only PV claims that don’t specify a storage class can be bound to it.
storageClassName: slow
The volume type is specified by name in the spec. There is no volumeType
stanza in the spec. In the preceding example, nfs
is the volume type:
nfs:
path: /tmp
server: 172.17.0.8
Each volume type may have its own set of parameters. In this case, it’s a path and server.
We will go over various volume types later.
Some persistent volume types have additional mount options you can specify. The mount options are not validated. If you provide an invalid mount option, the volume provisioning will fail. For example, NFS supports additional mount options:
mountOptions:
- hard
- nfsvers=4.1
Now that we have looked at provisioning a single persistent volume, let’s look at projected volumes, which add more flexibility and abstraction of storage.
Projected volumes allow you to mount multiple persistent volumes into the same directory. You need to be careful of naming conflicts of course.
The following volume types support projected volumes:
ConfigMap
Secret
SownwardAPI
ServiceAccountToken
The snippet below projects a ConfigMap
and a Secret
into the same directory:
apiVersion: v1
kind: Pod
metadata:
name: projected-volumes-demo
spec:
containers:
- name: projected-volumes-demo
image: busybox:1.28
volumeMounts:
- name: projected-volumes-demo
mountPath: "/projected-volume"
readOnly: true
volumes:
- name: projected-volumes-demo
projected:
sources:
- secret:
name: the-user
items:
- key: username
path: the-group/the-user
- configMap:
name: the-config-map
items:
- key: config
path: the-group/the-config-map
The parameters for projected volumes are very similar to regular volumes. The exceptions are:
ConfigMap
naming, the field secretName
has been updated to name
for secrets.defaultMode
can only be set at the projected level and cannot be specified individually for each volume source (but you can specify the mode explicitly for each projection).Let’s look at a special kind of projected volume – the serviceAccountToken
exceptions.
Kubernetes pods can access the Kubernetes API server using the permissions of the service account associated with the pod. serviceAccountToken
projected volumes give you more granularity and control from a security standpoint. The token can have an expiration and a specific audience.
More details are available here: https://kubernetes.io/docs/concepts/storage/projected-volumes/#serviceaccounttoken.
Local volumes are static persistent disks that are allocated on a specific node. They are similar to HostPath
volumes, but Kubernetes knows which node a local volume belongs to and will schedule pods that bind to that local volume always to that node. This means the pod will not be evicted and scheduled to another node where the data is not available.
Let’s create a local volume. First, we need to create a backing directory. For KinD and k3d clusters you can access the node through Docker:
$ docker exec -it k3d-k3s-default-agent-1 mkdir -p /mnt/disks/disk-1
$ docker exec -it k3d-k3s-default-agent-1 ls -la /mnt/disks
total 12
drwxr-xr-x 3 0 0 4096 Jun 29 21:40 .
drwxr-xr-x 3 0 0 4096 Jun 29 21:40 ..
drwxr-xr-x 2 0 0 4096 Jun 29 21:40 disk-1
For minikube you need to use minikube ssh
.
Now, we can create a local volume backed by the /mnt/disks/disk1
directory:
apiVersion: v1
kind: PersistentVolume
metadata:
name: local-pv
labels:
release: stable
capacity: 10Gi
spec:
capacity:
storage: 10Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Delete
storageClassName: local-storage
local:
path: /mnt/disks/disk-1
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- k3d-k3s-default-agent-1
$ k create -f local-volume.yaml
persistentvolume/local-pv created
$ k get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
local-pv 10Gi RWO Delete Bound default/local-storage-claim local-storage 6m44s
When containers want access to some persistent storage they make a claim (or rather, the developer and cluster administrator coordinate on necessary storage resources to claim). Here is a sample claim that matches the persistent volume from the previous section - Creating a local volume:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: local-storage-claim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 8Gi
storageClassName: local-storage
selector:
matchLabels:
release: "stable"
matchExpressions:
- {key: capacity, operator: In, values: [8Gi, 10Gi]}
Let’s create the claim and then explain what the different pieces do:
$ k create -f local-persistent-volume-claim.yaml
persistentvolumeclaim/local-storage-claim created
$ k get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
local-storage-claim WaitForFirstConsumer local-pv 10Gi RWO local-storage 21m
The name local-storage-claim
will be important later when mounting the claim into a container.
The access mode in the spec is ReadWriteOnce
, which means if the claim is satisfied no other claim with the ReadWriteOnce
access mode can be satisfied, but claims for ReadOnlyMany
can still be satisfied.
The resources section requests 8 GiB. This can be satisfied by our persistent volume, which has a capacity of 10 Gi. But, this is a little wasteful because 2 Gi will not be used by definition.
The storage class name is local-storage
. As mentioned earlier it must match the class name of the persistent volume. However, with PVC there is a difference between an empty class name (""
) and no class name at all. The former (an empty class name) matches persistent volumes with no storage class name. The latter (no class name) will be able to bind to persistent volumes only if the DefaultStorageClass
admission plugin is turned on and the default storage class is used.
The selector section allows you to filter available volumes further. For example, here the volume must match the label release:stable
and also have a label with either capacity:8Gi
or capacity:10Gi
. Imagine that we have several other volumes provisioned with capacities of 20 Gi and 50 Gi. We don’t want to claim a 50 Gi volume when we only need 8 Gi.
Kubernetes always tries to match the smallest volume that can satisfy a claim, but if there are no 8 Gi or 10 Gi volumes then the labels will prevent assigning a 20 Gi or 50 Gi volume and use dynamic provisioning instead.
It’s important to realize that claims don’t mention volumes by name. You can’t claim a specific volume. The matching is done by Kubernetes based on storage class, capacity, and labels.
Finally, persistent volume claims belong to a namespace. Binding a persistent volume to a claim is exclusive. That means that a persistent volume will be bound to a namespace. Even if the access mode is ReadOnlyMany
or ReadWriteMany
, all the pods that mount the persistent volume claim must be from that claim’s namespace.
OK. We have provisioned a volume and claimed it. It’s time to use the claimed storage in a container. This turns out to be pretty simple. First, the persistent volume claim must be used as a volume in the pod and then the containers in the pod can mount it, just like any other volume. Here is a pod manifest that specifies the persistent volume claim we created earlier (bound to the local persistent volume we provisioned):
kind: Pod
apiVersion: v1
metadata:
name: the-pod
spec:
containers:
- name: the-container
image: g1g1/py-kube:0.3
volumeMounts:
- mountPath: "/mnt/data"
name: persistent-volume
volumes:
- name: persistent-volume
persistentVolumeClaim:
claimName: local-storage-claim
The key is in the persistentVolumeClaim
section under volumes. The claim name (local-storage-claim
here) uniquely identifies within the current namespace the specific claim and makes it available as a volume (named persistent-volume
here). Then, the container can refer to it by its name and mount it to "/mnt/data"
.
Before we create the pod it’s important to note that the persistent volume claim didn’t actually claim any storage yet and wasn’t bound to our local volume. The claim is pending until some container actually attempts to mount a volume using the claim:
$ k get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
local-storage-claim Pending local-storage 6m14s
Now, the claim will be bound when creating the pod:
$ k create -f pod-with-local-claim.yaml
pod/the-pod created
$ k get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
local-storage-claim Bound local-pv 100Gi RWO local-storage 20m
Kubernetes 1.9 added this capability as an Alpha feature. Kubernetes 1.13 moved it to Beta. Since Kubernetes 1.18 it is GA.
Raw block volumes provide direct access to the underlying storage, which is not mediated via a file system abstraction. This is very useful for applications that require high-performance storage like databases or when consistent I/O performance and low latency are needed. The following storage providers support raw block volumes:
In addition many CSI storage providers also offer raw block volume. For the full list check out: https://kubernetes-csi.github.io/docs/drivers.html.
Here is how to define a raw block volume using the FireChannel provider:
apiVersion: v1
kind: PersistentVolume
metadata:
name: block-pv
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
volumeMode: Block
persistentVolumeReclaimPolicy: Retain
fc:
targetWWNs: ["50060e801049cfd1"]
lun: 0
readOnly: false
A matching Persistent Volume Claim (PVC) MUST specify volumeMode: Block
as well. Here is what it looks like:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: block-pvc
spec:
accessModes:
- ReadWriteOnce
volumeMode: Block
resources:
requests:
storage: 10Gi
Pods consume raw block volumes as devices under /dev
and NOT as mounted filesystems. Containers can then access these devices and read/write to them. In practice this means that I/O requests to block storage go directly to the underlying block storage and don’t pass through the file system drivers. This is in theory faster, but in practice it can actually decrease performance if your application benefits from file system buffering.
Here is a pod with a container that binds the block-pvc
with the raw block storage as a device named /dev/xdva
:
apiVersion: v1
kind: Pod
metadata:
name: pod-with-block-volume
spec:
containers:
- name: fc-container
image: fedora:26
command: ["/bin/sh", "-c"]
args: ["tail -f /dev/null"]
volumeDevices:
- name: data
devicePath: /dev/xvda
volumes:
- name: data
persistentVolumeClaim:
claimName: block-pvc
We will cover the Container Storage Interface (CSI) in detail later in the chapter in the section The Container Storage Interface. CSI ephemeral volumes are backed by local storage on the node. These volumes’ lifecycles are tied to the pod’s lifecycle. In addition, they can only be mounted by containers of that pod, which is useful for populating secrets and certificates directly into a pod, without going through a Kubernetes secret object.
Here is an example of a pod with a CSI ephemeral volume:
kind: Pod
apiVersion: v1
metadata:
name: the-pod
spec:
containers:
- name: the-container
image: g1g1/py-kube:0.3
volumeMounts:
- mountPath: "/data"
name: the-volume
command: [ "sleep", "1000000" ]
volumes:
- name: the-volume
csi:
driver: inline.storage.kubernetes.io
volumeAttributes:
key: value
CSI ephemeral volumes have been GA since Kubernetes 1.25. However, they may not be supported by all CSI drivers. As usual check the list: https://kubernetes-csi.github.io/docs/drivers.html.
Generic ephemeral volumes are yet another volume type that is tied to the pod lifecycle. When the pod is gone the generic ephemeral volume is gone.
This volume type actually creates a full-fledged persistent volume claim. This provides several capabilities:
Here is an example of a pod with a generic ephemeral volume:
kind: Pod
apiVersion: v1
metadata:
name: the-pod
spec:
containers:
- name: the-container
image: g1g1/py-kube:0.3
volumeMounts:
- mountPath: "/data"
name: the-volume
command: [ "sleep", "1000000" ]
volumes:
- name: the-volume
ephemeral:
volumeClaimTemplate:
metadata:
labels:
type: generic-ephemeral-volume
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: generic-storage
resources:
requests:
storage: 1Gi
Note that from a security point of view users that have permission to create pods, but not PVCs, can now create PVCs via generic ephemeral volumes. To prevent that it is possible to use admission control.
We’ve run into storage classes already. What are they exactly? Storage classes let an administrator configure a cluster with custom persistent storage (as long as there is a proper plugin to support it). A storage class has a name in the metadata (it must be specified in the storageClassName
file of the claim), a provisioner, a reclaim policy, and parameters.
We declared a storage class for local storage earlier. Here is a sample storage class that uses AWS EBS as a provisioner (so, it works only on AWS):
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp2
reclaimPolicy: Retain
allowVolumeExpansion: true
mountOptions:
- debug
volumeBindingMode: Immediate
You may create multiple storage classes for the same provisioner with different parameters. Each provisioner has its own parameters.
The currently supported provisioners are:
This list doesn’t contain provisioners for other volume types, such as configMap
or secret
, that are not backed by your typical network storage. Those volume types don’t require a storage class. Utilizing volume types intelligently is a major part of architecting and managing your cluster.
The cluster administrator can also assign a default storage class. When a default storage class is assigned and the DefaultStorageClass
admission plugin is turned on, then claims with no storage class will be dynamically provisioned using the default storage class. If the default storage class is not defined or the admission plugin is not turned on, then claims with no storage class can only match volumes with no storage class.
We covered a lot of ground and a lot of options for provisioning storage and using it in different ways. Let’s put everything together and show the whole process from start to finish.
To illustrate all the concepts, let’s do a mini demonstration where we create a HostPath
volume, claim it, mount it, and have containers write to it. We will use k3d for this part.
Let’s start by creating a hostPath
volume using the dir
storage class. Save the following in dir-persistent-volume.yaml
:
kind: PersistentVolume
apiVersion: v1
metadata:
name: dir-pv
spec:
storageClassName: dir
capacity:
storage: 1Gi
accessModes:
- ReadWriteMany
hostPath:
path: "/tmp/data"
Then, let’s create it:
$ k create -f dir-persistent-volume.yaml
persistentvolume/dir-pv created
To check out the available volumes, you can use the resource type persistentvolumes
or pv
for short:
$ k get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
dir-pv 1Gi RWX Retain Available dir 22s
The capacity is 1 GiB as requested. The reclaim policy is Retain
because host path volumes are retained (not destroyed). The status is Available
because the volume has not been claimed yet. The access mode is specified as RWX
, which means ReadWriteMany
. All of the access modes have a shorthand version:
RWO
– ReadWriteOnce
ROX
– ReadOnlyMany
RWX
– ReadWriteMany
We have a persistent volume. Let’s create a claim. Save the following to dir-persistent-volume-claim.yaml
:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: dir-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
Then, run the following command:
$ k create -f dir-persistent-volume-claim.yaml
persistentvolumeclaim/dir-pvc created
Let’s check the claim and the volume:
$ k get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
dir-pvc Bound dir-pv 1Gi RWX dir 106s
$ k get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
dir-pv 1Gi RWX Retain Bound default/dir-pvc dir 4m25s
As you can see, the claim and the volume are bound to each other and reference each other. The reason the binding works is that the same storage class is used by the volume and the claim. But, what happens if they don’t match? Let’s remove the storage class from the persistent volume claim and see what happens. Save the following persistent volume claim to some-persistent-volume-claim.yaml
:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: some-pvc
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
Then, create it:
$ k create -f some-persistent-volume-claim.yaml
persistentvolumeclaim/some-pvc created
Ok. It was created. Let’s check it out:
$ k get pvc some-pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
some-pvc Pending local-path 3m29s
Very interesting. The some-pvc
claim was associated with the local-path
storage class that we never specified, but it is still pending. Let’s understand why.
Here is the local-path
storage class:
$ k get storageclass local-path -o yaml
kind: StorageClass
metadata:
annotations:
objectset.rio.cattle.io/applied: H4sIAAAAAAAA/4yRT+vUMBCGv4rMua1bu1tKwIO u7EUEQdDzNJlux6aZkkwry7LfXbIqrIffn2PyZN7hfXIFXPg7xcQSwEBSiXimaupSxfJ2q6GAiYMDA9 /+oKPHlKCAmRQdKoK5AoYgisoSUj5K/5OsJtIqslQWVT3lNM4xUDzJ5VegWJ63CQxMTXogW128+czBvf/gnIQXIwLOBAa8WPTl30qvGkoL2jw5rT2V6ZKUZij+SbG5eZVRDKR0F8SpdDTg6rW8YzCgcSW4FeCxJ/+sjxHTCAbqrhmag20Pw9DbZtfu210z7JuhPnQ719m2w3cOe7fPof81W1DHfLlE2Th/IEUwEDHYkWJe8PCs gJgL8PxVPNsLGPhEnjRr2cSvM33k4Dicv4jLC34g60niiWPSo4S0zhTh9jsAAP//ytgh5S0CAAA
objectset.rio.cattle.io/id: ""
objectset.rio.cattle.io/owner-gvk: k3s.cattle.io/v1, Kind=Addon
objectset.rio.cattle.io/owner-name: local-storage
objectset.rio.cattle.io/owner-namespace: kube-system
storageclass.kubernetes.io/is-default-class: "true"
creationTimestamp: "2022-06-22T18:16:56Z"
labels:
objectset.rio.cattle.io/hash: 183f35c65ffbc3064603f43f1580d8c68a2dabd4
name: local-path
resourceVersion: "290"
uid: b51cf456-f87e-48ac-9062-4652bf8f683e
provisioner: rancher.io/local-path
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
It is a storage class that comes with k3d (k3s).
Note the annotation: storageclass.kubernetes.io/is-default-class: "true"
. It tells Kubernetes that this is the default storage class. Since our PVC had no storage class name it was associated with the default storage class. But, why is the claim still pending? The reason is that volumeBindingMode
is WaitForFirstConsumer
. This means that the volume for the claim will be provisioned dynamically only when a container attempts to mount the volume via the claim.
Back to our dir-pvc
. The final step is to create a pod with two containers and assign the claim as a volume to both of them. Save the following to shell-pod.yaml
:
kind: Pod
apiVersion: v1
metadata:
name: just-a-shell
labels:
name: just-a-shell
spec:
containers:
- name: a-shell
image: g1g1/py-kube:0.3
command: ["sleep", "10000"]
volumeMounts:
- mountPath: "/data"
name: pv
- name: another-shell
image: g1g1/py-kube:0.3
command: ["sleep", "10000"]
volumeMounts:
- mountPath: "/another-data"
name: pv
volumes:
- name: pv
persistentVolumeClaim:
claimName: dir-pvc
This pod has two containers that use the g1g1/py-kube:0.3
image and both just sleep for a long time. The idea is that the containers will keep running, so we can connect to them later and check their file system. The pod mounts our persistent volume claim with a volume name of pv
. Note that the volume specification is done at the pod level just once and multiple containers can mount it into different directories.
Let’s create the pod and verify that both containers are running:
$ k create -f shell-pod.yaml
pod/just-a-shell created
$ k get po just-a-shell -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
just-a-shell 2/2 Running 0 74m 10.42.2.104 k3d-k3s-default-agent-1 <none> <none>
Then, connect to the node (k3d-k3s-default-agent-1
). This is the host whose /tmp/data
is the pod’s volume that is mounted as /data
and /another-data
into each of the running containers:
$ docker exec -it k3d-k3s-default-agent-1 sh
/ #
Then, let’s create a file in the /tmp/data
directory on the host. It should be visible by both containers via the mounted volume:
/ # echo "yeah, it works" > /tmp/data/cool.txt
Let’s verify from the outside that the file cool.txt
is indeed available:
$ docker exec -it k3d-k3s-default-agent-1 cat /tmp/data/cool.txt
yeah, it works
Next, let’s verify the file is available in the containers (in their mapped directories):
$ k exec -it just-a-shell -c a-shell -- cat /data/cool.txt
yeah, it works
$ k exec -it just-a-shell -c another-shell -- cat /another-data/cool.txt
yeah, it works
We can even create a new file, yo.txt
, in one of the containers and see that it’s available to the other container or to the node itself:
$ k exec -it just-a-shell -c another-shell – bash –c "echo yo > /another-data/yo.txt"
yo /another-data/yo.txt
$ k exec -it just-a-shell -c a-shell cat /data/yo.txt
yo
$ k exec -it just-a-shell -c another-shell cat /another-data/yo.txt
yo
Yes. Everything works as expected and both containers share the same storage.
In this section, we’ll look at some of the common volume types available in the leading public cloud platforms. Managing storage at scale is a difficult task that eventually involves physical resources, similar to nodes. If you choose to run your Kubernetes cluster on a public cloud platform, you can let your cloud provider deal with all these challenges and focus on your system. But it’s important to understand the various options, constraints, and limitations of each volume type.
Many of the volume types we will go over used to be handled by in-tree plugins (part of core Kubernetes), but have now migrated to out-of-tree CSI plugins.
The CSI migration feature allows in-tree plugins that have corresponding out-of-tree CSI plugins to direct operations toward the out-of-tree plugins as a transitioning measure.
We will cover the CSI itself later.
AWS provides the Elastic Block Store (EBS) as persistent storage for EC2 instances. An AWS Kubernetes cluster can use AWS EBS as persistent storage with the following limitations:
Those are severe limitations. The restriction for a single availability zone, while great for performance, eliminates the ability to share storage at scale or across a geographically distributed system without custom replication and synchronization. The limit of a single EBS volume to a single EC2 instance means even within the same availability zone, pods can’t share storage (even for reading) unless you make sure they run on the same node.
This is an example of an in-tree plugin that also has a CSI driver and supports CSIMigration. That means that if the CSI driver for AWS EBS (ebs.csi.aws.com
) is installed, then the in-tree plugin will redirect all plugin operations to the out-of-tree plugin.
It is also possible to disable loading the in-tree awsElasticBlockStore
storage plugin from being loaded by setting the InTreePluginAWSUnregister
feature gate to true
(the default is false
).
Check out all the feature gates here: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/.
Let’s see how to define an AWS EBS persistent volume (static provisioning):
apiVersion: v1
kind: PersistentVolume
metadata:
name: test-pv
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 5Gi
csi:
driver: ebs.csi.aws.com
volumeHandle: {EBS volume ID}
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: topology.ebs.csi.aws.com/zone
operator: In
values:
- {availability zone}
Then you need to define a PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ebs-claim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
Finally, a pod can mount the PVC:
apiVersion: v1
kind: Pod
metadata:
name: some-pod
spec:
containers:
- image: some-container
name: some-container
volumeMounts:
- name: persistent-storage
mountPath: /data
volumes:
- name: persistent-storage
persistentVolumeClaim:
claimName: ebs-claim
AWS has a service called the Elastic File System (EFS). This is really a managed NFS service. It uses the NFS 4.1 protocol and has many benefits over EBS:
That said, EFS is more expansive than EBS even when you consider the automatic replication to multiple AZs (assuming you fully utilize your EBS volumes). The recommended way to use EFS via its dedicated CSI driver: https://github.com/kubernetes-sigs/aws-efs-csi-driver.
Here is an example of static provisioning. First, define the persistent volume:
apiVersion: v1
kind: PersistentVolume
metadata:
name: efs-pv
spec:
capacity:
storage: 1Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
csi:
driver: efs.csi.aws.com
volumeHandle: <Filesystem Id>
You can find the Filesystem Id
using the AWS CLI:
aws efs describe-file-systems --query "FileSystems[*].FileSystemId"
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: efs-claim
spec:
accessModes:
- ReadWriteOnce
storageClassName: ""
resources:
requests:
storage: 1Gi
Here is a pod that consumes it:
piVersion: v1
kind: Pod
metadata:
name: efs-app
spec:
containers:
- name: app
image: centos
command: ["/bin/sh"]
args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
volumeMounts:
- name: persistent-storage
mountPath: /data
volumes:
- name: persistent-storage
persistentVolumeClaim:
claimName: efs-claim
You can also use dynamic provisioning by defining a proper storage class instead of creating a static volume:
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: efs-sc
provisioner: efs.csi.aws.com
parameters:
provisioningMode: efs-ap
fileSystemId: <Filesystem Id>
directoryPerms: "700"
gidRangeStart: "1000" # optional
gidRangeEnd: "2000" # optional
basePath: "/dynamic_provisioning" # optional
The PVC is similar, but now uses the storage class name:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: efs-claim
spec:
accessModes:
- ReadWriteMany
storageClassName: efs-sc
resources:
requests:
storage: 5Gi
The pod consumes the PVC just like before:
apiVersion: v1
kind: Pod
metadata:
name: efs-app
spec:
containers:
- name: app
image: centos
command: ["/bin/sh"]
args: ["-c", "while true; do echo $(date -u) >> /data/out; sleep 5; done"]
volumeMounts:
- name: persistent-storage
mountPath: /data
volumes:
- name: persistent-storage
persistentVolumeClaim:
claimName: efs-claim
The gcePersistentDisk
volume type is very similar to awsElasticBlockStore
. You must provision the disk ahead of time. It can only be used by GCE instances in the same project and zone. But the same volume can be used as read-only on multiple instances. This means it supports ReadWriteOnce
and ReadOnlyMany
. You can use a GCE persistent disk to share data as read-only between multiple pods in the same zone.
It also has a CSI driver called pd.csi.storage.gke.io
and supports CSIMigration.
If the pod that’s using a persistent disk in ReadWriteOnce
mode is controlled by a replication controller, a replica set, or a deployment, the replica count must be 0 or 1. Trying to scale beyond 1 will fail for obvious reasons.
Here is a storage class for GCE persistent disk using the CSI driver:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: csi-gce-pd
provisioner: pd.csi.storage.gke.io
parameters:
labels: key1=value1,key2=value2
volumeBindingMode: WaitForFirstConsumer
Here is the PVC:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: gce-pd-pvc
spec:
accessModes:
- ReadWriteOnce
storageClassName: csi-gce-pd
resources:
requests:
storage: 200Gi
A Pod can consume it for dynamic provisioning:
apiVersion: v1
kind: Pod
metadata:
name: some-pod
spec:
containers:
- image: some-image
name: some-container
volumeMounts:
- mountPath: /pd
name: some-volume
volumes:
- name: some-volume
persistentVolumeClaim:
claimName: gce-pd-pvc
readOnly: false
The GCE persistent disk has supported a regional disk option since Kubernetes 1.10 (in Beta). Regional persistent disks automatically sync between two zones. Here is what the storage class looks like for a regional persistent disk:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: csi-gce-pd
provisioner: pd.csi.storage.gke.io
parameters:
type: pd-standard
replication-type: regional-pd
volumeBindingMode: WaitForFirstConsumer
Google Cloud Filestore is the managed NFS file service of GCP. Kubernetes doesn’t have an in-tree plugin for it and there is no general-purpose supported CSI driver.
However, there is a CSI driver used on GKE and if you are adventurous, you may want to try it even if you’re installing Kubernetes yourself on GCP and want to use Google Cloud Storage as a storage option.
See: https://github.com/kubernetes-sigs/gcp-filestore-csi-driver.
The Azure data disk is a virtual hard disk stored in Azure storage. It’s similar in capabilities to AWS EBS or a GCE persistent disk.
It also has a CSI driver called disk.csi.azure.com
and supports CSIMigration. See: https://github.com/kubernetes-sigs/azuredisk-csi-driver.
Here is an example of defining an Azure disk persistent volume:
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-azuredisk
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: managed-csi
csi:
driver: disk.csi.azure.com
readOnly: false
volumeHandle: /subscriptions/{sub-id}/resourcegroups/{group-name}/providers/microsoft.compute/disks/{disk-id}
volumeAttributes:
fsType: ext4
In addition to the mandatory diskName
and diskURI
parameters, it also has a few optional parameters:
kind
: The available options for disk storage configurations are Shared
(allowing multiple disks per storage account), Dedicated
(providing a single blob disk per storage account), or Managed
(offering an Azure-managed data disk). The default is Shared
.cachingMode
: The disk caching mode. This must be one of None
, ReadOnly
, or ReadWrite
. The default is None
.fsType
: The filesystem type set to mount. The default is ext4
.readOnly
: Whether the filesystem is used as readOnly
. The default is false
.Azure data disks are limited to 32 GiB. Each Azure VM can have up to 32 data disks. Larger VM sizes can have more disks attached. You can attach an Azure data disk to a single Azure VM.
As usual you should create a PVC and consume it in a pod (or a pod controller).
In addition to the data disk, Azure has also a shared filesystem similar to AWS EFS. However, Azure file storage uses the SMB/CIFS protocol (it supports SMB 2.1 and SMB 3.0). It is based on the Azure storage platform and has the same availability, durability, scalability, and geo-redundancy capabilities as Azure Blob, Table, or Queue storage.
In order to use Azure file storage, you need to install on each client VM the cifs-utils
package. You also need to create a secret, which is a required parameter:
apiVersion: v1
kind: Secret
metadata:
name: azure-file-secret
type: Opaque
data:
azurestorageaccountname: <base64 encoded account name>
azurestorageaccountkey: <base64 encoded account key>
Here is a pod that uses Azure file storage:
apiVersion: v1
kind: Pod
metadata:
name: some-pod
spec:
containers:
- image: some-container
name: some-container
volumeMounts:
- name: some-volume
mountPath: /azure
volumes:
- name: some-volume
azureFile:
secretName: azure-file-secret
shareName: azure-share
readOnly: false
Azure file storage supports sharing within the same region as well as connecting on-premise clients.
This covers the public cloud storage volume types. Let’s look at some distributed storage volumes you can install on your own in your cluster.
GlusterFS and Ceph are two distributed persistent storage systems. GlusterFS is, at its core, a network filesystem. Ceph is, at its core, an object store. Both expose block, object, and filesystem interfaces. Both use the xfs
filesystem under the hood to store the data and metadata as xattr
attributes. There are several reasons why you may want to use GlusterFS or Ceph as persistent volumes in your Kubernetes cluster:
Let’s take a closer look at GlusterFS.
GlusterFS is intentionally simple, exposing the underlying directories as they are and leaving it to clients (or middleware) to handle high availability, replication, and distribution. GlusterFS organizes the data into logical volumes, which encompass multiple nodes (machines) that contain bricks, which store files. Files are allocated to bricks according to DHT (distributed hash table). If files are renamed or the GlusterFS cluster is expanded or rebalanced, files may be moved between bricks. The following diagram shows the GlusterFS building blocks:
Figure 6.2: GlusterFS building blocks
To use a GlusterFS cluster as persistent storage for Kubernetes (assuming you have an up-and-running GlusterFS cluster), you need to follow several steps. In particular, the GlusterFS nodes are managed by the plugin as a Kubernetes service.
Here is an example of an endpoints resource that you can create as a normal Kubernetes resource using kubectl create
:
kind: Endpoints
apiVersion: v1
metadata:
name: glusterfs-cluster
subsets:
- addresses:
- ip: 10.240.106.152
ports:
- port: 1
- addresses:
- ip: 10.240.79.157
ports:
- port: 1
To make the endpoints persistent, you use a Kubernetes service with no selector to indicate the endpoints are managed manually:
kind: Service
apiVersion: v1
metadata:
name: glusterfs-cluster
spec:
ports:
- port: 1
Finally, in the pod spec’s volumes
section, provide the following information:
volumes:
- name: glusterfsvol
glusterfs:
endpoints: glusterfs-cluster
path: kube_vol
readOnly: true
The containers can then mount glusterfsvol
by name.
The endpoints tell the GlusterFS volume plugin how to find the storage nodes of the GlusterFS cluster.
There was an effort to create a CSI driver for GlusterFS, but it was abandoned: https://github.com/gluster/gluster-csi-driver.
After covering GlusterFS let’s look at CephFS.
Ceph’s object store can be accessed using multiple interfaces. Unlike GlusterFS, Ceph does a lot of work automatically. It does distribution, replication, and self-healing all on its own. The following diagram shows how RADOS – the underlying object store – can be accessed in multiple ways.
Figure 6.3: Accessing RADOS
Kubernetes supports Ceph via the Rados Block Device (RBD) interface.
You must install ceph-common
on each node of the Kubernetes cluster. Once you have your Ceph cluster up and running, you need to provide some information required by the Ceph RBD volume plugin in the pod configuration file:
monitors
: Ceph monitors.pool
: The name of the RADOS pool. If not provided, the default RBD pool is used.image
: The image name that RBD has created.user
: The RADOS username. If not provided, the default admin is used.keyring
: The path to the keyring file. If not provided, the default /etc/ceph/keyring
is used.secretName
: The name of the authentication secrets. If provided, secretName
overrides keyring
. Note: see the following paragraph about how to create a secret.fsType
: The filesystem type (ext4
, xfs
, and so on) that is formatted on the device.readOnly
: Whether the filesystem is used as readOnly
.If the Ceph authentication secret is used, you need to create a secret object:
apiVersion: v1
kind: Secret
metadata:
name: ceph-secret
type: "kubernetes.io/rbd"
data:
key: QVFCMTZWMVZvRjVtRXhBQTVrQ1FzN2JCajhWVUxSdzI2Qzg0SEE9PQ==
The secret type is kubernetes.io/rbd
.
Here is a sample pod that uses Ceph through RBD with a secret using the in-tree provider:
apiVersion: v1
kind: Pod
metadata:
name: rbd2
spec:
containers:
- image: kubernetes/pause
name: rbd-rw
volumeMounts:
- name: rbdpd
mountPath: /mnt/rbd
volumes:
- name: rbdpd
rbd:
monitors:
- '10.16.154.78:6789'
- '10.16.154.82:6789'
- '10.16.154.83:6789'
pool: kube
image: foo
fsType: ext4
readOnly: true
user: admin
secretRef:
name: ceph-secret
Ceph RBD supports ReadWriteOnce
and ReadOnlyMany
access modes. But, these days it is best to work with Ceph via Rook.
Rook is an open source cloud native storage orchestrator. It is currently a graduated CNCF project. It used to provide a consistent experience on top of multiple storage solutions like Ceph, edgeFS, Cassandra, Minio, NFS, CockroachDB, and YugabyteDB. But, eventually it laser-focused on supporting only Ceph. Here are the features Rook provides:
Rook takes advantage of modern Kubernetes best practices like CRDs and operators.
Here is the Rook architecture:
Figure 6.4: Rook architecture
Once you install the Rook operator you can create a Ceph cluster using a Rook CRD such as: https://github.com/rook/rook/blob/release-1.10/deploy/examples/cluster.yaml.
Here is a shortened version (without the comments):
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
namespace: rook-ceph # namespace:cluster
spec:
cephVersion:
image: quay.io/ceph/ceph:v17.2.5
allowUnsupported: false
dataDirHostPath: /var/lib/rook
skipUpgradeChecks: false
continueUpgradeAfterChecksEvenIfNotHealthy: false
waitTimeoutForHealthyOSDInMinutes: 10
mon:
count: 3
allowMultiplePerNode: false
mgr:
count: 2
allowMultiplePerNode: false
modules:
- name: pg_autoscaler
enabled: true
dashboard:
enabled: true
ssl: true
monitoring:
enabled: false
network:
connections:
encryption:
enabled: false
compression:
enabled: false
crashCollector:
disable: false
logCollector:
enabled: true
periodicity: daily # one of: hourly, daily, weekly, monthly
maxLogSize: 500M # SUFFIX may be 'M' or 'G'. Must be at least 1M.
cleanupPolicy:
confirmation: ""
sanitizeDisks:
method: quick
dataSource: zero
iteration: 1
allowUninstallWithVolumes: false
annotations:
labels:
resources:
removeOSDsIfOutAndSafeToRemove: false
priorityClassNames:
mon: system-node-critical
osd: system-node-critical
mgr: system-cluster-critical
storage: # cluster level storage configuration and selection
useAllNodes: true
useAllDevices: true
config:
onlyApplyOSDPlacement: false
disruptionManagement:
managePodBudgets: true
osdMaintenanceTimeout: 30
pgHealthCheckTimeout: 0
manageMachineDisruptionBudgets: false
machineDisruptionBudgetNamespace: openshift-machine-api
healthCheck:
daemonHealth:
mon:
disabled: false
interval: 45s
osd:
disabled: false
interval: 60s
status:
disabled: false
interval: 60s
livenessProbe:
mon:
disabled: false
mgr:
disabled: false
osd:
disabled: false
startupProbe:
mon:
disabled: false
mgr:
disabled: false
osd:
disabled: false
Here is a storage class for CephFS:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: rook-ceph-retain-bucket
provisioner: rook-ceph.ceph.rook.io/bucket # driver:namespace:cluster
# set the reclaim policy to retain the bucket when its OBC is deleted
reclaimPolicy: Retain
parameters:
objectStoreName: my-store # port 80 assumed
objectStoreNamespace: rook-ceph # namespace:cluster
The full code is available here: https://github.com/rook/rook/blob/release-1.10/deploy/examples/storageclass-bucket-retain.yaml.
Now that we’ve covered using distributed storage using GlusterFS, Ceph, and Rook, let’s look at enterprise storage options.
If you have an existing Storage Area Network (SAN) exposed over the iSCSI interface, Kubernetes has a volume plugin for you. It follows the same model as other shared persistent storage plugins we’ve seen earlier. It supports the following features:
multipathd
You must configure the iSCSI initiator, but you don’t have to provide any initiator information. All you need to provide is the following:
3260
)Readonly
Boolean flagThe iSCSI plugin supports ReadWriteOnce
and ReadonlyMany
. Note that you can’t partition your device at this time. Here is an example pod with an iSCSI volume spec:
---
apiVersion: v1
kind: Pod
metadata:
name: iscsipd
spec:
containers:
- name: iscsipd-rw
image: kubernetes/pause
volumeMounts:
- mountPath: "/mnt/iscsipd"
name: iscsipd-rw
volumes:
- name: iscsipd-rw
iscsi:
targetPortal: 10.0.2.15:3260
portals: ['10.0.2.16:3260', '10.0.2.17:3260']
iqn: iqn.2001-04.com.example:storage.kube.sys1.xyz
lun: 0
fsType: ext4
readOnly: true
The Kubernetes storage scene keeps innovating. A lot of companies adapt their products to Kubernetes and some companies and organizations build Kubernetes-dedicated storage solutions. Here are some of the more popular and mature solutions:
The Container Storage Interface (CSI) is a standard interface for the interaction between container orchestrators and storage providers. It was developed by Kubernetes, Docker, Mesos, and Cloud Foundry. The idea is that storage providers implement just one CSI driver and all container orchestrators need to support only the CSI. It is the equivalent of CNI for storage.
A CSI volume plugin was added in Kubernetes 1.9 as an Alpha feature and has been generally available since Kubernetes 1.13. The older FlexVolume approach (which you may have come across) is deprecated now.
Here is a diagram that demonstrates how CSI works within Kubernetes:
Figure 6.5: CSI architecture
The migration effort to port all in-tree plugins to out-of-tree CSI drivers is well underway. See https://kubernetes-csi.github.io for more details.
These features are available only to CSI drivers. They represent the benefits of a uniform storage model that allows adding optional advanced functionality across all storage providers with a uniform interface.
Volume snapshots are generally available as of Kubernetes 1.20. They are exactly what they sound like – a snapshot of a volume at a certain point in time. You can create and later restore volumes from a snapshot. It’s interesting that the API objects associated with snapshots are CRDs and not part of the core Kubernetes API. The objects are:
VolumeSnapshotClass
VolumeSnapshotContents
VolumeSnapshot
Volume snapshots work using an external-prosnapshotter
sidecar container that the Kubernetes team developed. It watches for snapshot CRDs to be created and interacts with the snapshot controller, which can invoke the CreateSnapshot
and DeleteSnapshot
operations of CSI drivers that implement snapshot support.
Here is how to declare a volume snapshot:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: new-snapshot-test
spec:
volumeSnapshotClassName: csi-hostpath-snapclass
source:
persistentVolumeClaimName: pvc-test
You can also provision volumes from a snapshot.
Here is a persistent volume claim bound to a snapshot:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: restore-pvc
spec:
storageClassName: csi-hostpath-sc
dataSource:
name: new-snapshot-test
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
See https://github.com/kubernetes-csi/external-snapshotter#design for more details.
Volume cloning is available in GA as of Kubernetes 1.18. Volume clones are new volumes that are populated with the content of an existing volume. Once the volume cloning is complete there is no relation between the original and the clone. Their content will diverge over time. You can perform a clone manually by creating a snapshot and then create a new volume from the snapshot. But, volume cloning is more streamlined and efficient.
It only works for dynamic provisioning and uses the storage class of the source volume for the clone as well. You initiate a volume clone by specifying an existing persistent volume claim as a data source of a new persistent volume claim. That triggers the dynamic provisioning of a new volume that clones the source claim’s volume:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: clone-of-pvc-1
namespace: myns
spec:
accessModes:
- ReadWriteOnce
storageClassName: cloning
resources:
requests:
storage: 5Gi
dataSource:
kind: PersistentVolumeClaim
name: pvc-1
See https://kubernetes.io/docs/concepts/storage/volume-pvc-datasource/ for more details.
Storage capacity tracking (GA as of Kubernetes 1.24) allows the scheduler to better schedule pods that require storage into nodes that can provide that storage. This requires a CSI driver that supports storage capacity tracking.
The CSI driver will create a CSIStorageCapacity
object for each storage class and determine which nodes have access to this storage. In addition the CSIDriverSpec
's field StorageCapacity
must be set to true.
When a pod specifies a storage class name in WaitForFirstConsumer
mode and the CSI driver has StorageCapacity
set to true the Kubernetes scheduler will consider the CSIStorageCapacity object associated with the storage class and schedule the pod only to nodes that have sufficient storage.
Check out: https://kubernetes.io/docs/concepts/storage/storage-capacity for more details.
Volume health monitoring is a recent addition to the storage APIs. It has been in Alpha since Kubernetes 1.21. It involves two components:
CSI drivers that support volume health monitoring will update PVCs with events on abnormal conditions of associated storage volumes. The external health monitor also watches nodes for failures and will report events on PVCs bound to these nodes.
In the case where a CSI driver enables volume health monitoring from the node side, any abnormal condition detected will result in an event being reported for every pod that utilizes a PVC with the corresponding issue.
There is also a new metric associated with volume health: kubelet_volume_stats_health_status_abnormal
.
It has two labels: namespace
and persistentvolumeclaim
. The values are 0 or 1.
More details are available here: https://kubernetes.io/docs/concepts/storage/volume-health-monitoring/.
CSI is an exciting initiative that simplified the Kubernetes code base itself by externalizing storage drivers. It simplified the life of storage solutions that can develop out-of-tree drivers and added a lot of advanced capabilities to the Kubernetes storage story.
In this chapter, we took a deep look into storage in Kubernetes. We’ve looked at the generic conceptual model based on volumes, claims, and storage classes, as well as the implementation of volume plugins. Kubernetes eventually maps all storage systems into mounted filesystems in containers or devices of raw block storage. This straightforward model allows administrators to configure and hook up any storage system from local host directories, through cloud-based shared storage, all the way to enterprise storage systems. The transition of storage provisioners from in-tree to CSI-based out-of-tree drivers bodes well for the storage ecosystem. You should now have a clear understanding of how storage is modeled and implemented in Kubernetes and be able to make intelligent choices on how to implement storage in your Kubernetes cluster.
In Chapter 7, Running Stateful Applications with Kubernetes, we’ll see how Kubernetes can raise the level of abstraction and, on top of storage, help to develop, deploy, and operate stateful applications using concepts such as stateful sets.
Read this book alongside other users, cloud experts, authors, and like-minded professionals.
Ask questions, provide solutions to other readers, chat with the authors via. Ask Me Anything sessions and much more.
Scan the QR code or visit the link to join the community now.