6

Managing Storage

In this chapter, we’ll look at how Kubernetes manages storage. Storage is very different from compute, but at a high level they are both resources. Kubernetes as a generic platform takes the approach of abstracting storage behind a programming model and a set of plugins for storage providers. First, we’ll go into detail about the conceptual storage model and how storage is made available to containers in the cluster. Then, we’ll cover the common cloud platform storage providers, such as Amazon Web Services (AWS), Google Compute Engine (GCE), and Azure. Then we’ll look at a prominent open source storage provider, GlusterFS from Red Hat, which provides a distributed filesystem. We’ll also look into another solution – Ceph – that manages your data in containers as part of the Kubernetes cluster using the Rook operator. We’ll see how Kubernetes supports the integration of existing enterprise storage solutions. Finally, we will explore the Constrainer Storage Interface (CSI) and all the advanced capabilities it brings to the table.

This chapter will cover the following main topics:

  • Persistent volumes walk-through
  • Demonstrating persistent volume storage end to end
  • Public cloud storage volume types – GCE, AWS, and Azure
  • GlusterFS and Ceph volumes in Kubernetes
  • Integrating enterprise storage into Kubernetes
  • The Container Storage Interface

At the end of this chapter, you’ll have a solid understanding of how storage is represented in Kubernetes, the various storage options in each deployment environment (local testing, public cloud, and enterprise), and how to choose the best option for your use case.

You should try the code samples in this chapter on minikube, or another cluster that supports storage adequately. The KinD cluster has some problems related to labeling nodes, which is necessary for some storage solutions.

Persistent volumes walk-through

In this section, we will understand the Kubernetes storage conceptual model and see how to map persistent storage into containers, so they can read and write. Let’s start by understanding the problem of storage.

Containers and pods are ephemeral. Anything a container writes to its own filesystem gets wiped out when the container dies. Containers can also mount directories from their host node and read or write to them. These will survive container restarts, but the nodes themselves are not immortal. Also, if the pod itself is evicted and scheduled to a different node, the pod’s containers will not have access to the old node host’s filesystem.

There are other problems, such as ownership of mounted hosted directories when the container dies. Just imagine a bunch of containers writing important data to various data directories on their host and then going away, leaving all that data all over the nodes with no direct way to tell what container wrote what data. You can try to record this information, but where would you record it? It’s pretty clear that for a large-scale system, you need persistent storage accessible from any node to reliably manage the data.

Understanding volumes

The basic Kubernetes storage abstraction is the volume. Containers mount volumes that are bound to their pod, and they access the storage wherever it may be as if it’s in their local filesystem. This is nothing new, and it is great, because as a developer who writes applications that need access to data, you don’t have to worry about where and how the data is stored. Kubernetes supports many types of volumes with their own distinctive features. Let’s review some of the main volume types.

Using emptyDir for intra-pod communication

It is very simple to share data between containers in the same pod using a shared volume. Container 1 and container 2 simply mount the same volume and can communicate by reading and writing to this shared space. The most basic volume is the emptyDir. An emptyDir volume is an empty directory on the host. Note that it is not persistent because when the pod is evicted or deleted, the contents are erased. If a container just crashes, the pod will stick around, and the restarted container can access the data in the volume. Another very interesting option is to use a RAM disk, by specifying the medium as Memory. Now, your containers communicate through shared memory, which is much faster, but more volatile of course. If the node is restarted, the emptyDir's volume contents are lost.

Here is a pod configuration file that has two containers that mount the same volume, called shared-volume. The containers mount it in different paths, but when the hue-global-listener container is writing a file to /notifications, the hue-job-scheduler will see that file under /incoming:

apiVersion: v1
kind: Pod
metadata:
  name: hue-scheduler
spec:
  containers:
  - image: g1g1/hue-global-listener:1.0
    name: hue-global-listener
    volumeMounts:
    - mountPath: /notifications
      name: shared-volume
  - image: g1g1/hue-job-scheduler:1.0
    name: hue-job-scheduler
    volumeMounts:
    - mountPath: /incoming
      name: shared-volume
  volumes:
  - name: shared-volume
    emptyDir: {}

To use the shared memory option, we just need to add medium: Memory to the emptyDir section:

  volumes:
  - name: shared-volume
    emptyDir:
     medium: Memory

Note that memory-based emptyDir counts toward the container’s memory limit.

To verify it worked, let’s create the pod and then write a file using one container and read it using the other container:

$ k create -f hue-scheduler.yaml
pod/hue-scheduler created

Note that the pod has two containers:

$ k get pod hue-scheduler -o json | jq .spec.containers
[
  {
    "image": "g1g1/hue-global-listener:1.0",
    "name": "hue-global-listener",
    "volumeMounts": [
      {
        "mountPath": "/notifications",
        "name": "shared-volume"
      },
      ...
    ]
    ...
  },
  {
    "image": "g1g1/hue-job-scheduler:1.0",
    "name": "hue-job-scheduler",
    "volumeMounts": [
      {
        "mountPath": "/incoming",
        "name": "shared-volume"
      },
      ...  
    ]
    ...
  }
]

Now, we can create a file in the /notifications directory of the hue-global-listener container and list it in the /incoming directory of the hue-job-scheduler container:

$ kubectl exec -it hue-scheduler -c hue-global-listener -- touch /notifications/1.txt
$ kubectl exec -it hue-scheduler -c hue-job-scheduler -- ls /incoming
1.txt

As you can see, we are able to see a file that was created in one container in the file system of another container; thereby, the containers can communicate via the shared file system.

Using HostPath for intra-node communication

Sometimes, you want your pods to get access to some host information (for example, the Docker daemon) or you want pods on the same node to communicate with each other. This is useful if the pods know they are on the same host. Since Kubernetes schedules pods based on available resources, pods usually don’t know what other pods they share the node with. There are several cases where a pod can rely on other pods being scheduled with it on the same node:

  • In a single-node cluster, all pods obviously share the same node
  • DaemonSet pods always share a node with any other pod that matches their selector
  • Pods with required pod affinity are always scheduled together

For example, in Chapter 5, Using Kubernetes Resources in Practice, we discussed a DaemonSet pod that serves as an aggregating proxy to other pods. Another way to implement this behavior is for the pods to simply write their data to a mounted volume that is bound to a host directory, and the DaemonSet pod can directly read it and act on it.

A HostPath volume is a host file or directory that is mounted into a pod. Before you decide to use the HostPath volume, make sure you understand the consequences:

  • It is a security risk since access to the host filesystem can expose sensitive data (e.g. kubelet keys)
  • The behavior of pods with the same configuration might be different if they are data-driven and the files on their host are different
  • It can violate resource-based scheduling because Kubernetes can’t monitor HostPath resources
  • The containers that access host directories must have a security context with privileged set to true or, on the host side, you need to change the permissions to allow writing
  • It’s difficult to coordinate disk usage across multiple pods on the same node.
  • You can easily run out of disk space

Here is a configuration file that mounts the /coupons directory into the hue-coupon-hunter container, which is mapped to the host’s /etc/hue/data/coupons directory:

apiVersion: v1
kind: Pod
metadata:
  name: hue-coupon-hunter
spec:
  containers:
  - image: the_g1g1/hue-coupon-hunter
    name: hue-coupon-hunter
    volumeMounts:
    - mountPath: /coupons
      name: coupons-volume
  volumes:
  - name: coupons-volume
    host-path:
        path: /etc/hue/data/coupons

Since the pod doesn’t have a privileged security context, it will not be able to write to the host directory. Let’s change the container spec to enable it by adding a security context:

  - image: the_g1g1/hue-coupon-hunter
    name: hue-coupon-hunter
    volumeMounts:
    - mountPath: /coupons
      name: coupons-volume
    securityContext:
      privileged: true

In the following diagram, you can see that each container has its own local storage area inaccessible to other containers or pods, and the host’s /data directory is mounted as a volume into both container 1 and container 2:

Figure 6.1: Container local storage

Using local volumes for durable node storage

Local volumes are similar to HostPath, but they persist across pod restarts and node restarts. In that sense they are considered persistent volumes. They were added in Kubernetes 1.7. As of Kubernetes 1.14 they are considered stable. The purpose of local volumes is to support Stateful Sets where specific pods need to be scheduled on nodes that contain specific storage volumes. Local volumes have node affinity annotations that simplify the binding of pods to the storage they need to access.

We need to define a storage class for using local volumes. We will cover storage classes in depth later in this chapter. In one sentence, storage classes use a provisioner to allocate storage to pods. Let’s define the storage class in a file called local-storage-class.yaml and create it:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
$ k create -f local-storage-class.yaml
storageclass.storage.k8s.io/local-storage created 

Now, we can create a persistent volume using the storage class that will persist even after the pod that’s using it is terminated:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: local-pv
  labels:
    release: stable
    capacity: 10Gi
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Delete
  storageClassName: local-storage
  local:
    path: /mnt/disks/disk-1
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - k3d-k3s-default-agent-1

Provisioning persistent volumes

While emptyDir volumes can be mounted and used by containers, they are not persistent and don’t require any special provisioning because they use existing storage on the node. HostPath volumes persist on the original node, but if a pod is restarted on a different node, it can’t access the HostPath volume from its previous node. Local volumes are real persistent volumes that use storage provisioned ahead of time by administrators or dynamic provisioning via storage classes. They persist on the node and can survive pod restarts and rescheduling and even node restarts. Some persistent volumes use external storage (not a disk physically attached to the node) provisioned ahead of time by administrators. In cloud environments, the provisioning may be very streamlined, but it is still required, and as a Kubernetes cluster administrator you have to at least make sure your storage quota is adequate and monitor usage versus quota diligently.

Remember that persistent volumes are resources that the Kubernetes cluster is using, similar to nodes. As such they are not managed by the Kubernetes API server.

You can provision resources statically or dynamically.

Provisioning persistent volumes statically

Static provisioning is straightforward. The cluster administrator creates persistent volumes backed up by some storage media ahead of time, and these persistent volumes can be claimed by containers.

Provisioning persistent volumes dynamically

Dynamic provisioning may happen when a persistent volume claim doesn’t match any of the statically provisioned persistent volumes. If the claim specified a storage class and the administrator configured that class for dynamic provisioning, then a persistent volume may be provisioned on the fly. We will see examples later when we discuss persistent volume claims and storage classes.

Provisioning persistent volumes externally

Kubernetes originally contained a lot of code for storage provisioning “in-tree” as part of the main Kubernetes code base. With the introduction of CSI, storage provisioners started to migrate out of Kubernetes core into volume plugins (AKA out-of-tree). External provisioners work just like in-tree dynamic provisioners but can be deployed and updated independently. Most in-tree storage provisioners have been migrated out-of-tree. Check out this project for a library and guidelines for writing external storage provisioners: https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner.

Creating persistent volumes

Here is the configuration file for an NFS persistent volume:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-777
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Recycle
  storageClassName: slow
  mountOptions:
    - hard
    - nfsvers=4.2
  nfs:
    path: /tmp
    server: nfs-server.default.svc.cluster.local

A persistent volume has a spec and metadata that possibly includes a storage class name. Let’s focus on the spec here. There are six sections: capacity, volume mode, access modes, reclaim policy, storage class, and the volume type (nfs in the example).

Capacity

Each volume has a designated amount of storage. Storage claims may be satisfied by persistent volumes that have at least that amount of storage. In the example, the persistent volume has a capacity of 10 gibibytes (a single gibibyte is 2 to the power of 30 bytes).

capacity:
    storage: 10Gi

It is important when allocating static persistent volumes to understand the storage request patterns. For example, if you provision 20 persistent volumes with 100 GiB capacity and a container claims a persistent volume with 150 GiB, then this claim will not be satisfied even though there is enough capacity overall in the cluster.

Volume mode

The optional volume mode was added in Kubernetes 1.9 as an Alpha feature (moved to Beta in Kubernetes 1.13) for static provisioning. It lets you specify if you want a file system (Filesystem) or raw storage (Block). If you don’t specify volume mode, then the default is Filesystem, just like it was pre-1.9.

Access modes

There are three access modes:

  • ReadOnlyMany: Can be mounted read-only by many nodes
  • ReadWriteOnce: Can be mounted as read-write by a single node
  • ReadWriteMany: Can be mounted as read-write by many nodes

The storage is mounted to nodes, so even with ReadWriteOnce, multiple containers on the same node can mount the volume and write to it. If that causes a problem, you need to handle it through some other mechanism (for example, claim the volume only in DaemonSet pods that you know will have just one per node).

Different storage providers support some subset of these modes. When you provision a persistent volume, you can specify which modes it will support. For example, NFS supports all modes, but in the example, only these modes were enabled:

accessModes:
    - ReadWriteMany
    - ReadOnlyMany

Reclaim policy

The reclaim policy determines what happens when a persistent volume claim is deleted. There are three different policies:

  • Retain – the volume will need to be reclaimed manually
  • Delete – the content, the volume, and the backing storage are removed
  • Recycle – delete content only (rm -rf /volume/*)

The Retain and Delete policies mean the persistent volume is not available anymore for future claims. The Recycle policy allows the volume to be claimed again.

At the moment, NFS and HostPath support the recycle policy, while AWS EBS, GCE PD, Azure disk, and Cinder volumes support the delete policy. Note that dynamically provisioned volumes are always deleted.

Storage class

You can specify a storage class using the optional storageClassName field of the spec. If you do then only persistent volume claims that specify the same storage class can be bound to the persistent volume. If you don’t specify a storage class, then only PV claims that don’t specify a storage class can be bound to it.

  storageClassName: slow

Volume type

The volume type is specified by name in the spec. There is no volumeType stanza in the spec. In the preceding example, nfs is the volume type:

nfs:
    path: /tmp
    server: 172.17.0.8

Each volume type may have its own set of parameters. In this case, it’s a path and server.

We will go over various volume types later.

Mount options

Some persistent volume types have additional mount options you can specify. The mount options are not validated. If you provide an invalid mount option, the volume provisioning will fail. For example, NFS supports additional mount options:

  mountOptions:
    - hard
    - nfsvers=4.1

Now that we have looked at provisioning a single persistent volume, let’s look at projected volumes, which add more flexibility and abstraction of storage.

Projected volumes

Projected volumes allow you to mount multiple persistent volumes into the same directory. You need to be careful of naming conflicts of course.

The following volume types support projected volumes:

  • ConfigMap
  • Secret
  • SownwardAPI
  • ServiceAccountToken

The snippet below projects a ConfigMap and a Secret into the same directory:

apiVersion: v1
kind: Pod
metadata:
  name: projected-volumes-demo
spec:
  containers:
  - name: projected-volumes-demo
    image: busybox:1.28
    volumeMounts:
    - name: projected-volumes-demo
      mountPath: "/projected-volume"
      readOnly: true
  volumes:
  - name: projected-volumes-demo
    projected:
      sources:
      - secret:
          name: the-user
          items:
            - key: username
              path: the-group/the-user
      - configMap:
          name: the-config-map
          items:
            - key: config
              path: the-group/the-config-map

The parameters for projected volumes are very similar to regular volumes. The exceptions are:

  • To maintain consistency with ConfigMap naming, the field secretName has been updated to name for secrets.
  • The defaultMode can only be set at the projected level and cannot be specified individually for each volume source (but you can specify the mode explicitly for each projection).

Let’s look at a special kind of projected volume – the serviceAccountToken exceptions.

serviceAccountToken projected volumes

Kubernetes pods can access the Kubernetes API server using the permissions of the service account associated with the pod. serviceAccountToken projected volumes give you more granularity and control from a security standpoint. The token can have an expiration and a specific audience.

More details are available here: https://kubernetes.io/docs/concepts/storage/projected-volumes/#serviceaccounttoken.

Creating a local volume

Local volumes are static persistent disks that are allocated on a specific node. They are similar to HostPath volumes, but Kubernetes knows which node a local volume belongs to and will schedule pods that bind to that local volume always to that node. This means the pod will not be evicted and scheduled to another node where the data is not available.

Let’s create a local volume. First, we need to create a backing directory. For KinD and k3d clusters you can access the node through Docker:

$ docker exec -it k3d-k3s-default-agent-1 mkdir -p /mnt/disks/disk-1
$ docker exec -it k3d-k3s-default-agent-1 ls -la /mnt/disks
total 12
drwxr-xr-x 3 0 0 4096 Jun 29 21:40 .
drwxr-xr-x 3 0 0 4096 Jun 29 21:40 ..
drwxr-xr-x 2 0 0 4096 Jun 29 21:40 disk-1

For minikube you need to use minikube ssh.

Now, we can create a local volume backed by the /mnt/disks/disk1 directory:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: local-pv
  labels:
    release: stable
    capacity: 10Gi
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Delete
  storageClassName: local-storage
  local:
    path: /mnt/disks/disk-1
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
                - k3d-k3s-default-agent-1

Here is the create command:

$ k create -f local-volume.yaml
persistentvolume/local-pv created
$ k get pv
NAME       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                         STORAGECLASS    REASON   AGE
local-pv   10Gi       RWO            Delete           Bound    default/local-storage-claim   local-storage            6m44s

Making persistent volume claims

When containers want access to some persistent storage they make a claim (or rather, the developer and cluster administrator coordinate on necessary storage resources to claim). Here is a sample claim that matches the persistent volume from the previous section - Creating a local volume:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: local-storage-claim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 8Gi
  storageClassName: local-storage
  selector:
    matchLabels:
      release: "stable"
    matchExpressions:
      - {key: capacity, operator: In, values: [8Gi, 10Gi]}

Let’s create the claim and then explain what the different pieces do:

$ k create -f local-persistent-volume-claim.yaml
persistentvolumeclaim/local-storage-claim created
$ k get pvc
NAME                  STATUS   VOLUME     CAPACITY   ACCESS MODES   STORAGECLASS    AGE
local-storage-claim    WaitForFirstConsumer    local-pv   10Gi       RWO            local-storage   21m

The name local-storage-claim will be important later when mounting the claim into a container.

The access mode in the spec is ReadWriteOnce, which means if the claim is satisfied no other claim with the ReadWriteOnce access mode can be satisfied, but claims for ReadOnlyMany can still be satisfied.

The resources section requests 8 GiB. This can be satisfied by our persistent volume, which has a capacity of 10 Gi. But, this is a little wasteful because 2 Gi will not be used by definition.

The storage class name is local-storage. As mentioned earlier it must match the class name of the persistent volume. However, with PVC there is a difference between an empty class name ("") and no class name at all. The former (an empty class name) matches persistent volumes with no storage class name. The latter (no class name) will be able to bind to persistent volumes only if the DefaultStorageClass admission plugin is turned on and the default storage class is used.

The selector section allows you to filter available volumes further. For example, here the volume must match the label release:stable and also have a label with either capacity:8Gi or capacity:10Gi. Imagine that we have several other volumes provisioned with capacities of 20 Gi and 50 Gi. We don’t want to claim a 50 Gi volume when we only need 8 Gi.

Kubernetes always tries to match the smallest volume that can satisfy a claim, but if there are no 8 Gi or 10 Gi volumes then the labels will prevent assigning a 20 Gi or 50 Gi volume and use dynamic provisioning instead.

It’s important to realize that claims don’t mention volumes by name. You can’t claim a specific volume. The matching is done by Kubernetes based on storage class, capacity, and labels.

Finally, persistent volume claims belong to a namespace. Binding a persistent volume to a claim is exclusive. That means that a persistent volume will be bound to a namespace. Even if the access mode is ReadOnlyMany or ReadWriteMany, all the pods that mount the persistent volume claim must be from that claim’s namespace.

Mounting claims as volumes

OK. We have provisioned a volume and claimed it. It’s time to use the claimed storage in a container. This turns out to be pretty simple. First, the persistent volume claim must be used as a volume in the pod and then the containers in the pod can mount it, just like any other volume. Here is a pod manifest that specifies the persistent volume claim we created earlier (bound to the local persistent volume we provisioned):

kind: Pod
apiVersion: v1
metadata:
  name: the-pod
spec:
  containers:
    - name: the-container
      image: g1g1/py-kube:0.3
      volumeMounts:
      - mountPath: "/mnt/data"
        name: persistent-volume
  volumes:
    - name: persistent-volume
      persistentVolumeClaim:
        claimName: local-storage-claim

The key is in the persistentVolumeClaim section under volumes. The claim name (local-storage-claim here) uniquely identifies within the current namespace the specific claim and makes it available as a volume (named persistent-volume here). Then, the container can refer to it by its name and mount it to "/mnt/data".

Before we create the pod it’s important to note that the persistent volume claim didn’t actually claim any storage yet and wasn’t bound to our local volume. The claim is pending until some container actually attempts to mount a volume using the claim:

$ k get pvc
NAME                  STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS    AGE
local-storage-claim   Pending                                      local-storage   6m14s

Now, the claim will be bound when creating the pod:

$ k create -f pod-with-local-claim.yaml
pod/the-pod created
$ k get pvc
NAME                  STATUS   VOLUME     CAPACITY   ACCESS MODES   STORAGECLASS    AGE
local-storage-claim   Bound    local-pv   100Gi      RWO            local-storage   20m

Raw block volumes

Kubernetes 1.9 added this capability as an Alpha feature. Kubernetes 1.13 moved it to Beta. Since Kubernetes 1.18 it is GA.

Raw block volumes provide direct access to the underlying storage, which is not mediated via a file system abstraction. This is very useful for applications that require high-performance storage like databases or when consistent I/O performance and low latency are needed. The following storage providers support raw block volumes:

  • AWSElasticBlockStore
  • AzureDisk
  • FC (Fibre Channel)
  • GCE Persistent Disk
  • iSCSI
  • Local volume
  • OpenStack Cinder
  • RBD (Ceph Block Device)
  • VsphereVolume

In addition many CSI storage providers also offer raw block volume. For the full list check out: https://kubernetes-csi.github.io/docs/drivers.html.

Here is how to define a raw block volume using the FireChannel provider:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: block-pv
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  volumeMode: Block
  persistentVolumeReclaimPolicy: Retain
  fc:
    targetWWNs: ["50060e801049cfd1"]
    lun: 0
    readOnly: false

A matching Persistent Volume Claim (PVC) MUST specify volumeMode: Block as well. Here is what it looks like:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: block-pvc
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Block
  resources:
    requests:
      storage: 10Gi

Pods consume raw block volumes as devices under /dev and NOT as mounted filesystems. Containers can then access these devices and read/write to them. In practice this means that I/O requests to block storage go directly to the underlying block storage and don’t pass through the file system drivers. This is in theory faster, but in practice it can actually decrease performance if your application benefits from file system buffering.

Here is a pod with a container that binds the block-pvc with the raw block storage as a device named /dev/xdva:

apiVersion: v1
kind: Pod
metadata:
  name: pod-with-block-volume
spec:
  containers:
    - name: fc-container
      image: fedora:26
      command: ["/bin/sh", "-c"]
      args: ["tail -f /dev/null"]
      volumeDevices:
        - name: data
          devicePath: /dev/xvda
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: block-pvc

CSI ephemeral volumes

We will cover the Container Storage Interface (CSI) in detail later in the chapter in the section The Container Storage Interface. CSI ephemeral volumes are backed by local storage on the node. These volumes’ lifecycles are tied to the pod’s lifecycle. In addition, they can only be mounted by containers of that pod, which is useful for populating secrets and certificates directly into a pod, without going through a Kubernetes secret object.

Here is an example of a pod with a CSI ephemeral volume:

kind: Pod
apiVersion: v1
metadata:
  name: the-pod
spec:
  containers:
    - name: the-container
      image: g1g1/py-kube:0.3
      volumeMounts:
        - mountPath: "/data"
          name: the-volume
      command: [ "sleep", "1000000" ]
  volumes:
    - name: the-volume
      csi:
        driver: inline.storage.kubernetes.io
        volumeAttributes:
          key: value

CSI ephemeral volumes have been GA since Kubernetes 1.25. However, they may not be supported by all CSI drivers. As usual check the list: https://kubernetes-csi.github.io/docs/drivers.html.

Generic ephemeral volumes

Generic ephemeral volumes are yet another volume type that is tied to the pod lifecycle. When the pod is gone the generic ephemeral volume is gone.

This volume type actually creates a full-fledged persistent volume claim. This provides several capabilities:

  • The storage for the volume can be either local or network-attached.
  • The volume has the option to be provisioned with a fixed size.
  • Depending on the driver and specified parameters, the volume may contain initial data.
  • If supported by the driver, typical operations such as snapshotting, cloning, resizing, and storage capacity tracking can be performed on the volumes.

Here is an example of a pod with a generic ephemeral volume:

kind: Pod
apiVersion: v1
metadata:
  name: the-pod
spec:
  containers:
    - name: the-container
      image: g1g1/py-kube:0.3
      volumeMounts:
        - mountPath: "/data"
          name: the-volume
      command: [ "sleep", "1000000" ]
  volumes:
    - name: the-volume
      ephemeral:
        volumeClaimTemplate:
          metadata:
            labels:
              type: generic-ephemeral-volume
          spec:
            accessModes: [ "ReadWriteOnce" ]
            storageClassName: generic-storage
            resources:
              requests:
                storage: 1Gi

Note that from a security point of view users that have permission to create pods, but not PVCs, can now create PVCs via generic ephemeral volumes. To prevent that it is possible to use admission control.

Storage classes

We’ve run into storage classes already. What are they exactly? Storage classes let an administrator configure a cluster with custom persistent storage (as long as there is a proper plugin to support it). A storage class has a name in the metadata (it must be specified in the storageClassName file of the claim), a provisioner, a reclaim policy, and parameters.

We declared a storage class for local storage earlier. Here is a sample storage class that uses AWS EBS as a provisioner (so, it works only on AWS):

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp2
reclaimPolicy: Retain
allowVolumeExpansion: true
mountOptions:
  - debug
volumeBindingMode: Immediate

You may create multiple storage classes for the same provisioner with different parameters. Each provisioner has its own parameters.

The currently supported provisioners are:

  • AWSElasticBlockStore
  • AzureFile
  • AzureDisk
  • CephFS
  • Cinder
  • FC
  • FlexVolume
  • Flocker
  • GCE Persistent Disk
  • GlusterFS
  • iSCSI
  • Quobyte
  • NFS
  • RBD
  • VsphereVolume
  • PortworxVolume
  • ScaleIO
  • StorageOS
  • Local

This list doesn’t contain provisioners for other volume types, such as configMap or secret, that are not backed by your typical network storage. Those volume types don’t require a storage class. Utilizing volume types intelligently is a major part of architecting and managing your cluster.

Default storage class

The cluster administrator can also assign a default storage class. When a default storage class is assigned and the DefaultStorageClass admission plugin is turned on, then claims with no storage class will be dynamically provisioned using the default storage class. If the default storage class is not defined or the admission plugin is not turned on, then claims with no storage class can only match volumes with no storage class.

We covered a lot of ground and a lot of options for provisioning storage and using it in different ways. Let’s put everything together and show the whole process from start to finish.

Demonstrating persistent volume storage end to end

To illustrate all the concepts, let’s do a mini demonstration where we create a HostPath volume, claim it, mount it, and have containers write to it. We will use k3d for this part.

Let’s start by creating a hostPath volume using the dir storage class. Save the following in dir-persistent-volume.yaml:

kind: PersistentVolume
apiVersion: v1
metadata:
  name: dir-pv
spec:
  storageClassName: dir
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: "/tmp/data"

Then, let’s create it:

$ k create -f dir-persistent-volume.yaml
persistentvolume/dir-pv created

To check out the available volumes, you can use the resource type persistentvolumes or pv for short:

$ k get pv
NAME       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS    REASON   AGE
dir-pv     1Gi        RWX            Retain           Available           dir                      22s

The capacity is 1 GiB as requested. The reclaim policy is Retain because host path volumes are retained (not destroyed). The status is Available because the volume has not been claimed yet. The access mode is specified as RWX, which means ReadWriteMany. All of the access modes have a shorthand version:

  • RWOReadWriteOnce
  • ROXReadOnlyMany
  • RWXReadWriteMany

We have a persistent volume. Let’s create a claim. Save the following to dir-persistent-volume-claim.yaml:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: dir-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

Then, run the following command:

$ k create -f  dir-persistent-volume-claim.yaml
persistentvolumeclaim/dir-pvc created

Let’s check the claim and the volume:

$ k get pvc
NAME                  STATUS   VOLUME     CAPACITY   ACCESS MODES   STORAGECLASS    AGE
dir-pvc               Bound    dir-pv     1Gi        RWX            dir             106s
$ k get pv
NAME       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM             STORAGECLASS    REASON   AGE
dir-pv     1Gi        RWX            Retain           Bound    default/dir-pvc   dir                      4m25s

As you can see, the claim and the volume are bound to each other and reference each other. The reason the binding works is that the same storage class is used by the volume and the claim. But, what happens if they don’t match? Let’s remove the storage class from the persistent volume claim and see what happens. Save the following persistent volume claim to some-persistent-volume-claim.yaml:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: some-pvc
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 1Gi

Then, create it:

$ k create -f some-persistent-volume-claim.yaml
persistentvolumeclaim/some-pvc created

Ok. It was created. Let’s check it out:

$ k get pvc some-pvc
NAME       STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
some-pvc   Pending                                      local-path     3m29s

Very interesting. The some-pvc claim was associated with the local-path storage class that we never specified, but it is still pending. Let’s understand why.

Here is the local-path storage class:

$ k get storageclass local-path -o yaml
kind: StorageClass
metadata:
  annotations:
    objectset.rio.cattle.io/applied: H4sIAAAAAAAA/4yRT+vUMBCGv4rMua1bu1tKwIO u7EUEQdDzNJlux6aZkkwry7LfXbIqrIffn2PyZN7hfXIFXPg7xcQSwEBSiXimaupSxfJ2q6GAiYMDA9 /+oKPHlKCAmRQdKoK5AoYgisoSUj5K/5OsJtIqslQWVT3lNM4xUDzJ5VegWJ63CQxMTXogW128+czBvf/gnIQXIwLOBAa8WPTl30qvGkoL2jw5rT2V6ZKUZij+SbG5eZVRDKR0F8SpdDTg6rW8YzCgcSW4FeCxJ/+sjxHTCAbqrhmag20Pw9DbZtfu210z7JuhPnQ719m2w3cOe7fPof81W1DHfLlE2Th/IEUwEDHYkWJe8PCs gJgL8PxVPNsLGPhEnjRr2cSvM33k4Dicv4jLC34g60niiWPSo4S0zhTh9jsAAP//ytgh5S0CAAA
    objectset.rio.cattle.io/id: ""
    objectset.rio.cattle.io/owner-gvk: k3s.cattle.io/v1, Kind=Addon
    objectset.rio.cattle.io/owner-name: local-storage
    objectset.rio.cattle.io/owner-namespace: kube-system
    storageclass.kubernetes.io/is-default-class: "true"
  creationTimestamp: "2022-06-22T18:16:56Z"
  labels:
    objectset.rio.cattle.io/hash: 183f35c65ffbc3064603f43f1580d8c68a2dabd4
  name: local-path
  resourceVersion: "290"
  uid: b51cf456-f87e-48ac-9062-4652bf8f683e
provisioner: rancher.io/local-path
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

It is a storage class that comes with k3d (k3s).

Note the annotation: storageclass.kubernetes.io/is-default-class: "true". It tells Kubernetes that this is the default storage class. Since our PVC had no storage class name it was associated with the default storage class. But, why is the claim still pending? The reason is that volumeBindingMode is WaitForFirstConsumer. This means that the volume for the claim will be provisioned dynamically only when a container attempts to mount the volume via the claim.

Back to our dir-pvc. The final step is to create a pod with two containers and assign the claim as a volume to both of them. Save the following to shell-pod.yaml:

kind: Pod
apiVersion: v1
metadata:
  name: just-a-shell
  labels:
    name: just-a-shell
spec:
  containers:
    - name: a-shell
      image: g1g1/py-kube:0.3
      command: ["sleep", "10000"]
      volumeMounts:
        - mountPath: "/data"
          name: pv
    - name: another-shell
      image: g1g1/py-kube:0.3
      command: ["sleep", "10000"]
      volumeMounts:
        - mountPath: "/another-data"
          name: pv
  volumes:
    - name: pv
      persistentVolumeClaim:
        claimName: dir-pvc

This pod has two containers that use the g1g1/py-kube:0.3 image and both just sleep for a long time. The idea is that the containers will keep running, so we can connect to them later and check their file system. The pod mounts our persistent volume claim with a volume name of pv. Note that the volume specification is done at the pod level just once and multiple containers can mount it into different directories.

Let’s create the pod and verify that both containers are running:

$ k create -f shell-pod.yaml
pod/just-a-shell created
$ k get po just-a-shell -o wide
NAME           READY   STATUS    RESTARTS   AGE   IP            NODE                      NOMINATED NODE   READINESS GATES
just-a-shell   2/2     Running   0          74m   10.42.2.104   k3d-k3s-default-agent-1   <none>           <none>

Then, connect to the node (k3d-k3s-default-agent-1). This is the host whose /tmp/data is the pod’s volume that is mounted as /data and /another-data into each of the running containers:

$ docker exec -it k3d-k3s-default-agent-1 sh
/ #

Then, let’s create a file in the /tmp/data directory on the host. It should be visible by both containers via the mounted volume:

/ # echo "yeah, it works" > /tmp/data/cool.txt

Let’s verify from the outside that the file cool.txt is indeed available:

$ docker exec -it k3d-k3s-default-agent-1 cat /tmp/data/cool.txt
yeah, it works 

Next, let’s verify the file is available in the containers (in their mapped directories):

$ k exec -it just-a-shell -c a-shell -- cat  /data/cool.txt
yeah, it works
$ k exec -it just-a-shell -c another-shell -- cat  /another-data/cool.txt
yeah, it works

We can even create a new file, yo.txt, in one of the containers and see that it’s available to the other container or to the node itself:

$ k exec -it just-a-shell -c another-shell – bash –c "echo yo  > /another-data/yo.txt"
yo /another-data/yo.txt
$ k exec -it just-a-shell -c a-shell cat /data/yo.txt
yo
$ k exec -it just-a-shell -c another-shell cat /another-data/yo.txt
yo

Yes. Everything works as expected and both containers share the same storage.

Public cloud storage volume types – GCE, AWS, and Azure

In this section, we’ll look at some of the common volume types available in the leading public cloud platforms. Managing storage at scale is a difficult task that eventually involves physical resources, similar to nodes. If you choose to run your Kubernetes cluster on a public cloud platform, you can let your cloud provider deal with all these challenges and focus on your system. But it’s important to understand the various options, constraints, and limitations of each volume type.

Many of the volume types we will go over used to be handled by in-tree plugins (part of core Kubernetes), but have now migrated to out-of-tree CSI plugins.

The CSI migration feature allows in-tree plugins that have corresponding out-of-tree CSI plugins to direct operations toward the out-of-tree plugins as a transitioning measure.

We will cover the CSI itself later.

AWS Elastic Block Store (EBS)

AWS provides the Elastic Block Store (EBS) as persistent storage for EC2 instances. An AWS Kubernetes cluster can use AWS EBS as persistent storage with the following limitations:

  • The pods must run on AWS EC2 instances as nodes
  • Pods can only access EBS volumes provisioned in their availability zone
  • An EBS volume can be mounted on a single EC2 instance

Those are severe limitations. The restriction for a single availability zone, while great for performance, eliminates the ability to share storage at scale or across a geographically distributed system without custom replication and synchronization. The limit of a single EBS volume to a single EC2 instance means even within the same availability zone, pods can’t share storage (even for reading) unless you make sure they run on the same node.

This is an example of an in-tree plugin that also has a CSI driver and supports CSIMigration. That means that if the CSI driver for AWS EBS (ebs.csi.aws.com) is installed, then the in-tree plugin will redirect all plugin operations to the out-of-tree plugin.

It is also possible to disable loading the in-tree awsElasticBlockStore storage plugin from being loaded by setting the InTreePluginAWSUnregister feature gate to true (the default is false).

Check out all the feature gates here: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/.

Let’s see how to define an AWS EBS persistent volume (static provisioning):

apiVersion: v1
kind: PersistentVolume
metadata:
  name: test-pv
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 5Gi
  csi:
    driver: ebs.csi.aws.com
    volumeHandle: {EBS volume ID}
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: topology.ebs.csi.aws.com/zone
              operator: In
              values:
                - {availability zone}

Then you need to define a PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ebs-claim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi

Finally, a pod can mount the PVC:

apiVersion: v1
kind: Pod
metadata:
  name: some-pod
spec:
  containers:
  - image: some-container
    name: some-container
    volumeMounts:
    - name: persistent-storage
      mountPath: /data
  volumes:
  - name: persistent-storage
    persistentVolumeClaim:
      claimName: ebs-claim

AWS Elastic File System (EFS)

AWS has a service called the Elastic File System (EFS). This is really a managed NFS service. It uses the NFS 4.1 protocol and has many benefits over EBS:

  • Multiple EC2 instances can access the same files across multiple availability zones (but within the same region)
  • Capacity is automatically scaled up and down based on actual usage
  • You pay only for what you use
  • You can connect on-premise servers to EFS over VPN
  • EFS runs off SSD drives that are automatically replicated across availability zones

That said, EFS is more expansive than EBS even when you consider the automatic replication to multiple AZs (assuming you fully utilize your EBS volumes). The recommended way to use EFS via its dedicated CSI driver: https://github.com/kubernetes-sigs/aws-efs-csi-driver.

Here is an example of static provisioning. First, define the persistent volume:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-pv
spec:
  capacity:
    storage: 1Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: efs.csi.aws.com
    volumeHandle: <Filesystem Id> 

You can find the Filesystem Id using the AWS CLI:

aws efs describe-file-systems --query "FileSystems[*].FileSystemId"

Then define a PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: efs-claim
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: ""
  resources:
    requests:
      storage: 1Gi

Here is a pod that consumes it:

piVersion: v1
kind: Pod
metadata:
  name: efs-app
spec:
  containers:
  - name: app
    image: centos
    command: ["/bin/sh"]
    args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
    volumeMounts:
    - name: persistent-storage
      mountPath: /data
  volumes:
  - name: persistent-storage
    persistentVolumeClaim:
      claimName: efs-claim

You can also use dynamic provisioning by defining a proper storage class instead of creating a static volume:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com
parameters:
  provisioningMode: efs-ap
  fileSystemId: <Filesystem Id>
  directoryPerms: "700"
  gidRangeStart: "1000" # optional
  gidRangeEnd: "2000" # optional
  basePath: "/dynamic_provisioning" # optional

The PVC is similar, but now uses the storage class name:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: efs-claim
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: efs-sc
  resources:
    requests:
      storage: 5Gi

The pod consumes the PVC just like before:

apiVersion: v1
kind: Pod
metadata:
  name: efs-app
spec:
  containers:
    - name: app
      image: centos
      command: ["/bin/sh"]
      args: ["-c", "while true; do echo $(date -u) >> /data/out; sleep 5; done"]
      volumeMounts:
        - name: persistent-storage
          mountPath: /data
  volumes:
    - name: persistent-storage
      persistentVolumeClaim:
        claimName: efs-claim

GCE persistent disk

The gcePersistentDisk volume type is very similar to awsElasticBlockStore. You must provision the disk ahead of time. It can only be used by GCE instances in the same project and zone. But the same volume can be used as read-only on multiple instances. This means it supports ReadWriteOnce and ReadOnlyMany. You can use a GCE persistent disk to share data as read-only between multiple pods in the same zone.

It also has a CSI driver called pd.csi.storage.gke.io and supports CSIMigration.

If the pod that’s using a persistent disk in ReadWriteOnce mode is controlled by a replication controller, a replica set, or a deployment, the replica count must be 0 or 1. Trying to scale beyond 1 will fail for obvious reasons.

Here is a storage class for GCE persistent disk using the CSI driver:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: csi-gce-pd
provisioner: pd.csi.storage.gke.io
parameters:
  labels: key1=value1,key2=value2
volumeBindingMode: WaitForFirstConsumer

Here is the PVC:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: gce-pd-pvc
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: csi-gce-pd
  resources:
    requests:
      storage: 200Gi

A Pod can consume it for dynamic provisioning:

apiVersion: v1
kind: Pod
metadata:
  name: some-pod
spec:
  containers:
  - image: some-image
    name: some-container
    volumeMounts:
    - mountPath: /pd
      name: some-volume
  volumes:
  - name: some-volume
    persistentVolumeClaim:
       claimName: gce-pd-pvc
       readOnly: false

The GCE persistent disk has supported a regional disk option since Kubernetes 1.10 (in Beta). Regional persistent disks automatically sync between two zones. Here is what the storage class looks like for a regional persistent disk:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: csi-gce-pd
provisioner: pd.csi.storage.gke.io
parameters:
  type: pd-standard
  replication-type: regional-pd
volumeBindingMode: WaitForFirstConsumer

Google Cloud Filestore

Google Cloud Filestore is the managed NFS file service of GCP. Kubernetes doesn’t have an in-tree plugin for it and there is no general-purpose supported CSI driver.

However, there is a CSI driver used on GKE and if you are adventurous, you may want to try it even if you’re installing Kubernetes yourself on GCP and want to use Google Cloud Storage as a storage option.

See: https://github.com/kubernetes-sigs/gcp-filestore-csi-driver.

Azure data disk

The Azure data disk is a virtual hard disk stored in Azure storage. It’s similar in capabilities to AWS EBS or a GCE persistent disk.

It also has a CSI driver called disk.csi.azure.com and supports CSIMigration. See: https://github.com/kubernetes-sigs/azuredisk-csi-driver.

Here is an example of defining an Azure disk persistent volume:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-azuredisk
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: managed-csi
  csi:
    driver: disk.csi.azure.com
    readOnly: false
    volumeHandle: /subscriptions/{sub-id}/resourcegroups/{group-name}/providers/microsoft.compute/disks/{disk-id}
    volumeAttributes:
      fsType: ext4

In addition to the mandatory diskName and diskURI parameters, it also has a few optional parameters:

  • kind: The available options for disk storage configurations are Shared (allowing multiple disks per storage account), Dedicated (providing a single blob disk per storage account), or Managed (offering an Azure-managed data disk). The default is Shared.
  • cachingMode: The disk caching mode. This must be one of None, ReadOnly, or ReadWrite. The default is None.
  • fsType: The filesystem type set to mount. The default is ext4.
  • readOnly: Whether the filesystem is used as readOnly. The default is false.

Azure data disks are limited to 32 GiB. Each Azure VM can have up to 32 data disks. Larger VM sizes can have more disks attached. You can attach an Azure data disk to a single Azure VM.

As usual you should create a PVC and consume it in a pod (or a pod controller).

Azure file

In addition to the data disk, Azure has also a shared filesystem similar to AWS EFS. However, Azure file storage uses the SMB/CIFS protocol (it supports SMB 2.1 and SMB 3.0). It is based on the Azure storage platform and has the same availability, durability, scalability, and geo-redundancy capabilities as Azure Blob, Table, or Queue storage.

In order to use Azure file storage, you need to install on each client VM the cifs-utils package. You also need to create a secret, which is a required parameter:

apiVersion: v1
kind: Secret
metadata:
  name: azure-file-secret
type: Opaque
data:
  azurestorageaccountname: <base64 encoded account name>
  azurestorageaccountkey: <base64 encoded account key>

Here is a pod that uses Azure file storage:

apiVersion: v1
kind: Pod
metadata:
 name: some-pod
spec:
 containers:
  - image: some-container
    name: some-container
    volumeMounts:
      - name: some-volume
        mountPath: /azure
 volumes:
      - name: some-volume
        azureFile:
          secretName: azure-file-secret
         shareName: azure-share
          readOnly: false

Azure file storage supports sharing within the same region as well as connecting on-premise clients.

This covers the public cloud storage volume types. Let’s look at some distributed storage volumes you can install on your own in your cluster.

GlusterFS and Ceph volumes in Kubernetes

GlusterFS and Ceph are two distributed persistent storage systems. GlusterFS is, at its core, a network filesystem. Ceph is, at its core, an object store. Both expose block, object, and filesystem interfaces. Both use the xfs filesystem under the hood to store the data and metadata as xattr attributes. There are several reasons why you may want to use GlusterFS or Ceph as persistent volumes in your Kubernetes cluster:

  • You run on-premises and cloud storage is not available
  • You may have a lot of data and applications that access the data in GlusterFS or Ceph
  • You have operational expertise managing GlusterFS or Ceph
  • You run in the cloud, but the limitations of the cloud platform persistent storage are a non-starter

Let’s take a closer look at GlusterFS.

Using GlusterFS

GlusterFS is intentionally simple, exposing the underlying directories as they are and leaving it to clients (or middleware) to handle high availability, replication, and distribution. GlusterFS organizes the data into logical volumes, which encompass multiple nodes (machines) that contain bricks, which store files. Files are allocated to bricks according to DHT (distributed hash table). If files are renamed or the GlusterFS cluster is expanded or rebalanced, files may be moved between bricks. The following diagram shows the GlusterFS building blocks:

Figure 6.2: GlusterFS building blocks

To use a GlusterFS cluster as persistent storage for Kubernetes (assuming you have an up-and-running GlusterFS cluster), you need to follow several steps. In particular, the GlusterFS nodes are managed by the plugin as a Kubernetes service.

Creating endpoints

Here is an example of an endpoints resource that you can create as a normal Kubernetes resource using kubectl create:

kind: Endpoints
apiVersion: v1
metadata:
  name: glusterfs-cluster
subsets:
- addresses:
  - ip: 10.240.106.152
  ports:
  - port: 1
- addresses:
  - ip: 10.240.79.157
  ports:
  - port: 1 

Adding a GlusterFS Kubernetes service

To make the endpoints persistent, you use a Kubernetes service with no selector to indicate the endpoints are managed manually:

kind: Service
apiVersion: v1
metadata:
  name: glusterfs-cluster
spec:
  ports:
  - port: 1

Creating pods

Finally, in the pod spec’s volumes section, provide the following information:

volumes:
- name: glusterfsvol
  glusterfs:
    endpoints: glusterfs-cluster
    path: kube_vol
    readOnly: true

The containers can then mount glusterfsvol by name.

The endpoints tell the GlusterFS volume plugin how to find the storage nodes of the GlusterFS cluster.

There was an effort to create a CSI driver for GlusterFS, but it was abandoned: https://github.com/gluster/gluster-csi-driver.

After covering GlusterFS let’s look at CephFS.

Using Ceph

Ceph’s object store can be accessed using multiple interfaces. Unlike GlusterFS, Ceph does a lot of work automatically. It does distribution, replication, and self-healing all on its own. The following diagram shows how RADOS – the underlying object store – can be accessed in multiple ways.

Figure 6.3: Accessing RADOS

Kubernetes supports Ceph via the Rados Block Device (RBD) interface.

Connecting to Ceph using RBD

You must install ceph-common on each node of the Kubernetes cluster. Once you have your Ceph cluster up and running, you need to provide some information required by the Ceph RBD volume plugin in the pod configuration file:

  • monitors: Ceph monitors.
  • pool: The name of the RADOS pool. If not provided, the default RBD pool is used.
  • image: The image name that RBD has created.
  • user: The RADOS username. If not provided, the default admin is used.
  • keyring: The path to the keyring file. If not provided, the default /etc/ceph/keyring is used.
  • secretName: The name of the authentication secrets. If provided, secretName overrides keyring. Note: see the following paragraph about how to create a secret.
  • fsType: The filesystem type (ext4, xfs, and so on) that is formatted on the device.
  • readOnly: Whether the filesystem is used as readOnly.

If the Ceph authentication secret is used, you need to create a secret object:

apiVersion: v1
kind: Secret
metadata:
  name: ceph-secret
type: "kubernetes.io/rbd"
data:
  key: QVFCMTZWMVZvRjVtRXhBQTVrQ1FzN2JCajhWVUxSdzI2Qzg0SEE9PQ==

The secret type is kubernetes.io/rbd.

Here is a sample pod that uses Ceph through RBD with a secret using the in-tree provider:

apiVersion: v1
kind: Pod
metadata:
  name: rbd2
spec:
  containers:
    - image: kubernetes/pause
      name: rbd-rw
      volumeMounts:
      - name: rbdpd
        mountPath: /mnt/rbd
  volumes:
    - name: rbdpd
      rbd:
        monitors:
        - '10.16.154.78:6789'
        - '10.16.154.82:6789'
        - '10.16.154.83:6789'
        pool: kube
        image: foo
        fsType: ext4
        readOnly: true
        user: admin
        secretRef:
          name: ceph-secret

Ceph RBD supports ReadWriteOnce and ReadOnlyMany access modes. But, these days it is best to work with Ceph via Rook.

Rook

Rook is an open source cloud native storage orchestrator. It is currently a graduated CNCF project. It used to provide a consistent experience on top of multiple storage solutions like Ceph, edgeFS, Cassandra, Minio, NFS, CockroachDB, and YugabyteDB. But, eventually it laser-focused on supporting only Ceph. Here are the features Rook provides:

  • Automating deployment
  • Bootstrapping
  • Configuration
  • Provisioning
  • Scaling
  • Upgrading
  • Migration
  • Scheduling
  • Lifecycle management
  • Resource management
  • Monitoring
  • Disaster recovery

Rook takes advantage of modern Kubernetes best practices like CRDs and operators.

Here is the Rook architecture:

Figure 6.4: Rook architecture

Once you install the Rook operator you can create a Ceph cluster using a Rook CRD such as: https://github.com/rook/rook/blob/release-1.10/deploy/examples/cluster.yaml.

Here is a shortened version (without the comments):

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph # namespace:cluster
spec:
  cephVersion:
    image: quay.io/ceph/ceph:v17.2.5
    allowUnsupported: false
  dataDirHostPath: /var/lib/rook
  skipUpgradeChecks: false
  continueUpgradeAfterChecksEvenIfNotHealthy: false
  waitTimeoutForHealthyOSDInMinutes: 10
  mon:
    count: 3
    allowMultiplePerNode: false
  mgr:
    count: 2
    allowMultiplePerNode: false
    modules:
      - name: pg_autoscaler
        enabled: true
  dashboard:
    enabled: true
    ssl: true
  monitoring:    
   enabled: false
  network:
    connections:
      encryption:
        enabled: false
      compression:
        enabled: false
  crashCollector:
    disable: false
  logCollector:
    enabled: true
    periodicity: daily # one of: hourly, daily, weekly, monthly
    maxLogSize: 500M # SUFFIX may be 'M' or 'G'. Must be at least 1M.
  cleanupPolicy:
    confirmation: ""
    sanitizeDisks:
      method: quick
      dataSource: zero
      iteration: 1
    allowUninstallWithVolumes: false
  annotations:
  labels:
  resources:
  removeOSDsIfOutAndSafeToRemove: false
  priorityClassNames:
    mon: system-node-critical
    osd: system-node-critical
    mgr: system-cluster-critical
  storage: # cluster level storage configuration and selection
    useAllNodes: true
    useAllDevices: true
    config:
    onlyApplyOSDPlacement: false
  disruptionManagement:
    managePodBudgets: true
    osdMaintenanceTimeout: 30
    pgHealthCheckTimeout: 0
    manageMachineDisruptionBudgets: false
    machineDisruptionBudgetNamespace: openshift-machine-api
  healthCheck:
    daemonHealth:
      mon:
        disabled: false
        interval: 45s
      osd:
        disabled: false
        interval: 60s
      status:
        disabled: false
        interval: 60s
    livenessProbe:
      mon:
        disabled: false
      mgr:
        disabled: false
      osd:
        disabled: false
    startupProbe:
      mon:
        disabled: false
      mgr:
        disabled: false
      osd:
        disabled: false

Here is a storage class for CephFS:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: rook-ceph-retain-bucket
provisioner: rook-ceph.ceph.rook.io/bucket # driver:namespace:cluster
# set the reclaim policy to retain the bucket when its OBC is deleted
reclaimPolicy: Retain
parameters:
   objectStoreName: my-store # port 80 assumed
   objectStoreNamespace: rook-ceph # namespace:cluster

The full code is available here: https://github.com/rook/rook/blob/release-1.10/deploy/examples/storageclass-bucket-retain.yaml.

Now that we’ve covered using distributed storage using GlusterFS, Ceph, and Rook, let’s look at enterprise storage options.

Integrating enterprise storage into Kubernetes

If you have an existing Storage Area Network (SAN) exposed over the iSCSI interface, Kubernetes has a volume plugin for you. It follows the same model as other shared persistent storage plugins we’ve seen earlier. It supports the following features:

  • Connecting to one portal
  • Mounting a device directly or via multipathd
  • Formatting and partitioning any new device
  • Authenticating via CHAP

You must configure the iSCSI initiator, but you don’t have to provide any initiator information. All you need to provide is the following:

  • IP address of the iSCSI target and port (if not the default 3260)
  • Target’s IQN (iSCSI Qualified Name) – typically the reversed domain name
  • LUN (Logical Unit Number)
  • Filesystem type
  • Readonly Boolean flag

The iSCSI plugin supports ReadWriteOnce and ReadonlyMany. Note that you can’t partition your device at this time. Here is an example pod with an iSCSI volume spec:

---
apiVersion: v1
kind: Pod
metadata:
  name: iscsipd
spec:
  containers:
  - name: iscsipd-rw
    image: kubernetes/pause
    volumeMounts:
    - mountPath: "/mnt/iscsipd"
      name: iscsipd-rw
  volumes:
  - name: iscsipd-rw
    iscsi:
      targetPortal: 10.0.2.15:3260
      portals: ['10.0.2.16:3260', '10.0.2.17:3260']
      iqn: iqn.2001-04.com.example:storage.kube.sys1.xyz
      lun: 0
      fsType: ext4
      readOnly: true

Other storage providers

The Kubernetes storage scene keeps innovating. A lot of companies adapt their products to Kubernetes and some companies and organizations build Kubernetes-dedicated storage solutions. Here are some of the more popular and mature solutions:

  • OpenEBS
  • Longhorn
  • Portworx

The Container Storage Interface

The Container Storage Interface (CSI) is a standard interface for the interaction between container orchestrators and storage providers. It was developed by Kubernetes, Docker, Mesos, and Cloud Foundry. The idea is that storage providers implement just one CSI driver and all container orchestrators need to support only the CSI. It is the equivalent of CNI for storage.

A CSI volume plugin was added in Kubernetes 1.9 as an Alpha feature and has been generally available since Kubernetes 1.13. The older FlexVolume approach (which you may have come across) is deprecated now.

Here is a diagram that demonstrates how CSI works within Kubernetes:

Figure 6.5: CSI architecture

The migration effort to port all in-tree plugins to out-of-tree CSI drivers is well underway. See https://kubernetes-csi.github.io for more details.

Advanced storage features

These features are available only to CSI drivers. They represent the benefits of a uniform storage model that allows adding optional advanced functionality across all storage providers with a uniform interface.

Volume snapshots

Volume snapshots are generally available as of Kubernetes 1.20. They are exactly what they sound like – a snapshot of a volume at a certain point in time. You can create and later restore volumes from a snapshot. It’s interesting that the API objects associated with snapshots are CRDs and not part of the core Kubernetes API. The objects are:

  • VolumeSnapshotClass
  • VolumeSnapshotContents
  • VolumeSnapshot

Volume snapshots work using an external-prosnapshotter sidecar container that the Kubernetes team developed. It watches for snapshot CRDs to be created and interacts with the snapshot controller, which can invoke the CreateSnapshot and DeleteSnapshot operations of CSI drivers that implement snapshot support.

Here is how to declare a volume snapshot:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: new-snapshot-test
spec:
  volumeSnapshotClassName: csi-hostpath-snapclass
  source:
    persistentVolumeClaimName: pvc-test

You can also provision volumes from a snapshot.

Here is a persistent volume claim bound to a snapshot:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: restore-pvc
spec:
  storageClassName: csi-hostpath-sc
  dataSource:
    name: new-snapshot-test
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

See https://github.com/kubernetes-csi/external-snapshotter#design for more details.

CSI volume cloning

Volume cloning is available in GA as of Kubernetes 1.18. Volume clones are new volumes that are populated with the content of an existing volume. Once the volume cloning is complete there is no relation between the original and the clone. Their content will diverge over time. You can perform a clone manually by creating a snapshot and then create a new volume from the snapshot. But, volume cloning is more streamlined and efficient.

It only works for dynamic provisioning and uses the storage class of the source volume for the clone as well. You initiate a volume clone by specifying an existing persistent volume claim as a data source of a new persistent volume claim. That triggers the dynamic provisioning of a new volume that clones the source claim’s volume:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
    name: clone-of-pvc-1
    namespace: myns
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: cloning
  resources:
    requests:
      storage: 5Gi
  dataSource:
    kind: PersistentVolumeClaim
    name: pvc-1

See https://kubernetes.io/docs/concepts/storage/volume-pvc-datasource/ for more details.

Storage capacity tracking

Storage capacity tracking (GA as of Kubernetes 1.24) allows the scheduler to better schedule pods that require storage into nodes that can provide that storage. This requires a CSI driver that supports storage capacity tracking.

The CSI driver will create a CSIStorageCapacity object for each storage class and determine which nodes have access to this storage. In addition the CSIDriverSpec's field StorageCapacity must be set to true.

When a pod specifies a storage class name in WaitForFirstConsumer mode and the CSI driver has StorageCapacity set to true the Kubernetes scheduler will consider the CSIStorageCapacity object associated with the storage class and schedule the pod only to nodes that have sufficient storage.

Check out: https://kubernetes.io/docs/concepts/storage/storage-capacity for more details.

Volume health monitoring

Volume health monitoring is a recent addition to the storage APIs. It has been in Alpha since Kubernetes 1.21. It involves two components:

  • An external health monitor
  • The kubelet

CSI drivers that support volume health monitoring will update PVCs with events on abnormal conditions of associated storage volumes. The external health monitor also watches nodes for failures and will report events on PVCs bound to these nodes.

In the case where a CSI driver enables volume health monitoring from the node side, any abnormal condition detected will result in an event being reported for every pod that utilizes a PVC with the corresponding issue.

There is also a new metric associated with volume health: kubelet_volume_stats_health_status_abnormal.

It has two labels: namespace and persistentvolumeclaim. The values are 0 or 1.

More details are available here: https://kubernetes.io/docs/concepts/storage/volume-health-monitoring/.

CSI is an exciting initiative that simplified the Kubernetes code base itself by externalizing storage drivers. It simplified the life of storage solutions that can develop out-of-tree drivers and added a lot of advanced capabilities to the Kubernetes storage story.

Summary

In this chapter, we took a deep look into storage in Kubernetes. We’ve looked at the generic conceptual model based on volumes, claims, and storage classes, as well as the implementation of volume plugins. Kubernetes eventually maps all storage systems into mounted filesystems in containers or devices of raw block storage. This straightforward model allows administrators to configure and hook up any storage system from local host directories, through cloud-based shared storage, all the way to enterprise storage systems. The transition of storage provisioners from in-tree to CSI-based out-of-tree drivers bodes well for the storage ecosystem. You should now have a clear understanding of how storage is modeled and implemented in Kubernetes and be able to make intelligent choices on how to implement storage in your Kubernetes cluster.

In Chapter 7, Running Stateful Applications with Kubernetes, we’ll see how Kubernetes can raise the level of abstraction and, on top of storage, help to develop, deploy, and operate stateful applications using concepts such as stateful sets.

Join us on Discord!

Read this book alongside other users, cloud experts, authors, and like-minded professionals.

Ask questions, provide solutions to other readers, chat with the authors via. Ask Me Anything sessions and much more.

Scan the QR code or visit the link to join the community now.

https://packt.link/cloudanddevops

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset