Handling node failure with PVC involvement

Let's run the describe nodes command again as follows:

kubectl describe nodes

We found that, in our cluster, agent-0 had all the critical pods of the database and WordPress.

We are going to be evil again and stop the node that can cause most damage by shutting down node-0 on the Azure portal:

Let the fun begin.

Click refresh on the page to verify that the page does not work.

You have to wait at least 300s (in this case) as that is the default for Kubernetes for tolerations. Check this by running the following command:

kc describe pods/<pod-id>

You will get the following output:

Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s

Keep refreshing the page once in a while, and eventually you will see Kubernetes try to migrate the pod to the running agent.

Use kubectl edit deploy/... to fix any insufficient CPU/memory errors as shown in the last section.

In our case, we see these errors when you run kubectl get events:

2s          2s           1         handsonaks-wp-wordpress-55644f585c-hqmn5.15753d7898e1545c   Pod                      Warning   FailedAttachVolume    attachdetach-controller   Multi-Attach error for volume "pvc-74f785bc-0c73-11e9-9914-82000ff4ac53" Volume is already used by pod(s) handsonaks-wp-wordpress-6ddcfd5c89-p2925
36s 36s 1 handsonaks-wp-wordpress-55644f585c-hqmn5.15753d953ce0bdca Pod Warning FailedMount kubelet, aks-agentpool-18162866-1 Unable to mount volumes for pod "handsonaks-wp-wordpress-55644f585c-hqmn5_default(ec03818d-0c83-11e9-9914-82000ff4ac53)": timeout expired waiting for volumes to attach or mount for pod "default"/"handsonaks-wp-wordpress-55644f585c-hqmn5". list of unmounted volumes=[wordpress-data]. list of unattached volumes=[wordpress-data default-token-rskvs]
a

This is because storage detachment from a VM is a tricky business, and Azure has a high tolerance before it would let you detach storage from one VM. As usual, there are multiple ways around this issue:

  • Use the Azure portal to manually detach the disk we identified previously.
  • Delete the old pod manually (the one with the status Unknown) to force-detach the volume.
  • Give it around 5 or 10 minutes, then delete the pod to force Kubernetes to try again.

By trying some, or all, of these, we were able to mount the WordPress volume on agent-1, but not the mariadb volume. We had to restart agent-0 to get the cluster to a decent state. At this point, there are only two options:

  • Get Microsoft support.
  • Make sure that you have good backup.

It is for this reason and more that we recommend using managed DBs for your pods and not hosting them yourself. We will see how we can do that in the upcoming chapters.

When your cluster is running for a long time, or at scale, eventually you will run into the following issues:

  • The kube DNS kubelet stops working and will require a restart.
  • Azure limits outside connections from a single VM to 1,024.
  • If any of your pods create zombie processes and don't clean up, you won't be able to even connect to the pod. You will have to restart the pod.

Before continuing, let's clean up the PV/PVC using the following command:

helm delete --purge handsonaks-wp
# delete any pv or pvc that might be present using kubectl delete pvc/...

By the end of this section, you now have a detailed knowledge in studying and fixing node failures.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset