Chapter 15: Rancher and Kubernetes Troubleshooting

In this chapter, we'll explore the master components of Kubernetes, their interactions, and how to troubleshoot the most common problems. Next, we'll explore some common failure scenarios, including identifying the failures and resolving them as quickly as possible, using the same troubleshooting steps and tools that Rancher's support team uses when supporting Enterprise customers. Then, we'll discuss recovery from some common cluster failures. This chapter includes scripts and documentation for reproducing all of these failures in a lab environment (based on actual events).

In this chapter, we're going to cover the following main topics:

  • Recovering an RKE cluster from an etcd split-brain
  • Rebuilding from etcd backup
  • Pods not being scheduled with OPA Gatekeeper
  • A runaway app stomping all over a cluster
  • Can rotating kube-ca break my cluster?
  • A namespace is stuck in terminating
  • General troubleshooting for RKE clusters

Recovering an RKE cluster from an etcd split-brain

In this section, we are going to be covering what an etcd spilt-brain is, how to detect it, and finally, how to recover from it.

What is an etcd split-brain?

Etcd is a leader-based distributed system. Etcd ensures that the leader node periodically sends heartbeats to all followers in order to keep the leader lease. Etcd requires a majority of nodes to be up and healthy to accept writes using the model (n+1)/2 members. When fewer than half of the etcd members fail, the etcd cluster can still accept read/write requests. For example, if you have a five-node etcd cluster and lose two nodes, the Kubernetes cluster will still be up and running. But if you lose an additional node, then the etcd cluster will lose quorum, and the remaining nodes will go into read-only mode until a quorum is restored.

After a failure, the etcd cluster will go through a recovery process. The first step is to elect a new leader that verifies that the cluster has a majority of members in a healthy state – that is, responding to health checks. The leader will then return the cluster to a healthy state and begin accepting write requests.

Now, another common failure scenario is what we call a network partition. This is when most or all nodes in the etcd cluster lose access to one another, which generally happens during an infrastructure outage such as a switch failure or a storage outage. But this can also occur if you have an even number of etcd nodes – for example, if you have three etcd nodes in data center A and three etcd nodes in data center B.

Important Note

Having etcd running across two data centers is not recommended.

Then, the network connection between the data center fails. In this case, this means that all etcd nodes will go into read-only mode because of quorum loss.

You should rarely run into a split-brain cluster if you have an odd number of nodes in the preceding scenario. But it still can happen. Of course, the question that comes up is, what is a split-brain cluster? The essential thing to understand is that etcd uses a cluster and member IDs to track the state of the cluster. The first node to come online creates the cluster ID, sometimes called initial-cluster-token. As nodes join that cluster, they will each be assigned a unique member ID and sent the cluster ID. At this point, the new node will be syncing data from other members in the cluster.

There are only three main reasons why the cluster ID would be changed:

  • The first is data corruption; this is a rare occurrence (I have only seen it once before, during an intentional data corruption test), that is, using the dd command to write random data to the drive storing the etcd database filesystem. Most of the time, the safeguards and consistency checks built into etcd prevent this.
  • A misconfiguration is the second reason, which is more common when someone is making a cluster change. For example, when an etcd node fails, some users will try to add a new etcd node without removing the broken node first, causing the new etcd node to fail to join correctly, putting the cluster into a weird broken state. The new node sometimes generates a new cluster ID instead of joining the existing nodes.
  • The third reason is a failed etcd restore. During the etcd restore process, a new etcd cluster is created, with the first node being used as a bootstrap node to create a new etcd cluster, with the original data being injected into this new cluster. The rest of the etcd node should join the new etcd cluster, but this process can fail if the connection between Rancher and the cluster/nodes is unstable, or if there is a bug in Rancher/RKE/RKE2. The other reason is that the restore fails partway through, leaving some etcd nodes running on older data and some nodes running on newer data.

Now we know how etcd can get into a split-brain state. In the next section, we are going to cover how to identify this issue in the real world, including common error messages that you should find.

Identifying the common error messages

When etcd goes into a split-brain state, it is typically found when a cluster is found offline – that is, a request to the kube-apiserver(s) start failing, which generally shows itself as a cluster going offline in the Rancher UI.

You should run the following commands for RKE(1) clusters and review the output:

Error messages in etcd logs:

`docker logs --tail 100 -f etcd`

```

2021-05-04 07:50:10.140405 E | rafthttp: request cluster ID mismatch (got ecdd18d533c7bdc3 want a0b4701215acdc84)

2021-05-04 07:50:10.142212 E | rafthttp: request sent was ignored (cluster ID mismatch: peer[fa573fde1c0b9eb9]=ecdd18d533c7bdc3, local=a0b4701215acdc84)

2021-05-04 07:50:10.155090 E | rafthttp: request sent was ignored (cluster ID mismatch: peer[fa573fde1c0b9eb9]=ecdd18d533c7bdc3, local=a0b4701215acdc84)

```

Note in the output that the fa573fde1c0b9eb9 member responds with a cluster ID different from the local copy in the following command; we are jumping into the etcd container and then connecting the etcd server using the etcd command-line tool. Finally, we are running the member list sub-command to show all the nodes in this etcd cluster:

Unhealthy members in etcd cluster:

`docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") etcd etcdctl member list`

```

15de45eddfe271bb, started, etcd-a1ublabat03, https://172.27.5.33:2380, https://172.27.5.33:2379, false

1d6ed2e3fa3a12e1, started, etcd-a1ublabat02, https://172.27.5.32:2380, https://172.27.5.32:2379, false

68d49b1389cdfca0, started, etcd-a1ublabat01, https://172.27.5.31:2380, https://172.27.5.31:2379, false

```

Note that the output shows that all etcd members are in the started state, which would make you think that they are all healthy, but this output may be misleading, particularly that the members have successfully joined the cluster:

Endpoint health:

`docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") etcd etcdctl endpoint health`

```

https://172.27.5.31:2379 is healthy: successfully committed proposal: took = 66.729472ms

https://172.27.5.32:2379 is healthy: successfully committed proposal: took = 70.804719ms

https://172.27.5.33:2379 is healthy: successfully committed proposal: took = 71.457556ms

```

Note that the output shows that all etcd members are reporting as healthy even though one of the members has the wrong cluster ID. This output reports that the etcd process is up and running, responding to its health check endpoint.

You should run the following commands for RKE2 clusters and review the output:

Error messages in etcd logs:

`tail -f /var/log/pods/kube-system_etcd-*/etcd/*.log`

Note that the output is very similar to the output for the RKE1 cluster, with the only difference being that etcd runs as a Pod instead of a standalone container. In the following commands, we are doing a for loop, going through each etcd server and testing the endpoint. This endpoint will tell us whether the etcd server is healthy or having issues:

Unhealthy members in etcd cluster:

`for etcdpod in $(kubectl -n kube-system get pod -l component=etcd --no-headers -o custom-columns=NAME:.metadata.name); do echo $etcdpod; kubectl -n kube-system exec $etcdpod -- sh -c "ETCDCTL_ENDPOINTS='https://127.0.0.1:2379' ETCDCTL_CACERT='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' ETCDCTL_CERT='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' ETCDCTL_KEY='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' ETCDCTL_API=3 etcdctl --write-out=table endpoint health"; echo ""; done;`

In the following screenshot, we can see that we are testing a total of five etcd servers, with each server reporting health that equals true, along with output showing how long each server took to respond to this health check request. Finally, the last block will show us whether there are any known errors with the etcd server:

Figure 15.1 – The RKE2 endpoint health output table

Figure 15.1 – The RKE2 endpoint health output table

Note that the output shows the health status of each of the master nodes. It is crucial to note that this script uses kubectl to execute into each etcd Pod and runs the etcdctl endpoint health command, which checks itself.

If kubectl is unavailable, you can SSH into each of the master nodes and run the following command instead:

```
export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml
etcdcontainer=$(/var/lib/rancher/rke2/bin/crictl ps --label io.kubernetes.container.name=etcd --quiet)
/var/lib/rancher/rke2/bin/crictl exec $etcdcontainer sh -c "ETCDCTL_ENDPOINTS='https://127.0.0.1:2379' ETCDCTL_CACERT='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' ETCDCTL_CERT='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' ETCDCTL_KEY='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' ETCDCTL_API=3 etcdctl endpoint health --cluster --write-out=table"
```

The command directly connects to the container process.

To recover from this issue in an RKE(1) cluster, you'll want to use try the following steps:

  1. Triggering a cluster update process by running the rke up --config cluster.yml command, or for Rancher-managed RKE(1) clusters, you'll need to change the cluster settings.
  2. If the rke up command fails, use etcd-tools, found at https://github.com/rancherlabs/support-tools/tree/master/etcd-tools, to rebuild the etcd cluster manually.
  3. If etcd-tools fails, you need to restore the cluster from an etcd snapshot.

At this point, we know how to resolve an etcd failure such as this. We now need to take steps to prevent these issues from happening again. In the next section, we are going to go over some common steps that you can take to protect your cluster.

Here are the preventive tasks to take:

To reproduce this issue in a lab environment, you should follow the steps located at https://github.com/mattmattox/Kubernetes-Master-Class/tree/main/disaster-recovery/etcd-split-brain#reproducing-in-a-lab. Note that this process only applies to RKE(1) clusters, as finding a repeatable process for RKE2 is very difficult due to the built-in self-healing processes that are part of RKE2.

At this point, we have handled a broken etcd cluster and will need to restore the cluster in place. We, of course, need to take this to the next step, which is how to recover when the cluster is lost and we need to rebuild. In the next section, we are going to cover the steps for rebuilding a cluster from zero.

Rebuilding from an etcd backup

Cluster data, including Deployments, Secrets, and configmap, is stored in etcd. Using RKE1/2, we can take an etcd backup and seed a cluster using the backup. This feature can be helpful in cases of disasters such as a large-scale storage outage or accidental deletion of data for a cluster.

For RKE v0.2.0 and newer versions, etcd backups are turned on by default. Using the default setting, RKE will take a backup every 12 hours, keeping 6 copies locally on each etcd node, located at /opt/rke/etcd-snapshots. You can, of course, customize these settings by overriding the values in cluster.yaml in the Rancher UI details, which can be found at https://rancher.com/docs/rke/latest/en/etcd-snapshots/recurring-snapshots/#configuring-the-snapshot-service-in-yaml.

The most important settings are the Amazon Simple Storage Service (S3) settings that allow you to store the etcd snapshots in an S3 bucket instead of locally on the etcd nodes. This is important because we want to get the backups off the server that is being backed up. Note that RKE uses a standard S3 GO library that supports any S3 provider that follows the S3 standard. For example, you can use Wasabi in place of AWS S3, but you cannot use Azure Blob, as it's not fully S3 compatible. For environments where sending data to the cloud is not allowed, you can use some enterprise storage arrays such as NetApp and EMC, as they can become an S3 provider.

RKE can restore an etcd snapshot up into the same cluster or a new cluster. For restoring etcd, run the rke etcd snapshot-restore --name SnapshotName command, with RKE taking care of the rest. Restoring a snapshot into a new cluster is slightly different because the etcd snapshot restores all the cluster data, including items such as the node object for the old nodes. In addition, the Kubernetes certificates are regenerated. This causes the service account tokens to be invalided, breaking several services such as canal, coredns, and ingress-nginx-controllers. To work around this issue, I created a script that deleted all the broken service account tokens and recycled the services and nodes. This script can be found at https://github.com/mattmattox/Kubernetes-Master-Class/tree/main/disaster-recovery/rebuild-from-scratch#restoringrecovering.

You can find more details about the backup and restore process in Rancher's official documentation, located at https://rancher.com/docs/rke/latest/en/etcd-snapshots/.

In the RKE2 cluster, you can restore an etcd snapshot using the built-in rke2 command on the master nodes, using the following steps:

  1. Stop rke2 on all master nodes using the systemctl stop rke2-server command.
  2. Reset a cluster on one of the master nodes using the rke2 server --cluster-reset command. This command creates a new etcd cluster with only a single node one.
  3. Clean the other master nodes using the mv /var/lib/rancher/rke2/server/db/etcd /var/lib/rancher/rke2/server/db/etcd-old-%date% command.
  4. Then, rejoin the other master nodes to the cluster by running systemctl start rke2-server.

You can find more details on this process in the official RKE2 documentation at https://docs.rke2.io/backup_restore/.

At this point, you should be able to take an etcd backup and rebuild a cluster using just that backup. This process includes both the RKE1 and RKE2 clusters.

How to resolve Pods not being able to be scheduled due to OPA Gatekeeper

As we covered in Chapter 12, Security and Compliance Using OPA Gatekeeper, OPA Gatekeeper uses ValidatingWebhookConfigurations to screen updates requests sent to kube-apiserver to verify whether they pass OPA Gatekeeper policies. If OPA Gatekeeper Pod(s) are down, these requests will fail, which will break kube-scheduler because all the update requests will be blocked. This means that all new Pods will fail to be created.

Important Note

OPA Gatekeeper can be set to fail open – that is, if OPA Gatekeeper is down, assume that it would have been approved and move forward. I have seen in larger clusters that the delay caused by OPA Gatekeeper timing out caused a ton of load on the kube-apiservers, which caused the cluster to go offline.

You can identify this issue by reviewing the kube-scheduler logs using the following commands:

  1. For RKE(1) clusters, run the docker logs --tail 10 -t kube-scheduler command if the output looks like the following. It's telling us that the kube-scheduler is having issues connecting the OPA Gatekeeper service endpoint:

    2021-05-08T04:44:41.406070907Z E0508 04:44:41.405968       1 leaderelection.go:361] Failed to update lock: Internal error occurred: failed calling webhook "validation.gatekeeper.sh": Post "https://gatekeeper-webhook-service.gatekeeper-system.svc:443/v1/admit?timeout=3s": dial tcp 10.43.104.236:443: connect: connection refused

    ```

  2. By running the following command, you can discover which RKE server is currently hosting the kube-scheduler leader:

    ```

    NODE="$(kubectl get leases -n kube-system kube-scheduler -o 'jsonpath={.spec.holderIdentity}' | awk -F '_' '{print $1}')"

    echo "kube-scheduler is the leader on node $NODE"

    ```

  3. For RKE2 clusters, it's a little different because kube-scheduler runs as a pod instead of a standalone container. You can use the following command to show the logs for all the kube-scheduler Pods:

    kubectl -n kube-system logs -f -l component=kube-scheduler

To recover from this issue, you need to restore the OPA Gatekeeper Pods, but this is a problem because all new Pod creations are being blocked. To work around this issue, we need to remove the webhook, allowing OPA Gatekeeper to restart successfully before restoring the webhook:

  1. First, try setting the failure policy to open using the following command:

    kubectl get ValidatingWebhookConfiguration gatekeeper-validating-webhook-configuration -o yaml | sed 's/failurePolicy.*/failurePolicy: Ignore/g' | kubectl apply -f -.

  2. If the open policy doesn't work, backup and remove all Gatekeeper admission checks, using the following commands:

    kubectl get validatingwebhookconfigurations.admissionregistration.k8s.io gatekeeper-validating-webhook-configuration -o yaml > webhook.yaml

    kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io gatekeeper-validating-webhook-configuration.

  3. Monitor the cluster and wait for the cluster to stabilize.
  4. Restore the webhook using the kubectl apply -f webhook.yaml command.

At this point, you should be able to recover from an OPA Gatekeeper outage. In addition, you should be able to use these steps for recovery of other software that uses webhooks in your cluster.

A runaway app stomping all over a cluster

One question that comes up a lot is, How can a single app bring down my cluster?

Let's say an application was deployed without CPU and memory limits. Pods can consume so much of a node's resources that the node becomes unresponsive, causing the node to go into an unschedulable state – that is, not ready. kube-scheduler is configured to reschedule the Pods running on the node after 5 minutes (default). This will break that node, and the process will repeat until all nodes are broken.

Important Note

Most of the time, the node will crash and self-recover, meaning you'll only see nodes flipping up and down as the Pods are bouncing between nodes. But I have seen environments where the nodes become locked up but don't restart.

You can identify this issue by reviewing the cluster event using the kubectl get events -A command, which shows the Pod events for all namespaces. And what we are looking for is a large number of Pod evictions, which is Kubernetes moving the Pods from the dying/dead node. You can also review the current CPU and memory of the present running Pods by using the kubectl top Pod -A command, which breaks the usage by the Pod. It's also recommended that you review any monitoring software such as Prometheus to watch the node resource usage over time.

To recover from this issue, you need to disable the Pod/workload, with an example being to scale the deployment to zero using the kubectl -n <namespace> scale deployment/<deployment name> --replicas=0 command, and then to prevent the issue from happening again, you should add resource limits and a request to all workloads by adding the following settings:

```
    resources:
    limits:
      cpu: "800m"
      mem: "500Mi"
    requests:
      cpu: "500m"
      mem: "250Mi"
```

It is important to note that in Chapter 12, Security and Compliance Using OPA Gatekeeper, we covered how to use OPA Gatekeeper to enforce these settings on all Pods in your cluster, and it is highly recommended that you use that policy, which can be found at https://docs.rafay.co/recipes/governance/limits_policy/.

To reproduce this issue in the lab, you can find an example application, located at https://github.com/mattmattox/Kubernetes-Master-Class/tree/main/disaster-recovery/run-away-app.

At this point, you should be able to detect a runaway application in your cluster. Then, you should be able to apply resource requests and limits to stop the application from damaging your cluster. Finally, we covered how to use OPA Gatekeeper to prevent this issue in the future.

Can rotating kube-ca break my cluster?

What is kube-ca, and how can it break my cluster?

Kubernetes protects all of its services using SSL certificates, and as part of this, a Certificate Authority (CA) is needed in order to work correctly. In the case of Kubernetes, kube-ca is the root certificate authority for the cluster and handles signing all the different certificates needed for the cluster. RKE then creates key pairs for kube-apiserver, etcd, kube-scheduler, and more, and signs them using kube-ca. This also includes service account tokens, which kube-service-account-token certificate signs as part of the authentication model. This means that if that chain is broken, kubectl and other Kubernetes services will choose the safest option and block the connection as that token can no longer be trusted. And of course, several services such as canal, coredns, and ingress-nginx-controller use service-account-token in order to communicate and authenticate with the cluster.

Typically, with RKE1/2, the kube-ca certificate is valid for 10 years. So typically, there is no need for this certificate ever to be rotated. But it can be for a couple of reasons, the first being because of cluster upgrade. Sometimes, during a Kubernetes upgrade, cluster services change to different versions, requiring new certificates to be created. But most of the time, this issue is accidentally caused when someone runs the rke up command but it is missing, or has an out-of-date cluster.rkestate file on their local machine. This is because the rkestate file stores the certificates and their private keys. When RKE defaults to generating these certificates, i.e., starts building a new cluster if this file is missing. This process typically fails, as some services such as kubelet are still using the old certificates and tokens so never go into a healthy state, causing the rke up process to error out. But RKE will leave the cluster in a broken state.

At this point, you should have a better understanding of what kube-ca is and how rotating it can affect your cluster. In addition, you should be able to fix the cluster using the rke up command.

How to fix a namespace that is stuck in terminating status

Why is my namespace stuck in termination?

When you run kubectl delete ns <namespace> on a namespace, status.phase will be set to Terminating, at which point the kube-controller will wait for the finalizers to be removed. At this point, the different controllers will detect that they need to clean up their resources inside the namespace.

For example, if you delete a namespace with a PVC inside it, the volume controller unmaps and deletes the volume(s), at which point the controller will remove the finalizer. Once all the finalizers have been removed, the kube-controller will finally delete the namespace. This is because finalizers are a safety mechanism built in Kubernetes to ensure that all objects are cleaned up before deleting the namespace. This whole process can take a few minutes. The issue comes into play when a finalizer never gets removed.

We'll see some of the common finalizers and how to resolve them:

  • Rancher-created namespaces getting stuck.
  • Custom metrics causing all namespaces to be stuck.
  • The Longhorn system is stuck terminating.

Rancher-created namespaces getting stuck

In this example, when disabling/uninstalling monitoring in Rancher, the finalizer, controller.cattle.io/namespace-auth, is left behind by Rancher. And because of this, the namespace will get stuck in Terminating and will never self-resolve. You can confirm this issue by running the kubectl get ns NamespaceName -o yaml command.

It is important to note that this issue has mostly stopped since Rancher v2.4 but still comes up if Rancher is unhealthy or disconnected from the cluster. In the following screenshot, you'll see a YAML output for a stuck namespace. The most important part that we want to look into is the spec.finalizers section, which tells us what finalizers are currently assigned to this namespace:

Figure 15.2 – An example of a stuck namespace YAML output

Figure 15.2 – An example of a stuck namespace YAML output

To resolve this issue, you have two options:

  • Manually remove the finalizer using the kubectl edit namespace NamespaceName command, delete the line containing controller.cattle.io/namespace-auth, and save the edit.
  • If you need to make a mass change for all namespaces in the cluster, you can run the following command:

    kubectl get ns | awk '{print $1}' | grep -v NAME | xargs -I{} kubectl patch namespace {}  -p '{"metadata":{"finalizers":[]}}' --type='merge' -n {}

Custom metrics causing all namespaces to be stuck

A common reason for a namespace getting stuck is the custom metrics endpoint. Prometheus adds an API resource called custom.metrics.k8s.io/v1beta1, which exposes Prometheus metrics to the Kubernetes services such as Horizontal Pod Autoscaling (HPA). In this case, the kubernetes finalizer will be left behind, which is not a very helpful status. You can confirm this issue by running the following command:

kubectl get ns NamespaceName  -o yaml.

In the following screenshot, you'll see a namespace with finalizer kubernetes:

Figure 15.3 – A namespace stuck terminating with the Kubernetes finalizer

Figure 15.3 – A namespace stuck terminating with the Kubernetes finalizer

To resolve this issue, you have a couple of different options.

  • Fix Prometheus because as long as it is up and running, the finalizer should be removed automatically without issue.
  • If Prometheus has been disabled/removed from the cluster, you should clean up the leftover custom.metrics endpoint using the following commands:
    • Run kubectl get apiservice|grep metrics to find the name.
    • Delete it using the kubectl delete apiservice v1beta1.custom.metrics.k8s.io command.
  • You can also remove the finalizer by running the following command:

    for ns in $(kubectl get ns --field-selector status.phase=Terminating -o jsonpath='{.items[*].metadata.name}'); do  kubectl get ns $ns -ojson | jq '.spec.finalizers = []' | kubectl replace --raw "/api/v1/namespaces/$ns/finalize" -f -; done.

It is important to note that this command is used to fix all the namespaces that are stuck in Terminating. Also, this does not fix the root cause but is more like a workaround to recover a broken cluster.

  • You can use a tool called knsk, which can be found at https://github.com/thyarles/knsk. The aim of this script is to fix stuck namespaces and clean up broken API resources.

The Longhorn system is stuck terminating

Another common issue is the longhorn-system namespace being stuck in Terminating after uninstalling Longhorn. This namespace is used by Longhorn and stores several Custom Resource Definitions (CRDs) (CustomResourceDefinition). You can confirm this issue by running the kubectl get ns longhorn-system -o json command.

In the following screenshot, you'll see the JSON output for the longhorn-system namespace, which is the default namespace for Longhorn:

Figure 15.4 – longhorn-system stuck terminating with the Kubernetes finalizer

Figure 15.4 – longhorn-system stuck terminating with the Kubernetes finalizer

To resolve this issue, you have various options:

  • Run the Longhorn cleanup script, which can be found at https://longhorn.io/docs/1.2.4/deploy/uninstall/. This script cleans up all the other CRD resources used by Longhorn.
  • Run the following command to cycle through all the api-resource types in the cluster and delete them from the namespace:

    kubectl api-resources --verbs=list --namespaced -o name   | xargs -n 1 kubectl get --show-kind --ignore-not-found -n longhorn-system,

At this point, you should be able to clean up a namespace that is stuck in terminating by finding what finalizer is assigned to it. Then, you should be able to resolve that finalizer or remove it.

General troubleshooting for RKE clusters

This section will cover some common troubleshooting commands and scripts that can be used to debug issues. All these commands and scripts are designed around standard RKE clusters.

Find the current leader node by running the following listed script. This script will review the kube-scheduler endpoint in the kube-system namespace, which includes an annotation used by the leader controller.

This is the script for finding the kube-scheduler leader Pod: curl https://raw.githubusercontent.com/mattmattox/k8s-troubleshooting/master/kube-scheduler | bash.

Here is an output example of a healthy cluster:

```

kube-scheduler is the leader on node a1ubk8slabl03

```

Suppose that this node is unhealthy or overlay networking isn't working correctly. In that case, the kube-scheduler isn't operating correctly, and you should recycle the containers by running rke up. And if that doesn't resolve the issue, you should stop the container on the leader node and allow another node to take over.

In order to show the etcd cluster members list, we'll use the following command:

docker exec etcd etcdctl member list

With the preceding command, you can see the current list of members – that is, the nodes in the etcd cluster.

Here is an output example of a healthy cluster from the preceding command:

```

2f080bc6ec98f39b, started, etcd-a1ubrkeat03, https://172.27.5.33:2380, https://172.27.5.33:2379,https://172.27.5.33:4001, false

9d7204f89b221ba3, started, etcd-a1ubrkeat01, https://172.27.5.31:2380, https://172.27.5.31:2379,https://172.27.5.31:4001, false

bd37bc0dc2e990b6, started, etcd-a1ubrkeat02, https://172.27.5.32:2380, https://172.27.5.32:2379,https://172.27.5.32:4001, false

```

If this list does not match the cluster – that is, it has a node that should have been removed and a duplicate node – then you know that the etcd cluster is currently misconfigured and needs to be synced using RKE and etcd tools.

To expand the member list command, you can run the following command to show the health status of each etcd node:

curl https://raw.githubusercontent.com/mattmattox/etcd-troubleshooting/master/etcd-endpoints | bash

It is important to note that this health check only shows that etcd is up and running, as the node might be having other issues, such as a full filesystem or low memory, but may still be reporting as healthy.

From the preceding command, this is an output example of a healthy cluster:

```

Validating connection to https://172.27.5.33:2379/health

{"health":"true"}

Validating connection to https://172.27.5.31:2379/health

{"health":"true"}

Validating connection to https://172.27.5.32:2379/health

{"health":"true"}

```

Finally, we will wrap up this section and go over some common errors and what they mean:

  • The following error tells us that the etcd is failing to make a connection with the etcd node on port 2380. So, we need to verify that the etcd container is up and running. Your first step is to review the logs of the etcd container:

    `health check for peer xxx could not connect: dial tcp IP:2380: getsockopt: connection refused`

  • This error means that the etcd cluster has lost quorum and it is trying to establish a new leader. Typically, this occurs when the majority of the nodes running etcd go down or cannot be reached – for example, if two out of three etcd nodes are down. This message usually appears following an outage, but if this message is reported multiple times without rebooting etcd nodes, it should be taken seriously. This means that the leader is switching nodes due to etcd timing out leader leases, which should be investigated. This is known by the following error:

    `xxx is starting a new election at term x`

  • The following error means that the TCP connection to an etcd node is timing out and the request that was sent by the client never received a response. This can be because the node is offline or that a firewall is dropping the traffic:

    `connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: i/o timeout"; Reconnecting to {0.0.0.0:2379 0 <nil>}`

  • The etcd service stores the etcd node and cluster state in a directory (/var/lib/etcd). If this state is wrong for any reason, the node should be removed from the cluster and cleaned; the recommended way to run the cleanup script can be found at https://github.com/rancherlabs/support-tools/blob/master/extended-rancher-2-cleanup/extended-cleanup-rancher2.sh. Then, the node can to readded to the cluster. The following error shows this:

    `rafthttp: failed to find member.`

You can find more scripts and commands at https://github.com/mattmattox/Kubernetes-Master-Class/tree/main/troubleshooting-kubernetes.

At this point, you should be able to detect and resolve the most common failures that can happen with your RKE cluster. In addition, we covered how to prevent these kinds of failures from happening.

Summary

This chapter went over the main parts of an RKE1 and RKE2 cluster. We then dove into some of the common failure scenarios, covering how these scenarios happen, how to find them, and finally, how to resolve them.

We then closed out the chapter by covering some common troubleshooting commands and scripts that can be used to debug other issues.

In the next chapter, we are going to dive into the topic of CI/CD pipelines and image registries, including how to install tools such as Drone and Harbor. Then, we'll be covering how to integrate with our clusters. Finally, we'll be covering how to set up our applications to use the new pipelines.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset