Diagnosing out-of-resource errors

When deleting agent-0, we can observe the issue of being out of resources. Only one node is available, but that node is out of disk space:

0s        1m        13        redis-slave-b58dc4644-lqkdp.1574e88a78913f1a   Pod                 Warning   FailedScheduling   default-scheduler   0/2 nodes are available: 1 Insufficient cpu, 1 node(s) were not ready, 1 node(s) were out of disk space.

To confirm this, run the following command:

kc get pods

When you run the command, you will get the following output:

redis-slave-b58dc4644-tcl2x     0/1       Pending   0          4h
redis-slave-b58dc4644-wtkwj     1/1       Unknown   0          6h
redis-slave-b58dc4644-xtdkx     1/1       Unknown   1          20h

If you had launched the cluster on VMs with more vCPUs (ours was running the smallest available, A1), you can set the replicas to be 10 or higher to recreate this issue as follows:
kubectl scale --replicas=10 deployment/redis-slave

Now that we have confirmed the issue, let's get back to the error:

redis-slave-... Warning   FailedScheduling ...   0/2 nodes are available: 1 Insufficient cpu, 1 node(s) were not ready, 1 node(s) were out of disk space.

There are three errors as follows:

Insufficient CPU
One node not ready
One node out of disk space

Let's look at them in detail:

One node not ready: We know about this error because we caused it. We can also probably guess that it is the same node that is reporting out of disk space.

How can we make sure that it is the Insufficient cpu issue instead of the node being out of disk space? Let's explore this using the following steps:

Get hold of a running pod:

kc get pods
NAME                            READY     STATUS    RESTARTS   AGE
frontend-56f7975f44-9k7f2       1/1       Unknown   0          5h
frontend-56f7975f44-lsnrq       1/1       Running   0          4h

Use the kubectl exec command to run a shell on the node as follows:

kc exec -it frontend-<running-pod-id> bash

Once we are in, run the following command:

df -h

The output we should get will be similar to the following:

root@frontend-56f7975f44-lsnrq:/var/www/html# df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay          30G   15G   15G  51% /
tmpfs            64M     0   64M   0% /dev
tmpfs           966M     0  966M   0% /sys/fs/cgroup
/dev/sda1        30G   15G   15G  51% /etc/hosts
shm              64M     0   64M   0% /dev/shm
tmpfs           966M   12K  965M   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs           966M     0  966M   0% /proc/acpi
tmpfs           966M     0  966M   0% /proc/scsi
tmpfs           966M     0  966M   0% /sys/firmware

Clearly there is enough disk space, since the node is not reporting a status. So, enter the following command to know why the node is showing out of disk space:

kc describe nodes

This is not much of help in determining where the out of disk space issue is coming from (the Unknown status doesn't mean out of disk). This seems to be a bug in the eventing mechanism reporter of Kubernetes, although this bug might be fixed by the time you read this.

In our case, the CPU is the bottleneck. So, let's see what Kubernetes is having trouble with, by getting the ReplicaSet definition of redis-slave as follows:

ab443838-9b3e-4811-b287-74e417a9@Azure:~$ kc get rs
NAME                      DESIRED   CURRENT   READY     AGE
frontend-56f7975f44       1         1         1         20h
redis-master-6b464554c8   1         1         1         20h
redis-slave-b58dc4644     1         1         0         20h
ab443838-9b3e-4811-b287-74e417a9@Azure:~$ kc get -o yaml rs/redis-slave-b58dc4644
apiVersion: extensions/v1beta1
...
kind: ReplicaSet
        resources:
          requests:
            cpu: 100m

You might think that since redis-slave is used only for reading, the application might still work. On the surface, it looks okay. The guestbook appears in the browser when we enter the IP address as follows:

But if you try to add an entry, nothing happens.

The Developer Web Tools are good debugging tools for these cases, and are available in most browsers. You can launch them by right clicking and choosing Inspect:

After a page refresh, you can see this error in the Network tab:

<br />
<b>Fatal error</b>:  Uncaught exception 'PredisConnectionConnectionException' with message 'Connection timed out [tcp://redis-slave:6379]' in /usr/local/lib/php/Predis/Connection/AbstractConnection.php:168

There are multiple ways we can solve this issue. In production, you would restart the node or add additional nodes. To demonstrate, we will try multiple approaches (all of them coming from practical experience).

Table of Contents for Diagnosing out-of-resource errors

Create new playlist

Sign In

Sign Up

Table of Contents for
Diagnosing out-of-resource errors