Diagnosing out-of-resource errors

When deleting agent-0, we can observe the issue of being out of resources. Only one node is available, but that node is out of disk space:

0s        1m        13        redis-slave-b58dc4644-lqkdp.1574e88a78913f1a   Pod                 Warning   FailedScheduling   default-scheduler   0/2 nodes are available: 1 Insufficient cpu, 1 node(s) were not ready, 1 node(s) were out of disk space.

To confirm this, run the following command:

kc get pods

When you run the command, you will get the following output:

redis-slave-b58dc4644-tcl2x     0/1       Pending   0          4h
redis-slave-b58dc4644-wtkwj 1/1 Unknown 0 6h
redis-slave-b58dc4644-xtdkx 1/1 Unknown 1 20h
If you had launched the cluster on VMs with more vCPUs (ours was running the smallest available, A1), you can set the replicas to be 10 or higher to recreate this issue as follows:
kubectl scale --replicas=10 deployment/redis-slave

Now that we have confirmed the issue, let's get back to the error:

redis-slave-... Warning   FailedScheduling ...   0/2 nodes are available: 1 Insufficient cpu, 1 node(s) were not ready, 1 node(s) were out of disk space.

There are three errors as follows:

  • Insufficient CPU
  • One node not ready
  • One node out of disk space

Let's look at them in detail:

  • One node not ready: We know about this error because we caused it. We can also probably guess that it is the same node that is reporting out of disk space.

How can we make sure that it is the Insufficient cpu issue instead of the node being out of disk space? Let's explore this using the following steps:

  1. Get hold of a running pod:
kc get pods
NAME READY STATUS RESTARTS AGE
frontend-56f7975f44-9k7f2 1/1 Unknown 0 5h
frontend-56f7975f44-lsnrq 1/1 Running 0 4h
  1. Use the kubectl exec command to run a shell on the node as follows:
kc exec -it frontend-<running-pod-id> bash
  1. Once we are in, run the following command:
df -h

The output we should get will be similar to the following: 

root@frontend-56f7975f44-lsnrq:/var/www/html# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 30G 15G 15G 51% /
tmpfs 64M 0 64M 0% /dev
tmpfs 966M 0 966M 0% /sys/fs/cgroup
/dev/sda1 30G 15G 15G 51% /etc/hosts
shm 64M 0 64M 0% /dev/shm
tmpfs 966M 12K 965M 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 966M 0 966M 0% /proc/acpi
tmpfs 966M 0 966M 0% /proc/scsi
tmpfs 966M 0 966M 0% /sys/firmware
  1. Clearly there is enough disk space, since the node is not reporting a status. So, enter the following command to know why the node is showing out of disk space:
kc describe nodes

This is not much of help in determining where the out of disk space issue is coming from (the Unknown status doesn't mean out of disk). This seems to be a bug in the eventing mechanism reporter of Kubernetes, although this bug might be fixed by the time you read this.

In our case, the CPU is the bottleneck.  So, let's see what Kubernetes is having trouble with, by getting the ReplicaSet definition of redis-slave as follows:

ab443838-9b3e-4811-b287-74e417a9@Azure:~$ kc get rs
NAME DESIRED CURRENT READY AGE
frontend-56f7975f44 1 1 1 20h
redis-master-6b464554c8 1 1 1 20h
redis-slave-b58dc4644 1 1 0 20h
ab443838-9b3e-4811-b287-74e417a9@Azure:~$ kc get -o yaml rs/redis-slave-b58dc4644
apiVersion: extensions/v1beta1
...
kind: ReplicaSet
resources:
requests:
cpu: 100m

You might think that since redis-slave is used only for reading, the application might still work. On the surface, it looks okay. The guestbook appears in the browser when we enter the IP address as follows:

But if you try to add an entry, nothing happens.

The Developer Web Tools are good debugging tools for these cases, and are available in most browsers. You can launch them by right clicking and choosing Inspect:

 After a page refresh, you can see this error in the Network tab:

<br />
<b>Fatal error</b>: Uncaught exception 'PredisConnectionConnectionException' with message 'Connection timed out [tcp://redis-slave:6379]' in /usr/local/lib/php/Predis/Connection/AbstractConnection.php:168

There are multiple ways we can solve this issue. In production, you would restart the node or add additional nodes. To demonstrate, we will try multiple approaches (all of them coming from practical experience).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset