Detecting node problems

In Kubernetes' conceptual model the unit of work is the pod. But, pods are scheduled on nodes. When it comes to monitoring and reliability, the nodes are what require the most attention because Kubernetes itself (the scheduler and replication controllers) takes care of the pods. Nodes can suffer from a variety of problems that Kubernetes is unaware of. As a result, it will keep scheduling pods to the bad nodes and the pods might fail to function properly. Here are some of the problems that nodes may suffer while still appearing functional:

  • Bad CPU
  • Bad memory
  • Bad disk
  • Kernel deadlock
  • Corrupt filesystem
  • Problems with the Docker Daemon

The kubelet and cAdvisor don't detect these issues. Another solution is needed. Enter the node problem detector.

Node problem detector

The node problem detector is a pod that runs on every node. It needs to solve a difficult problem. It needs to detect various problems across different environments, different hardware, and different OSes. It needs to be reliable enough not to be affected itself (otherwise it can't report the problem), and it needs to have relatively low overhead to avoid spamming the master. In addition, it needs to run on every node. Kubernetes recently received a new capability called DaemonSet that addresses that last concern.

DaemonSet

DaemonSet is a pod for every node. Once you define DaemonSet, every node that's added to the cluster automatically gets a pod. If that pod dies, Kubernetes will start another instance of that pod on that node. Think about it as a fancy replication controller with 1:1 node-pod affinity. Node problem detector is defined as a DaemonSet, which is a perfect match for its requirements.

Problem Daemons

The problem with node problem detector (pun intended) is that there are too many problems it needs to handle. Trying to cram all of them into a single codebase can lead to a complex, bloated, and never-stabilizing codebase. The design of the node problem detector calls for separation of the core functionality of reporting node problems to the master from the specific problem detection. The reporting API is based on generic conditions and events. The problem detection should be done by separate problem Daemons (each in its own container). This way, it is possible to add and evolve new problem detectors without impacting the code node problem detector. In addition, the control plane may have a remedy controller that can resolve some node problems automatically, therefore implementing self-healing:

Problem Daemons

Note

At this stage (Kubernetes 1.4), problem Daemons are baked into the node problem detector binary and execute as Goroutines, so you don't get the benefits of the loosely coupled design just yet.

In this section, we covered the important topic of node problems, which can get in the way to successful scheduling of workloads and how the node problem detector can help. In the next section, we'll talk about various failure scenarios and how to troubleshoot them using Heapster, central logging, the Kubernetes dashboard, and node problem detector.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset