The concepts explained in Chapter 5, OpenShift Deployment, provided the foundation for you to initiate your first contact with an OpenShift cluster. In this chapter, we will give some tips on how to perform a health check on a cluster, dive into some root cause analysis (RCA), and also provide details on how to make a cluster run according to some best practices. Our intention with this chapter is to give you some general guidance about troubleshooting, however, it is important you always open a support ticket with Red Hat before making any changes in the platform due to troubleshooting attempts.
This chapter covers the following topics:
Note
The source code used in this chapter is available at https://github.com/PacktPublishing/OpenShift-Multi-Cluster-Management-Handbook/tree/main/chapter06.
Every time we start learning about some technology, it is common to be twice as careful with the installation, configuration, or adjustment to be as thorough as possible. Sometimes, to achieve these goals related to troubleshooting, performance, and best practices, the reader would turn to multiple expert readings on each of the related topics, or go through the pain of trial and error, which takes a lot of effort to succeed.
OpenShift is a great and disruptive technology, but you will navigate through a puzzle of different aspects related to storage, compute, network, and others. Obviously, in the official documentation—or even in quick internet searches—you will find commands to start from scratch, but in many situations, even with the necessary commands and parameters, it is difficult to navigate from troubleshooting to a solution.
Currently, OpenShift has an automatic recovery system, but this is usually not enough to ensure a stable environment. For this self-healing to take place successfully, many prerequisites need to be checked on the cluster first. So, before we understand what can potentially crash, let’s understand how this self-adjustment mechanism works.
In the world of technology, there are many roles played, and some of them are linked to the administration of the infrastructure. There are several names for this role, the most common still being the system administrator, or sysadmin, who operates the servers and services of an information technology (IT) infrastructure. Likewise, OpenShift has operators that are nothing more than applications designed to monitor platform behavior and maintain an operation.
How do operators work? Operators are assigned to fulfill a single task of maintaining the application and all its components according to a standard. Understand that operators are not the same for all applications—that is, operators are unique, each with its own parameter definitions, configurations that are required and optional, and others.
The operator parameters’ contract is described in the Custom Resource Definition (CRD). A CRD is a definition that extends a Kubernetes application programming interface (API) functionality, giving more flexibility to the cluster to store a collection of objects of a certain type. Once a CRD is defined, you can create a Custom Resource (CR) that will allow you to add a Kubernetes’ custom API to the cluster.
Operators are a tool for keeping a cluster or application healthy, so why should we care about learning about OpenShift troubleshooting if it fixes itself? Indeed, operators are a powerful tool, but as we mentioned earlier, OpenShift is a big puzzle, and the pieces need to fit together perfectly for it to work properly. Although it is reigned by operators that are somehow prepared to maintain its integrity, failures can occur, and the role of the cluster administrator and their experience in solving problems will help keep all these operators healthy.
In the next sections, we’ll go deeper into the main components of OpenShift and which aspects to be concerned about.
In the case of OpenShift clusters, etcd is a distributed key-value service responsible for storing the state of the cluster. Through it, all objects contained in the cluster are shown in a key-value format, so it is important to consider at least three important factors in this service, which is the heart of the control plane’s operation. Note the following:
Red Hat made the etcd complexity easier by establishing the number of master nodes to be 3 as default and also by using a cluster operator that manages etcd and reports any issue in it; however, you must still understand how etcd works to be able to troubleshoot if any complex issue occurs. Go ahead to learn how the quorum and leader-based etcd algorithm works.
An etcd cluster works on the concept of leader and followers, which is known as the Raft Distributed Consensus protocol. This protocol implements an algorithm based on a leader election to establish a distributed consensus among all members of an etcd cluster. Once members are added to an etcd cluster and a leader is elected, the process only requires sending periodic heartbeats to confirm that the leader still responds within a suitable latency time.
In case of an unanswered heartbeat time frame, the members start a new election to guarantee cluster resilience, self-healing, and continuity of service.
It is recommended that an etcd cluster has an odd number of nodes so that the following formula guarantees the tolerance of a given number of failing members. To this we give the name of quorum:
Quorum = (n/2)+1, where “n” represents the number of members.
A cluster must always have at least the quorum number of members working to be functioning properly. For the sake of clarity, let’s check out some scenarios, as follows:
Figure 6.1 – Healthy etcd cluster (three-node member health)
Analysis: Quorum is OK as there are a majority of working members and leadership is assured, so the cluster is healthy.
Figure 6.2 – Healthy etcd cluster (two-node member health; risk of outage)
Analysis: Quorum is OK as there are a majority of working members and leadership is assured. There is a degradation risk in case of disruption of one more node, but the cluster is healthy.
Figure 6.3 – Degraded etcd cluster (one-node member health; unhealthy cluster)
Analysis: There is no quorum as the majority of members are down, so it is no longer possible to elect a new leader, and the cluster is degraded.
As mentioned earlier, OpenShift is managed by its operators, which can provide standardization, self-healing, and activities metrics, but this is not always enough to keep the cluster fully functional.
Situations such as scenarios 2 and 3 can occur due to different factors related to any infrastructure layer. It is important to carry out an in-depth analysis to restore the cluster to a functional state. Here’s an approach to troubleshooting etcd.
For troubleshooting, it is important to consider what kind of disruption the cluster has been hit with. If we are still able to use the OpenShift API to access the nodes, you can use the first approach. Otherwise, if the API is unavailable, you must refer to the second scenario.
In cases where it is still possible to execute oc or kubectl commands, use the rsh command to the etcd pod, and performing the following steps through etcdctl is the quickest approach.
An etcd degradation can occur for many different reasons, such as storage or network failures, live migration of the master nodes, or even manipulations in the operating system (OS), which may cause immediate disruption of the cluster.
As mentioned before, run the following commands to open a terminal in an etcd pod and identify the failure:
$ oc project openshift-etcd $ oc get pods -n openshift-etcd | grep -v etcd-quorum-guard | grep etcd etcd-ocp-master-0 3/3 Pending 0 14m etcd-ocp-master-1 3/3 CrashLoopBackOff 6 17m etcd-ocp-master-3 2/3 Running 0 9m11s
Notice that in the previous output, two out of three masters are getting issues, so let’s rsh to master-3, perform a backup, and recreate the etcd nodes, as follows:
oc rsh etcd-ocp-master-3 Defaulting container name to etcdctl. Use 'oc describe pod/etcd-ocp-master-3 -n openshift-etcd' to see all of the containers in this pod. sh-4.4# etcdctl member list -w table +------+------+------+------+------+------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER +------+------+------+------+------+------+ | 5bdacda4e0f48d91 | failed | ocp-master-1 | https://192.168.200.235:2380 | https://192.168.200.235:2379 | false | | b50b656cba1b0122 | started | ocp-master-3 | https://192.168.200.14:2380 | https://192.168.200.14:2379 | false | | cdc9d2f71033600a | failed | ocp-master-0 | https://192.168.200.234:2380 | https://192.168.200.234:2379 | false | +------+------+------+------+------+------+ sh-4.4# etcdctl endpoint status -w table +------+------+------+------+------+------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +------+------+------+------+------+------+ | https://192.168.200.14:2379 | b50b656cba1b0122 | 3.4.9 | 136 MB | true | false | 281 | 133213554 | 133213554 | | | https://192.168.200.234:2379 | cdc9d2f71033600a | 3.4.9 | 137 MB | false | false | 281 | 133213554 | 133213554 | | | https://192.168.200.235:2379 | 5bdacda4e0f48d91 | 3.4.9 | 136 MB | false | false | 281 | 133213554 | 133213554 | | +------+------+------+------+------+------+
Once the identifiers (IDs) of the failed members are determined, the next step is to remove the etcd members, leaving only the one that is running as part of the cluster. To do so, first run the backup of etcd, like so:
/usr/local/bin/cluster-backup.sh /home/core/assets/backup
Note
The etcd backup will be saved in the home directory of the core user in the same node as the one where the etcd pod is running. It is highly recommended you copy it to an off-node location as this will avoid the risk of losing access to the node and, as such, losing the etcd backup file and putting the cluster recovery at risk.
Now that you have already identified the problem with the etcd cluster, see next the suggested steps to recover your cluster.
Formerly, on OpenShift version 3, the usual solution for both cases was reestablishing the etcd cluster using a backup. Now, with OpenShift 4, it is easier to provision new master nodes in your infrastructure. That said, if your issue is the first scenario (API is still available) and your installation is the installer-provisioned infrastructure (IPI) method, we suggest you take the following steps to recreate the problematic master node, using a new node name:
oc get machine <master-node>
-n openshift-machine-api
-o yaml
> new-master-machine.yaml
new-master-machine.yaml
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
finalizers:
- machine.machine.openshift.io
labels:
machine.openshift.io/cluster-api-cluster: ocp-sgw5f
(.. omitted ..)
name: ocp-master-4
namespace: openshift-machine-api
selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/ocp-master-4
spec:
metadata: {}
providerSpec:
value:
apiVersion: vsphereprovider.openshift.io/v1beta1
credentialsSecret:
name: vsphere-cloud-credentials
diskGiB: 120
kind: VSphereMachineProviderSpec
(.. omitted ..)
II. Change the metadata.name field to a new name (for example, <clustername>-<clusterid>-master-3).
$ oc delete machine <problematic-master-node-name> -n openshift-machine-api
$ oc get machines -n openshift-machine-api -o wide
oc apply –f new-master-machine.yaml
Note
Repeat this procedure if necessary, changing only the metadata.name and metadata.selfLink fields for each problematic master node in your cluster.
After the new master nodes have been provisioned, observe the following steps to verify whether the etcd cluster is healthy:
$ oc get pods -n openshift-etcd | grep -v etcd-quorum-guard | grep etcd
$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
# Get the name of one etcd pod
$ oc get pods -n openshift-etcd | grep -v etcd-quorum-guard | grep etcd
$ oc rsh <etcd-pod-name> -n openshift-etcd
etcdctl member list -w table
$ etcdctl remove <member-id>
If the OpenShift API is down, it is important to perform any steps with much more caution to avoid an irreversible loss to the cluster. In such a scenario, you can’t use the rsh command, get logs using oc logs, OR use any oc or kubectl command, as all of them use the OpenShift API, which makes troubleshooting and finding a solution much more difficult and complex.
Due to that, there must be regular etcd backups in place before the cluster malfunctions. If there is no previous backup, the first step is to perform a direct backup on the node that is in operation. To do so, proceed as follows:
$ ssh -i ~/.ssh/id_rsa core@ocp-master-3
$ sudo crictl | grep –i etcd
$ crictl exec bd077a3f1b211 etcdctl member list -w table
+---+---+---+---+---+---+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+---+---+---+---+---+---+
| 9e715067705c0f7c | unknown | ocp-master-4 | https://192.168.200.15:2380 | https://192.168.200.15:2379 | false |
| b50b656cba1b0122 | started | ocp-master-3 | https://192.168.200.14:2380 | https://192.168.200.14:2379 | false |
| cdc9d2f71033600a | failed | ocp-master-0 | https://192.168.200.234:2380 | https://192.168.200.234:2379 | false |
+---+---+---+---+---+---+
sudo /usr/local/bin/cluster-backup.sh /home/core/assets/backup
$ crictl exec bd077a3f1b211 etcdctl endpoint status -w table
+-------+-------+-------+-------+-------+-------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-------+-------+-------+-------+-------+-------+
| https://192.168.200.14:2379 | b50b656cba1b0122 | 3.4.9 | 136 MB | true | false | 491 | 133275501 | 133275501 | |
| https://192.168.200.15:2379 | 9e715067705c0f7c | 3.4.9 | 137 MB | false| false | 491 | 133275501 | 133275501 | |
+-------+-------+-------+-------+-------+-------+
At this stage, it is possible to draw some conclusions. We understand that there is no minimum quorum to keep the cluster up, so the API became unavailable. Continue next for a suggestion on how to proceed in this scenario.
In general, to restore a cluster in this scenario you will use the command etcdctl member remove to remove all problematic etcd members and then use the procedure we described in the previous How to solve it? section to remove problematic master nodes and provision new ones. However, troubleshooting a cluster in this scenario is much more challenging, work with Red Hat support to find the best way to restore it.
Now that we have already gone through etcd troubleshooting, let’s discuss another important aspect of it: performance analysis.
Kubernetes clusters are highly sensitive to latency and throughput. Due to this, some precautions are necessary to have a stable cluster and also great performance. OpenShift is a platform designed for HA, and, as such, the expected etcd use and consumption are traffic-intensive. It is important, then, to follow some best practices to have a stable cluster. Let’s look at some recommended configurations.
etcd’s disk usage is intensive, so it is recommended to use solid-state drive (SSD) disks for a fast write/read response time. Regarding response times, we could say that 50 sequential input/output operations per second (IOPS) would be a minimum requirement, but from our experience, the OpenShift usage grows really fast, so we recommend you consider disks that can deliver at least 500 concurrent IOPS, to maintain the cluster’s health and stability. However, note that some providers do not publish the sequential IOPS but only the shared IOPS. In such cases, consider that concurrent IOPS is equivalent to 10 times the sequential IOPS value.
Here is an example of how to measure the performance of the etcd disks using a customized version of the fio tool. In the OpenShift cluster, run the debug command to get access to a master node, like so:
$ oc debug node/master1.ocp.hybridcloud.com
As soon as the command is executed, the following message will be displayed. Execute the chroot command after the shell to be able to execute commands in privileged mode:
Starting pod/ocp-master1hybridcloud-debug ... To use host binaries, run `chroot /host` chroot /host Pod IP: 172.19.10.4 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4#
Create a container, as indicated in the following code snippet. After the etcd-perf container starts, it will automatically run performance checks:
sh-4.4# podman run --volume /var/lib/etcd:/var/lib/etcd:Z quay.io/openshift-scale/etcd-perf Trying to pull quay.io/openshift-scale/etcd-perf:latest... Getting image source signatures (.. omitted ..) ------- Running fio ------{ "fio version" : "fio-3.7", "timestamp" : 1631814461, "timestamp_ms" : 1631814461780, "time" : "Thu Sep 16 17:47:41 2021", "global options" : { "rw" : "write", "ioengine" : "sync", "fdatasync" : "1", "directory" : "/var/lib/etcd", "size" : "22m", [1] "bs" : "2300" }, [2] (.. omitted ..) "write" : { "io_bytes" : 23066700, "io_kbytes" : 22526, "bw_bytes" : 1319077, "bw" : 1288, [3] "iops" : 573.511752, [4] (.. omitted ..) "read_ticks" : 3309, "write_ticks" : 29285, "in_queue" : 32594, "util" : 98.318751 } ] } --------- 99th percentile of fsync is 5406720 ns 99th percentile of the fsync is within the recommended threshold - 10 ms, the disk can be used to host etcd [5]
In the preceding code snippet, we have used the following annotations:
[1]: A chunk size of 22 megabytes (MB) is usually enough to analyze performance results.
[2]: Instead of using 4k block size, etcd uses small chunks of 2.3k block size, so it guarantees performance, including with small writing fragmentations.
[3]: Bandwidth required considering traffic between the node and underlying storage. It is recommended you use at least a network interface of 1 gigabyte (GB). For medium and large clusters, the recommendation is a 10 GB interface.
[4]: The recommendation is at least 500 concurrent IOPS, as explained previously.
[5]: The report of the etcd IO check. In the example, 5.40 milliseconds (ms) demonstrates a reliable performance—for it to be so, it must be under 10 ms.
Besides using etcd-perf to check the disk performance, you could also perfectly use custom parameters as you need, such as block size, chunk size, and so on, using the fio binary tool, which is available using the standard Red Hat package manager (for example, by executing yum/dnf install fio).
Note
For didactic reasons, we suppressed some results, leaving only items that are pertinent to our analysis.
To avoid any problems related to the central processing unit (CPU), you must understand whether your cluster is well-sized. You must consider some factors to check the cluster sizing, such as the number of customers using the platform, the expected number of requests per second, and the amount of storage available for etcd.
First, let’s give you some parameters to consider for your cluster size:
The following table demonstrates some use cases using public clouds and on-premises infrastructures according to the amount of CPU, memory, disk IOPS, and bandwidth linked to the cluster size:
In a nutshell, when you size a cluster, you should consider these thresholds because this is already benchmarked by the etcd community, and their performance will likely be acceptable if these recommendations are followed. Further information regarding sizing the etcd cluster can be found at the link we have provided in the Further reading session of this chapter.
In this section, you have seen some ways to check etcd performance and troubleshoot, and you also got some important information regarding sizing best practices. We hope you enjoyed the approach and take a close look at the next section about authentication, which will be another interesting theme.
Another important aspect of an OpenShift cluster is user authentication and authorization flow. OpenShift’s flexibility and easy-to-use authentication plugins are a smart way of setting up users and groups. Instead of simply having a vault of usernames and passwords, OpenShift’s authentication service can authenticate a user in a variety of ways—we call it an identity provider (IdP). In this way, OpenShift is responsible for trusting the IdP and allowing or denying authentication according to the provider. In the following diagram, you can see how the process of authenticating a user works:
Figure 6.4 – Authentication and authorization flow
The IdP is responsible for notifying OpenShift if the username and password are valid and returning to OpenShift the success or failure authentication status. This process is known as authentication (AuthN in some literature).
We mentioned that the authentication process uses IdPs to validate the user against an authentication provider, but, more importantly, you need to understand how authorization (aka as AuthZ) occurs. Initially, a user on OpenShift doesn’t have any permissions in any project; however, they can log on. They will not have rights to perform any tasks except creating their own new projects if the cluster has the self-provisioner role enabled (self-provisioner is a role that allows any logged user to create their own projects). To add permissions to a user, it is necessary to inform the appropriate role. This can be accomplished through the inclusion of a user group or by directly assigning it to the user. This process of adding roles to users or groups is named RoleBindings.
To better understand which role best fits a user or group, consider looking at the default roles that already exist in an OpenShift cluster, as set out here:
You must also understand the scope of the RoleBinding, which can be one of the following:
$ oc adm policy add-role-to-user <role> <user> -n <project>
$ oc adm policy add-cluster-role-to-user <role> <user>
Similar commands can be applied to groups, just replacing user with group (for example, add-role-to-group and add-cluster-role-to-group). Similarly, to remove a role from a user or group, use remove-role-from-user/group, as in the following example:
$ oc adm policy remove-role-from-user <role> <user> -n <project> $ oc adm policy remove-role-from-group <role> <group> -n <project> $ oc adm policy remove-cluster-role-from-user <role> <user> $ oc adm policy remove-cluster-role-from-group <role> <group>
These are some of the most popular default roles that are used with OpenShift, but you can create custom ones if needed. To create a custom role, you need to first understand what are verbs and resources, so here are definitions of these:
That said, to define a custom role, you need to know which verbs for the user or group will be allowed to run over which objects. As soon as you have defined verbs and resources, a role can be created using the following command:
$ oc create role <role-name> --verb=<verbs-list> --resource=<resources-list>
Have a look at the following example:
$ oc create role sample --verb=get,list,watch --resource=pods,pods/status
There are more things about authentication and authorization with OpenShift that it's not our intention to bring to light here. We tried to highlight some of the important aspects you need to know about it, and we left a set of links in the Further reading section of this chapter if you want to go even deeper into this subject.
With that, we have demystified the authentication process a bit, and you can now perform the process of AuthN and AuthZ. The previous diagram showed a quick point of view about the steps of an authentication process. It is important to give proper permissions to each user or group, and—more importantly—to plan the roles you will need to have in place to give your users and groups the proper permissions to perform their job. In the following section, we will cover another important aspect that an OpenShift operator needs to know about: troubleshooting.
In this section, you will see some approaches to troubleshooting your OpenShift cluster if you face any issues. Due to the power of the oc command-line interface (CLI), you will have different ways to succeed in almost any troubleshooting scenario of your OpenShift cluster. Along with your training, you will gain the experience you need to take a step further in using and troubleshooting your OpenShift/Kubernetes cluster.
As we have mentioned, the oc CLI is a powerful tool to help OpenShift users to do a lot of operations and also do some troubleshooting. One of the first steps of troubleshooting is to get some details and descriptions of the objects. Suppose, for instance, you have an issue related to a pod that is not coming up, for some reason. Let’s start our troubleshooting by checking the pod details, as follows:
$ oc describe pod sso-10-qm2hc
Check the output in the Events section of the object to see what is preventing the pod from spinning up, as illustrated in the following code snippet:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 90s default-scheduler Successfully assigned rhsso/sso-10-qm2hc to ocp-hml4.hybridcloud.com Normal AddedInterface 89s multus Add eth0 [10.242.22.12/23] from openshift-sdn Normal Pulling 39s (x3 over 89s) kubelet Pulling image "image-registry.openshift-image-registry.svc:5000/rhsso/sso74-custom" Warning Failed 33s (x3 over 83s) kubelet Failed to pull image "image-registry.openshift-image-registry.svc:5000/rhsso/sso74-custom": rpc error: code = Unknown desc = pinging container registry image-registry.openshift-image-registry.svc:5000: Get "https://image-registry.openshift-image-registry.svc:5000/v2/": dial tcp 10.244.109.169:5000: connect: no route to host Warning Failed 33s (x3 over 83s) kubelet Error: ErrImagePull Normal BackOff 8s (x4 over 83s) kubelet Back-off pulling image "image-registry.openshift-image-registry.svc:5000/rhsso/sso74-custom" Warning Failed 8s (x4 over 83s) kubelet Error: ImagePullBackOff
In this case, you were able to see quickly in the oc describe command that the error is related to the connection between the node and the image registry (no route to host). You can act accordingly to fix the connectivity issue and get the pod up and running. You can also use the Events log to see other meaningful information, as you can see in the following section.
Another parameter on the oc CLI that helps with problem investigation is the oc get events command. This is very useful for showing a log of tasks recently executed, along with presenting success or error messages. Events can be executed cluster-wide or project scoped. Check out the following sample of event logs:
$ oc get events -n openshift-image-registry LAST SEEN TYPE REASON OBJECT MESSAGE 35m Normal Scheduled pod/cluster-image-registry-operator-7456697c64-88hxc Successfully assigned openshift-image-registry/cluster-image-registry-operator-7456697c64-88hxc to ocp-master2.hybridcloud.com 35m Normal AddedInterface pod/cluster-image-registry-operator-7456697c64-88hxc Add eth0 [10.242.0.37/23] from openshift-sdn 35m Normal Pulled pod/cluster-image-registry-operator-7456697c64-88hxc Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6a78c524aab5bc95c671811b2c76d59a6c2d394c8f9ba3f2a92bc05a780c783a" already present on machine (...omitted...)
If the pod is up and running but you still have some issues in an application, you can also use OpenShift to check the application logs, as you will see next.
Regularly, pod logs bring important information related to the scheduler, pod affinity/anti-affinity, container images, and persistent volumes. There are several ways to check the pod logs, as outlined here:
$ oc project mynamespace
$ oc logs mypod
$ oc –n mynamespace logs mypod
$ oc -n mynamespace logs mypod -c kube_proxy
Figure 6.5 – Pod logs example (OpenShift console graphical UI (GUI))
You may also have issues during the application deployment. See next what you can check to find evidence of deployment problems.
In some cases, a pod does not start and remains in a constant crashing state, which makes it difficult to get any logs. In such cases, you can check the deployment or deploymentconfig logs, which can help you to identify deployment misconfigurations.
Similar to pod logs, deployment logs are accessible by running the oc logs command. For deployment logs, run the following command:
$ oc –n mynamespace logs deployment/mydeploypods
For deploymentconfigs logs, use this one:
$ oc –n mynamespace logs dc/mydeploypods
Usually, you will not find the exact issue or the root cause and the solution to be applied, but it will give you a good indication of why it is failing—for example, dependent components that are missing in the solution, such as images not available; security context constraint with incorrect permissions; configmaps, secrets, and serviceaccounts missing; and so on.
Another useful way you may use to troubleshoot an issue is to use the debug command with temporary root privileges. Learn more on this in the following section.
Another interesting tool to troubleshoot a pod that is constantly crashing can be used by executing the oc debug deployment or oc debug deploymentconfig command. Through this, you can instruct OpenShift to not fail and restart the pod as it crashes. The pod will still be alive, so you can check the logs, access it, and troubleshoot it from inside the container. To use the tool, run the following command:
$ oc debug deployment/<deployment-name>
Note
The oc debug command allows some interesting options, such as --as-user to run the pod with a defined user. To see a comprehensive list of allowed parameters and examples, run the oc debug -h command.
As we already covered in this book, OpenShift uses several operators to deploy and monitor critical features for the platform. That said, operators bring helpful logs to identify some configuration issues and instability with the platform. Those logs are stored in namespaces starting with the name openshift-*, which is a standard for most tools included with OpenShift.
The reason to maintain several operator pods as a part of project pods is related to some affinity/anti-affinity rules and taints/toleration strategies that a user can apply to the cluster. Operator pods are the watchdog to maintain namespaces’ health, watching the CRD, its standards, liveness, and readiness, and preventing OpenShift’s critical namespaces from suffering undesirable changes.
One of the main benefits of operators for the cluster’s stability is the ability to maintain the desired state of the operator namespace’s objects. In other words, if any unwanted changes are made by mistake directly into the namespace’s objects, they will be reverted by the operator itself. Every change in the operator’s objects needs to be done through the operator itself, by editing ConfigMaps or CR objects, according to the operator’s specification. That also means that any changes are checked and confirmed by operators before they are effectively applied.
To check the cluster operators’ functions, you must do the following:
$ oc get co
$ oc describe co <clusteroperatorName>
Here’s an example:
$ oc describe co storage
(...omitted...)
Status:
Conditions:
Last Transition Time: 2021-08-26T14:51:59Z
Message: All is well
Reason: AsExpected
Status: False
Type: Degraded
Last Transition Time: 2021-08-26T14:51:59Z
Message: All is well
Reason: AsExpected
Status: False
Type: Progressing
Last Transition Time: 2021-08-26T14:51:59Z
Message: DefaultStorageClassControllerAvailable: No default StorageClass for this platform
Reason: AsExpected
Status: True
Type: Available
Last Transition Time: 2021-08-26T14:52:00Z
Message: All is well
Reason: AsExpected
Status: True
Type: Upgradeable
(...ommitted...)
In the previous example, only a warning message about setting up a default storage class for a cluster is shown. No critical issues were found at storage. In case of any issue, look at the events and pod logs in the operator’s namespace.
The oc CLI has some other powerful commands for troubleshooting. Some useful and powerful commands for troubleshooting are listed here:
$ oc –n <namespaceName> logs <podName> -v 8
$ oc get events
$ oc –n <namespaceName> get events
$ oc exec mypod -- date
$ oc exec mypod -c httpd-container -- date
$ oc exec mypod -i -t -- ls -t /usr
$ oc -n <namespaceName> rsh <podName>
All of the previous commands help you to identify problems in a cluster or even an application. Furthermore, you can investigate the node directly using the oc debug command, like so:
$ oc debug node/<nodeName>
The oc debug command gives you non-root privilege access, and you cannot run many OS commands without escalation. To do so, we recommend you run the chroot command, like so. After that, you can regularly use OS shell commands:
$ chroot /host /bin/bash
As you can see, OpenShift has a lot of useful debugging commands to help you identify cluster-wide or scoped issues. It is not recommended, but it is possible to also directly ssh on nodes. This kind of approach requires good knowledge of the Red Hat CoreOS OS, podman, and crio to avoid node disruption.
In any situation, we also recommend you open a support ticket with Red Hat, which will assist and give you the right guidance to solve your problem. The Red Hat support team often asks for the result of the must-gather command, which generates a temporary pod and concatenates meaningful logs and configurations that are useful for the Red Hat engineering team to analyze and correlate events and find the issue or root cause.
The most common way to run must-gather is shown here:
$ oc adm must-gather --dest-dir=/local/directory
This will create a tar file under the chosen directory with all logs collected, which can be very useful to identify problems. We suggest you always run this command and upload it when you are opening the support ticket to speed up the process of analyzing the issue.
In this section, you saw different debug approaches that will certainly help you in everyday life. In the next section, you will see the most common error messages that occur at pod startup, and in this way, you will be able to draw your own line of reasoning that will help you in the problem solution.
Even if you have learned the different ways to identify a problem, it is not unusual that the error shown does not provide enough information to help you to detect the issue and fix it. Having that in mind, we decided to highlight some very common error messages in this section and also some suggestions to solve the problem.
This is a common error related to a missing container image. Check out the following lines of code to become familiarized with this kind of issue when you face it:
NAMESPACE NAME READY STATUS RESTARTS AGE namespace1 backend-tfmqm 0/1 ImagePullBackOff 0 17h
Here’s a message that may come up when investigating the pod log:
$ oc -n namespace1 logs backend-tfmqm Error from server (BadRequest): container " backend" in pod " backend-tfmqm" is waiting to start: trying and failing to pull image
Looking at the error message, it is typically linked to the absence of the image in the registry. This can occur due to some problems, such as the image and its tags not being available in the registry, or incorrect pointing in the deployment/deployment config. Another correlation would be the node where the pod is running is not able to reach the image registry.
This is an error that requires some knowledge before acting on solving it effectively. It occurs because the application crashes constantly, so the root cause issue can be due to several different reasons. You can see an example of this here:
NAMESPACE NAME READY STATUS RESTARTS AGE 3scale backend-redis-1-9qs2q 0/1 CrashLoopBackOff 211 17h
Here’s a message you may see when investigating the pod log:
$ oc logs backend-redis-1-9qs2q 1:M 11 Jan 13:02:19.042 # Bad file format reading the append only file: make a backup of your AOF file, then use ./redis-check-aof --fix <filename>
The log usually gives you some hints about the issue’s root cause, but it can also be a trap to conduct you to a wrong conclusion. It is important to take into account when a pod has a persistent volume or it has a precedence order that depends on another application to start first and prepare a persistent volume, as well as many other different scenarios that can lead to this error.
When you get an Init:0/1 error message, it typically means the pod is waiting for another pod or a condition that hasn’t been satisfied yet. The following lines of code demonstrate what kinds of conditions can result from this message and how to solve this:
NAMESPACE NAME READY STATUS RESTARTS AGE 3scale backend-cron-1-zmnpj 0/1 Init:0/1 0 17h
Here’s a message you may see when investigating the pod log:
$ oc logs backend-cron-1-zmnpj Error from server (BadRequest): container "backend-cron" in pod "backend-cron-1-zmnpj" is waiting to start: PodInitializing
This, perhaps, can be a confusing status when you are troubleshooting error messages. Certainly, it can be anything wrong in the namespace, so the error message only shows PodInitializing. You could interpret it as a condition when a pod is waiting to start; meanwhile, this message means that a condition isn’t satisfied.
To help you, we have listed here some items that must be checked that may be preventing the pod from starting:
If you are not familiar with SCC yet, don’t freak out. We will be covering it in depth in Chapter 8, OpenShift Security.
In this chapter, we focused on important components of OpenShift, such as operators and their role in maintaining the resilience of a cluster, and we also talked about situations that can cause damage to the cluster.
We have dived into the heart of OpenShift, which is its distributed database (known as etcd), understanding its importance in the cluster and how to prepare it to receive a high volume of traffic, as well as verifying its sizing and performance, and understanding how to perform troubleshooting in some cases.
We have also discussed a bit about the AuthN and AuthZ process, so you now know the power and flexibility of the Openshift IDPs. We finally have seen some important troubleshooting tips and tools that will certainly help you in your daily job, operating OpenShift clusters and applications.
In the next chapter, we will present some other important information about the network on OpenShift. We will discuss and give examples to understand the main differences between a pod network and a service network, as well as understand the difference between North-South and East-West traffic. Keep up with us in this interesting reading and learn more in Chapter 7, OpenShift Network.
If you want to look at more information on what we covered in this chapter, check out the following references: