Common Kubernetes Errors and Solutions: OOMKilled, CrashLoopBackOff, and More

Common Kubernetes Errors and Solutions: OOMKilled, CrashLoopBackOff, and More

By Contributing Writer
Gilad David Maayan
  |  November 04, 2022

What is Kubernetes Troubleshooting?

Kubernetes is an open source platform for managing Linux containers in private, public, and hybrid cloud environments. It is often used to manage large microservices applications. While Kubernetes is very powerful, it is also complex, and it can be difficult to identify and fix problems with its many components and the resources they create.

When troubleshooting an issue in your Kubernetes deployments, it’s important to realize that the symptom you are experiencing might be only part of the problem. For example, a cluster is unavailable or pods are not responding as expected, but it might require deeper inspection to identify which other components of the Kubernetes cluster or the infrastructure are at play.

Let's look at three common Kubernetes troubleshooting scenarios that IT and DevOps teams may face and how to solve them.

Common Kubernetes Errors and Solutions

OOMKilled

The OOMKilled (Out Of Memory) error indicates that a pod or container terminated because it used more memory than allowed. It has an exit code of 137.

To identify the error:

Use the following command to identify the OOMKilled error:

kubectl get pods command

The pod with the error will have OOMKilled under the STATUS column. To further investigate, look into the Events section of the pod’s text file and locate the following message:

State:          Running

Started:      Thu, 10 Oct 2019 11:14:13 +0200

Last State:   Terminated

Reason:       OOMKilled

Exit Code:    137

Diagnosis and resolution

Now, go through the pod’s recent activity history and pinpoint what caused the error. Here are some potential causes:

  • A container limit was reached, and the pod was terminated.

  • A pod was terminated because the node was overcommitted. It means the pods scheduled for the node collectively requested memory which exceeded the memory available on the node.

If the pod termination occurred because the container limit was reached:

  • Determine if the application indeed needs the extra memory. If it does, increase the container’s memory limit in the pod specification.

  • If the increase in memory use is sudden and cannot be tied to the application’s loads, the application could have memory leaks. Debug the applications for memory leaks and resolve them. However, don’t increase the memory limit since the application will consume too many resources on the nodes.

If the pod got terminated because the node was overcommitted, investigate the individual memory requests value, i.e., the minimal memory value for a pod. The total request value for all pods on a node should be less than the node’s available memory. If needed, adjust the memory requests and limit values to ensure that the node doesn’t get overcommitted.

CrashLoopBackOff

The CrashLoopBackOff error indicates that a pod cannot be scheduled on a node. It can only occur if the node doesn’t have the required resources for running the pod or the needed volumes haven't mounted successfully.

To identify the error:

  1. Use the following command to identify the error:

kubectl get pods

The pod facing the issue will have CrashLoopBackOff under STATUS. 

  1. Use the following command to get further details about the error:

    kubectl describe pod [pod-name]

Common causes and resolution

Here are some common causes of the error:

  • Inadequate resources—if the node has insufficient resources, manually eject the pods from it or increase your cluster’s scale to ensure there are more nodes present for the pods.

  • Errors in volume mounting—if there is a problem in mounting a storage volume, check the volume the pod is trying to mount and ensure it is correctly defined within its manifest. Also, ensure that there is a storage volume which matches those definitions.

  • Using hostPort—if the pods are bound to a hostPort, you can only schedule a single pod per node. In most cases, you can avoid using the hostPort  and instead, use a Service object for communication with the pod.

CreateContainerConfigError

The CreateContainerConfigError error commonly results from a missing Secret of ConfigMap. A Secret is a Kubernetes object that stores confidential information such as database credentials. ConfigMaps store data in key-value pair format and are useful for storing the configuration information needed by multiple pods.

To identify the error:

  1. Use the following command to identify the error:

kubectl get pods

The pod facing the issue will have CreateContainerConfigError under STATUS. 

  1. Use the following command to get further details about the error:

    kubectl describe pod demo-pod

    Here is what the output might look like:

Warning Failed 34s (x6 over 1m45s) kubelet 

Error: configmap "configmap-7" not found

 
  1. Run the following command to check if the ConfigMap returned by the previous step is present in the cluster:

 

    kubectl get configmap configmap-7

 

    If it’s absent, create the ConfigMap since it’s missing.

 
  1. Once created, use the following command to ensure the ConfigMap is available:

 

     get configmap demo-map

 

Use the command in step 1 to ensure the pod is now running.

ImagePullBackOff or ErrImagePull

The ImagePullBackOff and ErrImagePull errors mean that a pod couldn’t run because it unsuccessfully tried to pull a container image from a registry. Hence, the pod cannot start because it cannot create one (or more) containers given in its manifest.

 

To identify the error:

  1. Use the following command to identify the error:

 

kubectl get pods

 

The pod facing the issue will have ImagePullBackOff or ErrImagePull under STATUS. 

 
  1. Use the following command to get further details about the error:

 

    kubectl describe pod demo-pod

 

Root causes and resolution

Here are some causes behind the issue:

  • Wrong container image tag (News - Alert) or name—commonly happens when the container’s image name or tag title was incorrectly typed while defining in the pod manifest. 

Ensure that the image names are correct using the following command:

 

    docker pull <image-name|image-tag>

 
  • Authentication error with the container registry—the pod might not have successfully authenticated in the registry to pull the container image. It could’ve happened due to issues in the specific Secret that stored the credentials or because the pod doesn’t have the adequate RBAC role that allows it to perform this operation. 

 

Ensure that the pod has the required permissions and Secrets. Then, manually attempt the operation using the docker pull command.

Kubernetes Node Not Ready

All the stateful pods in a node become unavailable when it crashes or shuts down. Then, the node shows NotReady as its status. If this status persists for more than five minutes, Kubernetes changes its scheduled pods’ status to Unknown. Later, Kubernetes attempts to schedule the pods on another node and gives it a ContainerCreating status.

 

To identify the error:

  1. Use the following command to identify the error:

 

kubectl get pods

 

The pod facing the issue will have NotReady under STATUS. 

 
  1. Use the following command to see if the pods scheduled on the node are being shifted to other nodes:

 

    get nodes

 

    Check if the same pod appears on two different nodes in the output.

 

Resolving the issue

The issue can resolve itself if the failed node recovers or you reboot it. Here is what happens once it recovers and joins the cluster:

 
  • The pod with Unknown status gets deleted, and the failed node's volumes are detached.

  • The pod's status changes to ContainerCreating once it's rescheduled to a new node and the required volumes are attached.

  • Kubernetes waits for a default period of five minutes. After that, the pod's status will change from ContainerCreating to Running once it starts running on the new node.

 

If there is a time constraint or the node fails to recover, you must guide Kubernetes about rescheduling the stateful pods on a different working node. Here are the two ways to achieve it:

 
  • Remove failed node:

Use the following command to remove the failed node from the cluster:

 

    kubectl delete node demo-node

 
  • Delete stateful pods with unknown status:

    Use the following command to delete the stateful pods:

 

kubectl delete pods demo-pod --grace-period=0 --force -n demo-namespace

Conclusion

In this article, I covered some of the most common Kubernetes errors and showed how to solve them:

 
  • OOMKilled - indicates that a pod or container terminated because it used more memory than allowed.

  • CrashLoopBackOff - indicates that a pod cannot be scheduled on a node due to repeated crashing of a container.

  • CreateContainerConfigError - commonly results from a missing Secret of ConfigMap

  • ImagePullBackOff or ErrImagePull - indicates that a pod couldn’t run because it unsuccessfully tried to pull an image from a registry.

  • Kubernetes Node Not Ready - status shown when a node crashes or shuts down and all stateful pods become unavailable.

 

I hope this will help you get a head start in the exciting world of Kubernetes troubleshooting.