Attack of the Kubernetes Clones

One of the customers I support is using Kubernetes under Docker EE UCP (Enterprise Edition Universal Control Plane) and has been very impressed with its stability and ease of management. Recently, however, a worker node that had been very stable for months started evicting Kubernetes pods extremely frequently, reporting inadequate CPU resources. Our DevOps team was still experimenting with determining resource requirements for many of their containerized apps, so at first, we thought the problem was caused by resource contention between pods running on the node.

It was also very unusual that only one node out of three nodes designated for a particular set of apps was showing this behavior.

We checked the status of the node using kubectl describe node <NODE_NAME>, and the output showed that the Kubernetes control plane thought the node had rebooted many times in the past few hours. Monitoring the node status over time showed that the Kubernetes control plane also thought that the number of CPUs on the node was constantly changing between four and eight CPUs, while the underlying VM was configured with eight CPU cores. It made sense that if the node suddenly started reporting only half of its CPU capacity, pods would be evicted. But why was this happening?

Monitoring the status of the other worker nodes in the cluster including those designated for the same set of apps showed none of the constantly reported rebooting and CPU count changes. Checking the kubelet and kube-proxy logs provided further proof that the problems were isolated to just one node.

The next step was to check that status of that node from a VM/OS level. The node had been up for over a week since the last time it was intentionally rebooted based on the uptime command, and the logs, system files, and system utilities showed no underlying system problems. Both the CPU count at a VM/OS level and the connectivity of the VM’s primary network interface were stable based on VM/OS level logs and utilities. We checked for a duplicate IP address or a routing misconfiguration, but we found no problems there either.

After our troubleshooting provided no real clues regarding the root cause of the issues, we were left with a couple options to try:

  • Remove the node from the cluster and then re-join it to the cluster.
  • Build a new node from a new VM, remove the old node from the cluster, and then add the new node to the cluster.

We started with the simpler option: remove the existing node from the cluster, and then re-join it to the cluster. We also planned to clean up the file system locations were Kubernetes and Docker store container, image, volume and pod data prior to re-joining the node to the cluster. With Kubernetes under Docker EE and UCP, this is done by using the following sequence of activities:

Note that while these are basically the same steps used for a Docker Swarm scenario, Docker EE and UCP are smart enough to handle the Kubernetes details for you.

  • Issue the command docker swarm leave at a terminal on the node.
  • Remove the node from the UCP / Swarm perspective, forcing the removal if necessary.
    • This can be done from the UCP GUI or by using the Docker CLI while connected to a manager node using a Client Certificate Bundle. From the CLI, you can execute the command: docker node rm <NODE-NAME> -f.
  • Validate that the node has been removed from UCP/Swarm, either by using the UCP GUI or by using the Docker CLI while connected using a Client Certificate Bundle. From the CLI, you can execute the command docker node ls to get a list of nodes currently in the cluster.
  • Clean up old Docker and Kubernetes files on the node if desired. By default, this data is stored in the following locations:
    – /var/lib/docker
    – /var/lib/kubelet
  • Re-join the node to the Swarm. The easiest way to do this is to start by clicking the “+” symbol (add a node) from the UCP GUI’s nodes panel, providing the details about the node role and type, and then copying the CLI command from the resulting text field in that GUI. You then simply connect to the target node with SSH and paste the copied command into the command line terminal.

Finally a Real Clue

As soon as we removed the node from the cluster and checked to validate that it was gone, we got our first real clue about the root cause of the problem. Our problem node was gone, but a new node with a different hostname was trying to join the cluster in its place! The hostname of the node was a name we planned to use for a new node to be added in the future to provide additional compute capacity. But we had not added the node yet, and as far as we knew the node was not even ready to be added!

We checked with our Linux and VMWare support teams and got another clue: They had planned to clone a “model” VM numerous times, and then reconfigure those clones to use for the new nodes. The idea is that this would make the process much more efficient while reducing the possibility of OS-level configuration differences and Docker installation errors. And as you may already be guessing, the node they chose to clone from was the same node we were having problems with.

Problem Resolved

Unfortunately, the cloning process created clones that included the same Docker node ID, and still had the Docker service enabled for autostart. So… when the Linux team brought up the clones to reconfigure them (change the hostname, clean up some of the old Docker and Kubernetes stored data, etc), the Docker service started up. When that happened, UCP started the rest of the Swarm and Kubernetes components, and the node started trying to interact with the cluster using the Docker node ID of the cloned node. The frequency of the attempted interactions increased as more of the cloned nodes were brought online. And apparently, some of the clones were configured to use only 4 CPU cores, instead of the 8 CPU cores used by the node they were cloned from.

We suspected that the VM clones’ attempted interactions with the cluster were the cause of the instability of the original node they were cloned from. We asked the Linux team to stop and disable the Docker service on the VM clones. We then added the original node back into the cluster and installed its previous Swarm and Kubernetes labels so that workload scheduling would take place correctly. The original node immediately returned to is previous stable state, and monitoring its status and logs showed none of the issues we had been struggling with.

Our plan going forward with the clones is to remove Docker entirely, including file system cleanup, and then reinstall Docker Engine. It might be possible to just leave the swarm and uninstall UCP on each of the clones, but we had enjoyed enough troubleshooting at this point and opted for the overkill approach. As more capacity is needed in the cluster, the new nodes can be joined to the cluster and labeled as needed for their intended workloads.

If you have questions or feel like you need help with Kubernetes, Docker or anything related to running your applications in containers, get in touch with us at Capstone IT.

Dave Thompson
Solution Architect
Docker Accredited Consultant
Certified Kubernetes Administrator