Kubernetes Node Maintenance for Security Patching

Situation

A Kubernetes cluster running on Google Kubernetes Engine (GKE) hosts a critical web application with high availability requirements. The cluster, consisting of multiple executor and master nodes, is managed by a team of DevOps engineers. A recent security advisory revealed vulnerabilities in the Kubernetes runtime and container images, necessitating immediate patching to prevent potential exploits. The cluster uses Pod Disruption Budgets (PDBs) for high availability, namespaces for resource organization, and network policies for security. Monitoring is handled by Prometheus, with logs centralized using an EFK (Elasticsearch, Fluentd, Kibana) stack. The application is exposed via an Ingress controller with TLS for secure communication. The challenge was to perform node maintenance without disrupting the application’s availability.

Task

The DevOps engineer was tasked with performing maintenance on a specific Kubernetes node to apply security patches to the container runtime and update container images. The maintenance had to:

Ensure minimal disruption, respecting the Pod Disruption Budget.
Maintain application availability for users.
Safely return the node to the cluster post-maintenance.
Verify the application’s performance and security.
Document the process and provide a sequence diagram for the maintenance workflow.

Action

The DevOps engineer followed a structured approach, leveraging Kubernetes tools and best practices. Below are the detailed actions taken:

1. Cluster Setup with Terraform

To manage the GKE cluster infrastructure consistently, we used Terraform, ensuring repeatable and secure node provisioning. The Terraform configuration, formatted in YAML-like structure, defined the GKE cluster with security features:

provider "google" {
project = "websecure-inc"
region  = "us-central1"
}
 
resource "google_container_cluster" "healthcare_cluster" {
name     = "websecure-gke"
location = "us-central1-a"
network  = "default"
 
initial_node_count = 3
 
node_config {
machine_type = "e2-standard-4"
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform",
]
metadata = {
disable-legacy-endpoints = "true"
}
}
 
private_cluster_config {
enable_private_nodes    = true
enable_private_endpoint = false
master_ipv4_cidr_block  = "172.16.0.0/28"
}
 
master_auth {
client_certificate_config {
issue_client_certificate = false
}
}
}

This configuration ensured private nodes and secure cluster settings, aligning with the application’s high availability and security requirements.

2. Preparation and Planning

We identified the target node (e.g., node-1) using:

kubectl get nodes

This listed node status, roles, and resource usage, ensuring accurate selection. We verified the Pod Disruption Budget (PDB) to maintain at least two available pods for the app: web-app label:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-app-pdb
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: web-app

We reviewed pod resource requests and limits to confirm other nodes could handle evicted pods:
 
apiVersion: v1
kind: Pod
metadata:
  name: web-app-pod
  namespace: production
spec:
  containers:
    - name: web-app-container
      image: web-app:latest
      resources:
        requests:
          memory: "256Mi"
          cpu: "500m"
        limits:
          memory: "512Mi"
          cpu: "1000m"

We notified stakeholders via Slack to avoid conflicting operations.

3. Cordon the Node

We marked the node as unschedulable using:

kubectl cordon node-1

This prevented new pods from being scheduled on the node, reducing disruption risk.

4. Drain the Node

We safely evicted pods with:

kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data

The --ignore-daemonsets flag preserved Fluentd pods for logging, and --delete-emptydir-data handled emptyDir volumes. We monitored pod eviction using:

kubectl get pods -n production -o wide

This ensured pods were rescheduled correctly, respecting the PDB.

5. Perform Maintenance

We applied security patches by SSHing into the node (using gcloud compute ssh node-1) and running:

apt-get update && apt-get upgrade

This patched the container runtime (e.g., containerd). We updated container images in the deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app-deployment
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
        - name: web-app-container
          image: web-app:1.0.1 # Updated image version
          resources:
            requests:
              memory: "256Mi"
              cpu: "500m"
            limits:
              memory: "512Mi"
              cpu: "1000m"

We restarted the kubelet service:

sudo systemctl restart kubelet

This applied runtime updates.

6. Return Node to Service

We marked the node schedulable again with:

kubectl uncordon node-1

We verified its status using:

kubectl get nodes

This confirmed the node was in the Ready state.

7. Post-Maintenance Validation

We checked application health via the /healthz endpoint, monitored CPU, memory, and latency in Prometheus and Grafana, and reviewed logs in Kibana for errors. We used kube-bench to verify the node’s security configuration post-patching.

8. Documentation and Reporting

We documented the process in Confluence, including commands, timestamps, and observations. A PlantUML sequence diagram was created to visualize the workflow.

Result

The maintenance was completed with zero downtime, maintaining 100% application availability. The node was patched, and container images were updated to secure versions. The Pod Disruption Budget ensured at least two pods remained available. Post-maintenance validation confirmed application health, performance, and security. Prometheus and EFK provided real-time insights, enabling quick issue detection. The documented process and sequence diagram enhanced team preparedness for future maintenance. The application continued to serve users securely and reliably, protected against vulnerabilities.

Kubernetes Node Maintenance for Security Patching

Technologies

Challenges

Solutions

Key Results

Kubernetes Node Maintenance for Security Patching

Situation

Task

Action

1. Cluster Setup with Terraform

2. Preparation and Planning

3. Cordon the Node

4. Drain the Node

5. Perform Maintenance

6. Return Node to Service

7. Post-Maintenance Validation

8. Documentation and Reporting

Result

Architectural Diagram

Need a Similar Solution?