Skip to main content
Nauman Munir
Back to Projects
Case StudyWeb ServicesManaged KubernetesCloud Cost & Performance Optimization

Kubernetes Node Maintenance for Security Patching

Performed Kubernetes node maintenance on GKE to apply security patches, achieving zero downtime and enhanced protection for a critical web application.

5 min read
WebSecure Inc.
2 days
4 DevOps Engineers
Kubernetes Node Maintenance for Security Patching

Technologies

KubernetesGoogle Kubernetes Engine (GKE)PrometheusGrafanaElasticsearchFluentdKibanaNginx IngressKube-benchTerraform

Challenges

Security VulnerabilitiesNode MaintenanceApplication Availability

Solutions

Node PatchingPod Disruption BudgetMonitoring and Logging

Key Results

100% during maintenance

uptime improvement

Kubernetes Node Maintenance for Security Patching

Situation

A Kubernetes cluster running on Google Kubernetes Engine (GKE) hosts a critical web application with high availability requirements. The cluster, consisting of multiple executor and master nodes, is managed by a team of DevOps engineers. A recent security advisory revealed vulnerabilities in the Kubernetes runtime and container images, necessitating immediate patching to prevent potential exploits. The cluster uses Pod Disruption Budgets (PDBs) for high availability, namespaces for resource organization, and network policies for security. Monitoring is handled by Prometheus, with logs centralized using an EFK (Elasticsearch, Fluentd, Kibana) stack. The application is exposed via an Ingress controller with TLS for secure communication. The challenge was to perform node maintenance without disrupting the application’s availability.

Task

The DevOps engineer was tasked with performing maintenance on a specific Kubernetes node to apply security patches to the container runtime and update container images. The maintenance had to:

  • Ensure minimal disruption, respecting the Pod Disruption Budget.
  • Maintain application availability for users.
  • Safely return the node to the cluster post-maintenance.
  • Verify the application’s performance and security.
  • Document the process and provide a sequence diagram for the maintenance workflow.

Action

The DevOps engineer followed a structured approach, leveraging Kubernetes tools and best practices. Below are the detailed actions taken:

1. Cluster Setup with Terraform

To manage the GKE cluster infrastructure consistently, we used Terraform, ensuring repeatable and secure node provisioning. The Terraform configuration, formatted in YAML-like structure, defined the GKE cluster with security features:

provider "google" {
project = "websecure-inc"
region  = "us-central1"
}
 
resource "google_container_cluster" "healthcare_cluster" {
name     = "websecure-gke"
location = "us-central1-a"
network  = "default"
 
initial_node_count = 3
 
node_config {
machine_type = "e2-standard-4"
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform",
]
metadata = {
disable-legacy-endpoints = "true"
}
}
 
private_cluster_config {
enable_private_nodes    = true
enable_private_endpoint = false
master_ipv4_cidr_block  = "172.16.0.0/28"
}
 
master_auth {
client_certificate_config {
issue_client_certificate = false
}
}
}

This configuration ensured private nodes and secure cluster settings, aligning with the application’s high availability and security requirements.

2. Preparation and Planning

We identified the target node (e.g., node-1) using:

kubectl get nodes

This listed node status, roles, and resource usage, ensuring accurate selection. We verified the Pod Disruption Budget (PDB) to maintain at least two available pods for the app: web-app label:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-app-pdb
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: web-app
We reviewed pod resource requests and limits to confirm other nodes could handle evicted pods:
 
apiVersion: v1
kind: Pod
metadata:
  name: web-app-pod
  namespace: production
spec:
  containers:
    - name: web-app-container
      image: web-app:latest
      resources:
        requests:
          memory: "256Mi"
          cpu: "500m"
        limits:
          memory: "512Mi"
          cpu: "1000m"

We notified stakeholders via Slack to avoid conflicting operations.

3. Cordon the Node

We marked the node as unschedulable using:

kubectl cordon node-1

This prevented new pods from being scheduled on the node, reducing disruption risk.

4. Drain the Node

We safely evicted pods with:

kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data

The --ignore-daemonsets flag preserved Fluentd pods for logging, and --delete-emptydir-data handled emptyDir volumes. We monitored pod eviction using:

kubectl get pods -n production -o wide

This ensured pods were rescheduled correctly, respecting the PDB.

5. Perform Maintenance

We applied security patches by SSHing into the node (using gcloud compute ssh node-1) and running:

apt-get update && apt-get upgrade

This patched the container runtime (e.g., containerd). We updated container images in the deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app-deployment
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
        - name: web-app-container
          image: web-app:1.0.1 # Updated image version
          resources:
            requests:
              memory: "256Mi"
              cpu: "500m"
            limits:
              memory: "512Mi"
              cpu: "1000m"

We restarted the kubelet service:

sudo systemctl restart kubelet

This applied runtime updates.

6. Return Node to Service

We marked the node schedulable again with:

kubectl uncordon node-1

We verified its status using:

kubectl get nodes

This confirmed the node was in the Ready state.

7. Post-Maintenance Validation

We checked application health via the /healthz endpoint, monitored CPU, memory, and latency in Prometheus and Grafana, and reviewed logs in Kibana for errors. We used kube-bench to verify the node’s security configuration post-patching.

8. Documentation and Reporting

We documented the process in Confluence, including commands, timestamps, and observations. A PlantUML sequence diagram was created to visualize the workflow.

Result

The maintenance was completed with zero downtime, maintaining 100% application availability. The node was patched, and container images were updated to secure versions. The Pod Disruption Budget ensured at least two pods remained available. Post-maintenance validation confirmed application health, performance, and security. Prometheus and EFK provided real-time insights, enabling quick issue detection. The documented process and sequence diagram enhanced team preparedness for future maintenance. The application continued to serve users securely and reliably, protected against vulnerabilities.

Architectural Diagram

Need a Similar Solution?

I can help you design and implement similar cloud infrastructure and DevOps solutions for your organization.