Kubernetes - Cluster Rebalancing with Descheduler

Table of Contents

This post is part of our ongoing series of posts for Kubernetes. This post focuses on how Kubernetes Descheduler helps rebalance clusters for better performance and resource utilization. As clusters evolve—pods start or terminate, nodes join or leave, or resource demands shift—the initial pod placement may no longer be optimal for current conditions.

Descheduler addresses this challenge by identifying and relocating pods that no longer align with current cluster requirements, enabling Kubernetes to redistribute them more efficiently across the available infrastructure.

1. Setup Descheduler

Let’s configure Descheduler by creating cluster/default/descheduler.yaml. This declarative configuration establishes:

  • The Kubernetes Sigs Helm Repository reference
  • The Descheduler Helm Release with appropriate configuration
---
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
  name: descheduler
  namespace: kube-system
spec:
  interval: 30m
  url: https://kubernetes-sigs.github.io/descheduler
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: descheduler
  namespace: kube-system
spec:
  releaseName: descheduler
  interval: 10m
  chart:
    spec:
      chart: descheduler
      version: "0.32.2"
      interval: 10m
      sourceRef:
        kind: HelmRepository
        name: descheduler
        namespace: kube-system
  values:
    schedule: "0 0 * * *"
    timeZone: "Asia/Kolkata"
    deschedulerPolicy:
      profiles:
        - name: default
          pluginConfig:
            - name: DefaultEvictor
              args:
                ignorePvcPods: true
                evictLocalStoragePods: true
            - name: RemoveDuplicates
            - name: RemovePodsHavingTooManyRestarts
              args:
                podRestartThreshold: 100
                includingInitContainers: true
            - name: RemovePodsViolatingNodeAffinity
              args:
                nodeAffinityType:
                - requiredDuringSchedulingIgnoredDuringExecution
            - name: RemovePodsViolatingNodeTaints
            - name: RemovePodsViolatingInterPodAntiAffinity
            - name: RemovePodsViolatingTopologySpreadConstraint
            - name: LowNodeUtilization
              args:
                thresholds:
                  cpu: 30
                  memory: 20
                  pods: 4
                targetThresholds:
                  cpu: 30
                  memory: 30
                  pods: 10
          plugins:
            balance:
              enabled:
                - RemoveDuplicates
                - RemovePodsViolatingTopologySpreadConstraint
                - LowNodeUtilization
            deschedule:
              enabled:
                - RemovePodsHavingTooManyRestarts
                - RemovePodsViolatingNodeTaints
                - RemovePodsViolatingNodeAffinity
                - RemovePodsViolatingInterPodAntiAffinity

After applying this configuration, verify that the Descheduler CronJob has been successfully deployed:

kubectl -n kube-system get cronjobs

NAME          SCHEDULE    TIMEZONE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
descheduler   0 0 * * *   Asia/Kolkata   False     0        <none>          17m

This confirms Descheduler is properly configured as a CronJob scheduled to execute daily at midnight in the Asia/Kolkata timezone.

The descheduler policy contains several strategies strategically grouped into two plugin categories:

  • Balance Plugins: RemoveDuplicates, RemovePodsViolatingTopologySpreadConstraint, and LowNodeUtilization focus on spreading workloads evenly across the cluster.
  • Deschedule Plugins: RemovePodsHavingTooManyRestarts, RemovePodsViolatingNodeTaints, RemovePodsViolatingNodeAffinity, and RemovePodsViolatingInterPodAntiAffinity handle pods that no longer fit their rules or have issues like excessive restarts.
  • Additional Settings: The DefaultEvictor permits eviction of pods with local storage but preserves those bound to Persistent Volume Claims (PVCs). The LowNodeUtilization strategy implements thresholds of 30% for CPU, 20% for memory, and 4 for pods, with corresponding target thresholds of 30% for CPU, 30% for memory, and 10 for pods to determine when nodes are underutilized or overutilized, enabling intelligent workload redistribution.

2. Validate Setup

To validate our implementation, we’ll first simulate an unbalanced cluster state by stopping one node from our two-node k3s cluster. This will trigger the default Pod Disruption Budget mechanism to activate, causing all pods to migrate to the remaining operational node after a 5-minute grace period.

Once all pods have successfully migrated to the first node, we’ll reintroduce our second node to the cluster and examine the pod distribution across nodes.

kubectl get pods -A --field-selector=status.phase=Running -o=jsonpath='{range .items[*]}{.spec.nodeName}{"\n"}{end}' | sort | uniq -c | awk '{print $2 " : " $1}'

192.168.1.19 : 23
192.168.1.24 : 3

The output reveals a significant imbalance with 23 pods running on the first node and only 3 pods on the second node. Let’s manually trigger the Descheduler CronJob to verify its functionality.

kubectl -n kube-system create job --from=cronjob/descheduler descheduler-test

job.batch/descheduler-test created

Now, let’s examine the execution logs to observe the Descheduler’s decision-making process:

kubectl -n kube-system logs jobs/descheduler-test 

I0307 09:31:00.411392       1 nodeutilization.go:199] "Node is overutilized" node="192.168.1.19" usage={"cpu":"940m","memory":"1134Mi","pods":"23"} usagePercentage={"cpu":47,"memory":31.92,"pods":20.91}
I0307 09:31:00.411483       1 nodeutilization.go:196] "Node is underutilized" node="192.168.1.24" usage={"cpu":"500m","memory":"256Mi","pods":"4"} usagePercentage={"cpu":25,"memory":7.21,"pods":3.64}
I0307 09:31:00.411529       1 lownodeutilization.go:143] "Criteria for a node under utilization" CPU=30 Mem=20 Pods=4
I0307 09:31:00.411545       1 lownodeutilization.go:144] "Number of underutilized nodes" totalNumber=1
I0307 09:31:00.411555       1 lownodeutilization.go:147] "Criteria for a node above target utilization" CPU=30 Mem=30 Pods=10
I0307 09:31:00.411586       1 lownodeutilization.go:148] "Number of overutilized nodes" totalNumber=1
I0307 09:31:00.411602       1 nodeutilization.go:267] "Total capacity to be moved" CPU=100 Mem=849180262 Pods=7
I0307 09:31:00.411615       1 nodeutilization.go:270] "Evicting pods from node" node="192.168.1.19" usage={"cpu":"940m","memory":"1134Mi","pods":"23"}
I0307 09:31:00.411718       1 nodeutilization.go:273] "Pods on node" node="192.168.1.19" allPods=23 nonRemovablePods=8 removablePods=15
I0307 09:31:00.411759       1 nodeutilization.go:280] "Evicting pods based on priority, if they have same priority, they'll be evicted based on QoS tiers"
I0307 09:31:00.430412       1 evictions.go:551] "Evicted pod" pod="default/reflector-dcc5cf554-lhtqt" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
I0307 09:31:00.430476       1 nodeutilization.go:337] "Evicted pods" pod="default/reflector-dcc5cf554-lhtqt"
I0307 09:31:00.430489       1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=22
I0307 09:31:00.536118       1 evictions.go:551] "Evicted pod" pod="monitoring/kube-prometheus-stack-grafana-b6ff94658-8gwhd" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
I0307 09:31:00.536181       1 nodeutilization.go:337] "Evicted pods" pod="monitoring/kube-prometheus-stack-grafana-b6ff94658-8gwhd"
I0307 09:31:00.536192       1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=21
I0307 09:31:00.661102       1 evictions.go:551] "Evicted pod" pod="monitoring/kube-prometheus-stack-kube-state-metrics-655b644b89-r6lgd" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
I0307 09:31:00.661967       1 nodeutilization.go:337] "Evicted pods" pod="monitoring/kube-prometheus-stack-kube-state-metrics-655b644b89-r6lgd"
I0307 09:31:00.661984       1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=20
I0307 09:31:00.794059       1 evictions.go:551] "Evicted pod" pod="monitoring/kube-prometheus-stack-operator-56bf5c7657-vwt9k" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
I0307 09:31:00.794110       1 nodeutilization.go:337] "Evicted pods" pod="monitoring/kube-prometheus-stack-operator-56bf5c7657-vwt9k"
I0307 09:31:00.794124       1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=19
I0307 09:31:00.916132       1 evictions.go:551] "Evicted pod" pod="cert-manager/cert-manager-webhook-5f65ff988f-wmjsd" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
I0307 09:31:00.916167       1 nodeutilization.go:337] "Evicted pods" pod="cert-manager/cert-manager-webhook-5f65ff988f-wmjsd"
I0307 09:31:00.916175       1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=18
I0307 09:31:01.113925       1 evictions.go:551] "Evicted pod" pod="cert-manager/cert-manager-5c86944dc6-5m956" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
I0307 09:31:01.113955       1 nodeutilization.go:337] "Evicted pods" pod="cert-manager/cert-manager-5c86944dc6-5m956"
I0307 09:31:01.113963       1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=17
I0307 09:31:01.407545       1 evictions.go:551] "Evicted pod" pod="cert-manager/cert-manager-cainjector-bc8dbfcdd-srnct" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
I0307 09:31:01.407575       1 nodeutilization.go:337] "Evicted pods" pod="cert-manager/cert-manager-cainjector-bc8dbfcdd-srnct"
I0307 09:31:01.407585       1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=16
I0307 09:31:01.407617       1 profile.go:361] "Total number of evictions/requests" extension point="Balance" evictedPods=7 evictionRequests=0
I0307 09:31:01.407631       1 descheduler.go:248] "Number of evictions/requests" totalEvicted=7 evictionRequests=0

The logs indicate that the Descheduler has identified an imbalance between nodes and strategically evicted 7 pods from the overutilized node (192.168.1.19). Let’s verify the pod distribution after this rebalancing operation:

kubectl get pods -A -o json | jq -r '.items[] | select(.status.phase=="Running" and .metadata.deletionTimestamp==null) | .spec.nodeName' | sort | uniq -c | awk '{print $2 " : " $1}'

192.168.1.19 : 16
192.168.1.24 : 10

As demonstrated by the updated counts, the Descheduler has successfully rebalanced the workload across both nodes, resulting in a more equitable distribution with 16 pods on the first node and 10 pods on the second node. This equilibrium ensures optimal resource utilization and improved cluster performance.

References