Kubernetes - Cluster Rebalancing with Descheduler

Table of Contents

This post is part of our ongoing series of posts for Kubernetes. This post focuses on how Kubernetes Descheduler helps rebalance clusters for better performance and resource utilization. As clusters evolve—pods start or terminate, nodes join or leave, or resource demands shift—the initial pod placement may no longer be optimal for current conditions.

Descheduler addresses this challenge by identifying and relocating pods that no longer align with current cluster requirements, enabling Kubernetes to redistribute them more efficiently across the available infrastructure.

1. Setup Descheduler

Let’s configure Descheduler by creating cluster/default/descheduler.yaml. This declarative configuration establishes:

  • The Kubernetes Sigs Helm Repository reference
  • The Descheduler Helm Release with appropriate configuration
 1---
 2apiVersion: source.toolkit.fluxcd.io/v1
 3kind: HelmRepository
 4metadata:
 5  name: descheduler
 6  namespace: kube-system
 7spec:
 8  interval: 30m
 9  url: https://kubernetes-sigs.github.io/descheduler
10---
11apiVersion: helm.toolkit.fluxcd.io/v2
12kind: HelmRelease
13metadata:
14  name: descheduler
15  namespace: kube-system
16spec:
17  releaseName: descheduler
18  interval: 10m
19  chart:
20    spec:
21      chart: descheduler
22      version: "0.32.2"
23      interval: 10m
24      sourceRef:
25        kind: HelmRepository
26        name: descheduler
27        namespace: kube-system
28  values:
29    schedule: "0 0 * * *"
30    timeZone: "Asia/Kolkata"
31    deschedulerPolicy:
32      profiles:
33        - name: default
34          pluginConfig:
35            - name: DefaultEvictor
36              args:
37                ignorePvcPods: true
38                evictLocalStoragePods: true
39            - name: RemoveDuplicates
40            - name: RemovePodsHavingTooManyRestarts
41              args:
42                podRestartThreshold: 100
43                includingInitContainers: true
44            - name: RemovePodsViolatingNodeAffinity
45              args:
46                nodeAffinityType:
47                - requiredDuringSchedulingIgnoredDuringExecution
48            - name: RemovePodsViolatingNodeTaints
49            - name: RemovePodsViolatingInterPodAntiAffinity
50            - name: RemovePodsViolatingTopologySpreadConstraint
51            - name: LowNodeUtilization
52              args:
53                thresholds:
54                  cpu: 30
55                  memory: 20
56                  pods: 4
57                targetThresholds:
58                  cpu: 30
59                  memory: 30
60                  pods: 10
61          plugins:
62            balance:
63              enabled:
64                - RemoveDuplicates
65                - RemovePodsViolatingTopologySpreadConstraint
66                - LowNodeUtilization
67            deschedule:
68              enabled:
69                - RemovePodsHavingTooManyRestarts
70                - RemovePodsViolatingNodeTaints
71                - RemovePodsViolatingNodeAffinity
72                - RemovePodsViolatingInterPodAntiAffinity

After applying this configuration, verify that the Descheduler CronJob has been successfully deployed:

1kubectl -n kube-system get cronjobs
2
3NAME          SCHEDULE    TIMEZONE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
4descheduler   0 0 * * *   Asia/Kolkata   False     0        <none>          17m

This confirms Descheduler is properly configured as a CronJob scheduled to execute daily at midnight in the Asia/Kolkata timezone.

The descheduler policy contains several strategies strategically grouped into two plugin categories:

  • Balance Plugins: RemoveDuplicates, RemovePodsViolatingTopologySpreadConstraint, and LowNodeUtilization focus on spreading workloads evenly across the cluster.
  • Deschedule Plugins: RemovePodsHavingTooManyRestarts, RemovePodsViolatingNodeTaints, RemovePodsViolatingNodeAffinity, and RemovePodsViolatingInterPodAntiAffinity handle pods that no longer fit their rules or have issues like excessive restarts.
  • Additional Settings: The DefaultEvictor permits eviction of pods with local storage but preserves those bound to Persistent Volume Claims (PVCs). The LowNodeUtilization strategy implements thresholds of 30% for CPU, 20% for memory, and 4 for pods, with corresponding target thresholds of 30% for CPU, 30% for memory, and 10 for pods to determine when nodes are underutilized or overutilized, enabling intelligent workload redistribution.

2. Validate Setup

To validate our implementation, we’ll first simulate an unbalanced cluster state by stopping one node from our two-node k3s cluster. This will trigger the default Pod Disruption Budget mechanism to activate, causing all pods to migrate to the remaining operational node after a 5-minute grace period.

Once all pods have successfully migrated to the first node, we’ll reintroduce our second node to the cluster and examine the pod distribution across nodes.

1kubectl get pods -A --field-selector=status.phase=Running -o=jsonpath='{range .items[*]}{.spec.nodeName}{"\n"}{end}' | sort | uniq -c | awk '{print $2 " : " $1}'
2
3192.168.1.19 : 23
4192.168.1.24 : 3

The output reveals a significant imbalance with 23 pods running on the first node and only 3 pods on the second node. Let’s manually trigger the Descheduler CronJob to verify its functionality.

1kubectl -n kube-system create job --from=cronjob/descheduler descheduler-test
2
3job.batch/descheduler-test created

Now, let’s examine the execution logs to observe the Descheduler’s decision-making process:

 1kubectl -n kube-system logs jobs/descheduler-test 
 2
 3I0307 09:31:00.411392       1 nodeutilization.go:199] "Node is overutilized" node="192.168.1.19" usage={"cpu":"940m","memory":"1134Mi","pods":"23"} usagePercentage={"cpu":47,"memory":31.92,"pods":20.91}
 4I0307 09:31:00.411483       1 nodeutilization.go:196] "Node is underutilized" node="192.168.1.24" usage={"cpu":"500m","memory":"256Mi","pods":"4"} usagePercentage={"cpu":25,"memory":7.21,"pods":3.64}
 5I0307 09:31:00.411529       1 lownodeutilization.go:143] "Criteria for a node under utilization" CPU=30 Mem=20 Pods=4
 6I0307 09:31:00.411545       1 lownodeutilization.go:144] "Number of underutilized nodes" totalNumber=1
 7I0307 09:31:00.411555       1 lownodeutilization.go:147] "Criteria for a node above target utilization" CPU=30 Mem=30 Pods=10
 8I0307 09:31:00.411586       1 lownodeutilization.go:148] "Number of overutilized nodes" totalNumber=1
 9I0307 09:31:00.411602       1 nodeutilization.go:267] "Total capacity to be moved" CPU=100 Mem=849180262 Pods=7
10I0307 09:31:00.411615       1 nodeutilization.go:270] "Evicting pods from node" node="192.168.1.19" usage={"cpu":"940m","memory":"1134Mi","pods":"23"}
11I0307 09:31:00.411718       1 nodeutilization.go:273] "Pods on node" node="192.168.1.19" allPods=23 nonRemovablePods=8 removablePods=15
12I0307 09:31:00.411759       1 nodeutilization.go:280] "Evicting pods based on priority, if they have same priority, they'll be evicted based on QoS tiers"
13I0307 09:31:00.430412       1 evictions.go:551] "Evicted pod" pod="default/reflector-dcc5cf554-lhtqt" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
14I0307 09:31:00.430476       1 nodeutilization.go:337] "Evicted pods" pod="default/reflector-dcc5cf554-lhtqt"
15I0307 09:31:00.430489       1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=22
16I0307 09:31:00.536118       1 evictions.go:551] "Evicted pod" pod="monitoring/kube-prometheus-stack-grafana-b6ff94658-8gwhd" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
17I0307 09:31:00.536181       1 nodeutilization.go:337] "Evicted pods" pod="monitoring/kube-prometheus-stack-grafana-b6ff94658-8gwhd"
18I0307 09:31:00.536192       1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=21
19I0307 09:31:00.661102       1 evictions.go:551] "Evicted pod" pod="monitoring/kube-prometheus-stack-kube-state-metrics-655b644b89-r6lgd" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
20I0307 09:31:00.661967       1 nodeutilization.go:337] "Evicted pods" pod="monitoring/kube-prometheus-stack-kube-state-metrics-655b644b89-r6lgd"
21I0307 09:31:00.661984       1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=20
22I0307 09:31:00.794059       1 evictions.go:551] "Evicted pod" pod="monitoring/kube-prometheus-stack-operator-56bf5c7657-vwt9k" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
23I0307 09:31:00.794110       1 nodeutilization.go:337] "Evicted pods" pod="monitoring/kube-prometheus-stack-operator-56bf5c7657-vwt9k"
24I0307 09:31:00.794124       1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=19
25I0307 09:31:00.916132       1 evictions.go:551] "Evicted pod" pod="cert-manager/cert-manager-webhook-5f65ff988f-wmjsd" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
26I0307 09:31:00.916167       1 nodeutilization.go:337] "Evicted pods" pod="cert-manager/cert-manager-webhook-5f65ff988f-wmjsd"
27I0307 09:31:00.916175       1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=18
28I0307 09:31:01.113925       1 evictions.go:551] "Evicted pod" pod="cert-manager/cert-manager-5c86944dc6-5m956" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
29I0307 09:31:01.113955       1 nodeutilization.go:337] "Evicted pods" pod="cert-manager/cert-manager-5c86944dc6-5m956"
30I0307 09:31:01.113963       1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=17
31I0307 09:31:01.407545       1 evictions.go:551] "Evicted pod" pod="cert-manager/cert-manager-cainjector-bc8dbfcdd-srnct" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
32I0307 09:31:01.407575       1 nodeutilization.go:337] "Evicted pods" pod="cert-manager/cert-manager-cainjector-bc8dbfcdd-srnct"
33I0307 09:31:01.407585       1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=16
34I0307 09:31:01.407617       1 profile.go:361] "Total number of evictions/requests" extension point="Balance" evictedPods=7 evictionRequests=0
35I0307 09:31:01.407631       1 descheduler.go:248] "Number of evictions/requests" totalEvicted=7 evictionRequests=0

The logs indicate that the Descheduler has identified an imbalance between nodes and strategically evicted 7 pods from the overutilized node (192.168.1.19). Let’s verify the pod distribution after this rebalancing operation:

1kubectl get pods -A -o json | jq -r '.items[] | select(.status.phase=="Running" and .metadata.deletionTimestamp==null) | .spec.nodeName' | sort | uniq -c | awk '{print $2 " : " $1}'
2
3192.168.1.19 : 16
4192.168.1.24 : 10

As demonstrated by the updated counts, the Descheduler has successfully rebalanced the workload across both nodes, resulting in a more equitable distribution with 16 pods on the first node and 10 pods on the second node. This equilibrium ensures optimal resource utilization and improved cluster performance.

References