Kubernetes - Cluster Rebalancing with Descheduler
Table of Contents
This post is part of our ongoing series of posts for Kubernetes. This post focuses on how Kubernetes Descheduler helps rebalance clusters for better performance and resource utilization. As clusters evolve—pods start or terminate, nodes join or leave, or resource demands shift—the initial pod placement may no longer be optimal for current conditions.
Descheduler addresses this challenge by identifying and relocating pods that no longer align with current cluster requirements, enabling Kubernetes to redistribute them more efficiently across the available infrastructure.
1. Setup Descheduler
Let’s configure Descheduler by creating cluster/default/descheduler.yaml. This declarative configuration establishes:
- The Kubernetes Sigs Helm Repository reference
- The Descheduler Helm Release with appropriate configuration
1---
2apiVersion: source.toolkit.fluxcd.io/v1
3kind: HelmRepository
4metadata:
5 name: descheduler
6 namespace: kube-system
7spec:
8 interval: 30m
9 url: https://kubernetes-sigs.github.io/descheduler
10---
11apiVersion: helm.toolkit.fluxcd.io/v2
12kind: HelmRelease
13metadata:
14 name: descheduler
15 namespace: kube-system
16spec:
17 releaseName: descheduler
18 interval: 10m
19 chart:
20 spec:
21 chart: descheduler
22 version: "0.32.2"
23 interval: 10m
24 sourceRef:
25 kind: HelmRepository
26 name: descheduler
27 namespace: kube-system
28 values:
29 schedule: "0 0 * * *"
30 timeZone: "Asia/Kolkata"
31 deschedulerPolicy:
32 profiles:
33 - name: default
34 pluginConfig:
35 - name: DefaultEvictor
36 args:
37 ignorePvcPods: true
38 evictLocalStoragePods: true
39 - name: RemoveDuplicates
40 - name: RemovePodsHavingTooManyRestarts
41 args:
42 podRestartThreshold: 100
43 includingInitContainers: true
44 - name: RemovePodsViolatingNodeAffinity
45 args:
46 nodeAffinityType:
47 - requiredDuringSchedulingIgnoredDuringExecution
48 - name: RemovePodsViolatingNodeTaints
49 - name: RemovePodsViolatingInterPodAntiAffinity
50 - name: RemovePodsViolatingTopologySpreadConstraint
51 - name: LowNodeUtilization
52 args:
53 thresholds:
54 cpu: 30
55 memory: 20
56 pods: 4
57 targetThresholds:
58 cpu: 30
59 memory: 30
60 pods: 10
61 plugins:
62 balance:
63 enabled:
64 - RemoveDuplicates
65 - RemovePodsViolatingTopologySpreadConstraint
66 - LowNodeUtilization
67 deschedule:
68 enabled:
69 - RemovePodsHavingTooManyRestarts
70 - RemovePodsViolatingNodeTaints
71 - RemovePodsViolatingNodeAffinity
72 - RemovePodsViolatingInterPodAntiAffinity
After applying this configuration, verify that the Descheduler CronJob has been successfully deployed:
1kubectl -n kube-system get cronjobs
2
3NAME SCHEDULE TIMEZONE SUSPEND ACTIVE LAST SCHEDULE AGE
4descheduler 0 0 * * * Asia/Kolkata False 0 <none> 17m
This confirms Descheduler is properly configured as a CronJob scheduled to execute daily at midnight in the Asia/Kolkata timezone.
The descheduler policy contains several strategies strategically grouped into two plugin categories:
- Balance Plugins:
RemoveDuplicates,RemovePodsViolatingTopologySpreadConstraint, andLowNodeUtilizationfocus on spreading workloads evenly across the cluster. - Deschedule Plugins:
RemovePodsHavingTooManyRestarts,RemovePodsViolatingNodeTaints,RemovePodsViolatingNodeAffinity, andRemovePodsViolatingInterPodAntiAffinityhandle pods that no longer fit their rules or have issues like excessive restarts. - Additional Settings: The
DefaultEvictorpermits eviction of pods with local storage but preserves those bound to Persistent Volume Claims (PVCs). TheLowNodeUtilizationstrategy implements thresholds of 30% for CPU, 20% for memory, and 4 for pods, with corresponding target thresholds of 30% for CPU, 30% for memory, and 10 for pods to determine when nodes are underutilized or overutilized, enabling intelligent workload redistribution.
2. Validate Setup
To validate our implementation, we’ll first simulate an unbalanced cluster state by stopping one node from our two-node k3s cluster. This will trigger the default Pod Disruption Budget mechanism to activate, causing all pods to migrate to the remaining operational node after a 5-minute grace period.
Once all pods have successfully migrated to the first node, we’ll reintroduce our second node to the cluster and examine the pod distribution across nodes.
1kubectl get pods -A --field-selector=status.phase=Running -o=jsonpath='{range .items[*]}{.spec.nodeName}{"\n"}{end}' | sort | uniq -c | awk '{print $2 " : " $1}'
2
3192.168.1.19 : 23
4192.168.1.24 : 3
The output reveals a significant imbalance with 23 pods running on the first node and only 3 pods on the second node. Let’s manually trigger the Descheduler CronJob to verify its functionality.
1kubectl -n kube-system create job --from=cronjob/descheduler descheduler-test
2
3job.batch/descheduler-test created
Now, let’s examine the execution logs to observe the Descheduler’s decision-making process:
1kubectl -n kube-system logs jobs/descheduler-test
2
3I0307 09:31:00.411392 1 nodeutilization.go:199] "Node is overutilized" node="192.168.1.19" usage={"cpu":"940m","memory":"1134Mi","pods":"23"} usagePercentage={"cpu":47,"memory":31.92,"pods":20.91}
4I0307 09:31:00.411483 1 nodeutilization.go:196] "Node is underutilized" node="192.168.1.24" usage={"cpu":"500m","memory":"256Mi","pods":"4"} usagePercentage={"cpu":25,"memory":7.21,"pods":3.64}
5I0307 09:31:00.411529 1 lownodeutilization.go:143] "Criteria for a node under utilization" CPU=30 Mem=20 Pods=4
6I0307 09:31:00.411545 1 lownodeutilization.go:144] "Number of underutilized nodes" totalNumber=1
7I0307 09:31:00.411555 1 lownodeutilization.go:147] "Criteria for a node above target utilization" CPU=30 Mem=30 Pods=10
8I0307 09:31:00.411586 1 lownodeutilization.go:148] "Number of overutilized nodes" totalNumber=1
9I0307 09:31:00.411602 1 nodeutilization.go:267] "Total capacity to be moved" CPU=100 Mem=849180262 Pods=7
10I0307 09:31:00.411615 1 nodeutilization.go:270] "Evicting pods from node" node="192.168.1.19" usage={"cpu":"940m","memory":"1134Mi","pods":"23"}
11I0307 09:31:00.411718 1 nodeutilization.go:273] "Pods on node" node="192.168.1.19" allPods=23 nonRemovablePods=8 removablePods=15
12I0307 09:31:00.411759 1 nodeutilization.go:280] "Evicting pods based on priority, if they have same priority, they'll be evicted based on QoS tiers"
13I0307 09:31:00.430412 1 evictions.go:551] "Evicted pod" pod="default/reflector-dcc5cf554-lhtqt" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
14I0307 09:31:00.430476 1 nodeutilization.go:337] "Evicted pods" pod="default/reflector-dcc5cf554-lhtqt"
15I0307 09:31:00.430489 1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=22
16I0307 09:31:00.536118 1 evictions.go:551] "Evicted pod" pod="monitoring/kube-prometheus-stack-grafana-b6ff94658-8gwhd" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
17I0307 09:31:00.536181 1 nodeutilization.go:337] "Evicted pods" pod="monitoring/kube-prometheus-stack-grafana-b6ff94658-8gwhd"
18I0307 09:31:00.536192 1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=21
19I0307 09:31:00.661102 1 evictions.go:551] "Evicted pod" pod="monitoring/kube-prometheus-stack-kube-state-metrics-655b644b89-r6lgd" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
20I0307 09:31:00.661967 1 nodeutilization.go:337] "Evicted pods" pod="monitoring/kube-prometheus-stack-kube-state-metrics-655b644b89-r6lgd"
21I0307 09:31:00.661984 1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=20
22I0307 09:31:00.794059 1 evictions.go:551] "Evicted pod" pod="monitoring/kube-prometheus-stack-operator-56bf5c7657-vwt9k" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
23I0307 09:31:00.794110 1 nodeutilization.go:337] "Evicted pods" pod="monitoring/kube-prometheus-stack-operator-56bf5c7657-vwt9k"
24I0307 09:31:00.794124 1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=19
25I0307 09:31:00.916132 1 evictions.go:551] "Evicted pod" pod="cert-manager/cert-manager-webhook-5f65ff988f-wmjsd" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
26I0307 09:31:00.916167 1 nodeutilization.go:337] "Evicted pods" pod="cert-manager/cert-manager-webhook-5f65ff988f-wmjsd"
27I0307 09:31:00.916175 1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=18
28I0307 09:31:01.113925 1 evictions.go:551] "Evicted pod" pod="cert-manager/cert-manager-5c86944dc6-5m956" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
29I0307 09:31:01.113955 1 nodeutilization.go:337] "Evicted pods" pod="cert-manager/cert-manager-5c86944dc6-5m956"
30I0307 09:31:01.113963 1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=17
31I0307 09:31:01.407545 1 evictions.go:551] "Evicted pod" pod="cert-manager/cert-manager-cainjector-bc8dbfcdd-srnct" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
32I0307 09:31:01.407575 1 nodeutilization.go:337] "Evicted pods" pod="cert-manager/cert-manager-cainjector-bc8dbfcdd-srnct"
33I0307 09:31:01.407585 1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=16
34I0307 09:31:01.407617 1 profile.go:361] "Total number of evictions/requests" extension point="Balance" evictedPods=7 evictionRequests=0
35I0307 09:31:01.407631 1 descheduler.go:248] "Number of evictions/requests" totalEvicted=7 evictionRequests=0
The logs indicate that the Descheduler has identified an imbalance between nodes and strategically evicted 7 pods from the overutilized node (192.168.1.19). Let’s verify the pod distribution after this rebalancing operation:
1kubectl get pods -A -o json | jq -r '.items[] | select(.status.phase=="Running" and .metadata.deletionTimestamp==null) | .spec.nodeName' | sort | uniq -c | awk '{print $2 " : " $1}'
2
3192.168.1.19 : 16
4192.168.1.24 : 10
As demonstrated by the updated counts, the Descheduler has successfully rebalanced the workload across both nodes, resulting in a more equitable distribution with 16 pods on the first node and 10 pods on the second node. This equilibrium ensures optimal resource utilization and improved cluster performance.
References
- Kubernetes Documentation - https://kubernetes.io/docs/concepts/scheduling-eviction/
- Descheduler Documentation - https://github.com/kubernetes-sigs/descheduler