Kubernetes - Cluster Rebalancing with Descheduler
Table of Contents
This post is part of our ongoing series of posts for Kubernetes. This post focuses on how Kubernetes Descheduler helps rebalance clusters for better performance and resource utilization. As clusters evolve—pods start or terminate, nodes join or leave, or resource demands shift—the initial pod placement may no longer be optimal for current conditions.
Descheduler addresses this challenge by identifying and relocating pods that no longer align with current cluster requirements, enabling Kubernetes to redistribute them more efficiently across the available infrastructure.
1. Setup Descheduler
Let’s configure Descheduler by creating cluster/default/descheduler.yaml
. This declarative configuration establishes:
- The Kubernetes Sigs Helm Repository reference
- The Descheduler Helm Release with appropriate configuration
---
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
name: descheduler
namespace: kube-system
spec:
interval: 30m
url: https://kubernetes-sigs.github.io/descheduler
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: descheduler
namespace: kube-system
spec:
releaseName: descheduler
interval: 10m
chart:
spec:
chart: descheduler
version: "0.32.2"
interval: 10m
sourceRef:
kind: HelmRepository
name: descheduler
namespace: kube-system
values:
schedule: "0 0 * * *"
timeZone: "Asia/Kolkata"
deschedulerPolicy:
profiles:
- name: default
pluginConfig:
- name: DefaultEvictor
args:
ignorePvcPods: true
evictLocalStoragePods: true
- name: RemoveDuplicates
- name: RemovePodsHavingTooManyRestarts
args:
podRestartThreshold: 100
includingInitContainers: true
- name: RemovePodsViolatingNodeAffinity
args:
nodeAffinityType:
- requiredDuringSchedulingIgnoredDuringExecution
- name: RemovePodsViolatingNodeTaints
- name: RemovePodsViolatingInterPodAntiAffinity
- name: RemovePodsViolatingTopologySpreadConstraint
- name: LowNodeUtilization
args:
thresholds:
cpu: 30
memory: 20
pods: 4
targetThresholds:
cpu: 30
memory: 30
pods: 10
plugins:
balance:
enabled:
- RemoveDuplicates
- RemovePodsViolatingTopologySpreadConstraint
- LowNodeUtilization
deschedule:
enabled:
- RemovePodsHavingTooManyRestarts
- RemovePodsViolatingNodeTaints
- RemovePodsViolatingNodeAffinity
- RemovePodsViolatingInterPodAntiAffinity
After applying this configuration, verify that the Descheduler CronJob has been successfully deployed:
kubectl -n kube-system get cronjobs
NAME SCHEDULE TIMEZONE SUSPEND ACTIVE LAST SCHEDULE AGE
descheduler 0 0 * * * Asia/Kolkata False 0 <none> 17m
This confirms Descheduler is properly configured as a CronJob scheduled to execute daily at midnight in the Asia/Kolkata timezone.
The descheduler policy contains several strategies strategically grouped into two plugin categories:
- Balance Plugins:
RemoveDuplicates
,RemovePodsViolatingTopologySpreadConstraint
, andLowNodeUtilization
focus on spreading workloads evenly across the cluster. - Deschedule Plugins:
RemovePodsHavingTooManyRestarts
,RemovePodsViolatingNodeTaints
,RemovePodsViolatingNodeAffinity
, andRemovePodsViolatingInterPodAntiAffinity
handle pods that no longer fit their rules or have issues like excessive restarts. - Additional Settings: The
DefaultEvictor
permits eviction of pods with local storage but preserves those bound to Persistent Volume Claims (PVCs). TheLowNodeUtilization
strategy implements thresholds of 30% for CPU, 20% for memory, and 4 for pods, with corresponding target thresholds of 30% for CPU, 30% for memory, and 10 for pods to determine when nodes are underutilized or overutilized, enabling intelligent workload redistribution.
2. Validate Setup
To validate our implementation, we’ll first simulate an unbalanced cluster state by stopping one node from our two-node k3s cluster. This will trigger the default Pod Disruption Budget mechanism to activate, causing all pods to migrate to the remaining operational node after a 5-minute grace period.
Once all pods have successfully migrated to the first node, we’ll reintroduce our second node to the cluster and examine the pod distribution across nodes.
kubectl get pods -A --field-selector=status.phase=Running -o=jsonpath='{range .items[*]}{.spec.nodeName}{"\n"}{end}' | sort | uniq -c | awk '{print $2 " : " $1}'
192.168.1.19 : 23
192.168.1.24 : 3
The output reveals a significant imbalance with 23 pods running on the first node and only 3 pods on the second node. Let’s manually trigger the Descheduler CronJob to verify its functionality.
kubectl -n kube-system create job --from=cronjob/descheduler descheduler-test
job.batch/descheduler-test created
Now, let’s examine the execution logs to observe the Descheduler’s decision-making process:
kubectl -n kube-system logs jobs/descheduler-test
I0307 09:31:00.411392 1 nodeutilization.go:199] "Node is overutilized" node="192.168.1.19" usage={"cpu":"940m","memory":"1134Mi","pods":"23"} usagePercentage={"cpu":47,"memory":31.92,"pods":20.91}
I0307 09:31:00.411483 1 nodeutilization.go:196] "Node is underutilized" node="192.168.1.24" usage={"cpu":"500m","memory":"256Mi","pods":"4"} usagePercentage={"cpu":25,"memory":7.21,"pods":3.64}
I0307 09:31:00.411529 1 lownodeutilization.go:143] "Criteria for a node under utilization" CPU=30 Mem=20 Pods=4
I0307 09:31:00.411545 1 lownodeutilization.go:144] "Number of underutilized nodes" totalNumber=1
I0307 09:31:00.411555 1 lownodeutilization.go:147] "Criteria for a node above target utilization" CPU=30 Mem=30 Pods=10
I0307 09:31:00.411586 1 lownodeutilization.go:148] "Number of overutilized nodes" totalNumber=1
I0307 09:31:00.411602 1 nodeutilization.go:267] "Total capacity to be moved" CPU=100 Mem=849180262 Pods=7
I0307 09:31:00.411615 1 nodeutilization.go:270] "Evicting pods from node" node="192.168.1.19" usage={"cpu":"940m","memory":"1134Mi","pods":"23"}
I0307 09:31:00.411718 1 nodeutilization.go:273] "Pods on node" node="192.168.1.19" allPods=23 nonRemovablePods=8 removablePods=15
I0307 09:31:00.411759 1 nodeutilization.go:280] "Evicting pods based on priority, if they have same priority, they'll be evicted based on QoS tiers"
I0307 09:31:00.430412 1 evictions.go:551] "Evicted pod" pod="default/reflector-dcc5cf554-lhtqt" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
I0307 09:31:00.430476 1 nodeutilization.go:337] "Evicted pods" pod="default/reflector-dcc5cf554-lhtqt"
I0307 09:31:00.430489 1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=22
I0307 09:31:00.536118 1 evictions.go:551] "Evicted pod" pod="monitoring/kube-prometheus-stack-grafana-b6ff94658-8gwhd" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
I0307 09:31:00.536181 1 nodeutilization.go:337] "Evicted pods" pod="monitoring/kube-prometheus-stack-grafana-b6ff94658-8gwhd"
I0307 09:31:00.536192 1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=21
I0307 09:31:00.661102 1 evictions.go:551] "Evicted pod" pod="monitoring/kube-prometheus-stack-kube-state-metrics-655b644b89-r6lgd" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
I0307 09:31:00.661967 1 nodeutilization.go:337] "Evicted pods" pod="monitoring/kube-prometheus-stack-kube-state-metrics-655b644b89-r6lgd"
I0307 09:31:00.661984 1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=20
I0307 09:31:00.794059 1 evictions.go:551] "Evicted pod" pod="monitoring/kube-prometheus-stack-operator-56bf5c7657-vwt9k" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
I0307 09:31:00.794110 1 nodeutilization.go:337] "Evicted pods" pod="monitoring/kube-prometheus-stack-operator-56bf5c7657-vwt9k"
I0307 09:31:00.794124 1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=19
I0307 09:31:00.916132 1 evictions.go:551] "Evicted pod" pod="cert-manager/cert-manager-webhook-5f65ff988f-wmjsd" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
I0307 09:31:00.916167 1 nodeutilization.go:337] "Evicted pods" pod="cert-manager/cert-manager-webhook-5f65ff988f-wmjsd"
I0307 09:31:00.916175 1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=18
I0307 09:31:01.113925 1 evictions.go:551] "Evicted pod" pod="cert-manager/cert-manager-5c86944dc6-5m956" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
I0307 09:31:01.113955 1 nodeutilization.go:337] "Evicted pods" pod="cert-manager/cert-manager-5c86944dc6-5m956"
I0307 09:31:01.113963 1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=17
I0307 09:31:01.407545 1 evictions.go:551] "Evicted pod" pod="cert-manager/cert-manager-cainjector-bc8dbfcdd-srnct" reason="" strategy="LowNodeUtilization" node="192.168.1.19" profile="default"
I0307 09:31:01.407575 1 nodeutilization.go:337] "Evicted pods" pod="cert-manager/cert-manager-cainjector-bc8dbfcdd-srnct"
I0307 09:31:01.407585 1 nodeutilization.go:353] "Updated node usage" node="192.168.1.19" CPU=940 Mem=1189085184 Pods=16
I0307 09:31:01.407617 1 profile.go:361] "Total number of evictions/requests" extension point="Balance" evictedPods=7 evictionRequests=0
I0307 09:31:01.407631 1 descheduler.go:248] "Number of evictions/requests" totalEvicted=7 evictionRequests=0
The logs indicate that the Descheduler has identified an imbalance between nodes and strategically evicted 7 pods from the overutilized node (192.168.1.19). Let’s verify the pod distribution after this rebalancing operation:
kubectl get pods -A -o json | jq -r '.items[] | select(.status.phase=="Running" and .metadata.deletionTimestamp==null) | .spec.nodeName' | sort | uniq -c | awk '{print $2 " : " $1}'
192.168.1.19 : 16
192.168.1.24 : 10
As demonstrated by the updated counts, the Descheduler has successfully rebalanced the workload across both nodes, resulting in a more equitable distribution with 16 pods on the first node and 10 pods on the second node. This equilibrium ensures optimal resource utilization and improved cluster performance.
References
- Kubernetes Documentation - https://kubernetes.io/docs/concepts/scheduling-eviction/
- Descheduler Documentation - https://github.com/kubernetes-sigs/descheduler