Kubernetes scheduling constraints
Affinity and anti-affinity rules allow you to fine-tune your Kubernetes deployments, optimizing resource utilization and enhancing reliability.
Pod Affinity
- Definition: Pod affinity is used to express scheduling constraints based on characteristics of candidate Nodes and existing Pods.
- Purpose: It encourages Pods to be colocated on the same Node if they need to communicate frequently over the network.
- Example: Imagine a microservices architecture where two Pods,
ServiceA
andServiceB
, interact frequently. You can set up pod affinity so that bothServiceA
andServiceB
prefer to run on the same Node. This enhances communication efficiency. - Description: The affinity rule ensures that Pods with a specific label will be scheduled onto a Node that already hosts a Pod with the same label.
This ensures that all nginx
Pods are scheduled on the same Node based on the hostname label.
apiVersion: apps/v1 kind: Deployment metadata: name: nginx-deployment spec: replicas: 3 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - nginx topologyKey: "kubernetes.io/hostname" containers: - name: nginx image: nginx
Pod Anti-Affinity
- Definition: Pod anti-affinity discourages scheduling Pods onto Nodes that already have Pods with certain labels.
- Purpose: It helps distribute workloads across different Nodes, promoting fault tolerance and resilience.
- Example: Consider a scenario where you have two Pods,
Frontend
andBackend
, serving a web application. You can set up pod anti-affinity so thatFrontend
andBackend
avoid running on the same Node. This way, if one Node fails, the other Node can still handle requests. - Description: The anti-affinity rule ensures that Pods with a specific label prefer not to be scheduled on a Node that already hosts a Pod with the same label.
This ensures that no two nginx
Pods are scheduled on the same Node based on the hostname label.
apiVersion: apps/v1 kind: Deployment metadata: name: nginx-deployment spec: replicas: 3 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - nginx topologyKey: "kubernetes.io/hostname" containers: - name: nginx image: nginx
Node Affinity
- Definition: Node affinity constrains which Nodes can receive a Pod by matching labels on those Nodes.
- Purpose: It allows you to specify an affinity toward a group of Nodes based on their labels.
- Example: Suppose you have a set of high-memory Nodes labeled as
memory=high
. You want to run memory-intensive Pods on these Nodes. You can define node affinity to ensure that Pods with the labelmemory=high
are scheduled on those specific Nodes. - Description: Node affinity acts as a preference, indicating that the scheduler should use a Node with the specified characteristics if available.
This ensures that the nginx
Pod is scheduled only on a Node with the disktype=ssd
label.
apiVersion: v1 kind: Pod metadata: name: nginx spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: disktype operator: In values: - ssd containers: - name: nginx image: nginx
Node Anti-Affinity
- Definition: Node anti-affinity discourages scheduling Pods onto Nodes that already have Pods with specific labels.
- Purpose: It promotes workload distribution across different Nodes, preventing resource bottlenecks.
- Example: Imagine a scenario where you have Pods performing CPU-intensive computations. You can set up node anti-affinity to prevent these Pods from running on the same Node, ensuring better resource utilization.
- Description: Node anti-affinity acts as a repelling rule, making it less probable for Pods to be scheduled on Nodes with the specified label.
This ensures that the nginx
Pod avoids Nodes with the gpu=true
label.
apiVersion: v1 kind: Pod metadata: name: nginx spec: affinity: nodeAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: gpu operator: In values: - true containers: - name: nginx image: nginx
requiredDuringSchedulingIgnoredDuringExecution
requiredDuringSchedulingIgnoredDuringExecution can be broken into two parts:
-
requiredDuringScheduling
:- This component implies that a pod should be scheduled on a node only if it satisfies certain criteria. In other words, the node must meet specific conditions for the pod to be placed there during the initial scheduling process.
-
IgnoredDuringExecution
:- This part comes into play after a pod is already scheduled and running on a node.
- If any changes occur in the labels on that node during the pod’s execution (for example, due to an update), the existing pod should not be evicted based on these label changes.
- Instead, only newly scheduled pods should be required to match the updated criteria.
In summary, requiredDuringSchedulingIgnoredDuringExecution
ensures that pods are initially placed on suitable nodes and avoids unnecessary evictions during runtime due to label changes on the node. It’s a way to maintain stability and predictability in your Kubernetes cluster.
topologyKey
topologyKey represents the key of node labels that the scheduler uses to determine the topology domain for pod placement. For example, when using pod affinity, the scheduler ensures that a pod is scheduled in the same domain (topology) as other pods that match a specific expression.
Common label options of topologyKey
include:
topology.kubernetes.io/zone
: Pods are scheduled in the same zone as other pods with matching labels.kubernetes.io/hostname
: Pods are scheduled on the same hostname as other pods with matching labels.
kind: Pod metadata: name: with-pod-affinity spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: security operator: In values: - S1 topologyKey: topology.kubernetes.io/zone containers: - name: with-pod-affinity image: k8s.gcr.io/pause:2.0
topologySpreadConstraints
topologySpreadConstraints
allow you to control how Pods are distributed across your cluster among different failure domains such as regions, zones, nodes, and other user-defined topology domains. The goal is to achieve both high availability and efficient resource utilization.
For example, it can avoid single-node dependency, the YAML below deploys pods evenly to all nodes.
apiVersion: v1 kind: Pod metadata: name: example-pod spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: DoNotSchedule
maxSkew
helps maintain a more even spread of pods, enhancing reliability and performance in your Kubernetes clusters. It defines the maximum allowed imbalance in the number of pods across topology domains. Set maxSkew
to 1 (meaning only one more pod than the average can be in any zone)
topologySpreadConstraints are ideal for hierarchical topologies (where nodes are spread across logical domains), while pod/node affinity is suitable for linear topologies (where all nodes are on the same level). topologySpreadConstraints provide more expressive control over pod scheduling across broader topological domains, and combining them with other affinity rules allows you to fine-tune your workload placement.
In this example, the pods of my-app
are spread across different zones(based on the topology.kubernetes.io/zone
label)
You may notice that there is labelSelector inside the topologySpreadConstraints
, there's difference between with and without the labelSelector.
1. With labelSelector
:
- When you define a
topologySpreadConstraints
with alabelSelector
, it allows you to select specific Pods based on their labels. These selected Pods are then counted to determine the number of Pods in their corresponding topology domain (such as nodes, zones, or other user-defined domains). - The
labelSelector
helps you control the spreading behavior of your Pods across different failure domains. You can ensure that Pods with specific labels are distributed evenly or according to your desired criteria. - For example, if you want to avoid running multiple Pods with the same label on a single node, you can use a
labelSelector
to enforce this constraint.
2. Without labelSelector
:
- When you omit the
labelSelector
, the spreading behavior is calculated automatically based on other information (such as services, replication controllers, replica sets, or stateful sets) that the Pod belongs to. - In this case, the system determines how to spread the Pods across different domains without explicitly considering their labels.
- It’s a more automatic approach, but it might not provide fine-grained control over the distribution of Pods based on specific labels.
Taints and Tolerations
Taints are applied to nodes to mark them as “tainted” with specific keys and values. A tainted node will not schedule pods that do not have the corresponding toleration.
Tolerations are set on pods to allow them to tolerate specific taints. They define how long a pod can tolerate being scheduled on a tainted node.
Add taint to a node, taint effect NoSchedule.
kubectl taint nodes node1 key1=value1:NoSchedule
The allowed values for the effect
field are:
NoExecute:
This affects pods that are already running on the node as follows:
-
- Pods that do not tolerate the taint are evicted immediately
- Pods that tolerate the taint without specifying
tolerationSeconds
in their toleration specification remain bound forever - Pods that tolerate the taint with a specified
tolerationSeconds
remain bound for the specified amount of time. After that time elapses, the node lifecycle controller evicts the Pods from the node.
NoSchedule:
No new Pods will be scheduled on the tainted node unless they have a matching toleration. Pods currently running on the node are not evicted.
PreferNoSchedule:
PreferNoSchedule
is a "preference" or "soft" version ofNoSchedule
. The control plane will try to avoid placing a Pod that does not tolerate the taint on the node, but it is not guaranteed.
Remove taint from a node.
kubectl taint nodes node1 key1=value1:NoSchedule-
Get the node's taint info
kubectl get node/node1 -o json | jq .spec.taints
tolerations usually used in pod or deployment declaration, in the YAML below, pods will tolerate the taint with key "hardware"
and value "gpu"
on the nodes where it is scheduled