Kubernetes scheduling constraints

2023-12-26 20:11 由 TigerLu 发表于 #其他

Affinity and anti-affinity rules allow you to fine-tune your Kubernetes deployments, optimizing resource utilization and enhancing reliability.

Pod Affinity

Definition: Pod affinity is used to express scheduling constraints based on characteristics of candidate Nodes and existing Pods.
Purpose: It encourages Pods to be colocated on the same Node if they need to communicate frequently over the network.
Example: Imagine a microservices architecture where two Pods, ServiceA and ServiceB, interact frequently. You can set up pod affinity so that both ServiceA and ServiceB prefer to run on the same Node. This enhances communication efficiency.
Description: The affinity rule ensures that Pods with a specific label will be scheduled onto a Node that already hosts a Pod with the same label.

This ensures that all nginx Pods are scheduled on the same Node based on the hostname label.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - nginx
              topologyKey: "kubernetes.io/hostname"
      containers:
        - name: nginx
          image: nginx

Pod Anti-Affinity

Definition: Pod anti-affinity discourages scheduling Pods onto Nodes that already have Pods with certain labels.
Purpose: It helps distribute workloads across different Nodes, promoting fault tolerance and resilience.
Example: Consider a scenario where you have two Pods, Frontend and Backend, serving a web application. You can set up pod anti-affinity so that Frontend and Backend avoid running on the same Node. This way, if one Node fails, the other Node can still handle requests.
Description: The anti-affinity rule ensures that Pods with a specific label prefer not to be scheduled on a Node that already hosts a Pod with the same label.

This ensures that no two nginx Pods are scheduled on the same Node based on the hostname label.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - nginx
              topologyKey: "kubernetes.io/hostname"
      containers:
        - name: nginx
          image: nginx

Node Affinity

Definition: Node affinity constrains which Nodes can receive a Pod by matching labels on those Nodes.
Purpose: It allows you to specify an affinity toward a group of Nodes based on their labels.
Example: Suppose you have a set of high-memory Nodes labeled as memory=high. You want to run memory-intensive Pods on these Nodes. You can define node affinity to ensure that Pods with the label memory=high are scheduled on those specific Nodes.
Description: Node affinity acts as a preference, indicating that the scheduler should use a Node with the specified characteristics if available.

This ensures that the nginx Pod is scheduled only on a Node with the disktype=ssd label.

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: disktype
                operator: In
                values:
                  - ssd
  containers:
    - name: nginx
      image: nginx

Node Anti-Affinity

Definition: Node anti-affinity discourages scheduling Pods onto Nodes that already have Pods with specific labels.
Purpose: It promotes workload distribution across different Nodes, preventing resource bottlenecks.
Example: Imagine a scenario where you have Pods performing CPU-intensive computations. You can set up node anti-affinity to prevent these Pods from running on the same Node, ensuring better resource utilization.
Description: Node anti-affinity acts as a repelling rule, making it less probable for Pods to be scheduled on Nodes with the specified label.

This ensures that the nginx Pod avoids Nodes with the gpu=true label.

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  affinity:
    nodeAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: gpu
                operator: In
                values:
                  - true
  containers:
    - name: nginx
      image: nginx

requiredDuringSchedulingIgnoredDuringExecution

requiredDuringSchedulingIgnoredDuringExecution can be broken into two parts:

requiredDuringScheduling:
- This component implies that a pod should be scheduled on a node only if it satisfies certain criteria. In other words, the node must meet specific conditions for the pod to be placed there during the initial scheduling process.
IgnoredDuringExecution:
- This part comes into play after a pod is already scheduled and running on a node.
- If any changes occur in the labels on that node during the pod’s execution (for example, due to an update), the existing pod should not be evicted based on these label changes.
- Instead, only newly scheduled pods should be required to match the updated criteria.

In summary, requiredDuringSchedulingIgnoredDuringExecution ensures that pods are initially placed on suitable nodes and avoids unnecessary evictions during runtime due to label changes on the node. It’s a way to maintain stability and predictability in your Kubernetes cluster.

topologyKey

topologyKey represents the key of node labels that the scheduler uses to determine the topology domain for pod placement. For example, when using pod affinity, the scheduler ensures that a pod is scheduled in the same domain (topology) as other pods that match a specific expression.

Common label options of topologyKey include:

topology.kubernetes.io/zone: Pods are scheduled in the same zone as other pods with matching labels.
kubernetes.io/hostname: Pods are scheduled on the same hostname as other pods with matching labels.

kind: Pod
metadata:
  name: with-pod-affinity
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: security
                operator: In
                values:
                  - S1
          topologyKey: topology.kubernetes.io/zone
  containers:
    - name: with-pod-affinity
      image: k8s.gcr.io/pause:2.0

topologySpreadConstraints

topologySpreadConstraints allow you to control how Pods are distributed across your cluster among different failure domains such as regions, zones, nodes, and other user-defined topology domains. The goal is to achieve both high availability and efficient resource utilization.

For example, it can avoid single-node dependency, the YAML below deploys pods evenly to all nodes.

apiVersion: v1
kind: Pod
metadata:
  name: example-pod
spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: DoNotSchedule

maxSkew helps maintain a more even spread of pods, enhancing reliability and performance in your Kubernetes clusters. It defines the maximum allowed imbalance in the number of pods across topology domains. Set maxSkew to 1 (meaning only one more pod than the average can be in any zone)

topologySpreadConstraints are ideal for hierarchical topologies (where nodes are spread across logical domains), while pod/node affinity is suitable for linear topologies (where all nodes are on the same level). topologySpreadConstraints provide more expressive control over pod scheduling across broader topological domains, and combining them with other affinity rules allows you to fine-tune your workload placement.

apiVersion: apps/v1

kind: Deployment

metadata:

name: my-app

spec:

replicas: 5

selector:

matchLabels:

app: my-app

template:

metadata:

labels:

app: my-app

spec:

affinity:

podAntiAffinity:

requiredDuringSchedulingIgnoredDuringExecution:

- labelSelector:

matchExpressions:

- key: app

operator: In

values:

- my-app

topologyKey: kubernetes.io/hostname

topologySpreadConstraints:

- labelSelector:

matchLabels:

app: my-app

maxSkew: 1

topologyKey: topology.kubernetes.io/zone

whenUnsatisfiable: ScheduleAnyway

In this example, the pods of my-app are spread across different zones(based on the topology.kubernetes.io/zone label)

You may notice that there is labelSelector inside the topologySpreadConstraints, there's difference between with and without the labelSelector.

1. With labelSelector:

When you define a topologySpreadConstraints with a labelSelector, it allows you to select specific Pods based on their labels. These selected Pods are then counted to determine the number of Pods in their corresponding topology domain (such as nodes, zones, or other user-defined domains).
The labelSelector helps you control the spreading behavior of your Pods across different failure domains. You can ensure that Pods with specific labels are distributed evenly or according to your desired criteria.
For example, if you want to avoid running multiple Pods with the same label on a single node, you can use a labelSelector to enforce this constraint.

2. Without labelSelector:

When you omit the labelSelector, the spreading behavior is calculated automatically based on other information (such as services, replication controllers, replica sets, or stateful sets) that the Pod belongs to.
In this case, the system determines how to spread the Pods across different domains without explicitly considering their labels.
It’s a more automatic approach, but it might not provide fine-grained control over the distribution of Pods based on specific labels.

Taints and Tolerations

Taints are applied to nodes to mark them as “tainted” with specific keys and values. A tainted node will not schedule pods that do not have the corresponding toleration.

Tolerations are set on pods to allow them to tolerate specific taints. They define how long a pod can tolerate being scheduled on a tainted node.

Add taint to a node, taint effect NoSchedule.

kubectl taint nodes node1 key1=value1:NoSchedule

The allowed values for the effect field are:

NoExecute: This affects pods that are already running on the node as follows:

Pods that do not tolerate the taint are evicted immediately
Pods that tolerate the taint without specifying tolerationSeconds in their toleration specification remain bound forever
Pods that tolerate the taint with a specified tolerationSeconds remain bound for the specified amount of time. After that time elapses, the node lifecycle controller evicts the Pods from the node.

NoSchedule: No new Pods will be scheduled on the tainted node unless they have a matching toleration. Pods currently running on the node are not evicted.

PreferNoSchedule: PreferNoSchedule is a "preference" or "soft" version of NoSchedule. The control plane will try to avoid placing a Pod that does not tolerate the taint on the node, but it is not guaranteed.

Remove taint from a node.

kubectl taint nodes node1 key1=value1:NoSchedule-

Get the node's taint info

kubectl get node/node1 -o json | jq .spec.taints

tolerations usually used in pod or deployment declaration, in the YAML below, pods will tolerate the taint with key "hardware" and value "gpu" on the nodes where it is scheduled

apiVersion: apps/v1

kind: Deployment

metadata:

name: my-deployment

spec:

replicas: 3

template:

metadata:

labels:

app: my-app

spec:

containers:

- name: ai

image: skynet:1997-08-29

affinity:

nodeAffinity:

requiredDuringSchedulingIgnoredDuringExecution:

nodeSelectorTerms:

- matchExpressions:

- key: kubernetes.io/hostname

operator: In

values: ["big-gpu", "expensive-gpu"]

tolerations:

- key: "hardware"

value: "gpu"

effect: "NoSchedule"

tolerationSeconds: 3600