This the multi-page printable view of this section. Click here to print.
Run Applications
- 1: Run a Stateless Application Using a Deployment
- 2: Run a Single-Instance Stateful Application
- 3: Run a Replicated Stateful Application
- 4: Scale a StatefulSet
- 5: Delete a StatefulSet
- 6: Force Delete StatefulSet Pods
- 7: Horizontal Pod Autoscaler
- 8: Horizontal Pod Autoscaler Walkthrough
- 9: Specifying a Disruption Budget for your Application
- 10: Accessing the Kubernetes API from a Pod
1 - Run a Stateless Application Using a Deployment
This page shows how to run an application using a Kubernetes Deployment object.
Objectives
- Create an nginx deployment.
- Use kubectl to list information about the deployment.
- Update the deployment.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version v1.9. To check the version, enterkubectl version
.Creating and exploring an nginx deployment
You can run an application by creating a Kubernetes Deployment object, and you can describe a Deployment in a YAML file. For example, this YAML file describes a Deployment that runs the nginx:1.14.2 Docker image:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 2 # tells deployment to run 2 pods matching the template
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
Create a Deployment based on the YAML file:
kubectl apply -f https://k8s.io/examples/application/deployment.yaml
Display information about the Deployment:
kubectl describe deployment nginx-deployment
The output is similar to this:
Name: nginx-deployment Namespace: default CreationTimestamp: Tue, 30 Aug 2016 18:11:37 -0700 Labels: app=nginx Annotations: deployment.kubernetes.io/revision=1 Selector: app=nginx Replicas: 2 desired | 2 updated | 2 total | 2 available | 0 unavailable StrategyType: RollingUpdate MinReadySeconds: 0 RollingUpdateStrategy: 1 max unavailable, 1 max surge Pod Template: Labels: app=nginx Containers: nginx: Image: nginx:1.14.2 Port: 80/TCP Environment: <none> Mounts: <none> Volumes: <none> Conditions: Type Status Reason ---- ------ ------ Available True MinimumReplicasAvailable Progressing True NewReplicaSetAvailable OldReplicaSets: <none> NewReplicaSet: nginx-deployment-1771418926 (2/2 replicas created) No events.
List the Pods created by the deployment:
kubectl get pods -l app=nginx
The output is similar to this:
NAME READY STATUS RESTARTS AGE nginx-deployment-1771418926-7o5ns 1/1 Running 0 16h nginx-deployment-1771418926-r18az 1/1 Running 0 16h
Display information about a Pod:
kubectl describe pod <pod-name>
where
<pod-name>
is the name of one of your Pods.
Updating the deployment
You can update the deployment by applying a new YAML file. This YAML file specifies that the deployment should be updated to use nginx 1.16.1.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 2
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.16.1 # Update the version of nginx from 1.14.2 to 1.16.1
ports:
- containerPort: 80
Apply the new YAML file:
kubectl apply -f https://k8s.io/examples/application/deployment-update.yaml
Watch the deployment create pods with new names and delete the old pods:
kubectl get pods -l app=nginx
Scaling the application by increasing the replica count
You can increase the number of Pods in your Deployment by applying a new YAML
file. This YAML file sets replicas
to 4, which specifies that the Deployment
should have four Pods:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 4 # Update the replicas from 2 to 4
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
Apply the new YAML file:
kubectl apply -f https://k8s.io/examples/application/deployment-scale.yaml
Verify that the Deployment has four Pods:
kubectl get pods -l app=nginx
The output is similar to this:
NAME READY STATUS RESTARTS AGE nginx-deployment-148880595-4zdqq 1/1 Running 0 25s nginx-deployment-148880595-6zgi1 1/1 Running 0 25s nginx-deployment-148880595-fxcez 1/1 Running 0 2m nginx-deployment-148880595-rwovn 1/1 Running 0 2m
Deleting a deployment
Delete the deployment by name:
kubectl delete deployment nginx-deployment
ReplicationControllers -- the Old Way
The preferred way to create a replicated application is to use a Deployment, which in turn uses a ReplicaSet. Before the Deployment and ReplicaSet were added to Kubernetes, replicated applications were configured using a ReplicationController.
What's next
- Learn more about Deployment objects.
2 - Run a Single-Instance Stateful Application
This page shows you how to run a single-instance stateful application in Kubernetes using a PersistentVolume and a Deployment. The application is MySQL.
Objectives
- Create a PersistentVolume referencing a disk in your environment.
- Create a MySQL Deployment.
- Expose MySQL to other pods in the cluster at a known DNS name.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.You need to either have a dynamic PersistentVolume provisioner with a default StorageClass, or statically provision PersistentVolumes yourself to satisfy the PersistentVolumeClaims used here.
Deploy MySQL
You can run a stateful application by creating a Kubernetes Deployment and connecting it to an existing PersistentVolume using a PersistentVolumeClaim. For example, this YAML file describes a Deployment that runs MySQL and references the PersistentVolumeClaim. The file defines a volume mount for /var/lib/mysql, and then creates a PersistentVolumeClaim that looks for a 20G volume. This claim is satisfied by any existing volume that meets the requirements, or by a dynamic provisioner.
Note: The password is defined in the config yaml, and this is insecure. See Kubernetes Secrets for a secure solution.
apiVersion: v1
kind: Service
metadata:
name: mysql
spec:
ports:
- port: 3306
selector:
app: mysql
clusterIP: None
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: mysql
spec:
selector:
matchLabels:
app: mysql
strategy:
type: Recreate
template:
metadata:
labels:
app: mysql
spec:
containers:
- image: mysql:5.6
name: mysql
env:
# Use secret in real usage
- name: MYSQL_ROOT_PASSWORD
value: password
ports:
- containerPort: 3306
name: mysql
volumeMounts:
- name: mysql-persistent-storage
mountPath: /var/lib/mysql
volumes:
- name: mysql-persistent-storage
persistentVolumeClaim:
claimName: mysql-pv-claim
apiVersion: v1
kind: PersistentVolume
metadata:
name: mysql-pv-volume
labels:
type: local
spec:
storageClassName: manual
capacity:
storage: 20Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/mnt/data"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mysql-pv-claim
spec:
storageClassName: manual
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
Deploy the PV and PVC of the YAML file:
kubectl apply -f https://k8s.io/examples/application/mysql/mysql-pv.yaml
Deploy the contents of the YAML file:
kubectl apply -f https://k8s.io/examples/application/mysql/mysql-deployment.yaml
Display information about the Deployment:
kubectl describe deployment mysql
The output is similar to this:
Name: mysql Namespace: default CreationTimestamp: Tue, 01 Nov 2016 11:18:45 -0700 Labels: app=mysql Annotations: deployment.kubernetes.io/revision=1 Selector: app=mysql Replicas: 1 desired | 1 updated | 1 total | 0 available | 1 unavailable StrategyType: Recreate MinReadySeconds: 0 Pod Template: Labels: app=mysql Containers: mysql: Image: mysql:5.6 Port: 3306/TCP Environment: MYSQL_ROOT_PASSWORD: password Mounts: /var/lib/mysql from mysql-persistent-storage (rw) Volumes: mysql-persistent-storage: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: mysql-pv-claim ReadOnly: false Conditions: Type Status Reason ---- ------ ------ Available False MinimumReplicasUnavailable Progressing True ReplicaSetUpdated OldReplicaSets: <none> NewReplicaSet: mysql-63082529 (1/1 replicas created) Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 33s 33s 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set mysql-63082529 to 1
List the pods created by the Deployment:
kubectl get pods -l app=mysql
The output is similar to this:
NAME READY STATUS RESTARTS AGE mysql-63082529-2z3ki 1/1 Running 0 3m
Inspect the PersistentVolumeClaim:
kubectl describe pvc mysql-pv-claim
The output is similar to this:
Name: mysql-pv-claim Namespace: default StorageClass: Status: Bound Volume: mysql-pv-volume Labels: <none> Annotations: pv.kubernetes.io/bind-completed=yes pv.kubernetes.io/bound-by-controller=yes Capacity: 20Gi Access Modes: RWO Events: <none>
Accessing the MySQL instance
The preceding YAML file creates a service that
allows other Pods in the cluster to access the database. The Service option
clusterIP: None
lets the Service DNS name resolve directly to the
Pod's IP address. This is optimal when you have only one Pod
behind a Service and you don't intend to increase the number of Pods.
Run a MySQL client to connect to the server:
kubectl run -it --rm --image=mysql:5.6 --restart=Never mysql-client -- mysql -h mysql -ppassword
This command creates a new Pod in the cluster running a MySQL client and connects it to the server through the Service. If it connects, you know your stateful MySQL database is up and running.
Waiting for pod default/mysql-client-274442439-zyp6i to be running, status is Pending, pod ready: false
If you don't see a command prompt, try pressing enter.
mysql>
Updating
The image or any other part of the Deployment can be updated as usual
with the kubectl apply
command. Here are some precautions that are
specific to stateful apps:
- Don't scale the app. This setup is for single-instance apps only. The underlying PersistentVolume can only be mounted to one Pod. For clustered stateful apps, see the StatefulSet documentation.
- Use
strategy:
type: Recreate
in the Deployment configuration YAML file. This instructs Kubernetes to not use rolling updates. Rolling updates will not work, as you cannot have more than one Pod running at a time. TheRecreate
strategy will stop the first pod before creating a new one with the updated configuration.
Deleting a deployment
Delete the deployed objects by name:
kubectl delete deployment,svc mysql
kubectl delete pvc mysql-pv-claim
kubectl delete pv mysql-pv-volume
If you manually provisioned a PersistentVolume, you also need to manually delete it, as well as release the underlying resource. If you used a dynamic provisioner, it automatically deletes the PersistentVolume when it sees that you deleted the PersistentVolumeClaim. Some dynamic provisioners (such as those for EBS and PD) also release the underlying resource upon deleting the PersistentVolume.
What's next
Learn more about Deployment objects.
Learn more about Deploying applications
3 - Run a Replicated Stateful Application
This page shows how to run a replicated stateful application using a StatefulSet controller. This application is a replicated MySQL database. The example topology has a single primary server and multiple replicas, using asynchronous row-based replication.
Note: This is not a production configuration. MySQL settings remain on insecure defaults to keep the focus on general patterns for running stateful applications in Kubernetes.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.You need to either have a dynamic PersistentVolume provisioner with a default StorageClass, or statically provision PersistentVolumes yourself to satisfy the PersistentVolumeClaims used here.
- This tutorial assumes you are familiar with PersistentVolumes and StatefulSets, as well as other core concepts like Pods, Services, and ConfigMaps.
- Some familiarity with MySQL helps, but this tutorial aims to present general patterns that should be useful for other systems.
- You are using the default namespace or another namespace that does not contain any conflicting objects.
Objectives
- Deploy a replicated MySQL topology with a StatefulSet controller.
- Send MySQL client traffic.
- Observe resistance to downtime.
- Scale the StatefulSet up and down.
Deploy MySQL
The example MySQL deployment consists of a ConfigMap, two Services, and a StatefulSet.
ConfigMap
Create the ConfigMap from the following YAML configuration file:
apiVersion: v1
kind: ConfigMap
metadata:
name: mysql
labels:
app: mysql
data:
primary.cnf: |
# Apply this config only on the primary.
[mysqld]
log-bin
replica.cnf: |
# Apply this config only on replicas.
[mysqld]
super-read-only
kubectl apply -f https://k8s.io/examples/application/mysql/mysql-configmap.yaml
This ConfigMap provides my.cnf
overrides that let you independently control
configuration on the primary MySQL server and replicas.
In this case, you want the primary server to be able to serve replication logs to replicas
and you want replicas to reject any writes that don't come via replication.
There's nothing special about the ConfigMap itself that causes different portions to apply to different Pods. Each Pod decides which portion to look at as it's initializing, based on information provided by the StatefulSet controller.
Services
Create the Services from the following YAML configuration file:
# Headless service for stable DNS entries of StatefulSet members.
apiVersion: v1
kind: Service
metadata:
name: mysql
labels:
app: mysql
spec:
ports:
- name: mysql
port: 3306
clusterIP: None
selector:
app: mysql
---
# Client service for connecting to any MySQL instance for reads.
# For writes, you must instead connect to the primary: mysql-0.mysql.
apiVersion: v1
kind: Service
metadata:
name: mysql-read
labels:
app: mysql
spec:
ports:
- name: mysql
port: 3306
selector:
app: mysql
kubectl apply -f https://k8s.io/examples/application/mysql/mysql-services.yaml
The Headless Service provides a home for the DNS entries that the StatefulSet
controller creates for each Pod that's part of the set.
Because the Headless Service is named mysql
, the Pods are accessible by
resolving <pod-name>.mysql
from within any other Pod in the same Kubernetes
cluster and namespace.
The Client Service, called mysql-read
, is a normal Service with its own
cluster IP that distributes connections across all MySQL Pods that report
being Ready. The set of potential endpoints includes the primary MySQL server and all
replicas.
Note that only read queries can use the load-balanced Client Service. Because there is only one primary MySQL server, clients should connect directly to the primary MySQL Pod (through its DNS entry within the Headless Service) to execute writes.
StatefulSet
Finally, create the StatefulSet from the following YAML configuration file:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: mysql
spec:
selector:
matchLabels:
app: mysql
serviceName: mysql
replicas: 3
template:
metadata:
labels:
app: mysql
spec:
initContainers:
- name: init-mysql
image: mysql:5.7
command:
- bash
- "-c"
- |
set -ex
# Generate mysql server-id from pod ordinal index.
[[ `hostname` =~ -([0-9]+)$ ]] || exit 1
ordinal=${BASH_REMATCH[1]}
echo [mysqld] > /mnt/conf.d/server-id.cnf
# Add an offset to avoid reserved server-id=0 value.
echo server-id=$((100 + $ordinal)) >> /mnt/conf.d/server-id.cnf
# Copy appropriate conf.d files from config-map to emptyDir.
if [[ $ordinal -eq 0 ]]; then
cp /mnt/config-map/primary.cnf /mnt/conf.d/
else
cp /mnt/config-map/replica.cnf /mnt/conf.d/
fi
volumeMounts:
- name: conf
mountPath: /mnt/conf.d
- name: config-map
mountPath: /mnt/config-map
- name: clone-mysql
image: gcr.io/google-samples/xtrabackup:1.0
command:
- bash
- "-c"
- |
set -ex
# Skip the clone if data already exists.
[[ -d /var/lib/mysql/mysql ]] && exit 0
# Skip the clone on primary (ordinal index 0).
[[ `hostname` =~ -([0-9]+)$ ]] || exit 1
ordinal=${BASH_REMATCH[1]}
[[ $ordinal -eq 0 ]] && exit 0
# Clone data from previous peer.
ncat --recv-only mysql-$(($ordinal-1)).mysql 3307 | xbstream -x -C /var/lib/mysql
# Prepare the backup.
xtrabackup --prepare --target-dir=/var/lib/mysql
volumeMounts:
- name: data
mountPath: /var/lib/mysql
subPath: mysql
- name: conf
mountPath: /etc/mysql/conf.d
containers:
- name: mysql
image: mysql:5.7
env:
- name: MYSQL_ALLOW_EMPTY_PASSWORD
value: "1"
ports:
- name: mysql
containerPort: 3306
volumeMounts:
- name: data
mountPath: /var/lib/mysql
subPath: mysql
- name: conf
mountPath: /etc/mysql/conf.d
resources:
requests:
cpu: 500m
memory: 1Gi
livenessProbe:
exec:
command: ["mysqladmin", "ping"]
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
readinessProbe:
exec:
# Check we can execute queries over TCP (skip-networking is off).
command: ["mysql", "-h", "127.0.0.1", "-e", "SELECT 1"]
initialDelaySeconds: 5
periodSeconds: 2
timeoutSeconds: 1
- name: xtrabackup
image: gcr.io/google-samples/xtrabackup:1.0
ports:
- name: xtrabackup
containerPort: 3307
command:
- bash
- "-c"
- |
set -ex
cd /var/lib/mysql
# Determine binlog position of cloned data, if any.
if [[ -f xtrabackup_slave_info && "x$(<xtrabackup_slave_info)" != "x" ]]; then
# XtraBackup already generated a partial "CHANGE MASTER TO" query
# because we're cloning from an existing replica. (Need to remove the tailing semicolon!)
cat xtrabackup_slave_info | sed -E 's/;$//g' > change_master_to.sql.in
# Ignore xtrabackup_binlog_info in this case (it's useless).
rm -f xtrabackup_slave_info xtrabackup_binlog_info
elif [[ -f xtrabackup_binlog_info ]]; then
# We're cloning directly from primary. Parse binlog position.
[[ `cat xtrabackup_binlog_info` =~ ^(.*?)[[:space:]]+(.*?)$ ]] || exit 1
rm -f xtrabackup_binlog_info xtrabackup_slave_info
echo "CHANGE MASTER TO MASTER_LOG_FILE='${BASH_REMATCH[1]}',\
MASTER_LOG_POS=${BASH_REMATCH[2]}" > change_master_to.sql.in
fi
# Check if we need to complete a clone by starting replication.
if [[ -f change_master_to.sql.in ]]; then
echo "Waiting for mysqld to be ready (accepting connections)"
until mysql -h 127.0.0.1 -e "SELECT 1"; do sleep 1; done
echo "Initializing replication from clone position"
mysql -h 127.0.0.1 \
-e "$(<change_master_to.sql.in), \
MASTER_HOST='mysql-0.mysql', \
MASTER_USER='root', \
MASTER_PASSWORD='', \
MASTER_CONNECT_RETRY=10; \
START SLAVE;" || exit 1
# In case of container restart, attempt this at-most-once.
mv change_master_to.sql.in change_master_to.sql.orig
fi
# Start a server to send backups when requested by peers.
exec ncat --listen --keep-open --send-only --max-conns=1 3307 -c \
"xtrabackup --backup --slave-info --stream=xbstream --host=127.0.0.1 --user=root"
volumeMounts:
- name: data
mountPath: /var/lib/mysql
subPath: mysql
- name: conf
mountPath: /etc/mysql/conf.d
resources:
requests:
cpu: 100m
memory: 100Mi
volumes:
- name: conf
emptyDir: {}
- name: config-map
configMap:
name: mysql
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
kubectl apply -f https://k8s.io/examples/application/mysql/mysql-statefulset.yaml
You can watch the startup progress by running:
kubectl get pods -l app=mysql --watch
After a while, you should see all 3 Pods become Running:
NAME READY STATUS RESTARTS AGE
mysql-0 2/2 Running 0 2m
mysql-1 2/2 Running 0 1m
mysql-2 2/2 Running 0 1m
Press Ctrl+C to cancel the watch. If you don't see any progress, make sure you have a dynamic PersistentVolume provisioner enabled as mentioned in the prerequisites.
This manifest uses a variety of techniques for managing stateful Pods as part of a StatefulSet. The next section highlights some of these techniques to explain what happens as the StatefulSet creates Pods.
Understanding stateful Pod initialization
The StatefulSet controller starts Pods one at a time, in order by their ordinal index. It waits until each Pod reports being Ready before starting the next one.
In addition, the controller assigns each Pod a unique, stable name of the form
<statefulset-name>-<ordinal-index>
, which results in Pods named mysql-0
,
mysql-1
, and mysql-2
.
The Pod template in the above StatefulSet manifest takes advantage of these properties to perform orderly startup of MySQL replication.
Generating configuration
Before starting any of the containers in the Pod spec, the Pod first runs any Init Containers in the order defined.
The first Init Container, named init-mysql
, generates special MySQL config
files based on the ordinal index.
The script determines its own ordinal index by extracting it from the end of
the Pod name, which is returned by the hostname
command.
Then it saves the ordinal (with a numeric offset to avoid reserved values)
into a file called server-id.cnf
in the MySQL conf.d
directory.
This translates the unique, stable identity provided by the StatefulSet
controller into the domain of MySQL server IDs, which require the same
properties.
The script in the init-mysql
container also applies either primary.cnf
or
replica.cnf
from the ConfigMap by copying the contents into conf.d
.
Because the example topology consists of a single primary MySQL server and any number of
replicas, the script assigns ordinal 0
to be the primary server, and everyone
else to be replicas.
Combined with the StatefulSet controller's
deployment order guarantee,
this ensures the primary MySQL server is Ready before creating replicas, so they can begin
replicating.
Cloning existing data
In general, when a new Pod joins the set as a replica, it must assume the primary MySQL server might already have data on it. It also must assume that the replication logs might not go all the way back to the beginning of time. These conservative assumptions are the key to allow a running StatefulSet to scale up and down over time, rather than being fixed at its initial size.
The second Init Container, named clone-mysql
, performs a clone operation on
a replica Pod the first time it starts up on an empty PersistentVolume.
That means it copies all existing data from another running Pod,
so its local state is consistent enough to begin replicating from the primary server.
MySQL itself does not provide a mechanism to do this, so the example uses a
popular open-source tool called Percona XtraBackup.
During the clone, the source MySQL server might suffer reduced performance.
To minimize impact on the primary MySQL server, the script instructs each Pod to clone
from the Pod whose ordinal index is one lower.
This works because the StatefulSet controller always ensures Pod N
is
Ready before starting Pod N+1
.
Starting replication
After the Init Containers complete successfully, the regular containers run.
The MySQL Pods consist of a mysql
container that runs the actual mysqld
server, and an xtrabackup
container that acts as a
sidecar.
The xtrabackup
sidecar looks at the cloned data files and determines if
it's necessary to initialize MySQL replication on the replica.
If so, it waits for mysqld
to be ready and then executes the
CHANGE MASTER TO
and START SLAVE
commands with replication parameters
extracted from the XtraBackup clone files.
Once a replica begins replication, it remembers its primary MySQL server and
reconnects automatically if the server restarts or the connection dies.
Also, because replicas look for the primary server at its stable DNS name
(mysql-0.mysql
), they automatically find the primary server even if it gets a new
Pod IP due to being rescheduled.
Lastly, after starting replication, the xtrabackup
container listens for
connections from other Pods requesting a data clone.
This server remains up indefinitely in case the StatefulSet scales up, or in
case the next Pod loses its PersistentVolumeClaim and needs to redo the clone.
Sending client traffic
You can send test queries to the primary MySQL server (hostname mysql-0.mysql
)
by running a temporary container with the mysql:5.7
image and running the
mysql
client binary.
kubectl run mysql-client --image=mysql:5.7 -i --rm --restart=Never --\
mysql -h mysql-0.mysql <<EOF
CREATE DATABASE test;
CREATE TABLE test.messages (message VARCHAR(250));
INSERT INTO test.messages VALUES ('hello');
EOF
Use the hostname mysql-read
to send test queries to any server that reports
being Ready:
kubectl run mysql-client --image=mysql:5.7 -i -t --rm --restart=Never --\
mysql -h mysql-read -e "SELECT * FROM test.messages"
You should get output like this:
Waiting for pod default/mysql-client to be running, status is Pending, pod ready: false
+---------+
| message |
+---------+
| hello |
+---------+
pod "mysql-client" deleted
To demonstrate that the mysql-read
Service distributes connections across
servers, you can run SELECT @@server_id
in a loop:
kubectl run mysql-client-loop --image=mysql:5.7 -i -t --rm --restart=Never --\
bash -ic "while sleep 1; do mysql -h mysql-read -e 'SELECT @@server_id,NOW()'; done"
You should see the reported @@server_id
change randomly, because a different
endpoint might be selected upon each connection attempt:
+-------------+---------------------+
| @@server_id | NOW() |
+-------------+---------------------+
| 100 | 2006-01-02 15:04:05 |
+-------------+---------------------+
+-------------+---------------------+
| @@server_id | NOW() |
+-------------+---------------------+
| 102 | 2006-01-02 15:04:06 |
+-------------+---------------------+
+-------------+---------------------+
| @@server_id | NOW() |
+-------------+---------------------+
| 101 | 2006-01-02 15:04:07 |
+-------------+---------------------+
You can press Ctrl+C when you want to stop the loop, but it's useful to keep it running in another window so you can see the effects of the following steps.
Simulating Pod and Node downtime
To demonstrate the increased availability of reading from the pool of replicas
instead of a single server, keep the SELECT @@server_id
loop from above
running while you force a Pod out of the Ready state.
Break the Readiness Probe
The readiness probe
for the mysql
container runs the command mysql -h 127.0.0.1 -e 'SELECT 1'
to make sure the server is up and able to execute queries.
One way to force this readiness probe to fail is to break that command:
kubectl exec mysql-2 -c mysql -- mv /usr/bin/mysql /usr/bin/mysql.off
This reaches into the actual container's filesystem for Pod mysql-2
and
renames the mysql
command so the readiness probe can't find it.
After a few seconds, the Pod should report one of its containers as not Ready,
which you can check by running:
kubectl get pod mysql-2
Look for 1/2
in the READY
column:
NAME READY STATUS RESTARTS AGE
mysql-2 1/2 Running 0 3m
At this point, you should see your SELECT @@server_id
loop continue to run,
although it never reports 102
anymore.
Recall that the init-mysql
script defined server-id
as 100 + $ordinal
,
so server ID 102
corresponds to Pod mysql-2
.
Now repair the Pod and it should reappear in the loop output after a few seconds:
kubectl exec mysql-2 -c mysql -- mv /usr/bin/mysql.off /usr/bin/mysql
Delete Pods
The StatefulSet also recreates Pods if they're deleted, similar to what a ReplicaSet does for stateless Pods.
kubectl delete pod mysql-2
The StatefulSet controller notices that no mysql-2
Pod exists anymore,
and creates a new one with the same name and linked to the same
PersistentVolumeClaim.
You should see server ID 102
disappear from the loop output for a while
and then return on its own.
Drain a Node
If your Kubernetes cluster has multiple Nodes, you can simulate Node downtime (such as when Nodes are upgraded) by issuing a drain.
First determine which Node one of the MySQL Pods is on:
kubectl get pod mysql-2 -o wide
The Node name should show up in the last column:
NAME READY STATUS RESTARTS AGE IP NODE
mysql-2 2/2 Running 0 15m 10.244.5.27 kubernetes-node-9l2t
Then drain the Node by running the following command, which cordons it so
no new Pods may schedule there, and then evicts any existing Pods.
Replace <node-name>
with the name of the Node you found in the last step.
This might impact other applications on the Node, so it's best to only do this in a test cluster.
kubectl drain <node-name> --force --delete-emptydir-data --ignore-daemonsets
Now you can watch as the Pod reschedules on a different Node:
kubectl get pod mysql-2 -o wide --watch
It should look something like this:
NAME READY STATUS RESTARTS AGE IP NODE
mysql-2 2/2 Terminating 0 15m 10.244.1.56 kubernetes-node-9l2t
[...]
mysql-2 0/2 Pending 0 0s <none> kubernetes-node-fjlm
mysql-2 0/2 Init:0/2 0 0s <none> kubernetes-node-fjlm
mysql-2 0/2 Init:1/2 0 20s 10.244.5.32 kubernetes-node-fjlm
mysql-2 0/2 PodInitializing 0 21s 10.244.5.32 kubernetes-node-fjlm
mysql-2 1/2 Running 0 22s 10.244.5.32 kubernetes-node-fjlm
mysql-2 2/2 Running 0 30s 10.244.5.32 kubernetes-node-fjlm
And again, you should see server ID 102
disappear from the
SELECT @@server_id
loop output for a while and then return.
Now uncordon the Node to return it to a normal state:
kubectl uncordon <node-name>
Scaling the number of replicas
With MySQL replication, you can scale your read query capacity by adding replicas. With StatefulSet, you can do this with a single command:
kubectl scale statefulset mysql --replicas=5
Watch the new Pods come up by running:
kubectl get pods -l app=mysql --watch
Once they're up, you should see server IDs 103
and 104
start appearing in
the SELECT @@server_id
loop output.
You can also verify that these new servers have the data you added before they existed:
kubectl run mysql-client --image=mysql:5.7 -i -t --rm --restart=Never --\
mysql -h mysql-3.mysql -e "SELECT * FROM test.messages"
Waiting for pod default/mysql-client to be running, status is Pending, pod ready: false
+---------+
| message |
+---------+
| hello |
+---------+
pod "mysql-client" deleted
Scaling back down is also seamless:
kubectl scale statefulset mysql --replicas=3
Note, however, that while scaling up creates new PersistentVolumeClaims automatically, scaling down does not automatically delete these PVCs. This gives you the choice to keep those initialized PVCs around to make scaling back up quicker, or to extract data before deleting them.
You can see this by running:
kubectl get pvc -l app=mysql
Which shows that all 5 PVCs still exist, despite having scaled the StatefulSet down to 3:
NAME STATUS VOLUME CAPACITY ACCESSMODES AGE
data-mysql-0 Bound pvc-8acbf5dc-b103-11e6-93fa-42010a800002 10Gi RWO 20m
data-mysql-1 Bound pvc-8ad39820-b103-11e6-93fa-42010a800002 10Gi RWO 20m
data-mysql-2 Bound pvc-8ad69a6d-b103-11e6-93fa-42010a800002 10Gi RWO 20m
data-mysql-3 Bound pvc-50043c45-b1c5-11e6-93fa-42010a800002 10Gi RWO 2m
data-mysql-4 Bound pvc-500a9957-b1c5-11e6-93fa-42010a800002 10Gi RWO 2m
If you don't intend to reuse the extra PVCs, you can delete them:
kubectl delete pvc data-mysql-3
kubectl delete pvc data-mysql-4
Cleaning up
Cancel the
SELECT @@server_id
loop by pressing Ctrl+C in its terminal, or running the following from another terminal:kubectl delete pod mysql-client-loop --now
Delete the StatefulSet. This also begins terminating the Pods.
kubectl delete statefulset mysql
Verify that the Pods disappear. They might take some time to finish terminating.
kubectl get pods -l app=mysql
You'll know the Pods have terminated when the above returns:
No resources found.
Delete the ConfigMap, Services, and PersistentVolumeClaims.
kubectl delete configmap,service,pvc -l app=mysql
If you manually provisioned PersistentVolumes, you also need to manually delete them, as well as release the underlying resources. If you used a dynamic provisioner, it automatically deletes the PersistentVolumes when it sees that you deleted the PersistentVolumeClaims. Some dynamic provisioners (such as those for EBS and PD) also release the underlying resources upon deleting the PersistentVolumes.
What's next
- Learn more about scaling a StatefulSet.
- Learn more about debugging a StatefulSet.
- Learn more about deleting a StatefulSet.
- Learn more about force deleting StatefulSet Pods.
- Look in the Helm Charts repository for other stateful application examples.
4 - Scale a StatefulSet
This task shows how to scale a StatefulSet. Scaling a StatefulSet refers to increasing or decreasing the number of replicas.
Before you begin
StatefulSets are only available in Kubernetes version 1.5 or later. To check your version of Kubernetes, run
kubectl version
.Not all stateful applications scale nicely. If you are unsure about whether to scale your StatefulSets, see StatefulSet concepts or StatefulSet tutorial for further information.
You should perform scaling only when you are confident that your stateful application cluster is completely healthy.
Scaling StatefulSets
Use kubectl to scale StatefulSets
First, find the StatefulSet you want to scale.
kubectl get statefulsets <stateful-set-name>
Change the number of replicas of your StatefulSet:
kubectl scale statefulsets <stateful-set-name> --replicas=<new-replicas>
Make in-place updates on your StatefulSets
Alternatively, you can do in-place updates on your StatefulSets.
If your StatefulSet was initially created with kubectl apply
,
update .spec.replicas
of the StatefulSet manifests, and then do a kubectl apply
:
kubectl apply -f <stateful-set-file-updated>
Otherwise, edit that field with kubectl edit
:
kubectl edit statefulsets <stateful-set-name>
Or use kubectl patch
:
kubectl patch statefulsets <stateful-set-name> -p '{"spec":{"replicas":<new-replicas>}}'
Troubleshooting
Scaling down does not work right
You cannot scale down a StatefulSet when any of the stateful Pods it manages is unhealthy. Scaling down only takes place after those stateful Pods become running and ready.
If spec.replicas > 1, Kubernetes cannot determine the reason for an unhealthy Pod. It might be the result of a permanent fault or of a transient fault. A transient fault can be caused by a restart required by upgrading or maintenance.
If the Pod is unhealthy due to a permanent fault, scaling without correcting the fault may lead to a state where the StatefulSet membership drops below a certain minimum number of replicas that are needed to function correctly. This may cause your StatefulSet to become unavailable.
If the Pod is unhealthy due to a transient fault and the Pod might become available again, the transient error may interfere with your scale-up or scale-down operation. Some distributed databases have issues when nodes join and leave at the same time. It is better to reason about scaling operations at the application level in these cases, and perform scaling only when you are sure that your stateful application cluster is completely healthy.
What's next
- Learn more about deleting a StatefulSet.
5 - Delete a StatefulSet
This task shows you how to delete a StatefulSet.
Before you begin
- This task assumes you have an application running on your cluster represented by a StatefulSet.
Deleting a StatefulSet
You can delete a StatefulSet in the same way you delete other resources in Kubernetes: use the kubectl delete
command, and specify the StatefulSet either by file or by name.
kubectl delete -f <file.yaml>
kubectl delete statefulsets <statefulset-name>
You may need to delete the associated headless service separately after the StatefulSet itself is deleted.
kubectl delete service <service-name>
When deleting a StatefulSet through kubectl
, the StatefulSet scales down to 0. All Pods that are part of this workload are also deleted. If you want to delete only the StatefulSet and not the Pods, use --cascade=orphan
.
For example:
kubectl delete -f <file.yaml> --cascade=orphan
By passing --cascade=orphan
to kubectl delete
, the Pods managed by the StatefulSet are left behind even after the StatefulSet object itself is deleted. If the pods have a label app=myapp
, you can then delete them as follows:
kubectl delete pods -l app=myapp
Persistent Volumes
Deleting the Pods in a StatefulSet will not delete the associated volumes. This is to ensure that you have the chance to copy data off the volume before deleting it. Deleting the PVC after the pods have terminated might trigger deletion of the backing Persistent Volumes depending on the storage class and reclaim policy. You should never assume ability to access a volume after claim deletion.
Note: Use caution when deleting a PVC, as it may lead to data loss.
Complete deletion of a StatefulSet
To delete everything in a StatefulSet, including the associated pods, you can run a series of commands similar to the following:
grace=$(kubectl get pods <stateful-set-pod> --template '{{.spec.terminationGracePeriodSeconds}}')
kubectl delete statefulset -l app=myapp
sleep $grace
kubectl delete pvc -l app=myapp
In the example above, the Pods have the label app=myapp
; substitute your own label as appropriate.
Force deletion of StatefulSet pods
If you find that some pods in your StatefulSet are stuck in the 'Terminating' or 'Unknown' states for an extended period of time, you may need to manually intervene to forcefully delete the pods from the apiserver. This is a potentially dangerous task. Refer to Force Delete StatefulSet Pods for details.
What's next
Learn more about force deleting StatefulSet Pods.
6 - Force Delete StatefulSet Pods
This page shows how to delete Pods which are part of a stateful set, and explains the considerations to keep in mind when doing so.
Before you begin
- This is a fairly advanced task and has the potential to violate some of the properties inherent to StatefulSet.
- Before proceeding, make yourself familiar with the considerations enumerated below.
StatefulSet considerations
In normal operation of a StatefulSet, there is never a need to force delete a StatefulSet Pod. The StatefulSet controller is responsible for creating, scaling and deleting members of the StatefulSet. It tries to ensure that the specified number of Pods from ordinal 0 through N-1 are alive and ready. StatefulSet ensures that, at any time, there is at most one Pod with a given identity running in a cluster. This is referred to as at most one semantics provided by a StatefulSet.
Manual force deletion should be undertaken with caution, as it has the potential to violate the at most one semantics inherent to StatefulSet. StatefulSets may be used to run distributed and clustered applications which have a need for a stable network identity and stable storage. These applications often have configuration which relies on an ensemble of a fixed number of members with fixed identities. Having multiple members with the same identity can be disastrous and may lead to data loss (e.g. split brain scenario in quorum-based systems).
Delete Pods
You can perform a graceful pod deletion with the following command:
kubectl delete pods <pod>
For the above to lead to graceful termination, the Pod must not specify a
pod.Spec.TerminationGracePeriodSeconds
of 0. The practice of setting a
pod.Spec.TerminationGracePeriodSeconds
of 0 seconds is unsafe and strongly discouraged
for StatefulSet Pods. Graceful deletion is safe and will ensure that the Pod
shuts down gracefully
before the kubelet deletes the name from the apiserver.
A Pod is not deleted automatically when a node is unreachable. The Pods running on an unreachable Node enter the 'Terminating' or 'Unknown' state after a timeout. Pods may also enter these states when the user attempts graceful deletion of a Pod on an unreachable Node. The only ways in which a Pod in such a state can be removed from the apiserver are as follows:
- The Node object is deleted (either by you, or by the Node Controller).
- The kubelet on the unresponsive Node starts responding, kills the Pod and removes the entry from the apiserver.
- Force deletion of the Pod by the user.
The recommended best practice is to use the first or second approach. If a Node is confirmed to be dead (e.g. permanently disconnected from the network, powered down, etc), then delete the Node object. If the Node is suffering from a network partition, then try to resolve this or wait for it to resolve. When the partition heals, the kubelet will complete the deletion of the Pod and free up its name in the apiserver.
Normally, the system completes the deletion once the Pod is no longer running on a Node, or the Node is deleted by an administrator. You may override this by force deleting the Pod.
Force Deletion
Force deletions do not wait for confirmation from the kubelet that the Pod has been terminated. Irrespective of whether a force deletion is successful in killing a Pod, it will immediately free up the name from the apiserver. This would let the StatefulSet controller create a replacement Pod with that same identity; this can lead to the duplication of a still-running Pod, and if said Pod can still communicate with the other members of the StatefulSet, will violate the at most one semantics that StatefulSet is designed to guarantee.
When you force delete a StatefulSet pod, you are asserting that the Pod in question will never again make contact with other Pods in the StatefulSet and its name can be safely freed up for a replacement to be created.
If you want to delete a Pod forcibly using kubectl version >= 1.5, do the following:
kubectl delete pods <pod> --grace-period=0 --force
If you're using any version of kubectl <= 1.4, you should omit the --force
option and use:
kubectl delete pods <pod> --grace-period=0
If even after these commands the pod is stuck on Unknown
state, use the following command to remove the pod from the cluster:
kubectl patch pod <pod> -p '{"metadata":{"finalizers":null}}'
Always perform force deletion of StatefulSet Pods carefully and with complete knowledge of the risks involved.
What's next
Learn more about debugging a StatefulSet.
7 - Horizontal Pod Autoscaler
The Horizontal Pod Autoscaler automatically scales the number of Pods in a replication controller, deployment, replica set or stateful set based on observed CPU utilization (or, with custom metrics support, on some other application-provided metrics). Note that Horizontal Pod Autoscaling does not apply to objects that can't be scaled, for example, DaemonSets.
The Horizontal Pod Autoscaler is implemented as a Kubernetes API resource and a controller. The resource determines the behavior of the controller. The controller periodically adjusts the number of replicas in a replication controller or deployment to match the observed metrics such as average CPU utilisation, average memory utilisation or any other custom metric to the target specified by the user.
How does the Horizontal Pod Autoscaler work?
The Horizontal Pod Autoscaler is implemented as a control loop, with a period controlled
by the controller manager's --horizontal-pod-autoscaler-sync-period
flag (with a default
value of 15 seconds).
During each period, the controller manager queries the resource utilization against the metrics specified in each HorizontalPodAutoscaler definition. The controller manager obtains the metrics from either the resource metrics API (for per-pod resource metrics), or the custom metrics API (for all other metrics).
For per-pod resource metrics (like CPU), the controller fetches the metrics from the resource metrics API for each Pod targeted by the HorizontalPodAutoscaler. Then, if a target utilization value is set, the controller calculates the utilization value as a percentage of the equivalent resource request on the containers in each Pod. If a target raw value is set, the raw metric values are used directly. The controller then takes the mean of the utilization or the raw value (depending on the type of target specified) across all targeted Pods, and produces a ratio used to scale the number of desired replicas.
Please note that if some of the Pod's containers do not have the relevant resource request set, CPU utilization for the Pod will not be defined and the autoscaler will not take any action for that metric. See the algorithm details section below for more information about how the autoscaling algorithm works.
For per-pod custom metrics, the controller functions similarly to per-pod resource metrics, except that it works with raw values, not utilization values.
For object metrics and external metrics, a single metric is fetched, which describes the object in question. This metric is compared to the target value, to produce a ratio as above. In the
autoscaling/v2beta2
API version, this value can optionally be divided by the number of Pods before the comparison is made.
The HorizontalPodAutoscaler normally fetches metrics from a series of aggregated APIs (metrics.k8s.io
,
custom.metrics.k8s.io
, and external.metrics.k8s.io
). The metrics.k8s.io
API is usually provided by
metrics-server, which needs to be launched separately. See
metrics-server
for instructions. The HorizontalPodAutoscaler can also fetch metrics directly from Heapster.
Note:FEATURE STATE:Kubernetes v1.11 [deprecated]
Fetching metrics from Heapster is deprecated as of Kubernetes 1.11.
See Support for metrics APIs for more details.
The autoscaler accesses corresponding scalable controllers (such as replication controllers, deployments, and replica sets) by using the scale sub-resource. Scale is an interface that allows you to dynamically set the number of replicas and examine each of their current states. More details on scale sub-resource can be found here.
Algorithm Details
From the most basic perspective, the Horizontal Pod Autoscaler controller operates on the ratio between desired metric value and current metric value:
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
For example, if the current metric value is 200m
, and the desired value
is 100m
, the number of replicas will be doubled, since 200.0 / 100.0 == 2.0
If the current value is instead 50m
, we'll halve the number of
replicas, since 50.0 / 100.0 == 0.5
. We'll skip scaling if the ratio is
sufficiently close to 1.0 (within a globally-configurable tolerance, from
the --horizontal-pod-autoscaler-tolerance
flag, which defaults to 0.1).
When a targetAverageValue
or targetAverageUtilization
is specified,
the currentMetricValue
is computed by taking the average of the given
metric across all Pods in the HorizontalPodAutoscaler's scale target.
Before checking the tolerance and deciding on the final values, we take
pod readiness and missing metrics into consideration, however.
All Pods with a deletion timestamp set (i.e. Pods in the process of being shut down) and all failed Pods are discarded.
If a particular Pod is missing metrics, it is set aside for later; Pods with missing metrics will be used to adjust the final scaling amount.
When scaling on CPU, if any pod has yet to become ready (i.e. it's still initializing) or the most recent metric point for the pod was before it became ready, that pod is set aside as well.
Due to technical constraints, the HorizontalPodAutoscaler controller
cannot exactly determine the first time a pod becomes ready when
determining whether to set aside certain CPU metrics. Instead, it
considers a Pod "not yet ready" if it's unready and transitioned to
unready within a short, configurable window of time since it started.
This value is configured with the --horizontal-pod-autoscaler-initial-readiness-delay
flag, and its default is 30
seconds. Once a pod has become ready, it considers any transition to
ready to be the first if it occurred within a longer, configurable time
since it started. This value is configured with the --horizontal-pod-autoscaler-cpu-initialization-period
flag, and its
default is 5 minutes.
The currentMetricValue / desiredMetricValue
base scale ratio is then
calculated using the remaining pods not set aside or discarded from above.
If there were any missing metrics, we recompute the average more conservatively, assuming those pods were consuming 100% of the desired value in case of a scale down, and 0% in case of a scale up. This dampens the magnitude of any potential scale.
Furthermore, if any not-yet-ready pods were present, and we would have scaled up without factoring in missing metrics or not-yet-ready pods, we conservatively assume the not-yet-ready pods are consuming 0% of the desired metric, further dampening the magnitude of a scale up.
After factoring in the not-yet-ready pods and missing metrics, we recalculate the usage ratio. If the new ratio reverses the scale direction, or is within the tolerance, we skip scaling. Otherwise, we use the new ratio to scale.
Note that the original value for the average utilization is reported back via the HorizontalPodAutoscaler status, without factoring in the not-yet-ready pods or missing metrics, even when the new usage ratio is used.
If multiple metrics are specified in a HorizontalPodAutoscaler, this
calculation is done for each metric, and then the largest of the desired
replica counts is chosen. If any of these metrics cannot be converted
into a desired replica count (e.g. due to an error fetching the metrics
from the metrics APIs) and a scale down is suggested by the metrics which
can be fetched, scaling is skipped. This means that the HPA is still capable
of scaling up if one or more metrics give a desiredReplicas
greater than
the current value.
Finally, right before HPA scales the target, the scale recommendation is recorded. The
controller considers all recommendations within a configurable window choosing the
highest recommendation from within that window. This value can be configured using the --horizontal-pod-autoscaler-downscale-stabilization
flag, which defaults to 5 minutes.
This means that scaledowns will occur gradually, smoothing out the impact of rapidly
fluctuating metric values.
API Object
The Horizontal Pod Autoscaler is an API resource in the Kubernetes autoscaling
API group.
The current stable version, which only includes support for CPU autoscaling,
can be found in the autoscaling/v1
API version.
The beta version, which includes support for scaling on memory and custom metrics,
can be found in autoscaling/v2beta2
. The new fields introduced in autoscaling/v2beta2
are preserved as annotations when working with autoscaling/v1
.
When you create a HorizontalPodAutoscaler API object, make sure the name specified is a valid DNS subdomain name. More details about the API object can be found at HorizontalPodAutoscaler Object.
Support for Horizontal Pod Autoscaler in kubectl
Horizontal Pod Autoscaler, like every API resource, is supported in a standard way by kubectl
.
We can create a new autoscaler using kubectl create
command.
We can list autoscalers by kubectl get hpa
and get detailed description by kubectl describe hpa
.
Finally, we can delete an autoscaler using kubectl delete hpa
.
In addition, there is a special kubectl autoscale
command for creating a HorizontalPodAutoscaler object.
For instance, executing kubectl autoscale rs foo --min=2 --max=5 --cpu-percent=80
will create an autoscaler for replication set foo, with target CPU utilization set to 80%
and the number of replicas between 2 and 5.
The detailed documentation of kubectl autoscale
can be found here.
Autoscaling during rolling update
Kubernetes lets you perform a rolling update on a Deployment. In that
case, the Deployment manages the underlying ReplicaSets for you.
When you configure autoscaling for a Deployment, you bind a
HorizontalPodAutoscaler to a single Deployment. The HorizontalPodAutoscaler
manages the replicas
field of the Deployment. The deployment controller is responsible
for setting the replicas
of the underlying ReplicaSets so that they add up to a suitable
number during the rollout and also afterwards.
If you perform a rolling update of a StatefulSet that has an autoscaled number of replicas, the StatefulSet directly manages its set of Pods (there is no intermediate resource similar to ReplicaSet).
Support for cooldown/delay
When managing the scale of a group of replicas using the Horizontal Pod Autoscaler, it is possible that the number of replicas keeps fluctuating frequently due to the dynamic nature of the metrics evaluated. This is sometimes referred to as thrashing.
Starting from v1.6, a cluster operator can mitigate this problem by tuning
the global HPA settings exposed as flags for the kube-controller-manager
component:
Starting from v1.12, a new algorithmic update removes the need for the upscale delay.
--horizontal-pod-autoscaler-downscale-stabilization
: Specifies the duration of the downscale stabilization time window. Horizontal Pod Autoscaler remembers the historical recommended sizes and only acts on the largest size within this time window. The default value is 5 minutes (5m0s
).
Note: When tuning these parameter values, a cluster operator should be aware of the possible consequences. If the delay (cooldown) value is set too long, there could be complaints that the Horizontal Pod Autoscaler is not responsive to workload changes. However, if the delay value is set too short, the scale of the replicas set may keep thrashing as usual.
Support for resource metrics
Any HPA target can be scaled based on the resource usage of the pods in the scaling target.
When defining the pod specification the resource requests like cpu
and memory
should
be specified. This is used to determine the resource utilization and used by the HPA controller
to scale the target up or down. To use resource utilization based scaling specify a metric source
like this:
type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
With this metric the HPA controller will keep the average utilization of the pods in the scaling target at 60%. Utilization is the ratio between the current usage of resource to the requested resources of the pod. See Algorithm for more details about how the utilization is calculated and averaged.
Note: Since the resource usages of all the containers are summed up the total pod utilization may not accurately represent the individual container resource usage. This could lead to situations where a single container might be running with high usage and the HPA will not scale out because the overall pod usage is still within acceptable limits.
Container Resource Metrics
Kubernetes v1.20 [alpha]
HorizontalPodAutoscaler
also supports a container metric source where the HPA can track the
resource usage of individual containers across a set of Pods, in order to scale the target resource.
This lets you configure scaling thresholds for the containers that matter most in a particular Pod.
For example, if you have a web application and a logging sidecar, you can scale based on the resource
use of the web application, ignoring the sidecar container and its resource use.
If you revise the target resource to have a new Pod specification with a different set of containers, you should revise the HPA spec if that newly added container should also be used for scaling. If the specified container in the metric source is not present or only present in a subset of the pods then those pods are ignored and the recommendation is recalculated. See Algorithm for more details about the calculation. To use container resources for autoscaling define a metric source as follows:
type: ContainerResource
containerResource:
name: cpu
container: application
target:
type: Utilization
averageUtilization: 60
In the above example the HPA controller scales the target such that the average utilization of the cpu
in the application
container of all the pods is 60%.
Note:If you change the name of a container that a HorizontalPodAutoscaler is tracking, you can make that change in a specific order to ensure scaling remains available and effective whilst the change is being applied. Before you update the resource that defines the container (such as a Deployment), you should update the associated HPA to track both the new and old container names. This way, the HPA is able to calculate a scaling recommendation throughout the update process.
Once you have rolled out the container name change to the workload resource, tidy up by removing the old container name from the HPA specification.
Support for multiple metrics
Kubernetes 1.6 adds support for scaling based on multiple metrics. You can use the autoscaling/v2beta2
API
version to specify multiple metrics for the Horizontal Pod Autoscaler to scale on. Then, the Horizontal Pod
Autoscaler controller will evaluate each metric, and propose a new scale based on that metric. The largest of the
proposed scales will be used as the new scale.
Support for custom metrics
Note: Kubernetes 1.2 added alpha support for scaling based on application-specific metrics using special annotations. Support for these annotations was removed in Kubernetes 1.6 in favor of the new autoscaling API. While the old method for collecting custom metrics is still available, these metrics will not be available for use by the Horizontal Pod Autoscaler, and the former annotations for specifying which custom metrics to scale on are no longer honored by the Horizontal Pod Autoscaler controller.
Kubernetes 1.6 adds support for making use of custom metrics in the Horizontal Pod Autoscaler.
You can add custom metrics for the Horizontal Pod Autoscaler to use in the autoscaling/v2beta2
API.
Kubernetes then queries the new custom metrics API to fetch the values of the appropriate custom metrics.
See Support for metrics APIs for the requirements.
Support for metrics APIs
By default, the HorizontalPodAutoscaler controller retrieves metrics from a series of APIs. In order for it to access these APIs, cluster administrators must ensure that:
The API aggregation layer is enabled.
The corresponding APIs are registered:
For resource metrics, this is the
metrics.k8s.io
API, generally provided by metrics-server. It can be launched as a cluster addon.For custom metrics, this is the
custom.metrics.k8s.io
API. It's provided by "adapter" API servers provided by metrics solution vendors. Check with your metrics pipeline, or the list of known solutions. If you would like to write your own, check out the boilerplate to get started.For external metrics, this is the
external.metrics.k8s.io
API. It may be provided by the custom metrics adapters provided above.
The
--horizontal-pod-autoscaler-use-rest-clients
istrue
or unset. Setting this to false switches to Heapster-based autoscaling, which is deprecated.
For more information on these different metrics paths and how they differ please see the relevant design proposals for the HPA V2, custom.metrics.k8s.io and external.metrics.k8s.io.
For examples of how to use them see the walkthrough for using custom metrics and the walkthrough for using external metrics.
Support for configurable scaling behavior
Starting from
v1.18
the v2beta2
API allows scaling behavior to be configured through the HPA
behavior
field. Behaviors are specified separately for scaling up and down in
scaleUp
or scaleDown
section under the behavior
field. A stabilization
window can be specified for both directions which prevents the flapping of the
number of the replicas in the scaling target. Similarly specifying scaling
policies controls the rate of change of replicas while scaling.
Scaling Policies
One or more scaling policies can be specified in the behavior
section of the spec.
When multiple policies are specified the policy which allows the highest amount of
change is the policy which is selected by default. The following example shows this behavior
while scaling down:
behavior:
scaleDown:
policies:
- type: Pods
value: 4
periodSeconds: 60
- type: Percent
value: 10
periodSeconds: 60
periodSeconds
indicates the length of time in the past for which the policy must hold true.
The first policy (Pods) allows at most 4 replicas to be scaled down in one minute. The second policy
(Percent) allows at most 10% of the current replicas to be scaled down in one minute.
Since by default the policy which allows the highest amount of change is selected, the second policy will only be used when the number of pod replicas is more than 40. With 40 or less replicas, the first policy will be applied. For instance if there are 80 replicas and the target has to be scaled down to 10 replicas then during the first step 8 replicas will be reduced. In the next iteration when the number of replicas is 72, 10% of the pods is 7.2 but the number is rounded up to 8. On each loop of the autoscaler controller the number of pods to be change is re-calculated based on the number of current replicas. When the number of replicas falls below 40 the first policy (Pods) is applied and 4 replicas will be reduced at a time.
The policy selection can be changed by specifying the selectPolicy
field for a scaling
direction. By setting the value to Min
which would select the policy which allows the
smallest change in the replica count. Setting the value to Disabled
completely disables
scaling in that direction.
Stabilization Window
The stabilization window is used to restrict the flapping of replicas when the metrics
used for scaling keep fluctuating. The stabilization window is used by the autoscaling
algorithm to consider the computed desired state from the past to prevent scaling. In
the following example the stabilization window is specified for scaleDown
.
scaleDown:
stabilizationWindowSeconds: 300
When the metrics indicate that the target should be scaled down the algorithm looks into previously computed desired states and uses the highest value from the specified interval. In above example all desired states from the past 5 minutes will be considered.
Default Behavior
To use the custom scaling not all fields have to be specified. Only values which need to be customized can be specified. These custom values are merged with default values. The default values match the existing behavior in the HPA algorithm.
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
For scaling down the stabilization window is 300 seconds (or the value of the
--horizontal-pod-autoscaler-downscale-stabilization
flag if provided). There is only a single policy
for scaling down which allows a 100% of the currently running replicas to be removed which
means the scaling target can be scaled down to the minimum allowed replicas.
For scaling up there is no stabilization window. When the metrics indicate that the target should be
scaled up the target is scaled up immediately. There are 2 policies where 4 pods or a 100% of the currently
running replicas will be added every 15 seconds till the HPA reaches its steady state.
Example: change downscale stabilization window
To provide a custom downscale stabilization window of 1 minute, the following behavior would be added to the HPA:
behavior:
scaleDown:
stabilizationWindowSeconds: 60
Example: limit scale down rate
To limit the rate at which pods are removed by the HPA to 10% per minute, the following behavior would be added to the HPA:
behavior:
scaleDown:
policies:
- type: Percent
value: 10
periodSeconds: 60
To ensure that no more than 5 Pods are removed per minute, you can add a second scale-down
policy with a fixed size of 5, and set selectPolicy
to minimum. Setting selectPolicy
to Min
means
that the autoscaler chooses the policy that affects the smallest number of Pods:
behavior:
scaleDown:
policies:
- type: Percent
value: 10
periodSeconds: 60
- type: Pods
value: 5
periodSeconds: 60
selectPolicy: Min
Example: disable scale down
The selectPolicy
value of Disabled
turns off scaling the given direction.
So to prevent downscaling the following policy would be used:
behavior:
scaleDown:
selectPolicy: Disabled
Implicit maintenance-mode deactivation
You can implicitly deactivate the HPA for a target without the
need to change the HPA configuration itself. If the target's desired replica count
is set to 0, and the HPA's minimum replica count is greater than 0, the HPA
stops adjusting the target (and sets the ScalingActive
Condition on itself
to false
) until you reactivate it by manually adjusting the target's desired
replica count or HPA's minimum replica count.
What's next
- Design documentation: Horizontal Pod Autoscaling.
- kubectl autoscale command: kubectl autoscale.
- Usage example of Horizontal Pod Autoscaler.
8 - Horizontal Pod Autoscaler Walkthrough
Horizontal Pod Autoscaler automatically scales the number of Pods in a replication controller, deployment, replica set or stateful set based on observed CPU utilization (or, with beta support, on some other, application-provided metrics).
This document walks you through an example of enabling Horizontal Pod Autoscaler for the php-apache server. For more information on how Horizontal Pod Autoscaler behaves, see the Horizontal Pod Autoscaler user guide.
Before you begin
This example requires a running Kubernetes cluster and kubectl, version 1.2 or later. Metrics server monitoring needs to be deployed in the cluster to provide metrics through the Metrics API. Horizontal Pod Autoscaler uses this API to collect metrics. To learn how to deploy the metrics-server, see the metrics-server documentation.
To specify multiple resource metrics for a Horizontal Pod Autoscaler, you must have a Kubernetes cluster and kubectl at version 1.6 or later. To make use of custom metrics, your cluster must be able to communicate with the API server providing the custom Metrics API. Finally, to use metrics not related to any Kubernetes object you must have a Kubernetes cluster at version 1.10 or later, and you must be able to communicate with the API server that provides the external Metrics API. See the Horizontal Pod Autoscaler user guide for more details.
Run and expose php-apache server
To demonstrate Horizontal Pod Autoscaler we will use a custom docker image based on the php-apache image. The Dockerfile has the following content:
FROM php:5-apache
COPY index.php /var/www/html/index.php
RUN chmod a+rx index.php
It defines an index.php page which performs some CPU intensive computations:
<?php
$x = 0.0001;
for ($i = 0; $i <= 1000000; $i++) {
$x += sqrt($x);
}
echo "OK!";
?>
First, we will start a deployment running the image and expose it as a service using the following configuration:
apiVersion: apps/v1
kind: Deployment
metadata:
name: php-apache
spec:
selector:
matchLabels:
run: php-apache
replicas: 1
template:
metadata:
labels:
run: php-apache
spec:
containers:
- name: php-apache
image: k8s.gcr.io/hpa-example
ports:
- containerPort: 80
resources:
limits:
cpu: 500m
requests:
cpu: 200m
---
apiVersion: v1
kind: Service
metadata:
name: php-apache
labels:
run: php-apache
spec:
ports:
- port: 80
selector:
run: php-apache
Run the following command:
kubectl apply -f https://k8s.io/examples/application/php-apache.yaml
deployment.apps/php-apache created
service/php-apache created
Create Horizontal Pod Autoscaler
Now that the server is running, we will create the autoscaler using
kubectl autoscale.
The following command will create a Horizontal Pod Autoscaler that maintains between 1 and 10 replicas of the Pods
controlled by the php-apache deployment we created in the first step of these instructions.
Roughly speaking, HPA will increase and decrease the number of replicas
(via the deployment) to maintain an average CPU utilization across all Pods of 50%
(since each pod requests 200 milli-cores by kubectl run
), this means average CPU usage of 100 milli-cores).
See here for more details on the algorithm.
kubectl autoscale deployment php-apache --cpu-percent=50 --min=1 --max=10
horizontalpodautoscaler.autoscaling/php-apache autoscaled
We may check the current status of autoscaler by running:
kubectl get hpa
NAME REFERENCE TARGET MINPODS MAXPODS REPLICAS AGE
php-apache Deployment/php-apache/scale 0% / 50% 1 10 1 18s
Please note that the current CPU consumption is 0% as we are not sending any requests to the server
(the TARGET
column shows the average across all the pods controlled by the corresponding deployment).
Increase load
Now, we will see how the autoscaler reacts to increased load. We will start a container, and send an infinite loop of queries to the php-apache service (please run it in a different terminal):
kubectl run -i --tty load-generator --rm --image=busybox --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://php-apache; done"
Within a minute or so, we should see the higher CPU load by executing:
kubectl get hpa
NAME REFERENCE TARGET MINPODS MAXPODS REPLICAS AGE
php-apache Deployment/php-apache/scale 305% / 50% 1 10 1 3m
Here, CPU consumption has increased to 305% of the request. As a result, the deployment was resized to 7 replicas:
kubectl get deployment php-apache
NAME READY UP-TO-DATE AVAILABLE AGE
php-apache 7/7 7 7 19m
Note: It may take a few minutes to stabilize the number of replicas. Since the amount of load is not controlled in any way it may happen that the final number of replicas will differ from this example.
Stop load
We will finish our example by stopping the user load.
In the terminal where we created the container with busybox
image, terminate
the load generation by typing <Ctrl> + C
.
Then we will verify the result state (after a minute or so):
kubectl get hpa
NAME REFERENCE TARGET MINPODS MAXPODS REPLICAS AGE
php-apache Deployment/php-apache/scale 0% / 50% 1 10 1 11m
kubectl get deployment php-apache
NAME READY UP-TO-DATE AVAILABLE AGE
php-apache 1/1 1 1 27m
Here CPU utilization dropped to 0, and so HPA autoscaled the number of replicas back down to 1.
Note: Autoscaling the replicas may take a few minutes.
Autoscaling on multiple metrics and custom metrics
You can introduce additional metrics to use when autoscaling the php-apache
Deployment
by making use of the autoscaling/v2beta2
API version.
First, get the YAML of your HorizontalPodAutoscaler in the autoscaling/v2beta2
form:
kubectl get hpa php-apache -o yaml > /tmp/hpa-v2.yaml
Open the /tmp/hpa-v2.yaml
file in an editor, and you should see YAML which looks like this:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: php-apache
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: php-apache
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
status:
observedGeneration: 1
lastScaleTime: <some-time>
currentReplicas: 1
desiredReplicas: 1
currentMetrics:
- type: Resource
resource:
name: cpu
current:
averageUtilization: 0
averageValue: 0
Notice that the targetCPUUtilizationPercentage
field has been replaced with an array called metrics
.
The CPU utilization metric is a resource metric, since it is represented as a percentage of a resource
specified on pod containers. Notice that you can specify other resource metrics besides CPU. By default,
the only other supported resource metric is memory. These resources do not change names from cluster
to cluster, and should always be available, as long as the metrics.k8s.io
API is available.
You can also specify resource metrics in terms of direct values, instead of as percentages of the
requested value, by using a target.type
of AverageValue
instead of Utilization
, and
setting the corresponding target.averageValue
field instead of the target.averageUtilization
.
There are two other types of metrics, both of which are considered custom metrics: pod metrics and object metrics. These metrics may have names which are cluster specific, and require a more advanced cluster monitoring setup.
The first of these alternative metric types is pod metrics. These metrics describe Pods, and
are averaged together across Pods and compared with a target value to determine the replica count.
They work much like resource metrics, except that they only support a target
type of AverageValue
.
Pod metrics are specified using a metric block like this:
type: Pods
pods:
metric:
name: packets-per-second
target:
type: AverageValue
averageValue: 1k
The second alternative metric type is object metrics. These metrics describe a different
object in the same namespace, instead of describing Pods. The metrics are not necessarily
fetched from the object; they only describe it. Object metrics support target
types of
both Value
and AverageValue
. With Value
, the target is compared directly to the returned
metric from the API. With AverageValue
, the value returned from the custom metrics API is divided
by the number of Pods before being compared to the target. The following example is the YAML
representation of the requests-per-second
metric.
type: Object
object:
metric:
name: requests-per-second
describedObject:
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
name: main-route
target:
type: Value
value: 2k
If you provide multiple such metric blocks, the HorizontalPodAutoscaler will consider each metric in turn. The HorizontalPodAutoscaler will calculate proposed replica counts for each metric, and then choose the one with the highest replica count.
For example, if you had your monitoring system collecting metrics about network traffic,
you could update the definition above using kubectl edit
to look like this:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: php-apache
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: php-apache
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
- type: Pods
pods:
metric:
name: packets-per-second
target:
type: AverageValue
averageValue: 1k
- type: Object
object:
metric:
name: requests-per-second
describedObject:
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
name: main-route
target:
type: Value
value: 10k
status:
observedGeneration: 1
lastScaleTime: <some-time>
currentReplicas: 1
desiredReplicas: 1
currentMetrics:
- type: Resource
resource:
name: cpu
current:
averageUtilization: 0
averageValue: 0
- type: Object
object:
metric:
name: requests-per-second
describedObject:
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
name: main-route
current:
value: 10k
Then, your HorizontalPodAutoscaler would attempt to ensure that each pod was consuming roughly 50% of its requested CPU, serving 1000 packets per second, and that all pods behind the main-route Ingress were serving a total of 10000 requests per second.
Autoscaling on more specific metrics
Many metrics pipelines allow you to describe metrics either by name or by a set of additional
descriptors called labels. For all non-resource metric types (pod, object, and external,
described below), you can specify an additional label selector which is passed to your metric
pipeline. For instance, if you collect a metric http_requests
with the verb
label, you can specify the following metric block to scale only on GET requests:
type: Object
object:
metric:
name: http_requests
selector: {matchLabels: {verb: GET}}
This selector uses the same syntax as the full Kubernetes label selectors. The monitoring pipeline
determines how to collapse multiple series into a single value, if the name and selector
match multiple series. The selector is additive, and cannot select metrics
that describe objects that are not the target object (the target pods in the case of the Pods
type, and the described object in the case of the Object
type).
Autoscaling on metrics not related to Kubernetes objects
Applications running on Kubernetes may need to autoscale based on metrics that don't have an obvious relationship to any object in the Kubernetes cluster, such as metrics describing a hosted service with no direct correlation to Kubernetes namespaces. In Kubernetes 1.10 and later, you can address this use case with external metrics.
Using external metrics requires knowledge of your monitoring system; the setup is
similar to that required when using custom metrics. External metrics allow you to autoscale your cluster
based on any metric available in your monitoring system. Provide a metric
block with a
name
and selector
, as above, and use the External
metric type instead of Object
.
If multiple time series are matched by the metricSelector
,
the sum of their values is used by the HorizontalPodAutoscaler.
External metrics support both the Value
and AverageValue
target types, which function exactly the same
as when you use the Object
type.
For example if your application processes tasks from a hosted queue service, you could add the following section to your HorizontalPodAutoscaler manifest to specify that you need one worker per 30 outstanding tasks.
- type: External
external:
metric:
name: queue_messages_ready
selector: "queue=worker_tasks"
target:
type: AverageValue
averageValue: 30
When possible, it's preferable to use the custom metric target types instead of external metrics, since it's easier for cluster administrators to secure the custom metrics API. The external metrics API potentially allows access to any metric, so cluster administrators should take care when exposing it.
Appendix: Horizontal Pod Autoscaler Status Conditions
When using the autoscaling/v2beta2
form of the HorizontalPodAutoscaler, you will be able to see
status conditions set by Kubernetes on the HorizontalPodAutoscaler. These status conditions indicate
whether or not the HorizontalPodAutoscaler is able to scale, and whether or not it is currently restricted
in any way.
The conditions appear in the status.conditions
field. To see the conditions affecting a HorizontalPodAutoscaler,
we can use kubectl describe hpa
:
kubectl describe hpa cm-test
Name: cm-test
Namespace: prom
Labels: <none>
Annotations: <none>
CreationTimestamp: Fri, 16 Jun 2017 18:09:22 +0000
Reference: ReplicationController/cm-test
Metrics: ( current / target )
"http_requests" on pods: 66m / 500m
Min replicas: 1
Max replicas: 4
ReplicationController pods: 1 current / 1 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale the last scale time was sufficiently old as to warrant a new scale
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from pods metric http_requests
ScalingLimited False DesiredWithinRange the desired replica count is within the acceptable range
Events:
For this HorizontalPodAutoscaler, we can see several conditions in a healthy state. The first,
AbleToScale
, indicates whether or not the HPA is able to fetch and update scales, as well as
whether or not any backoff-related conditions would prevent scaling. The second, ScalingActive
,
indicates whether or not the HPA is enabled (i.e. the replica count of the target is not zero) and
is able to calculate desired scales. When it is False
, it generally indicates problems with
fetching metrics. Finally, the last condition, ScalingLimited
, indicates that the desired scale
was capped by the maximum or minimum of the HorizontalPodAutoscaler. This is an indication that
you may wish to raise or lower the minimum or maximum replica count constraints on your
HorizontalPodAutoscaler.
Appendix: Quantities
All metrics in the HorizontalPodAutoscaler and metrics APIs are specified using
a special whole-number notation known in Kubernetes as a
quantity. For example,
the quantity 10500m
would be written as 10.5
in decimal notation. The metrics APIs
will return whole numbers without a suffix when possible, and will generally return
quantities in milli-units otherwise. This means you might see your metric value fluctuate
between 1
and 1500m
, or 1
and 1.5
when written in decimal notation.
Appendix: Other possible scenarios
Creating the autoscaler declaratively
Instead of using kubectl autoscale
command to create a HorizontalPodAutoscaler imperatively we
can use the following file to create it declaratively:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: php-apache
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: php-apache
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 50
We will create the autoscaler by executing the following command:
kubectl create -f https://k8s.io/examples/application/hpa/php-apache.yaml
horizontalpodautoscaler.autoscaling/php-apache created
9 - Specifying a Disruption Budget for your Application
Kubernetes v1.21 [stable]
This page shows how to limit the number of concurrent disruptions that your application experiences, allowing for higher availability while permitting the cluster administrator to manage the clusters nodes.
Before you begin
Your Kubernetes server must be version v1.21. To check the version, enterkubectl version
.- You are the owner of an application running on a Kubernetes cluster that requires high availability.
- You should know how to deploy Replicated Stateless Applications and/or Replicated Stateful Applications.
- You should have read about Pod Disruptions.
- You should confirm with your cluster owner or service provider that they respect Pod Disruption Budgets.
Protecting an Application with a PodDisruptionBudget
- Identify what application you want to protect with a PodDisruptionBudget (PDB).
- Think about how your application reacts to disruptions.
- Create a PDB definition as a YAML file.
- Create the PDB object from the YAML file.
Identify an Application to Protect
The most common use case when you want to protect an application specified by one of the built-in Kubernetes controllers:
- Deployment
- ReplicationController
- ReplicaSet
- StatefulSet
In this case, make a note of the controller's .spec.selector
; the same
selector goes into the PDBs .spec.selector
.
From version 1.15 PDBs support custom controllers where the scale subresource is enabled.
You can also use PDBs with pods which are not controlled by one of the above controllers, or arbitrary groups of pods, but there are some restrictions, described in Arbitrary Controllers and Selectors.
Think about how your application reacts to disruptions
Decide how many instances can be down at the same time for a short period due to a voluntary disruption.
- Stateless frontends:
- Concern: don't reduce serving capacity by more than 10%.
- Solution: use PDB with minAvailable 90% for example.
- Concern: don't reduce serving capacity by more than 10%.
- Single-instance Stateful Application:
- Concern: do not terminate this application without talking to me.
- Possible Solution 1: Do not use a PDB and tolerate occasional downtime.
- Possible Solution 2: Set PDB with maxUnavailable=0. Have an understanding (outside of Kubernetes) that the cluster operator needs to consult you before termination. When the cluster operator contacts you, prepare for downtime, and then delete the PDB to indicate readiness for disruption. Recreate afterwards.
- Concern: do not terminate this application without talking to me.
- Multiple-instance Stateful application such as Consul, ZooKeeper, or etcd:
- Concern: Do not reduce number of instances below quorum, otherwise writes fail.
- Possible Solution 1: set maxUnavailable to 1 (works with varying scale of application).
- Possible Solution 2: set minAvailable to quorum-size (e.g. 3 when scale is 5). (Allows more disruptions at once).
- Concern: Do not reduce number of instances below quorum, otherwise writes fail.
- Restartable Batch Job:
- Concern: Job needs to complete in case of voluntary disruption.
- Possible solution: Do not create a PDB. The Job controller will create a replacement pod.
- Concern: Job needs to complete in case of voluntary disruption.
Rounding logic when specifying percentages
Values for minAvailable
or maxUnavailable
can be expressed as integers or as a percentage.
- When you specify an integer, it represents a number of Pods. For instance, if you set
minAvailable
to 10, then 10 Pods must always be available, even during a disruption. - When you specify a percentage by setting the value to a string representation of a percentage (eg.
"50%"
), it represents a percentage of total Pods. For instance, if you setmaxUnavailable
to"50%"
, then only 50% of the Pods can be unavailable during a disruption.
When you specify the value as a percentage, it may not map to an exact number of Pods. For example, if you have 7 Pods and
you set minAvailable
to "50%"
, it's not immediately obvious whether that means 3 Pods or 4 Pods must be available.
Kubernetes rounds up to the nearest integer, so in this case, 4 Pods must be available. You can examine the
code
that controls this behavior.
Specifying a PodDisruptionBudget
A PodDisruptionBudget
has three fields:
- A label selector
.spec.selector
to specify the set of pods to which it applies. This field is required. .spec.minAvailable
which is a description of the number of pods from that set that must still be available after the eviction, even in the absence of the evicted pod.minAvailable
can be either an absolute number or a percentage..spec.maxUnavailable
(available in Kubernetes 1.7 and higher) which is a description of the number of pods from that set that can be unavailable after the eviction. It can be either an absolute number or a percentage.
Note: The behavior for an empty selector differs between the policy/v1beta1 and policy/v1 APIs for PodDisruptionBudgets. For policy/v1beta1 an empty selector matches zero pods, while for policy/v1 an empty selector matches every pod in the namespace.
You can specify only one of maxUnavailable
and minAvailable
in a single PodDisruptionBudget
.
maxUnavailable
can only be used to control the eviction of pods
that have an associated controller managing them. In the examples below, "desired replicas"
is the scale
of the controller managing the pods being selected by the
PodDisruptionBudget
.
Example 1: With a minAvailable
of 5, evictions are allowed as long as they leave behind
5 or more healthy pods among those selected by the PodDisruptionBudget's selector
.
Example 2: With a minAvailable
of 30%, evictions are allowed as long as at least 30%
of the number of desired replicas are healthy.
Example 3: With a maxUnavailable
of 5, evictions are allowed as long as there are at most 5
unhealthy replicas among the total number of desired replicas.
Example 4: With a maxUnavailable
of 30%, evictions are allowed as long as no more than 30%
of the desired replicas are unhealthy.
In typical usage, a single budget would be used for a collection of pods managed by a controller—for example, the pods in a single ReplicaSet or StatefulSet.
Note: A disruption budget does not truly guarantee that the specified number/percentage of pods will always be up. For example, a node that hosts a pod from the collection may fail when the collection is at the minimum size specified in the budget, thus bringing the number of available pods from the collection below the specified size. The budget can only protect against voluntary evictions, not all causes of unavailability.
If you set maxUnavailable
to 0% or 0, or you set minAvailable
to 100% or the number of replicas,
you are requiring zero voluntary evictions. When you set zero voluntary evictions for a workload
object such as ReplicaSet, then you cannot successfully drain a Node running one of those Pods.
If you try to drain a Node where an unevictable Pod is running, the drain never completes. This is permitted as per the
semantics of PodDisruptionBudget
.
You can find examples of pod disruption budgets defined below. They match pods with the label
app: zookeeper
.
Example PDB Using minAvailable:
Example PDB Using maxUnavailable:
For example, if the above zk-pdb
object selects the pods of a StatefulSet of size 3, both
specifications have the exact same meaning. The use of maxUnavailable
is recommended as it
automatically responds to changes in the number of replicas of the corresponding controller.
Create the PDB object
You can create or update the PDB object with a command like kubectl apply -f mypdb.yaml
.
Check the status of the PDB
Use kubectl to check that your PDB is created.
Assuming you don't actually have pods matching app: zookeeper
in your namespace,
then you'll see something like this:
kubectl get poddisruptionbudgets
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
zk-pdb 2 N/A 0 7s
If there are matching pods (say, 3), then you would see something like this:
kubectl get poddisruptionbudgets
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
zk-pdb 2 N/A 1 7s
The non-zero value for ALLOWED DISRUPTIONS
means that the disruption controller has seen the pods,
counted the matching pods, and updated the status of the PDB.
You can get more information about the status of a PDB with this command:
kubectl get poddisruptionbudgets zk-pdb -o yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
annotations:
…
creationTimestamp: "2020-03-04T04:22:56Z"
generation: 1
name: zk-pdb
…
status:
currentHealthy: 3
desiredHealthy: 2
disruptionsAllowed: 1
expectedPods: 3
observedGeneration: 1
Arbitrary Controllers and Selectors
You can skip this section if you only use PDBs with the built-in application controllers (Deployment, ReplicationController, ReplicaSet, and StatefulSet), with the PDB selector matching the controller's selector.
You can use a PDB with pods controlled by another type of controller, by an "operator", or bare pods, but with these restrictions:
- only
.spec.minAvailable
can be used, not.spec.maxUnavailable
. - only an integer value can be used with
.spec.minAvailable
, not a percentage.
You can use a selector which selects a subset or superset of the pods belonging to a built-in controller. The eviction API will disallow eviction of any pod covered by multiple PDBs, so most users will want to avoid overlapping selectors. One reasonable use of overlapping PDBs is when pods are being transitioned from one PDB to another.
10 - Accessing the Kubernetes API from a Pod
This guide demonstrates how to access the Kubernetes API from within a pod.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Accessing the API from within a Pod
When accessing the API from within a Pod, locating and authenticating to the API server are slightly different to the external client case.
The easiest way to use the Kubernetes API from a Pod is to use one of the official client libraries. These libraries can automatically discover the API server and authenticate.
Using Official Client Libraries
From within a Pod, the recommended ways to connect to the Kubernetes API are:
For a Go client, use the official Go client library. The
rest.InClusterConfig()
function handles API host discovery and authentication automatically. See an example here.For a Python client, use the official Python client library. The
config.load_incluster_config()
function handles API host discovery and authentication automatically. See an example here.There are a number of other libraries available, please refer to the Client Libraries page.
In each case, the service account credentials of the Pod are used to communicate securely with the API server.
Directly accessing the REST API
While running in a Pod, the Kubernetes apiserver is accessible via a Service named
kubernetes
in the default
namespace. Therefore, Pods can use the
kubernetes.default.svc
hostname to query the API server. Official client libraries
do this automatically.
The recommended way to authenticate to the API server is with a
service account credential. By default, a Pod
is associated with a service account, and a credential (token) for that
service account is placed into the filesystem tree of each container in that Pod,
at /var/run/secrets/kubernetes.io/serviceaccount/token
.
If available, a certificate bundle is placed into the filesystem tree of each
container at /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
, and should be
used to verify the serving certificate of the API server.
Finally, the default namespace to be used for namespaced API operations is placed in a file
at /var/run/secrets/kubernetes.io/serviceaccount/namespace
in each container.
Using kubectl proxy
If you would like to query the API without an official client library, you can run kubectl proxy
as the command
of a new sidecar container in the Pod. This way, kubectl proxy
will authenticate
to the API and expose it on the localhost
interface of the Pod, so that other containers
in the Pod can use it directly.
Without using a proxy
It is possible to avoid using the kubectl proxy by passing the authentication token directly to the API server. The internal certificate secures the connection.
# Point to the internal API server hostname
APISERVER=https://kubernetes.default.svc
# Path to ServiceAccount token
SERVICEACCOUNT=/var/run/secrets/kubernetes.io/serviceaccount
# Read this Pod's namespace
NAMESPACE=$(cat ${SERVICEACCOUNT}/namespace)
# Read the ServiceAccount bearer token
TOKEN=$(cat ${SERVICEACCOUNT}/token)
# Reference the internal certificate authority (CA)
CACERT=${SERVICEACCOUNT}/ca.crt
# Explore the API with TOKEN
curl --cacert ${CACERT} --header "Authorization: Bearer ${TOKEN}" -X GET ${APISERVER}/api
The output will be similar to this:
{
"kind": "APIVersions",
"versions": [
"v1"
],
"serverAddressByClientCIDRs": [
{
"clientCIDR": "0.0.0.0/0",
"serverAddress": "10.0.1.149:443"
}
]
}