Autoscalable Kubernetes cluster with preemptible nodes and non-preemptible fallback
Preemptible VMs are a lot cheaper than non-preemptible on-demand VMs on all major public clouds. By a lot, I mean up to 80% more affordable. 1 2 3 New businesses are emerging 4 that allow enterprises to run their critical workloads on preemptible VMs by increasing the tolerance for the faults in the underlying infrastructure. With Kubernetes, it is rather easy to use preemptible instances since Kubernetes is moderately reactive to infrastructure faults.
It is going to be a brief hypothesis since I don’t have a Kubernetes cluster running to test it and derive any useful conclusions. My primary goal here is to document this for myself.5
To fully grasp the text ahead, you should have
Kops allows us to create many InstanceGroups in a single cluster. We’ll create \(2 \times n\) Instance Groups for
Node role where \(n\) is the number of different machine configurations in-use. Each pair of Instance Groups should contain a preemptible and a non-preemptible group with the same machine configuration.
Kops can create a preemptible Instance Group by specifying the
maxPrice in its YAML spec. I am not sure if Kops supports this for other cloud providers, but I have used this with AWS in the past. To set this up, you can refer to this how-to.
Instance Groups map to Auto Scaling Groups in AWS, and Instance Groups in GCE.6 Cluster Autoscaler uses these vendor dependent APIs to scale an Instance Group. Cluster Autoscaler can work with many Instance Groups in a cluster, but it should be able to discover them.
To specify each Instance Group individually, we can pass them to the Cluster Autoscaler using
--nodes flag. Alternatively, we can also set it up for auto-discovery using
--node-group-auto-discovery. As the documentation states 7
sets min, max size and other configuration data for a node group in a format accepted by cloud provider. Can be used multiple times. Format:
One or more definition(s) of node group auto-discovery.
A definition is expressed
<name of discoverer>:[<key>[=<value>]]
azurecloud providers are currently supported. AWS matches by ASG tags, e.g.
GCE matches by IG name prefix, and requires you to specify min and max nodes per IG, e.g.
Azure matches by tags on VMSS, e.g.
label:foo=bar, and will auto-detect
maxtags on the VMSS to set scaling limits.
Can be used multiple times
To avoid setting labels for auto-discovery manually through the cloud vendor’s interface, use
cloudLabels in Instance Group spec. Once cluster autoscaler can discover these Instance Groups, we can move on to setting up a suitable expander. An expander defines the strategy used when autoscaling a cluster with many Instance Groups.8
It is only available for GCP GCE/GKE clusters. The price-based expander uses pseudo costs of instances to score Instance Groups.9 Since it doesn’t use the actual cost matrix, it won’t work out for our use-case as it should consider both preemptible and non-preemptible instance to cost the same.
A priority-based expander uses a user-defined priority list to score the Instance Groups in the cluster.10 To declare priorities, create a ConfigMap with name
apiVersion: v1 kind: ConfigMap metadata: name: cluster-autoscaler-priority-expander namespace: kube-system data: priorities: |- 10: - .*-non-preemptible # regex that matches IG names 50: - .*-preemptible
To use this expander, deploy the ConfigMap and add the following argument to the autoscaler deployment
./cluster-autoscaler --expander=priority # ... other args
Fine-tuning the autoscaler
The following cluster autoscaler parameters caught my eye. I’ll most definitely want to tweak these on a running cluster and observe their effects.
scan-interval: Time period for cluster reevaluation (default: 10 seconds). Reducing this may require more CPU, but it should decrease autoscaler’s reaction time to instance preemption events.
max-node-provision-time: Maximum time CA waits for a node to be provisioned (default: 15 minutes). Fine-tune this to closely match the average time it takes to provision an instance since higher values may lead to resource starvation in the cluster during mass preemption events.
There are still many things to consider for making this work fluently. The following is a list of a few additional resources to take into consideration.