Autoscalable Kubernetes cluster with preemptible nodes and non-preemptible fallback

804 words

Preemptible VMs are a lot cheaper than non-preemptible on-demand VMs on all major public clouds. By a lot, I mean up to 80% more affordable. 1 2 3 New businesses are emerging 4 that allow enterprises to run their critical workloads on preemptible VMs by increasing the tolerance for the faults in the underlying infrastructure. With Kubernetes, it is rather easy to use preemptible instances since Kubernetes is moderately reactive to infrastructure faults.

It is going to be a brief hypothesis since I don’t have a Kubernetes cluster running to test it and derive any useful conclusions. My primary goal here is to document this for myself.5


To fully grasp the text ahead, you should have


We can use Kops to create a set of Instance Groups and cluster autoscaler with a priority expander to achieve the autoscaling with preemptible nodes using non-preemptible nodes as a fallback.


Kops allows us to create many InstanceGroups in a single cluster. We’ll create \(2 \times n\) Instance Groups for Node role where \(n\) is the number of different machine configurations in-use. Each pair of Instance Groups should contain a preemptible and a non-preemptible group with the same machine configuration.

Kops can create a preemptible Instance Group by specifying the maxPrice in its YAML spec. I am not sure if Kops supports this for other cloud providers, but I have used this with AWS in the past. To set this up, you can refer to this how-to.

Cluster Autoscaler

Instance Groups map to Auto Scaling Groups in AWS, and Instance Groups in GCE.6 Cluster Autoscaler uses these vendor dependent APIs to scale an Instance Group. Cluster Autoscaler can work with many Instance Groups in a cluster, but it should be able to discover them.

To specify each Instance Group individually, we can pass them to the Cluster Autoscaler using --nodes flag. Alternatively, we can also set it up for auto-discovery using --node-group-auto-discovery. As the documentation states 7

sets min, max size and other configuration data for a node group in a format accepted by cloud provider. Can be used multiple times. Format: ::<other..>

One or more definition(s) of node group auto-discovery.
A definition is expressed <name of discoverer>:[<key>[=<value>]]
The aws, gce, and azure cloud providers are currently supported. AWS matches by ASG tags, e.g. asg:tag=tagKey,anotherTagKey
GCE matches by IG name prefix, and requires you to specify min and max nodes per IG, e.g. mig:namePrefix=pfx,min=0,max=10
Azure matches by tags on VMSS, e.g. label:foo=bar, and will auto-detect min and max tags on the VMSS to set scaling limits.
Can be used multiple times

To avoid setting labels for auto-discovery manually through the cloud vendor’s interface, use cloudLabels in Instance Group spec. Once cluster autoscaler can discover these Instance Groups, we can move on to setting up a suitable expander. An expander defines the strategy used when autoscaling a cluster with many Instance Groups.8

Price-based expander

It is only available for GCP GCE/GKE clusters. The price-based expander uses pseudo costs of instances to score Instance Groups.9 Since it doesn’t use the actual cost matrix, it won’t work out for our use-case as it should consider both preemptible and non-preemptible instance to cost the same.

Priority-based expander

A priority-based expander uses a user-defined priority list to score the Instance Groups in the cluster.10 To declare priorities, create a ConfigMap with name cluster-autoscaler-priority-expander, e.g.

apiVersion: v1
kind: ConfigMap
  name: cluster-autoscaler-priority-expander
  namespace: kube-system
  priorities: |-
      - .*-non-preemptible # regex that matches IG names
      - .*-preemptible

To use this expander, deploy the ConfigMap and add the following argument to the autoscaler deployment

./cluster-autoscaler --expander=priority # ... other args

Fine-tuning the autoscaler

The following cluster autoscaler parameters caught my eye. I’ll most definitely want to tweak these on a running cluster and observe their effects.

  • scan-interval: Time period for cluster reevaluation (default: 10 seconds). Reducing this may require more CPU, but it should decrease autoscaler’s reaction time to instance preemption events.
  • max-node-provision-time: Maximum time CA waits for a node to be provisioned (default: 15 minutes). Fine-tune this to closely match the average time it takes to provision an instance since higher values may lead to resource starvation in the cluster during mass preemption events.

Additional Resources

There are still many things to consider for making this work fluently. The following is a list of a few additional resources to take into consideration.