Custom Autoscaling with Kubernetes

Are you leveraging Kubernetes(K8S) extensibility for cost reduction and to improve your user’s experience?

Kubernetes with its REST APIs comes with great extensibility. But are you leveraging it to build something on top of it such that it aids you in saving infrastructure cost?
Or more importantly, are you using K8s to scale your application only when there is rise in traffic to the system such that your application is available to the users all the time?

Custom autoscaling with Kubernetes can make your life easier by helping you to scale based on your needs.

If you need to understand the background of why do you even need Kubernetes, check this article ‘Why Kubernetes?’

This is not the guide to make use of Kubernetes custom metric auto-scaling, instead it is about building your own application on top of K8s to optimise cost and gain more control over how custom auto scaling should behave.


Many applications/projects are deployed in Kubernetes or Openshift. Kubernetes being a flexible and extensible open source container orchestration tool provides a lot of opportunities to further optimise the resource utilisation within a cluster.

Well! One such thing which you can do with K8s or any platform which utilises K8s at its core is to custom autoscale your applications based on different kind of metric. Essentially scaling the applications the way you want and not the autoscaling feature which comes packed as default.

Scaling based on current traffic is the ultimate aim for any organisation, 
in order to keep the infrastructure cost low. And more importantly keeping the application available to your users all the time. This makes the need of custom auto scaling becomes significant.

Kubernetes comes with features of autoscaling based on CPU and Memory by default. There are a few reasons why would you want to go for custom autoscaling.  It can be based on
Requests per second
Transactions per second
message queue length and many more.

Why would you want Custom Autoscaling?

Ability to scale based on the requests getting into your system rather then just CPU and Memory.

Wouldn’t it be nice to control the number of running pods based on requests in the Application or transactions happening in the application.

Having control on resources(Pods) based on the traffic in the system allows us to optimise the cost of the infrastructure. As you would be precisely running only the required number of pods based on traffic on the system, not less – not more.

This approach saves you from keeping the pre-scaled pods for the cases where traffic may or may not increase.

Ability to scale dependent applications based on certain ratio.

You must have come across situations where one of your application depends on other application. And you want to be able to scale dependent application when the master application scales.

For eg :  If activity on your frontend microservice is more than usual, you know that you will soon be needing more pods for backend microservice. And if you know that each request in the frontend app would need double resources at backend then you would want to scale in the ratio of 1:2.

So if you are writing your custom autoscaler, you could configure settings as such that if 1 pod increases in frontend then backend should be scaled by 2 pods.

This would help avoid the bottleneck formation at the backend as pods would be pre-scaled to process the requests.

To be able to scale differently at peak & non-peak periods.

We all see the operations team super active during peak times. Specially when there is sale or during ongoing festive season, and we all know why!

What if you could configure a rule ahead of time such that during peak times scale by X pods more than the non-peak times.

During peak times, due to heavy traffic on the application, the pods are scaled frequently but each container have their own up time. This up time taken by the containers will affect the SLA of the application. Hence it becomes beneficial that we scale by extra pod only during peak times to stay ahead of the game.

Ability to scale consumers of queuing system based on message lag.

Now almost every application has a pub/sub system in place from where publisher sends messages to the message queue and subscribers pick messages from the queue.

Wouldn’t you want to scale up your servers based on the queue length of the broker!

The benefit of having your own custom autoscaling is that you can scale based on any set of metric. This could be a value from the DB(probably the number of pending orders) or the queue length from systems like Kafka or IBM MQ.

The consumers of queueing systems need to be scaled as the number of messages increase in the system. So as the messages increase we can define rules such that for X messages – there should be Y consumer instances. So once the X is hit there would be Y instances available to process the messages from queue.

Ability to scale by X number of pods when when the rate of traffic in the application is high.

Default Auto Scaling features scale based on CPU and Memory when certain percent of resource utilisation is breached. But this does not address the rate at which the resource limits are reached.

At the time of high traffic, the rate at which the resource utilization percent is breached will be high for each container, so this would mean scaling up a pod one by one at very little span of time and each container takes their own time to be up and running before it can serve requests, this could affect the SLA of the applications.

Wouldn’t it be better if you see that the rate at which the traffic coming in your application is high and based on that you want to scale by X pod and not by just 1. X can be anything defined in configuration.

So you can determine the surge in traffic to be low, average and high. Based on the levels of surge you can define the value of X.
For Eg: Low : 1
Average : 2
High  : 3

So if the traffic surge is high, the autoscaling would be with the factor of 3 instead of 1, which in case of default scaling is always 1.

Once you choose the route of Custom Auto Scaling, the opportunities to scale pods based on different factors are endless. We have discussed 5 reasons why we should opt custom auto scaling but there can be many more.

How to achieve Custom autoscaling?

Well, every K8s cluster has a monitoring system which keeps the track of current requests/transactions into the system. Eg : Prometheus , Cavisson etc. So you need a monitoring system for TPS and other resources. Combined with K8s REST endpoints you are all set to build your own custom autoscaler

In order to achieve custom autoscaling you need the following:

Metric Fetcher

A component which fetches the metric value(Transactions per second , Request per second , queue depth or message lag) from a desired server can be from prometheus, Cavisson server, IBM MQ, kafka.

You just have to write the piece of code which would connect to servers and get the metric value for you. It continues to fetch the value at a regular interval of time.

If you want to build a schedule based scaling, i.e the pods should scale at a given time of the day. Then You will not even need fetcher component which fetches metric values.

Create a cron job in a language of your choice to scale pods at the specified time in the configuration.

Decision Maker

A component which receives the metric value and takes the decision whether to scale a pod or not. It has the responsibility to perform the following:

  • Establish the connection with your cluster.
  • Here you define the deployment which you want to connect to.
  • This component has the rules which you setup for autoscaling. Like for 100 TPS, there should be 2 pods. for 200 4 pods etc.
  • You write the piece a of code here which would make the decision whether to scale a pod or not based on the rule setup.
  • If there is a need to scale the pods, send a request to the Kubernetes via its REST end points. This will scale the Pods of the deployment in question.


Persist your configurations in DB. Save your metric value alert history and scaling history in it.

Tip : The data collected over time can be used to pre-scale the pods using Tensor Flow before the surge of traffic.

With the above components in place, basically an application written in the language of your choice, which fetches the metric value from a server and processes it to decide the need of scaling and passes the request to scale the pods to K8s server, you can have a custom autoscaling system in place.  This provides you the flexibility of controlling everything, i.e. when to scale, on what metric you want to scale, and by how many pod you want to scale.


With the extensibility provided by K8s via its REST endpoints, you can develop many features on top of it.
We have seen how you can get the custom autoscaling with Kubernetes done, but the opportunities are endless.

What are you building on top of K8s to help your customers and organisation? let us know in the comments section.

Leave a Reply