Skip to content

Booking system and auto scaling

Zarquan edited this page Nov 5, 2020 · 2 revisions

A key issue that we are working to address is how to allocate a finite set of compute resources between end users, enabling multiple users to run multiple data analysis tasks on the platform while still getting a reasonable level of responsiveness in the interactive user interface.

The simplest way of implementing resource sharing is to have some form of job queue, where all jobs get added to the queue and processed in sequence. This works for a batch mode processing system, but is less suitable for an interactive notebook system, where the user is watching the program execute and waiting for the results to display on their screen.

The first phase of the project has focussed on developing a resource booking system, enabling end users to reserve computing resources for a period of time. This allows users to plan when they want to use the system and be able to get a predictable level of compute resources allocated to them.

The aggregate numbers from the booking system enables the system administrators to see changes in the number of bookings and predict the level resources that will be needed to meet the demand over the next few days. This in turn enables the system administrators to request additional resources from the underlying cloud compute platform to cover periods of high demand.

Phase II of the project will look at extending this functionality on a number of ways.

Extending the booking system to enable users to reserve blocks of resources will enable workshop tutors to reserve enough resources for their tutorial, ensuring that the whole class would be able to use the system interactively at the same time.

Extending the booking system to enable users to specific more detail about the types of they can book will enable a power users to tailor their booking to meet the specific needs of the analysis tasks that they want to run.

Extending the booking system to integrate it with the auto scaling features available on the underlying cloud compute platform. The current system assumes a fixed allocation of compute resources and distributes these between the end user reservations.

We have experimented with using auto-scaling provisioning of virtual machines to add nodes to the Kubernetes cluster in response to variation in demand. However, our initial experiments showed that a standard integration of the auto-scaling tools provided by the Openstack platform was not fast enough to respond to the very volatile changes in load produced by an interactive system.

The load is almost zero while the users edits their notebook, demand peaks as soon as they click the [run] button to initiate a Spark job running on all the nodes in the cluster, and then immediately drops off to zero again once the analysis is completed.

The existing auto scaling system available in the current Openstack/Kubernetes platform takes several seconds, sometimes minutes, to start up a new virtual machine and connect it into the Kubernetes cluster. This works for a large scale system with 1,000s of users where the variation in load averages out and the system has time to respond to changes in demand, but was less efficient a coping with the rapid changes in demand that our interactive system produced.

To resolve this problem we plan to use the information available in our booking system to predict the expected demand and allocate compute resources accordingly before they are needed.

Initially the compute resources would be created from the fixed quota allocated to our project on the cloud compute platform (based on resource allocation requests like this).

The data from the booking system will also enable us to predict periods of very high demand, when the number of people requesting bookings will indicate we need to request additional resources from the cloud platform.

Currently this means a manual process of negotiation between our project and the cloud platform providers to increase the quotas allocated to our project. We plan to work with the cloud platform providers to automate this process as much as possible.

At the UKRI Cloud Working Group meeting in March 2020[1], StackHPC outlined their plans to deploy the Blazar[2] resource booking component on the Cambridge Openstack platform.

We understand that StackHPC have received approval for this to go ahead in 2021, and we look forward to working with StackHPC to explore ways to integrate our end-user booking system with the Blazar resource booking system to automate the process of on demand resource allocation as much as possible.

The goal is to have a system that uses the available resources as efficiently as possible. Using the booking system to predict the level of demand, allocating the required resources in to be ready when needed and returning them to the system when they are no longer required.

[1] https://cloud.ac.uk/workshops/mar2020/autoscaling-reservation-contention-and-preemption-the-coral-reef-cloud/ [2] https://docs.openstack.org/blazar/latest/