Skip to content
Snippets Groups Projects
Commit ba6fa77d authored by Belmiro Moreira's avatar Belmiro Moreira
Browse files

OS-8030 Preemptible Instances

parent 6d931a18
Branches os8030
No related tags found
1 merge request!53OS-8030 Preemptible Instances
Pipeline #1969720 passed
---
title: "Preemptible Instances in production at CERN"
date: 2020-09-28T13:00:00+02:00
author: Belmiro Moreira, Theodoros Tsioutsias
tags: ["openstack", "nova", "preemptibles", "aardvark"]
---
Cloud providers need to ensure that they have enough capacity available when users
request it. As a result, they need to have spare capacity, unused servers. In other
words, they have exactly the same problem that they are trying to solve for their
IaaS customers.
Amazon Web Services (AWS) was the first public cloud provider to address this challenge.
In 2009 AWS released the "Spot Instances" marketplace. The idea is that unused servers
capacity can be sold at a massive discount price (up to 90% when compared with On-Demand
instances) with the caveat that the instances can be terminated at any time when AWS needs
the capacity for On-Demand or Reserved instances. For users this translates in massive
discounts for low SLA instances.
Spot instances not only solved the AWS problem with unused capacity but also it became
a very popular approach for all workloads that are stateless and fault-tolerant.
All the other large public cloud providers followed this “spare capacity” model and
today most of them offer a similar solution. Google Cloud launched Preemptible VMs,
Azure launched Low-priority VMs and Alibaba Cloud launched ECS Spot Market. They are
very similar products and a clever way that public cloud providers use to monetize
their spare capacity.
CERN private cloud infrastructure has a similar challenge with spare capacity.
However, CERN doesn’t charge the Organization users like public clouds providers.
This means that the pure "spot instance" model can't be applied into our Cloud and
we needed to find a different approach.
When new resources are added into the Cloud, the new capacity is distributed per project
using quotas to ensure that the resources are allocated fairly considering the project
scope and the importance/priority for the Organization.
Quotas are hard limits and if projects are not using all the resources allowed by
the quota, we have spare capacity. On the contrary, as a research organization we have
plenty of projects that would benefit from these additional resources, however this
capacity can't be easily allocated to other projects without serious compromises.
- We can manually reduce the quota of the projects that are not using all the allocated
capacity and allocate it to other projects. But, it is likely that they will need the
removed capacity in the near future and then it may be difficult to get it back.
- Another possibility is to overcommit quota and assume that not all the projects will
use all the quota allocated at the same time. This is a strong and dangerous assumption
for our use-cases.
CERN Cloud Infrastructure has more than 3500 users and more than 5300 projects.
Maximizing the cloud resources utilization when quotas are used to limit resource
utilization is a very difficult problem to solve.
Inspired by the spot instances model we decided to develop a similar solution that
allows projects that have already exhausted their quota to continue to provision
low SLA instances ("preemptible instances"), that can be terminated anytime when
that capacity is required by projects that still have unused quota.
CERN Cloud Infrastructure is deployed using OpenStack. OpenStack doesn't offer a
native solution to provision preemptible instances. After discussing with the Nova
team how preemptible instances could be integrated, it was concluded that this
functionality should be managed by an external tool.
We started to develop a tool that can interact with Nova and Placement to manage
preemptible instances. We called it "Aardvark".
## Aardvark
Currently, OpenStack does not support preemptible instances. The prerequisites for
providing preemptible instances with OpenStack can be summed up to the following:
- Tagging instances as preemptible, in order to be able to distinguish from normal
instances, preemptible instances need to be tagged at creation time with an immutable
property.
- Access to preemptible instances, operators should be able to control which users/tenants
are allowed to spawn and use preemptible instances.
- Control preemptible resources (optional), operators should have the ability to limit the
resources each user can use for creating preemptible instances to avoid misuse of the
functionality. This is optional and it depends on the financial model in each use case.
For example, it would be useful to control the preemptible resources in private clouds.
But in the case of public clouds, where users pay for what the resources they use, this
is not relevant.
Since there was no service in the OpenStack environment providing support for preemptible
instances, we decided to design and develop Aardvark. Aardvark is an orchestrator for
preemptible instances. It is a fully compliant OpenStack service developed using the community
provided tools and libraries.
Aardvark code is available at: https://gitlab.cern.ch/cloud-infrastructure/aardvark
### General
{{< figure src="../../img/post/aardvarkflow.png"
caption="Aardvark workflow"
class="caption"
>}}
We tried to address the requirements mentioned above in the following ways:
- Tagging instances as preemptible, instances will inherit this property from the project to
which they belong.
- Access to preemptible instances, by using preemptible tenants, operators can control which
users will be able to spawn preemptible instances.
- Control preemptible resources (optional), if this applies to the use case, the operator is
able to enforce quotas on the preemptible tenants to limit the resources each user can use
for preemptible instances.
The orchestrator is the component that knows where preemptible instances are located and also
knows the way that resources should be made available. So the idea is that the orchestrator
receives an event, translates the event into a resource request and then tries to free up the
resources needed.
### Strategies
To free up the resources needed, Aardvark uses the configured strategy. Currently, there are
two strategies:
1. Random strategy
Randomly selects preemptible VMs until we have enough resources to accommodate the request
2. Strict strategy
Tries to find the best fitting combination of preemptible VMs to leave as less idle resources
as it possibly can
### Notifiers
A pluggable framework of notifiers was implemented for Aardvark. This way the orchestrator can
notify users as well as the operator regarding actions taken:
1. Log notifier
Logs the actions taken while processing a resource request
2. Email notifier
Sends emails to the owners of preemptible VMs informing about the deletion of their VMs. This
notifier, if configured to do so, sends also notifications to the operator regarding failures
including useful information for debugging failures.
3. Oslo notifier
Emits oslo notifications regarding Aardvark actions to the configured queue.
### Supported Modes
Aardvark supports two modes of operation that can be configured depending on the operator’s
needs.
1. Nova-triggered mode
In this mode of operation aardvark gets triggered when the scheduling of a new instance
fails due to the lack of resources. Upon being triggered, Aardvark analyzes the request
that failed, calculates the resources needed for the new server and tries to gather these
resources by deleting preemptible instances. Then Aardvark rebuilds the instance that
failed in the first place.
2. Watermark mode
In this mode of operation, operators configure the maximum level of utilization per
resource class (e.g. 95% usage). Aardvark periodically checks the current utilization level
for each resource. If the utilization threshold is reached, then aardvark tries to lower
the utilization level by deleting preemptible instances.
There is also a third complementary mode of operation for the two previously mentioned modes.
Here, the operator can configure the maximum lifespan of preemptible instances. Then Aardvark
will periodically check and delete long-lived preemptible instances.
### Flexible Configuration
Aardvark started out as a prototype. So our goal was to make it as abstract and easy to
implement new things as we could. It has pluggable strategies for operators to choose what
matches their use case as well as a set of notifiers to inform users and operators, regarding
actions performed by the service.
At the same time, operators can choose from different deployment options depending on their
needs:
1. Single Binary
All functionality is in one binary
2. Two binaries
One binary for the Service Manager running the periodic tasks and the notification listener
One binary for the reaper manager spawning the worker threads
This deployment option needs a backend for task scheduling (e.g. Zookeeper or Redis)
## Nova Changes
Aardvark was designed in order to be integrated with Nova in the least painful way, but we
needed to introduce some changes in Nova side.
### Hard deleting an instance
This change was needed in order to be able to remove all information regarding a failed
instance before rebuilding it. This is already merged upstream in Openstack Nova.
### Pending state
When scheduling fails for an instance, it gets to the ERROR state. As Aardvark gets triggered
at that time we added a new state called “Pending” to set the instances while waiting for the
orchestrator to finish its processing. Depending on the outcome, Aardvark rebuilds the instance
or resets its state to ERROR.
The spec for the change that was submitted upstream can be found here:
https://review.opendev.org/#/c/648687/
### Rebuild PENDING instances
This change is needed in order to reschedule instances that were marked as Pending by Nova due
to the lack of resources. This way aardvark gets triggered for a Pending instance, makes some
resources available by deleting preemptible VMs and then rebuilds the instance that failed.
The spec for the change that was submitted upstream can be found here:
https://review.opendev.org/#/c/648686/
## Moving Preemptible Instances to Production at CERN
Introducing a new project in a production Cloud is always challenging and we were very careful
to not affect Cloud availability. The change was almost transparent to the users.
Users can see now their instances going to a "PENDING" state when Placement can't find
available resources before Aardvark sets it to "ERROR" or rebuilds the instance after deleting
the required preemptible instances.
### Technical Changes
#### Nova
We needed to patch Nova code to introduce the "Pending" state and Rebuild "Pending" instances.
It's painful to maintain this code changes in a production infrastructure but it was the only way to
make preemptible instances available in our Cloud. We hope this functionality can be introduced upstream soon.
#### Notifications
Aardvark listens for notifications from Nova. A first change that we needed to deploy was
to have separate Rabbit MQs for the different regions in CERN cloud. This was needed because
Aardvark is not region aware and we had a common notification infrastructure between different
regions.
#### Aardvark Infrastructure
Currently Aardvark is deployed in a Kubernetes cluster. This choice was made to enable us to
have quick iterations. The cluster has five nodes with 16 pods deployed running Aardvark. We
are using the Gitlab CI to create images and validate each change.
#### Nova Config Options
Required Nova configuration:
1. Notifications driver
The messagingv2 driver should be selected in Nova side in order to send notification in the
format that Aardvark recognizes
2. Sending versioned notifications
Nova should be configured to send versioned notifications and not the legacy unversioned ones
3. Versioned notifications topics
Operators should specify the topic for the versioned notifications. The nice thing here is that
Nova can send notifications to more than one topics. In our case, notifications were already
consumed by another service in our cloud, so we just added a second topic where Aardvark is the
only listener.
### Preemptible Instances Users
The Batch service is the main user of preemptible instances in CERN cloud. Initially, we created a
preemptible project that could spawn servers wherever there was free capacity in the cloud.
After a few days, we noticed an increase of CPU steal time in some service VMs as mentioned
below. To mitigate this issue, the project is now mapped to a part of the infrastructure where
there is low or even no overcommit of resources and no service VMs.
## Current Issues
### Long time in PENDING state...
The Cloud Infrastructure size can have a huge impact in the amount of time that
Aardvark takes to make a decision on deleting a preemptible instance to create
the available resources for the new instance. This is not related to the calculations
required to assess which preemptibles instances should be deleted, but the number
of API calls that need to be performed against Placement API to evaluate all the
resource providers.
When having thousands of resources providers, this evaluation can take few seconds.
We overcome this problem with allowing Aardvark to read
directly from the Placement database. As you can imagine this speeds ups considerably
all the Aardvark evaluation.
Aardvark only uses a read user when accessing the Placement database and this approach
should only be considered in Infrastructures with thousands of nodes.
### CPU steal
Our biggest use case for preemptible instances are batch instances that process the data
from the Large Hadron Collider (LHC) experiments. This workload is stateless and fault
tolerant, ideal for preemptible instances. It's also extremely CPU intensive.
The resources that usually have available capacity are allocated for services instances
and usually are not CPU intensive. For that reason, we overcommit the CPU of those resources.
When we introduced preemptible instances in the infrastructure we were targeting these
available resources with the batch instances. However, we soon learned that these workload
were not compatible the level of CPU overcommit that we were using.
Existing service instances started to report high CPU steal creating a bad user experience.
For that reason, we are now limiting preemptible instances to few cells.
## Conclusion and Next Steps
We have come a long way since we started the development of this project.
The project was initially sponsored by Huawei and done with the collaboration with
SKA (Square Kilometre Array) and many others from the OpenStack Scientific SIG.
The prototype was presented in several conferences and we got great feedback from the community.
Future Science on Future OpenStack developing next generation infrastructure at CERN and SKA (Sydney - 2017)
https://www.youtube.com/watch?v=XmQR06Mwd5g&t
Containers on Baremetal and Preemptible VMs at CERN and SKA (Vancouver - 2018)
https://www.youtube.com/watch?v=K5N4LYrupSs
Science Demonstrations Preemptible Instances at CERN and Bare Metal Containers for HPC at SKA (Berlin - 2018)
https://www.youtube.com/watch?v=d-qO1knInHM&t
We feel confident that this project can help a lot of different infrastructures to better
use their spare capacity.
Of course there are still work to be done!
We are thinking how we can limit the CPU steal time when extremely CPU intensive
workloads are deployed as preemptible instances in an infrastructure that is configured
to allow CPU overcommit. Basically how can we run preemptible instances in our infrastructure
dedicated to services.
The solution that we are designing is to only allow the deployment of preemptible instances
in resources that are not overcommited and if other instances are deployed there,
remove the preemptible instances as soon the compute node is overcommited.
This has been a fun project to develop.
If you have suggestions or you would like to collaborate don't hesitate and contact us.
static/img/post/aardvarkflow.png

44.3 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment