OS-8030 Preemptible Instances

ba6fa77d · Belmiro Moreira · 6d931a18 · ba6fa77d · ba6fa77d
Commit ba6fa77d authored 4 years ago by Belmiro Moreira
--- a/content/post/preemptible-instances.md
+++ b/content/post/preemptible-instances.md
+---
+title: "Preemptible Instances in production at CERN"
+date: 2020-09-28T13:00:00+02:00
+author: Belmiro Moreira, Theodoros Tsioutsias
+tags: ["openstack", "nova", "preemptibles", "aardvark"]
+---
+
+Cloud providers need to ensure that they have enough capacity available when users
+request it. As a result, they need to have spare capacity, unused servers. In other
+words, they have exactly the same problem that they are trying to solve for their
+IaaS customers.
+
+Amazon Web Services (AWS) was the first public cloud provider to address this challenge.
+In 2009 AWS released the "Spot Instances" marketplace. The idea is that unused servers
+capacity can be sold at a massive discount price (up to 90% when compared with On-Demand
+instances) with the caveat that the instances can be terminated at any time when AWS needs
+the capacity for On-Demand or Reserved instances. For users this translates in massive
+discounts for low SLA instances.
+
+Spot instances not only solved the AWS problem with unused capacity but also it became
+a very popular approach for all workloads that are stateless and fault-tolerant.
+
+All the other large public cloud providers followed this “spare capacity” model and
+today most of them offer a similar solution. Google Cloud launched Preemptible VMs,
+Azure launched Low-priority VMs and Alibaba Cloud launched ECS Spot Market. They are
+very similar products and a clever way that public cloud providers use to monetize
+their spare capacity.
+
+CERN private cloud infrastructure has a similar challenge with spare capacity.
+
+However, CERN doesn’t charge the Organization users like public clouds providers.
+This means that the pure "spot instance" model can't be applied into our Cloud and
+we needed to find a different approach.
+
+When new resources are added into the Cloud, the new capacity is distributed per project
+using quotas to ensure that the resources are allocated fairly considering the project
+scope and the importance/priority for the Organization.
+
+Quotas are hard limits and if projects are not using all the resources allowed by
+the quota, we have spare capacity. On the contrary, as a research organization we have
+plenty of projects that would benefit from these additional resources, however this
+capacity can't be easily allocated to other projects without serious compromises.
+
+- We can manually reduce the quota of the projects that are not using all the allocated
+capacity and allocate it to other projects. But, it is likely that they will need the
+removed capacity in the near future and then it may be difficult to get it back.
+- Another possibility is to overcommit quota and assume that not all the projects will
+use all the quota allocated at the same time. This is a strong and dangerous assumption
+for our use-cases.
+
+CERN Cloud Infrastructure has more than 3500 users and more than 5300 projects.
+Maximizing the cloud resources utilization when quotas are used to limit resource
+utilization is a very difficult problem to solve.
+
+Inspired by the spot instances model we decided to develop a similar solution that
+allows projects that have already exhausted their quota to continue to provision
+low SLA instances ("preemptible instances"), that can be terminated anytime when
+that capacity is required by projects that still have unused quota.
+
+CERN Cloud Infrastructure is deployed using OpenStack. OpenStack doesn't offer a
+native solution to provision preemptible instances. After discussing with the Nova
+team how preemptible instances could be integrated, it was concluded that this
+functionality should be managed by an external tool.
+
+We started to develop a tool that can interact with Nova and Placement to manage
+preemptible instances. We called it "Aardvark".
+
+## Aardvark
+
+Currently, OpenStack does not support preemptible instances. The prerequisites for
+providing preemptible instances with OpenStack can be summed up to the following:
+
+- Tagging instances as preemptible, in order to be able to distinguish from normal
+  instances, preemptible instances need to be tagged at creation time with an immutable
+  property.
+- Access to preemptible instances, operators should be able to control which users/tenants
+  are allowed to spawn and use preemptible instances.
+- Control preemptible resources (optional), operators should have the ability to limit the
+  resources each user can use for creating preemptible instances to avoid misuse of the
+  functionality. This is optional and it depends on the financial model in each use case.
+  For example, it would be useful to control the preemptible resources in private clouds.
+  But in the case of public clouds, where users pay for what the resources they use, this
+  is not relevant.
+
+Since there was no service in the OpenStack environment providing support for preemptible
+instances, we decided to design and develop Aardvark. Aardvark is an orchestrator for
+preemptible instances. It is a fully compliant OpenStack service developed using the community
+provided tools and libraries.
+
+Aardvark code is available at: https://gitlab.cern.ch/cloud-infrastructure/aardvark
+
+### General
+
+{{< figure src="../../img/post/aardvarkflow.png"
+     caption="Aardvark workflow"
+     class="caption"
+>}}
+
+We tried to address the requirements mentioned above in the following ways:
+
+- Tagging instances as preemptible, instances will inherit this property from the project to
+  which they belong.
+- Access to preemptible instances, by using preemptible tenants, operators can control which
+  users will be able to spawn preemptible instances.
+- Control preemptible resources (optional), if this applies to the use case, the operator is
+  able to enforce quotas on the preemptible tenants to limit the resources each user can use
+  for preemptible instances.
+
+The orchestrator is the component that knows where preemptible instances are located and also
+knows the way that resources should be made available. So the idea is that the orchestrator
+receives an event, translates the event into a resource request and then tries to free up the
+resources needed.
+
+### Strategies
+
+To free up the resources needed, Aardvark uses the configured strategy. Currently, there are
+two strategies:
+
+1. Random strategy
+
+    Randomly selects preemptible VMs until we have enough resources to accommodate the request
+
+2. Strict strategy
+
+    Tries to find the best fitting combination of preemptible VMs to leave as less idle resources
+    as it possibly can
+
+### Notifiers
+
+A pluggable framework of notifiers was implemented for Aardvark. This way the orchestrator can
+notify users as well as the operator regarding actions taken:
+
+1. Log notifier
+
+    Logs the actions taken while processing a resource request
+
+2. Email notifier
+
+    Sends emails to the owners of preemptible VMs informing about the deletion of their VMs. This
+    notifier, if configured to do so, sends also notifications to the operator regarding failures
+    including useful information for debugging failures.
+
+3. Oslo notifier
+
+    Emits oslo notifications regarding Aardvark actions to the configured queue.
+
+### Supported Modes
+
+Aardvark supports two modes of operation that can be configured depending on the operator’s
+needs.
+
+1. Nova-triggered mode
+
+    In this mode of operation aardvark gets triggered when the scheduling of a new instance
+    fails due to the lack of resources. Upon being triggered, Aardvark analyzes the request
+    that failed, calculates the resources needed for the new server and tries to gather these
+    resources by deleting preemptible instances. Then Aardvark rebuilds the instance that
+    failed in the first place.
+
+2. Watermark mode
+
+    In this mode of operation, operators configure the maximum level of utilization per
+    resource class (e.g. 95% usage). Aardvark periodically checks the current utilization level
+    for each resource. If the utilization threshold is reached, then aardvark tries to lower
+    the utilization level by deleting preemptible instances.
+
+There is also a third complementary mode of operation for the two previously mentioned modes.
+Here, the operator can configure the maximum lifespan of preemptible instances. Then Aardvark
+will periodically check and delete long-lived preemptible instances.
+
+### Flexible Configuration
+
+Aardvark started out as a prototype. So our goal was to make it as abstract and easy to
+implement new things as we could. It has pluggable strategies for operators to choose what
+matches their use case as well as a set of notifiers to inform users and operators, regarding
+actions performed by the service.
+
+At the same time, operators can choose from different deployment options depending on their
+needs:
+
+1. Single Binary
+
+    All functionality is in one binary
+
+2. Two binaries
+
+    One binary for the Service Manager running the periodic tasks and the notification listener
+
+    One binary for the reaper manager spawning the worker threads
+
+    This deployment option needs a backend for task scheduling (e.g. Zookeeper or Redis)
+
+## Nova Changes
+
+Aardvark was designed in order to be integrated with Nova in the least painful way, but we
+needed to introduce some changes in Nova side.
+
+### Hard deleting an instance
+
+This change was needed in order to be able to remove all information regarding a failed
+instance before rebuilding it. This is already merged upstream in Openstack Nova.
+
+### Pending state
+
+When scheduling fails for an instance, it gets to the ERROR state. As Aardvark gets triggered
+at that time we added a new state called “Pending” to set the instances while waiting for the
+orchestrator to finish its processing. Depending on the outcome, Aardvark rebuilds the instance
+or resets its state to ERROR.
+
+The spec for the change that was submitted upstream can be found here:
+https://review.opendev.org/#/c/648687/
+
+### Rebuild PENDING instances
+
+This change is needed in order to reschedule instances that were marked as Pending by Nova due
+to the lack of resources. This way aardvark gets triggered for a Pending instance, makes some
+resources available by deleting preemptible VMs and then rebuilds the instance that failed.
+
+The spec for the change that was submitted upstream can be found here:
+https://review.opendev.org/#/c/648686/
+
+## Moving Preemptible Instances to Production at CERN
+
+Introducing a new project in a production Cloud is always challenging and we were very careful
+to not affect Cloud availability. The change was almost transparent to the users.
+
+Users can see now their instances going to a "PENDING" state when Placement can't find
+available resources before Aardvark sets it to "ERROR" or rebuilds the instance after deleting
+the required preemptible instances.
+
+### Technical Changes
+#### Nova
+
+We needed to patch Nova code to introduce the "Pending" state and Rebuild "Pending" instances.
+It's painful to maintain this code changes in a production infrastructure but it was the only way to
+make preemptible instances available in our Cloud. We hope this functionality can be introduced upstream soon.
+
+#### Notifications
+
+Aardvark listens for notifications from Nova. A first change that we needed to deploy was
+to have separate Rabbit MQs for the different regions in CERN cloud. This was needed because
+Aardvark is not region aware and we had a common notification infrastructure between different
+regions.
+
+#### Aardvark Infrastructure
+
+Currently Aardvark is deployed in a Kubernetes cluster. This choice was made to enable us to
+have quick iterations. The cluster has five nodes with 16 pods deployed running Aardvark. We
+are using the Gitlab CI to create images and validate each change.
+
+#### Nova Config Options
+Required Nova configuration:
+
+1. Notifications driver
+
+    The messagingv2 driver should be selected in Nova side in order to send notification in the
+    format that Aardvark recognizes
+
+2. Sending versioned notifications
+
+    Nova should be configured to send versioned notifications and not the legacy unversioned ones
+
+3. Versioned notifications topics
+
+    Operators should specify the topic for the versioned notifications. The nice thing here is that
+    Nova can send notifications to more than one topics. In our case, notifications were already
+    consumed by another service in our cloud, so we just added a second topic where Aardvark is the
+    only listener.
+
+### Preemptible Instances Users
+
+The Batch service is the main user of preemptible instances in CERN cloud. Initially, we created a
+preemptible project that could spawn servers wherever there was free capacity in the cloud.
+After a few days, we noticed an increase of CPU steal time in some service VMs as mentioned
+below. To mitigate this issue, the project is now mapped to a part of the infrastructure where
+there is low or even no overcommit of resources and no service VMs.
+
+## Current Issues
+### Long time in PENDING state...
+
+The Cloud Infrastructure size can have a huge impact in the amount of time that
+Aardvark takes to make a decision on deleting a preemptible instance to create
+the available resources for the new instance. This is not related to the calculations
+required to assess which preemptibles instances should be deleted, but the number
+of API calls that need to be performed against Placement API to evaluate all the
+resource providers.
+
+When having thousands of resources providers, this evaluation can take few seconds.
+
+We overcome this problem with allowing Aardvark to read
+directly from the Placement database. As you can imagine this speeds ups considerably
+all the Aardvark evaluation.
+
+Aardvark only uses a read user when accessing the Placement database and this approach
+should only be considered in Infrastructures with thousands of nodes.
+
+
+### CPU steal
+
+Our biggest use case for preemptible instances are batch instances that process the data
+from the Large Hadron Collider (LHC) experiments. This workload is stateless and fault
+tolerant, ideal for preemptible instances. It's also extremely CPU intensive.
+
+The resources that usually have available capacity are allocated for services instances
+and usually are not CPU intensive. For that reason, we overcommit the CPU of those resources.
+
+When we introduced preemptible instances in the infrastructure we were targeting these
+available resources with the batch instances. However, we soon learned that these workload
+were not compatible the level of CPU overcommit that we were using.
+
+Existing service instances started to report high CPU steal creating a bad user experience.
+
+For that reason, we are now limiting preemptible instances to few cells.
+
+
+## Conclusion and Next Steps
+
+We have come a long way since we started the development of this project.
+The project was initially sponsored by Huawei and done with the collaboration with
+SKA (Square Kilometre Array) and many others from the OpenStack Scientific SIG.
+
+The prototype was presented in several conferences and we got great feedback from the community.
+
+Future Science on Future OpenStack developing next generation infrastructure at CERN and SKA (Sydney - 2017)
+https://www.youtube.com/watch?v=XmQR06Mwd5g&t
+
+Containers on Baremetal and Preemptible VMs at CERN and SKA (Vancouver - 2018)
+https://www.youtube.com/watch?v=K5N4LYrupSs
+
+Science Demonstrations Preemptible Instances at CERN and Bare Metal Containers for HPC at SKA (Berlin - 2018)
+https://www.youtube.com/watch?v=d-qO1knInHM&t
+
+We feel confident that this project can help a lot of different infrastructures to better
+use their spare capacity.
+
+Of course there are still work to be done!
+
+We are thinking how we can limit the CPU steal time when extremely CPU intensive
+workloads are deployed as preemptible instances in an infrastructure that is configured
+to allow CPU overcommit. Basically how can we run preemptible instances in our infrastructure
+dedicated to services.
+
+The solution that we are designing is to only allow the deployment of preemptible instances
+in resources that are not overcommited and if other instances are deployed there,
+remove the preemptible instances as soon the compute node is overcommited.
+
+This has been a fun project to develop.
+If you have suggestions or you would like to collaborate don't hesitate and contact us.
--- a/static/img/post/aardvarkflow.png
+++ b/static/img/post/aardvarkflow.png