OS-15032 - Migration Cycle blog post

724df9e6 · Belmiro Moreira · cf295d47 · 724df9e6
Commit 724df9e6 authored Nov 16, 2021 by Belmiro Moreira
--- a/content/post/beyond-live-migrating-virtual-machines.md
+++ b/content/post/beyond-live-migrating-virtual-machines.md
+---
+title: "Beyond live migrating Virtual Machines"
+date: 2021-11-16T08:00:00+01:00
+author: Belmiro Moreira, Jayaditya Gupta
+tags: ["openinfra", "openstack", "cern", "live-migration"]
+---
+
+The CERN Cloud Infrastructure hosts thousands of instances that are critical for the Organization. Keeping all these instances available during the required maintenance operations is a big challenge. In this article we will explore how the CERN Cloud Infrastructure takes advantage of the live migration to minimise virtual machines disruption during programmed maintenance operations. Also, we present the tool that we developed to orchestrate the continuous live migration of instances.
+
+
+## Introduction
+
+Maintenance is a common operation in a Cloud Infrastructure due to hardware failures, hardware decommissioning, security vulnerabilities, software upgrades, …
+Being an unavoidable reality, cloud operators try to mitigate and minimise disruptions of instances availability.
+
+There are different ways to approach this problem, a common approach, however, is to use live migration to move the instances to different compute nodes before performing the maintenance operation.
+
+“Live migration is the process of transferring a virtual machine from a physical node to another without disrupting its normal operation”.
+
+Live migration allows cloud operators to “hide” most of the programmed maintenance operations from users.
+
+At CERN we have been using live migration for a long time and we continue to explore new ways to leverage this functionality. This is a very mature functionality in libvirt and OpenStack Nova.
+
+The OpenStack Nova API allows to live migrate individual instances. This is a very powerful functionality. However, to actually manage the life cycle of a maintenance operation additional orchestration is required.
+
+## Live migration use cases
+
+The CERN Cloud Infrastructure runs multiple workloads. The compute resources are used for data processing, but also to run all the services of the organization (for example: software build systems from the experiments, engineering applications, or even the pension fund, the hostel booking or users desktops). Most of the services in the Organization are hosted in the Cloud Infrastructure, making it a very heterogeneous and complex environment. Therefore any intrusive intervention in a compute node is likely to affect an important service.
+
+Considering the type of the intervention we can have several teams involved (for example: cloud team, repair team, network team, ...). It’s essential to have clear procedures to make sure that the different teams don’t execute conflicting operations and, most importantly, the user instances are kept safe and available.
+
+Before starting a programmed intrusive intervention on a compute node there are a few operations that need to be performed. For example:
+- disable the compute node from Nova (to avoid the creation of new instances when the problem is initially detected or during the intervention);
+- disable the monitoring alarms in the compute node (to avoid raising alarms when the repair teams are intervening in the compute node);
+
+Then we try to live migrate all the instances from the compute node.
+
+Let’s discuss some examples…
+
+### Hardware repairs
+
+Compute nodes break!
+Fortunately, most of the time it is not that “dramatic” and instances continue to run. This means that we can try to live migrate them.
+
+We use Rundeck (https://www.rundeck.com/open-source) for automation. Our Rundeck jobs have been written over the years and they manage all the orchestration required to perform an intervention on a compute node. They are also responsible for triggering the live migration of the instances.
+
+Unfortunately, it’s not not always possible to live-migrate all the instances hosted on a compute node. Almost all the instances in the CERN Cloud Infrastructure are booted using an ephemeral root disk that is mapped to a file local to the compute node. As the size of the root disk increases, the probability of a successful live migration decreases. By experience we know  live migrating instances with root disks larger than 80GB can be very challenging.
+
+We explicitly skip the live migration of these large instances. When it is not possible to live-migrate an instance, the Rundeck job sends an email to the owners of the instances for them to acknowledge the service disruption during the hardware intervention.
+
+When the repair team finishes the intervention, a different Rundeck job is executed to enable the compute node and the alarm system.
+
+### Hardware retirement
+
+In the CERN Cloud Infrastructure the hardware retirement cycle is  3 to 5 years. When the compute nodes of a Nova cell need to be retired, we add the new replacement compute nodes  to the same cell. The added compute nodes allow us to live-migrate the instances from the old compute nodes. Usually we need to live-migrate thousands of instances to empty the old compute nodes. It’s an extremely challenging operation not only because of the number of instances involved, but also the different workloads and instance sizes involved. As we discussed previously, we avoid to live-migrate instances with a large ephemeral root disk, which means that a lot of planning and coordination with the owners of the instances is required to cold migrate these large instances.
+
+You can learn more about hardware retirement from a previous blog post,
+see: https://techblog.web.cern.ch/techblog/post/we-live-migrated-900-vms/
+
+### Linux kernel upgrades
+
+One of the challenges that we face in the Cloud Infrastructure is to keep the compute nodes Linux kernel up to date without disrupting the hosted instances. A Linux kernel upgrade requires a system reboot (when not running a Linux kernel live patching solution) which is a very disruptive intervention for a compute node running production workloads.
+
+Currently, we have compute nodes with uptimes of 3 years and more, meaning that the Linux kernel wasn’t upgraded during all this period. Even if this can have some security implications, it is a trade off between security and disruption to our cloud users.
+
+In the past we already rebooted all the compute nodes of the Cloud Infrastructure with a very short notice (remember Spectre/Meltdown? - https://techblog.web.cern.ch/techblog/post/keep-calm-and-reboot-patching-recent/).
+
+However, we have been avoiding doing this complex operation frequently.
+
+To overcome this issue we developed a small tool that allows us to live migrate the instances of a compute node and reboot it! Then repeat the same process in another one… This cycle can take weeks to finish in our large cells, but it has proven that we can safely upgrade the Linux kernel of the Cloud Infrastructure compute nodes without any instance disruption.
+
+We call it the “migration cycle” tool.
+
+In fact this tool is now the migration orchestration backend for all the operations that require live migration of instances.
+
+## Migration Cycle
+
+The “migration cycle” tool comes from the need to easily orchestrate the live migration of thousands of instances with minimal effort for the cloud operators. OpenStack Nova allows to liv- migrate individual instances, however there are a lot of different operations that are particular to a deployment before and after a live-migration. The migration cycle tool implements the CERN operation logic to automate the live-migration of instances.
+
+When discussing with several cloud operators we understood that different deployments have different needs. Also, different tools have been used to solve this problem. When we started to look into this problem, OpenStack Mistral was our first option. However, after some initial prototypes using OpenStack Mistral we concluded that we needed a more flexible solution. We decided then to write a python tool that uses the OpenStack API and CERN services APIs to orchestrate all the instance migrations.
+
+It can also be used through a CLI for particular interventions.
+
+For more information about the migration cycle, see the following gitlab repo:
+https://gitlab.cern.ch/cloud-infrastructure/migration_cycle
+
+## Conclusion
+
+The CERN Cloud Infrastructure leverages live-migration in several workflows. In this article we discussed the role of instance live-migration in hardware repairs, hardware retirement and Linux kernel upgrades on compute nodes.
+
+The migration cycle tool was developed to ease the live migration orchestration in the CERN Cloud but it can be adapted for any other deployment.
+
+We continue to develop the migration cycle. Our next step is to automatically reinstall the compute nodes with a new operating system after all the instances have been live-migrated. This will help us with the task to upgrade all the compute nodes from CentOS 7 to CentOS 8 Stream.