OS-13411 - We live migrated 900 VMs

6fc3ba16 · Belmiro Moreira · 43c6790a · 6fc3ba16 · 6fc3ba16 · 6fc3ba16
Commit 6fc3ba16 authored Apr 23, 2021 by Belmiro Moreira
--- a/content/post/we-live-migrated-900-vms.md
+++ b/content/post/we-live-migrated-900-vms.md
+---
+title: "We live migrated 900 VMs!"
+date: 2021-04-23T12:00:00+02:00
+author: Belmiro Moreira
+tags: ["openinfra", "openstack", "cern"]
+---
+
+Last month we live migrated ~900 instances.
+We have been using live migration since the early days of the CERN Cloud Infrastructure but never migrated this number of instances in such a short period of time. During the process we faced several challenges and learnt a lot… Here’s the story.
+
+## The Motivation
+
+In reality the title of this blog post should have been “How we decommission compute nodes from a cell”, but it’s not that catchy :) All 115 compute nodes of a Nova cell needed to be decommissioned and replaced by new hardware.
+
+These compute nodes were added into the CERN Cloud Infrastructure 5 years ago, more precisely on March 15, 2016. They were very special at that time. They were the first nodes with SSDs available for service instances. They had 128GB of RAM, an Intel Xeon CPU E5-2630 v3 @ 2.40GHz and 2 SSDs of 900GB each that we configured in RAID 1. They are now replaced by 120 compute nodes with 192GB of RAM, an Intel Xeon Silver 4216 CPU @ 2.10GHz and 2 SSDs of 1.8TB each, again configured in RAID 1.
+
+Therefore, we increased significantly the capacity of the cell, and started a campaign to migrate ~900 Virtual Machines from the old hardware to the new compute nodes.
+
+## The Plan and Configuration
+
+The plan to replace this hardware started a few months ago.
+When the new hardware was ordered we stopped scheduling new instances to this particular cell. This stopped the creation of new instances, but most importantly led to a reduction of the number of instances in the cell. More than 300 instances were naturally deleted by the users during the months that preceded the migration campaign. Meaning, 300 less instances that we needed to live migrate.
+
+When the new compute nodes were installed they were configured to join the same cell and share the same network segment as the old compute nodes. When replacing hardware we don’t create new cells, instead the compute nodes are added/removed from the same cell to enable live migration between them.
+
+Another important point for successful live migrations is how the CPU is exposed to the instances. At the beginning we used the default “host-model”, that should allow the live migration if the CPU in the target compute node is similar or a more recent generation. It’s a sensible default, but we were frequently hitting an unexpected problem. It happens that with CPU microcode/kernel/libvirt upgrades, new CPU features can be introduced. If the compute node reboots (usually because of a repair intervention), the running instances will be exposed to these new CPU features. At the end, the instances running in this “upgraded” compute node can’t be live migrated to any of the others, because they still don’t have these new CPU features enabled. Unless they are also rebooted!
+
+To avoid this problem we now configure all the compute nodes that host service instances to use “custom cpu_mode”. Then we define a limited set of CPU features. This allows us to have a more consistent behaviour when live migrating instances.
+
+Last but not least, we privilege instances that have the root disk in the local storage for performance reasons. However, this is not ideal for live migration. During a live migration (block migration) the entire root disk needs to be copied and sync in the target compute node. In most cases, the total time of the live migration is determined by the size of the root disk.
+
+## The Execution
+
+900 instances needed to be live migrated to the new compute nodes. Our goal was to complete this operation without affecting users or services.
+
+In a list of 900 instances there are few that stand out. Instances with large flavors (>30GB of RAM, >16 vCPUs, >160GB of disk), instances that support important services in the Organization, and even some that run the Cloud control plane.
+
+We started by recreating all the instances that we own, the Cloud control plane, in other cells to keep the resilience of the service. The instances that run the Cloud control plane are spread between different availability zones (cells).
+
+Then we focused on the very large instances and the instances running important services (~100 instances). We started to live migrate those but soon discovered issues that affected the availability of some of the instances or the live migration failed in a very dramatic way! (more about this issue later). For these cases we decided to contact the users and plan the live migration time slot with them to minimise possible problems. In some cases it was safer to cold migrate the instances.
+
+Finally, all the other instances! On average we live migrated all the instances from 10 compute nodes per day. To help us with this task we adapted a tool that we were developing for a different use case (autonomically migrate instances and reboot the compute nodes for kernel upgrades). The “migration cycle”, as we call it, checks the state of each instance in the specified compute nodes, selects the appropriate migration method for each instance (cold migration for shutdown instances, live migration or block live migration for running instances), monitors the availability of the instance during all the process and at the end provides some statistics. If the instances become “unpingable” or the migration fails, the cloud operators are notified. With this tool, we automated most of the process and got much better “observability” of the different steps for the migration of each instance.
+
+We plan to write a dedicated blog post about this tool, soon.
+
+For small flavors instances (with up to 50 GB of root disk) we didn’t identify any major issues during the live migration.
+
+
+{{< figure src="../../img/post/we-live-migrated-900-vms-1.png"
+     caption="Fig. 1 - Number of instances live migrated between the old and new compute nodes"
+     class="caption"
+>}}
+
+## The Issues
+
+During this live migration campaign we faced several issues. Some self-created and others for which we still don’t understand the root cause. Let’s start with the easy ones...
+
+Back in 2018, with the disclosure of the Meltdown and Spectre vulnerabilities, we disabled hyper-threading in all the compute nodes that host service instances. As a consequence, the overcommit ratio doubled (which created some performance issues), but this also resulted in some instances with more vCPUs than the physical CPUs exposed by the compute node.
+
+Placement has the concept of “max_unity”, that is the maximum amount of a resource that can be allocated to an instance. In the case of the CPU the “max_unity” is the number of physical CPUs exposed by the compute node. This avoids a single instance to overcommit the host. Therefore, users couldn’t use the large instance flavors available. I agree that it’s a reasonable default! However, it would be nice to leave the final decision to the operators and have the possibility to configure this value.
+
+(https://bugs.launchpad.net/nova/+bug/1918419)
+
+Now, a few years later, we hit the same issue... But, why is it related to live migration?
+
+When an instance is live migrated, a “migration” allocation is created against the source compute node. However, Placement fails to create the allocation for these large instances because it can’t allocate more resources than defined by the “max_unity”.
+As a result, these large instances can’t be live migrated, even if the target nodes can host them without being overcommitted!
+
+We bypassed this limitation by changing the code in the compute nodes that reports the CPU “max_unity” to Placement (the compute nodes were removed after, so this was acceptable). Another possibility would have been to change the “max_unity” directly in the Placement database and increase the compute node’s “periodic task” interval to not be overwritten before the live migration starts.
+
+The other issues that we faced are more complex and they affected the instance's availability.
+
+We observed several “symptoms” during the live migration of large instances:
+Instances not “pingable” for several seconds or even minutes during the live migration process;
+Users reported that their applications weren't available for a few minutes or started to perform badly.
+
+When we detected these “symptoms” or the users reported them, we aborted the live migration (the live migration of a very large instance can take several hours). But, if we failed to detect them, usually the live migration failed and the instance was left in ERROR state.
+
+Looking into the Nova and libvirt logs we see that the problem was related with a timeout in libvirt:
+
+```text
+libvirtError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMemoryStats)
+```
+
+However, the biggest problem is that Nova doesn’t fail in a clean way!
+
+(https://bugs.launchpad.net/nova/+bug/1924585)
+
+During a live migration Nova monitors the state of the migration querying libvirt every 0.5s. But, if it fails to get an answer from libvirt, it exits the live migration.
+
+The problem is that qemu continues to live migrate the instance! At the end the instance is running on the target compute node. Because we have a flat network and still nova-network in this particular cell, the instance is available in the network without any other configuration.
+As a result, we have the instance running in the target compute node. However for Nova the instance is in ERROR state in the source compute node!
+If we detect this in time, we can fix it by removing the domain in the target compute node or update the Nova DB entries and Placement allocation to the target compute node. But if the user or a less experienced operator sees the instance in ERROR state and decides to hard reboot it, we will end up with two instances running (in the source and target compute nodes) and both available in the network!
+
+We still don’t know why libvirt eventually timeouts. Next, some observations regarding the investigation of this issue:
+
+* The disk transfer rate of these instances was ~25MB/s. Much less than we expected for 10Gb/s network cards;
+* We see that 1 CPU core is fully used during the disk transfer of the live migration. This may indicate that the disk transfer is correlated with the CPU core performance. Trying to live migrate instances of similar size on more recent hardware we get better disk transfers which seems to confirm the suspicion;
+* Qemu has several options to “tune” the live migration. However, they only apply to the memory transfer and not the disk transfer (block migration);
+
+```text
+virsh qemu-monitor-command <domain> -—hmp --cmd "info migrate_capabilities”
+virsh qemu-monitor-command <domain> --hmp --cmd "info migrate_parameters"
+```
+
+Finally, during a live migration that ends up being problematic we see that libvirt takes more and more time to answer requests. It’s now possible to monitor this behaviour in Nova.
+
+(https://bugs.launchpad.net/nova/+bug/1916031)
+
+There are some reports that the libvirt timeouts can be caused by monitoring sensors. We have libvirt monitoring sensors, but disabling them didn’t solve the problem.
+We believe that this behaviour is related with the workload running inside the instance and, of course, having a large disk to sync can be challenging. Not all large instances were affected.
+
+At the end we decided to schedule an instance downtime with the users for the very large instances and cold migrate them in order to have a predictable downtime for the affected services.
+
+For the live migrations we didn’t use “auto-convergence” or “post-copy”. Considering that the main issue was related with the disk transfer (block migration) we don’t believe these options would give us a better result.
+
+{{< figure src="../../img/post/we-live-migrated-900-vms-2.png"
+     caption="Fig. 2 - CPU utilization and network transfer rate (source compute node) during a live migration (block migrate) of a large instance - instance is idle"
+     class="caption"
+>}}
+
+## Wrap Up and What’s Next?
+
+Live migration is an essential and powerful feature for operators. It can hide required maintenance operations in the infrastructure from users.
+
+We successfully live migrated almost all the instances in a Nova cell, allowing us to decommission old compute nodes with very little impact for the users and services.
+
+Live migration for instances with the root disk on shared storage (booted from volume), as expected, was extremely fast and without issues.
+
+We found issues that affected a few very large instances during the live migration (block migrate). We are still investigating those and we would appreciate your insights/comments related to this issue.
+
+Overall the experience was very positive.
+
+We still have another 2 Nova Cells running the exact same old hardware in a total of 197 compute nodes that host ~1500 virtual machines. These will be replaced soon and we plan to use the same approach.
+
+Finally, we developed a small tool to help us to automate and monitor all the process.
+
+Happy live-migrations!
+
+*Many thanks to Jan van Eldik and Jayaditya Gupta for all their contributions to this work.*
\ No newline at end of file
--- a/static/img/post/we-live-migrated-900-vms-1.png
+++ b/static/img/post/we-live-migrated-900-vms-1.png
--- a/static/img/post/we-live-migrated-900-vms-2.png
+++ b/static/img/post/we-live-migrated-900-vms-2.png