Splitting the CERN OpenStack Cloud into Two Regions

08bbe085 · Belmiro Moreira · Belmiro Moreira · e2ab1ff8 · 08bbe085 · 08bbe085
Commit 08bbe085 authored Mar 13, 2019 by Belmiro Moreira Committed by Belmiro Moreira Mar 18, 2019
--- a/content/post/region-split.md
+++ b/content/post/region-split.md
+---
+title: "Splitting the CERN OpenStack Cloud into Two Regions"
+date: 2019-03-18T13:00:00+01:00
+author: Belmiro Moreira, Ricardo Rocha
+tags: ["openstack"]
+---
+
+## Overview
+
+The CERN Cloud Infrastructure has been available since 2013 for all CERN users. During
+the last 6 years it has grown from few hundred to more than 300k cores. The Cloud
+Infrastructure is deployed in two data centres (Geneva, Switzerland and Budapest,
+Hungary).
+
+Back in 2013 we decided to have only one region across both data centres for
+simplicity. We wanted to offer an extremely simple solution to be adopted easily 
+by our users.
+
+We expected to scale the Infrastructure using cells (at that time cellsV1) and
+offer application availability using availability zones.
+
+Ohh... and also, as new OpenStack operators,
+
+> *"It was simpler to manage one small cloud than two small clouds"*
+
+After 6 years building on top of this architecture model we decided to split the
+Infrastructure into two production regions.
+
+## Why split the Infrastructure into two regions?
+
+A lot has changed during the last years. Have a look at the following graphs
+showing the growth of the Infrastructure. Unfortunately we don't have the data
+from 2013.
+
+{{% figure src="../../img/post/region-split-compute-nodes.png"
+     caption="Fig. 1 - Number of compute nodes in the CERN Cloud Infrastructure over the years"
+     class="caption"
+%}}
+
+{{% figure src="../../img/post/region-split-available-cores.png"
+     caption="Fig. 2 - Number of cores available in the CERN Cloud Infrastructure over the years"
+     class="caption"
+%}}
+
+We moved from 2 Nova cellsV1 to more than 70 Nova cellsV2, and we are in
+the process of migrating old cells still using the deprecated nova-network to 
+Neutron.
+
+Also, the use cases are now very well defined. We can group them in three categories:
+
+- Personal Projects
+- Service Projects
+- Batch Processing
+
+A CERN user that subscribes to the service gets a "Personal Project" with a
+very small and fixed quota to deploy his/her personal virtual machines (VMs). A
+service manager that wants to deploy a new service needs to request a new 
+"Service Project" with the desired quota. Both use cases are mapped to a subset
+of Nova cells. All other cells are reserved for "Batch Processing", which
+represent ~80% of our compute capacity.
+
+Every "Batch Processing" project is dedicated to a cell and uses all its 
+resources. The VMs deployed in these projects are responsible for processing the
+data from the LHC experiments. In order to not lose capacity in the Batch System
+they are only recreated if they are in bad shape (an automated process recreates
+them during the night).
+
+With these three very well defined use cases in mind we decided to split the
+Infrastructure into two regions. The existing region (let's call it the main region)
+will continue to have the "Personal" and "Service" projects and a new region
+(let's call it the batch region) will host the Batch Processing use case.
+
+Splitting the Cloud into two regions will allow us to have a more flexible and 
+agile Infrastructure:
+
+- Deploy new configuration changes
+- Upgrades (smaller footprint)
+- RabbitMQ scalability for Neutron
+- Placement scalability
+
+But ultimately...
+
+> *"It is simpler to manage two large clouds than one large cloud"*
+
+
+## Region Split
+
+The plan was to move the batch dedicated resources (+6000 compute nodes) to the
+new region without any API downtime for the "Personal" and "Services" use cases.
+
+Considering that the new region will only host the batch processing resources we
+didn't need to deploy all the OpenStack projects that we offer in the main region.
+For the batch region we only deployed a new Nova and Neutron control plane. Glance 
+is shared by both regions to avoid image duplication.
+
+The batch region was configured in Keystone and all the batch dedicated projects
+were mapped to a new endpoint group for this new region. All other projects are 
+mapped to the default main region. This means that for now we are not allowing a 
+project to have resources in both regions.
+
+
+## How it was done?
+### Day -1
+New Nova and Neutron control planes were configured and made available for the
+batch region.
+
+We run each component in a different VM for availability and scalability. All 
+the VMs for the control plane of the batch region are hosted in the main region.
+Here are all the components that we run per VM for the new control plane:
+
+- Nova API
+- Nova Conductor & Nova Scheduler
+- Placement
+- RabbitMQ cluster for Nova
+- Neutron Server
+- RabbitMQ cluster for Neutron
+
+We needed to add region support in our internal dashboards, monitoring and
+metric gathering tools, as well as operations scripts (project creation, 
+deletion, quota management, ...). Also, we needed to add region support in our 
+Neutron plugin. 
+
+At this point the functionality was tested using fake databases.
+
+
+### Day 0 (The Intervention Day)
+
+During the intervention it was required that no VMs were created/deleted in the
+resources that would be moved to the batch region. As mentioned previously,
+a VM is only recreated in these resources during the night if an issue is detected.
+Anyway, we decided to disable all those projects during the intervention, 
+just in case. This didn't affect the availability of the batch service. Rally 
+was also disabled in the entire cloud to avoid noise and false positives in our
+monitoring.
+
+All the other projects in the main region (reflecting the 
+majority of our users) were not affected by this intervention. The APIs were
+always available.
+
+The nova_api, nova_cell0 and neutron databases were cloned to different MySQL
+instances. After, the control plane was restarted and we could verify that the
+new API servers in the batch region were responding correctly. The corresponding
+nova "cell_mappings" were deleted from nova_api database in the main region and
+batch region to avoid the nova-schedulers cycling through cells that now
+belong to a different region.
+
+Finally, the cell controllers and compute nodes were populated with the new 
+configuration. The configuration changes included the new "region_name" for
+"keystone_authtoken" and the new endpoints for Placement and Neutron (including 
+the new RabbitMQ cluster for Neutron) for all nodes moving to the batch region.
+
+In the following graphs we can see the nodes moving from contacting Placement in
+the main region to the batch region.
+
+{{% figure src="../../img/post/region-split-placement-1.png"
+     caption="Fig. 3 - Number of compute nodes in the main region"
+     class="caption"
+%}}
+
+{{% figure src="../../img/post/region-split-placement-2.png"
+     caption="Fig. 4 - Number of compute nodes in the batch region"
+     class="caption"
+%}}
+
+
+### Day +1
+
+We left the databases clean up for the day after.
+
+We decided to delete the entries directly from the database, because in most
+cases the API would not allow us to delete a resource that is still in use.
+
+Deleting entries in a production database is always stressful with a lot of
+risk associated. For nova we needed to remove all the entries from the batch
+region in the main region "nova_api" database and all the entries from the main
+region in the batch region "nova_api" database.
+Several tables needed to be touched:
+
+- aggregates
+- aggregates_hosts
+- aggregates_metadata
+- allocations
+- cell_mappings
+- host_mappings
+- inventories
+- placement_aggregates
+- resource_provider_aggregates
+- resource_providers
+
+The first step was to recreate the "cell_mappings" entries with the same IDs but
+with a "fake" transport_url and database_connection entries. Then we used the 
+following MySQL script to delete all unwanted database entries.
+
+{{< highlight sql "linenos=table,linenostart=0" >}}
+set @cell = ‘CELL_NAME_MOVED_TO_THE_OTHER_REGION’;
+
+delete from inventories where resource_provider_id in \
+(select resource_provider_id from resource_provider_aggregates where \
+aggregate_id = (select id from placement_aggregates where \
+uuid = (select uuid from aggregates where name = @cell)));
+
+delete from allocations where resource_provider_id in \
+(select resource_provider_id from resource_provider_aggregates where \
+aggregate_id = (select id from placement_aggregates where \
+uuid = (select uuid from aggregates where name = @cell)));
+
+delete from resource_provider_aggregates where \
+aggregate_id = (select id from placement_aggregates where \
+uuid = (select uuid from aggregates where name = @cell));
+
+delete from placement_aggregates where \
+uuid = (select uuid from aggregates where name = @cell);
+
+delete from instance_mappings where \
+cell_id in (select id from cell_mappings where name = @cell);
+
+delete from host_mappings where \
+cell_id in (select id from cell_mappings where name = @cell);
+
+delete from aggregate_hosts where \
+aggregate_id in (select id from aggregates where name = @cell);
+
+delete from aggregate_metadata where \
+aggregate_id in (select id from aggregates where name = @cell);
+
+delete from aggregates where name = @cell;
+
+delete from cell_mappings where name = @cell;
+{{< / highlight >}}
+
+There was no need to clean up the nova_cell0 database.
+
+For Neutron the process was similar. The ports and subnets tables were cleaned.
+
+If something went wrong in this step, going back was possible... but would have
+been very painful. Fortunately, everything worked as planed. \o/
+
+## Wrapping up
+
+We split the CERN Cloud Infrastructure into two different regions without API
+downtime to the majority of our users.
+
+We now have a region dedicated to the Batch Processing use case.
+
+This solution will allow us to continue to grow the Infrastructure and make 
+future changes easier.
\ No newline at end of file
--- a/static/img/post/region-split-available-cores.png
+++ b/static/img/post/region-split-available-cores.png
--- a/static/img/post/region-split-compute-nodes.png
+++ b/static/img/post/region-split-compute-nodes.png
--- a/static/img/post/region-split-placement-1.png
+++ b/static/img/post/region-split-placement-1.png
--- a/static/img/post/region-split-placement-2.png
+++ b/static/img/post/region-split-placement-2.png