Skip to content
Snippets Groups Projects
Commit 08bbe085 authored by Belmiro Moreira's avatar Belmiro Moreira Committed by Belmiro Moreira
Browse files

Splitting the CERN OpenStack Cloud into Two Regions

parent e2ab1ff8
No related branches found
No related tags found
1 merge request!40Region Split
Pipeline #760296 passed
---
title: "Splitting the CERN OpenStack Cloud into Two Regions"
date: 2019-03-18T13:00:00+01:00
author: Belmiro Moreira, Ricardo Rocha
tags: ["openstack"]
---
## Overview
The CERN Cloud Infrastructure has been available since 2013 for all CERN users. During
the last 6 years it has grown from few hundred to more than 300k cores. The Cloud
Infrastructure is deployed in two data centres (Geneva, Switzerland and Budapest,
Hungary).
Back in 2013 we decided to have only one region across both data centres for
simplicity. We wanted to offer an extremely simple solution to be adopted easily
by our users.
We expected to scale the Infrastructure using cells (at that time cellsV1) and
offer application availability using availability zones.
Ohh... and also, as new OpenStack operators,
> *"It was simpler to manage one small cloud than two small clouds"*
After 6 years building on top of this architecture model we decided to split the
Infrastructure into two production regions.
## Why split the Infrastructure into two regions?
A lot has changed during the last years. Have a look at the following graphs
showing the growth of the Infrastructure. Unfortunately we don't have the data
from 2013.
{{% figure src="../../img/post/region-split-compute-nodes.png"
caption="Fig. 1 - Number of compute nodes in the CERN Cloud Infrastructure over the years"
class="caption"
%}}
{{% figure src="../../img/post/region-split-available-cores.png"
caption="Fig. 2 - Number of cores available in the CERN Cloud Infrastructure over the years"
class="caption"
%}}
We moved from 2 Nova cellsV1 to more than 70 Nova cellsV2, and we are in
the process of migrating old cells still using the deprecated nova-network to
Neutron.
Also, the use cases are now very well defined. We can group them in three categories:
- Personal Projects
- Service Projects
- Batch Processing
A CERN user that subscribes to the service gets a "Personal Project" with a
very small and fixed quota to deploy his/her personal virtual machines (VMs). A
service manager that wants to deploy a new service needs to request a new
"Service Project" with the desired quota. Both use cases are mapped to a subset
of Nova cells. All other cells are reserved for "Batch Processing", which
represent ~80% of our compute capacity.
Every "Batch Processing" project is dedicated to a cell and uses all its
resources. The VMs deployed in these projects are responsible for processing the
data from the LHC experiments. In order to not lose capacity in the Batch System
they are only recreated if they are in bad shape (an automated process recreates
them during the night).
With these three very well defined use cases in mind we decided to split the
Infrastructure into two regions. The existing region (let's call it the main region)
will continue to have the "Personal" and "Service" projects and a new region
(let's call it the batch region) will host the Batch Processing use case.
Splitting the Cloud into two regions will allow us to have a more flexible and
agile Infrastructure:
- Deploy new configuration changes
- Upgrades (smaller footprint)
- RabbitMQ scalability for Neutron
- Placement scalability
But ultimately...
> *"It is simpler to manage two large clouds than one large cloud"*
## Region Split
The plan was to move the batch dedicated resources (+6000 compute nodes) to the
new region without any API downtime for the "Personal" and "Services" use cases.
Considering that the new region will only host the batch processing resources we
didn't need to deploy all the OpenStack projects that we offer in the main region.
For the batch region we only deployed a new Nova and Neutron control plane. Glance
is shared by both regions to avoid image duplication.
The batch region was configured in Keystone and all the batch dedicated projects
were mapped to a new endpoint group for this new region. All other projects are
mapped to the default main region. This means that for now we are not allowing a
project to have resources in both regions.
## How it was done?
### Day -1
New Nova and Neutron control planes were configured and made available for the
batch region.
We run each component in a different VM for availability and scalability. All
the VMs for the control plane of the batch region are hosted in the main region.
Here are all the components that we run per VM for the new control plane:
- Nova API
- Nova Conductor & Nova Scheduler
- Placement
- RabbitMQ cluster for Nova
- Neutron Server
- RabbitMQ cluster for Neutron
We needed to add region support in our internal dashboards, monitoring and
metric gathering tools, as well as operations scripts (project creation,
deletion, quota management, ...). Also, we needed to add region support in our
Neutron plugin.
At this point the functionality was tested using fake databases.
### Day 0 (The Intervention Day)
During the intervention it was required that no VMs were created/deleted in the
resources that would be moved to the batch region. As mentioned previously,
a VM is only recreated in these resources during the night if an issue is detected.
Anyway, we decided to disable all those projects during the intervention,
just in case. This didn't affect the availability of the batch service. Rally
was also disabled in the entire cloud to avoid noise and false positives in our
monitoring.
All the other projects in the main region (reflecting the
majority of our users) were not affected by this intervention. The APIs were
always available.
The nova_api, nova_cell0 and neutron databases were cloned to different MySQL
instances. After, the control plane was restarted and we could verify that the
new API servers in the batch region were responding correctly. The corresponding
nova "cell_mappings" were deleted from nova_api database in the main region and
batch region to avoid the nova-schedulers cycling through cells that now
belong to a different region.
Finally, the cell controllers and compute nodes were populated with the new
configuration. The configuration changes included the new "region_name" for
"keystone_authtoken" and the new endpoints for Placement and Neutron (including
the new RabbitMQ cluster for Neutron) for all nodes moving to the batch region.
In the following graphs we can see the nodes moving from contacting Placement in
the main region to the batch region.
{{% figure src="../../img/post/region-split-placement-1.png"
caption="Fig. 3 - Number of compute nodes in the main region"
class="caption"
%}}
{{% figure src="../../img/post/region-split-placement-2.png"
caption="Fig. 4 - Number of compute nodes in the batch region"
class="caption"
%}}
### Day +1
We left the databases clean up for the day after.
We decided to delete the entries directly from the database, because in most
cases the API would not allow us to delete a resource that is still in use.
Deleting entries in a production database is always stressful with a lot of
risk associated. For nova we needed to remove all the entries from the batch
region in the main region "nova_api" database and all the entries from the main
region in the batch region "nova_api" database.
Several tables needed to be touched:
- aggregates
- aggregates_hosts
- aggregates_metadata
- allocations
- cell_mappings
- host_mappings
- inventories
- placement_aggregates
- resource_provider_aggregates
- resource_providers
The first step was to recreate the "cell_mappings" entries with the same IDs but
with a "fake" transport_url and database_connection entries. Then we used the
following MySQL script to delete all unwanted database entries.
{{< highlight sql "linenos=table,linenostart=0" >}}
set @cell = ‘CELL_NAME_MOVED_TO_THE_OTHER_REGION’;
delete from inventories where resource_provider_id in \
(select resource_provider_id from resource_provider_aggregates where \
aggregate_id = (select id from placement_aggregates where \
uuid = (select uuid from aggregates where name = @cell)));
delete from allocations where resource_provider_id in \
(select resource_provider_id from resource_provider_aggregates where \
aggregate_id = (select id from placement_aggregates where \
uuid = (select uuid from aggregates where name = @cell)));
delete from resource_provider_aggregates where \
aggregate_id = (select id from placement_aggregates where \
uuid = (select uuid from aggregates where name = @cell));
delete from placement_aggregates where \
uuid = (select uuid from aggregates where name = @cell);
delete from instance_mappings where \
cell_id in (select id from cell_mappings where name = @cell);
delete from host_mappings where \
cell_id in (select id from cell_mappings where name = @cell);
delete from aggregate_hosts where \
aggregate_id in (select id from aggregates where name = @cell);
delete from aggregate_metadata where \
aggregate_id in (select id from aggregates where name = @cell);
delete from aggregates where name = @cell;
delete from cell_mappings where name = @cell;
{{< / highlight >}}
There was no need to clean up the nova_cell0 database.
For Neutron the process was similar. The ports and subnets tables were cleaned.
If something went wrong in this step, going back was possible... but would have
been very painful. Fortunately, everything worked as planed. \o/
## Wrapping up
We split the CERN Cloud Infrastructure into two different regions without API
downtime to the majority of our users.
We now have a region dedicated to the Batch Processing use case.
This solution will allow us to continue to grow the Infrastructure and make
future changes easier.
\ No newline at end of file
static/img/post/region-split-available-cores.png

24 KiB

static/img/post/region-split-compute-nodes.png

24.1 KiB

static/img/post/region-split-placement-1.png

21.9 KiB

static/img/post/region-split-placement-2.png

22.5 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment