Commit 1a1cdcd1 authored by Georgios Bitzes's avatar Georgios Bitzes
Browse files

ci: make mkdocs official, drop gitbook

parent ea9611a8
Pipeline #1506697 passed with stages
in 98 minutes and 51 seconds
......@@ -154,7 +154,7 @@ docs:
- eosfusebind
- yum install -y git tree
- SNAPSHOT=$(date +%s)
- TARGET="/eos/project/q/quarkdb/www/mkdocs/${CI_COMMIT_REF_NAME}"
- TARGET="/eos/project/q/quarkdb/www/docs/${CI_COMMIT_REF_NAME}"
- STAGING_AREA="$TARGET-${SNAPSHOT}"
- tree
- cp -r make-docs "$STAGING_AREA"
......
# Password authentication
## Configuration
Three configuration options control password authentication in QuarkDB. Please note that
passwords need to contain a minimum of 32 characters.
* __redis.password__: Ensures that clients need to prove they know this password
in order to be able to connect. This includes other QuarkDB nodes: All QuarkDB
nodes part of the same cluster must be configured with the same password,
otherwise they won't be able to communicate.
* __redis.password_file__: An alternative to the above, except the password is
read from the specified file -- permissions must be `400` (`r-----`). Note that, any whitespace at the end of the password file contents is completely ignored, including the ending new-line, if any.
This means, the three following password files will, in fact, give identical
passwords:
```
$ cat file1
pickles<space><space><space>\n\n
```
```
$ cat file2
pickles\r\n
```
```
$ cat file3
pickles
```
This is to simplify the common case of just having a single line in the
passwordfile, without having to worry about newlines and whitespace
at the end, and being able to easily copy paste the password to a `redis-cli`
terminal.
* __redis.require_password_for_localhost__: By default, the requirement for
password authentication is lifted for localhost clients. Set this option to
true to require authentication at all times.
## Usage
An unauthenticated client will receive an error message for all issued commands:
```
some-host:7777> set mykey myvalue
(error) NOAUTH Authentication required.
```
Two ways exist for a client to prove they know the password - by simply sending
the password over the wire, like official redis does, or a signing challenge.
### AUTH
Simply send the password over the wire towards the server, just like in official redis:
```
some-host:7777> auth some-password
OK
```
### Signing challenge
This method of authentication is only viable from scripts, not interactively.
It avoids sending the password over plaintext.
* The client asks the server to generate a signing challenge, providing 64 random
bytes:
```
eoshome-i01:7777> HMAC-AUTH-GENERATE-CHALLENGE <... 64 random bytes ...>
"<... long string to sign...>"
```
* The client then signs the server-provided string using HMAC EVP sha256 and
the password it knows about:
```
eoshome-i01:7777> HMAC-AUTH-VALIDATE-CHALLENGE <... hmac signature ...>
OK
```
* The server validates the signature, and if the signatures match, lets the
client through.
# Backup
Let's assume there's QuarkDB running at ```/var/lib/quarkdb``` - how to backup
this directory?
First of all, it is a **bad** idea to directly copy the files of a running, live instance.
Please don't do that! Between the time you start the backup, to the time it finishes,
the underlying SST files will have likely changed, resulting in a backup that is corrupted.
Instead, take a checkpoint by issuing ``raft-checkpoint /path/to/backup``,
which will create a point-in-time consistent snapshot, containing both state machine
and journal, from which you will be able to easily spin up another QDB instance if need be.
Please make sure that ```/var/lib/quarkdb``` is on the **same** physical filesystem
as ```/path/to/backup```. This allows hard-linking the SST files, resulting in a
backup that takes virtually no additional space on the disk, and takes a couple
of seconds to create, even if your DB is half a terabyte.
After that, you should be able to stream or rsync this directory over the network.
Once you're done, make sure to **remove** it. Otherwise, the contents of ```/var/lib/quarkdb```
and ```/path/to/backup``` will soon start diverging as QDB processes more writes,
and new SST files are written. In short: ```/path/to/backup``` will no longer be
"for free", and start to consume actual disk space.
# Restore
Once we have a checkpoint, here are the steps to spin up an entirely new QuarkDB
instance out of it:
1. If the checkpoint was produced by a standalone instance, you can skip this
step. Otherwise, you need to change the hostname associated to the raft journal
by running the following command:
```
quarkdb-recovery --path /path/to/backup/current/raft-journal --command "recovery-force-reconfigure-journal localhost:8888| new-cluster-uuid"
```
This way, the new node will identify as ```localhost:8888```, for example, instead
of whatever hostname the machine we retrived the backup from had. Replace
```new-cluster-uuid``` with a unique string, such as a UUID.
2. It should now be possible to directly spin up a new node from the checkpoint
directory - example configuration file for raft mode:
```
xrd.port 8888
xrd.protocol redis:8888 libXrdQuarkDB.so
redis.mode raft
redis.database /path/to/backup
redis.myself localhost:8888
```
Make sure to use the same hostname:port pair in the configuration file, as
well as in ```quarkdb-recovery``` command invocation.
The resulting cluster will be raft-enabled, and single-node. It's possible
to expand it through regular membership updates.
# Bulkload mode
QuarkDB supports a special mode during which writes are sped up significantly,
at the cost of disallowing reads. Preventing reads enables several optimizations,
such as not having to maintain an in-memory index in order to support key
lookups for memtables.
Bulkload mode is available _only_ for newly created instances. If you try to
open in bulkload an instance which contains data already (either standalone
_or_ raft instance), the server will refuse to start.
## Starting a node in bulkload mode
1. Use quarkdb-create to make the database directory, specifying only the path.
```
quarkdb-create --path /var/lib/quarkdb/bulkload
```
2. Start the QuarkDB process, specify `bulkload` as the mode in the configuration:
```
xrd.port 4444
xrd.protocol redis:4444 libXrdQuarkDB.so
redis.mode bulkload
redis.database /var/lib/quarkdb/bulkload
```
3. Here are some example commands ran towards a bulkload node:
```
127.0.0.1:4444> quarkdb-info
1) MODE BULKLOAD
2) BASE-DIRECTORY /var/lib/quarkdb/bulkload
3) QUARKDB-VERSION 0.3.7
4) ROCKSDB-VERSION 5.18.3
5) MONITORS 0
6) BOOT-TIME 0 (0 seconds)
7) UPTIME 18 (18 seconds)
127.0.0.1:4444> set test 123
OK
127.0.0.1:4444> get test
(nil)
```
Note how attempting to read just-written values results in an empty string: All
reads during bulkload will receive empty values.
4. Now you are free to load QuarkDB with any data you wish.
## Finalizing bulkload
Once your data is written, how to turn this into a fully functioning raft cluster,
with reads enabled and replication?
1. Run ``quarkdb-bulkload-finalize`` - this needs to be the last command ran
on the node, more writes are disallowed from this point on. Let the command
run, which may take a while.
2. Once done, shut down the QuarkDB process.
3. To initialize a new raft cluster out of bulkloaded data:
1. Run ``quarkdb-create --path /var/lib/quarkdb/raft --clusterID my-cluster-id
--nodes node1:7777,node2:7777,node3:7777 --steal-state-machine /var/lib/quarkdb/bulkload/current/state-machine``.
2. ``/var/lib/quarkdb/raft`` now contains the full data. _Copy_ this directory
in full into _all_ nodes destined to be part of this new cluster: ``scp -r /var/lib/quarkdb/raft node1:/some/path``
3. Create the xrootd configuration files for each node, pointing to the location
where the above directory was copied into.
4. To initialize a new standalone node out of bulkloaded data:
1. Run ``quarkdb-create --path /var/lib/quarkdb/standalone --steal-state-machine /var/lib/quarkdb/bulkload/current/state-machine``
2. Create the xrootd configuration file as usual, pointing to ``/var/lib/quarkdb/standalone``.
# Configuration
In this example, we'll be configuring the following three nodes in our QuarkDB
cluster: `qdb-test-1.cern.ch:7777`, `qdb-test-2.cern.ch:7777`, `qdb-test-3.cern.ch:7777`.
We will also need an identifier string which uniquely identifies our cluster - a
[UUID](https://www.uuidgenerator.net) will do nicely. This prevents nodes from
different clusters from communicating by accident.
The first step is to initialize the database directory for each node in our cluster.
Run the following command on each node, only potentially changing `--path`.
For every single node in the cluster, including potential future additions,
the values used for `--clusterID` and `--nodes` must be consistent.
```
quarkdb-create --path /var/lib/quarkdb/node-1 --clusterID your-cluster-id --nodes qdb-test-1.cern.ch:7777,qdb-test-2.cern.ch:7777,qdb-test-3.cern.ch:7777
```
If you use the default systemd service file to run QuarkDB, you'll also need to
change the owner of the newly created files: ``chown -R xrootd:xrootd /var/lib/quarkdb/node-1``.
This is an example configuration file, running in Raft mode:
```
xrd.port 7777
xrd.protocol redis:7777 libXrdQuarkDB.so
redis.mode raft
redis.database /var/lib/quarkdb/node-1
redis.myself qdb-test-1.cern.ch:7777
```
Each node in the cluster has its own configuration file. The meaning of the above
directives is as follows:
* __xrd.port__: The port where the xrootd server should listen to.
* __xrd.protocol__: The protocol which the xrootd server should load, along with
the associated dynamic library. If you compiled QuarkDB manually, you need
to replace `libXrdQuarkDB.so` with the full path in which the dynamic library
resides in, such as `/path/to/quarkdb/build/src/libXrdQuarkDB.so`.
* __redis.mode__: The mode in which to run. Possible values: _raft_ to run with
consensus replication, _standalone_ to simply have a single node running.
Please note that once you have an established setup, it's not easy to switch between the
two modes. We'll focus on Raft mode for now.
* __redis.database__: The local path in which to store the data - needs to exist
beforehand.
* __redis.myself__: Only used in Raft mode: identifies a particular node within the cluster.
A few important details:
* The specified port needs to be identical across `xrd.port`, `xrd.protocol`,
and `redis.myself`.
* `redis.database` needs to exist beforehand - initialize by running `quarkdb-create`,
as found above.
You probably want to use `systemd` to run QuarkDB as a daemon - there is already a generic
systemd service file bundled with XRootD. Store your configuration file in
`/etc/xrootd/xrootd-quarkdb.cfg`, then run `systemctl start xrootd@quarkdb` to start
the node. The logs can be found in `/var/log/xrootd/quarkdb/xrootd.log`.
Otherwise, it's also possible to run manually with `xrootd -c /etc/xrootd/xrootd-quarkdb.cfg`,
or whichever is the location of your configuration file.
# Starting the cluster
To have a fully operational cluster, a majority (or *quorum*) of nodes need to
be alive and available: at least **2 out of 3**, 3 out of 5, 4 out of 7, etc.
Start at least two out of the three nodes in our test cluster. If all goes well,
they will hold an election with one becoming leader, and the others followers.
Using `redis-cli -p 7777`, it's possible to inspect the state of each node by issuing
`raft-info` and `quarkdb-info` commands. The output from `raft-info`
should look a bit like this:
```
127.0.0.1:7777> raft-info
1) TERM 6
2) LOG-START 0
3) LOG-SIZE 21
4) LEADER qdb-test-1.cern.ch:7777
5) CLUSTER-ID ed174a2c-3c2d-4155-85a4-36b7d1c841e5
6) COMMIT-INDEX 20
7) LAST-APPLIED 20
8) BLOCKED-WRITES 0
9) LAST-STATE-CHANGE 155053 (1 days, 19 hours, 4 minutes, 13 seconds)
10) ----------
11) MYSELF qdb-test-1.cern.ch:7777
12) STATUS LEADER
13) ----------
14) MEMBERSHIP-EPOCH 0
15) NODES qdb-test-1.cern.ch:7777,qdb-test-2.cern.ch:7777,qdb-test-3.cern.ch:7777
16) OBSERVERS
17) ----------
18) REPLICA qdb-test-2.cern.ch:7777 ONLINE | UP-TO-DATE | NEXT-INDEX 21
19) REPLICA qdb-test-3.cern.ch:7777 ONLINE | UP-TO-DATE | NEXT-INDEX 21
```
Verify that everything works by issuing a write towards the leader and
retrieving the data back:
```
qdb-test-1.cern.ch:7777> set mykey myval
OK
qdb-test-1.cern.ch:7777> get mykey
"myval"
```
# Fsync policy
In an ideal world, every write into QuarkDB would first be flushed to stable storage
(fsync'ed) before being acknowledged, thus ensuring maximum durability in
case of sudden power blackouts, or kernel crashes.
Given the very high cost of fsync (often in the tens of milliseconds)
QuarkDB does not by default fsync on every journal write, as that reduces throughput
by around a factor of 10 or more, and introduces excessive wear on the underlying
disk / SSD.
Instead, starting from version 0.4.1 the journal is always fsync'ed once per second
in a background thread. This limits any potential data loss during a power blackout
to the last second of writes.
In addition to the above, the journal can be configured with the following fsync
policies:
* async: Journal is synced at the discretion of the OS.
* sync-important-updates (**default**): Important writes relating to the raft state
are explicitly synced. This protects raft invariants after a blackout, ensuring
for example that no node votes twice for the same term due to forgetting its
vote after the blackout.
* always: Journal is synced for **each and every** write - expect massive perfromance reduction.
The default is *sync-important-updates*, as it represents a good tradeoff: During
infrequent but critical raft events (voting, term changes, membership changes)
the journal is synced to ensure raft state remains sane even after simultaneous,
all-node power blackouts.
Current fsync policy can be viewed through ``raft-info``, and changed through ``raft-set-fsync-policy``:
```
some-host:7777> raft-set-fsync-policy sync-important-updates
OK
```
Things to note:
* fsync policy is specific to each node **separately**, remember to change
the policy on every node of your cluster.
* A single node going down due to power blackout should not be a problem, as the
rest will re-populate any lost entries through replication. Data loss becomes
a possibility if multiple nodes are powered off *simultaneously*.
* Only power blackouts and kernel crashes pose such a problem. If just QuarkDB crashes
(even by ``kill -9``), no data loss will occur.
# Getting started
After following the instructions in this chapter, by the end you will have
a fully-functional QuarkDB cluster. The steps are:
* [Installation](INSTALLATION.md): Install the QuarkDB binaries into your local system,
along with the necessary dependencies.
* [Configuration](CONFIGURATION.md): Decide which nodes will be part of the QuarkDB
cluster, and configure them.
* [Troubleshooting](TROUBLESHOOTING.md): A list of common errors encountered
during setup, and how to solve them.
# Installation
## Packages
There are several ways to obtain QuarkDB, the easiest being to install from an RPM.
If running CentOS 7, store the following in `/etc/yum.repos.d/quarkdb.repo`:
```
[quarkdb-stable]
name=QuarkDB repository [stable]
baseurl=http://storage-ci.web.cern.ch/storage-ci/quarkdb/tag/el7/x86_64/
enabled=1
gpgcheck=False
```
Then, run `yum install quarkdb quarkdb-debuginfo`, and you're done.
## Building from source
### Requirements/ Dependencies
* Check out `utils/el7-packages.sh` for a list of build dependencies.
* Build will fail with older versions of gcc/gcc-c++
* On CC7, run `yum install centos-release-sc && yum install devtoolset-8 && source /opt/rh/devtoolset-8/enable`
The following will compile QuarkDB and run the tests.
```
git clone https://gitlab.cern.ch/eos/quarkdb.git && cd quarkdb
git submodule update --recursive --init
mkdir build && cd build
cmake ..
make
./test/quarkdb-tests
```
RocksDB is embedded as a submodule, but you can also compile it yourself
and specify `-DROCKSDB_ROOT_DIR` to the cmake invocation, in order to speed
things up if you do a full recompilation of QuarkDB often.
# Journal trimming
As you already know, all writes into QuarkDB during raft mode are first recorded
into the raft journal. This is a necessary step during consensus and replication,
since all writes must first be stored into a quorum of nodes before being committed.
However, we cannot store all writes since ever, that would cause the raft journal
to grow out of control in size. It's necessary to occasionally trim it and only
keep the last N entries.
Two configuration options control trimming:
* The number of entries to keep in the raft journal. Default value is 50 million
journal entries. Values below 1 million are probably not reasonable for a production
deployment, and values below 100k are disallowed.
* The batch size to use during trimming: This many entries will be deleted at once
when it's time to apply trimming. Default value is 1 million. Aim for batch sizes
around `1/50` of the total number of entries to keep.
You can change the above values with the following command towards the leader. This
sets total number of entries to keep at 10M, and batch size at 200k.
```
redis-cli -p 7777 config-set raft.trimming 10000000:200000
```
You can view all configuration options with the following:
```
redis-cli -p 7777 config-getall
```
Recommendation: Keep the default values, unless the raft-journal directory is
starting to consume too much space.
You can see current trimming status by running `raft-info`:
```
127.0.0.1:7777> raft-info
1) TERM 12259
2) LOG-START 35939700000
3) LOG-SIZE 35949744300
...
```
The above means that a total of `35949744300` writes have been recorded in the
journal so far, but the first `35939700000` have been trimmed already. Only
the last `35949744300 - 35939700000 = 10044300` entries are still available.
# Membership updates
QuarkDB supports dynamic changes to cluster membership without any impact on availability.
Caution needs to be taken that at any point in time, a quorum of nodes is
available and up-to-date for the cluster to function properly.
# Distinction between full nodes and observers
Consider the following:
1. We've been running a cluster in production consisting of nodes n1, n2, and n3.
1. One day n2 dies, so we add n4 to the cluster, without removing n2 first.
The leader starts the procedure of bringing n4 up-to-date. Quorum size becomes
three, since there are now four nodes in total.
1. Remember that writes have to be replicated to a quorum of nodes before they
are acknowledged to clients.
1. n2 remains dead, and n4 will take time to be brought up-to-date if the database
size is large. Writes can be replicated only to n1 and n3, which is less then the
quorum size, so they will all be **stalled** until n4 becomes up-to-date,
making the cluster unavailable for writes.
To prevent such an incident, QuarkDB discriminates between two types of nodes:
1. **Full members** participate in voting rounds and are capable of becoming leaders.
1. **Observers** receive all replicated entries, just like full nodes, however they:
* do not affect quorums
* do not vote
* are not taken into consideration when deciding whether a write has been successful
* will never attempt to become leaders
The idea is to first add a node as an observer (which will *not* in any way
affect quorum size, or availability), then promote it to full member status
once it has been brought up to date.
QuarkDB will further make an effort to refuse membership updates which might
compromise availability, as a protection against operator error, but please
keep the above in mind.
# How to view current cluster membership
Issue the command `raft-info` using `redis-cli` to any of the nodes, and check the `NODES` and
`OBSERVERS` fields. It's perfectly valid if the list of observers is empty.
# How to add a node
Three steps:
1. Run `quarkdb-create --path /path/to/db --clusterID ... ` on the machine you
would like to add. Note the complete omission of `--nodes` in the above invocation.
This creates a node which is _in limbo_ - the node has no idea of the participants
in the cluster, and will simply wait until it is contacted.
2. Write the xrootd configuration file for the new node, and start the process.
You will notice it complaining in the logs that it is in limbo, which is
completely normal.
3. Run `raft-add-observer server_hostname:server_port` towards the current
leader. Immediately, you should notice that the new node is no longer complaining
in the logs about not receiving heartbeats. The leader will start the process
of bringing this new node up-to-date.
A new node must always be added as an observer, there's no way to directly add
it as full member.
# How to promote an observer to full status
Issue `raft-promote-observer server_hostname:server_port` towards the current
leader.
First make sure it is sufficiently up to date! Running `raft-info` on the leader
will provide information on which replicas are online, up-to-date, or lagging.
# How to remove a node
Issue `raft-remove-member server_hostname:server_port` towards the current leader.
Works both on full members, as well as observers.
It's not possible to remove a node which is currently a leader. To do that, stop
the node, wait until the new leader emerges, and issue `raft-remove-member` towards
it.
A membership update is represented internally as a special kind of log entry.
This means that a removed node will often not know that it has been removed,
since the cluster stops replicating entries onto it. Such a node will also
stop receiving heartbeats, and thus trigger elections indefinitely.
There is built-in protection against such disruptive nodes, so this will not
affect the rest of the cluster, but it is highly recommended to stop QuarkDB
from running on removed nodes.
# Raft extensions in QuarkDB
Although we follow the raft algorithm closely, we have made several improvements.
Understanding the rest of this page requires having a good understanding of
[raft](https://raft.github.io/raft.pdf), please read the paper first.
1. An RPC which only serves as a heartbeat.
Even though _appendEntries_ serves as a heartbeat, it can be problematic: A
pipelined storm of gigantic in size _appendEntries_ messages will heavily
influence message acknowledgement reception latencies. When using short raft
timeouts, this can easily lead to spurious timeouts and re-elections.
For this reason, we use a separate thread on the leader node which regularly
sends heartbeats, decoupled from replication.
The heartbeat request contains two fields:
* The raft term of the contacting node, for which it is a leader.
* The server identifier (host:port) of the contacting node.
The heartbeat response contains two fields as well:
* The raft term of the node being contacted.
* Whether or not the heartbeat was successful, that is, whether the contacted
node recognizes the sender as leader for the specified term.
The fact that this exchange doesn't access the journal at all makes this a
separate process from replication, as certain locks don't need to be taken,
and makes the cluster much more robust against spurious timeouts.
1. Veto responses on vote requests.
In QuarkDB, a node can reply in three different ways to a vote request:
* Granted
* Denied
* Veto
A veto is simply a stronger form of vote denial. The responder communicates to
the sender that, if they were to become a leader, a critical safety violation
would occur on raft invariants. Namely, if elected, the contacting node would
attempt to overwrite already committed journal entries with conflicting ones.
Clearly, raft already protects from the above scenario -- it's simply an extra,
paranoid precaution. It should never happen that a node receives a quorum of
positive votes, plus a non-zero number of vetoes. If it does happen, we print
a serious error in the logs, and the node does not become leader, even though
having received a quorum of positive votes.
However, this mechanism is quite useful in a different way as well: A node
receiving a veto _knows_ it cannot possibly be the next leader of this cluster,
as its journal does not contain all committed entries. Therefore, a veto
begins a period of election embargo, during which a node will willingly stop
attempting to elect itself, up until the moment it is contacted by a leader node.
This is useful in a number of ways:
* Entirely mitigates "Disruptive Servers" scenario (see section 4.2.3 of the
Raft PhD thesis), in which a node which is not aware it has been removed from