Commit 02d89142 authored by Georgios Bitzes's avatar Georgios Bitzes
Browse files

doc: make backup page more understandable

parent 399439c5
Pipeline #1700333 failed with stages
in 94 minutes and 12 seconds
# Backup
## How to backup my data?
## QuarkDB is replicated, do I really need backups?
Let's assume there's QuarkDB running at ```/var/lib/quarkdb``` - how to backup
this directory?
Yes, absolutely! Replication is very different from a backup. While
replication will most likely protect your data against a failing disk,
it will not protect from software bugs or other accidents.
First of all, it is a **bad** idea to directly copy the files of a running, live instance.
Please don't do that! Between the time you start the backup, to the time it finishes,
the underlying SST files will have likely changed, resulting in a backup that is corrupted.
## Cool, how to take a backup?
Instead, take a checkpoint by issuing ``raft-checkpoint /path/to/backup``,
which will create a point-in-time consistent snapshot, containing both state machine
and journal, from which you will be able to easily spin up another QDB instance if need be.
1. Create a point-in-time checkpoint by issuing ``quarkdb-checkpoint /path/to/backup`` redis command on the node you wish to backup. Both `state-machine` and `raft-journal` are included in the checkpoint.
Please make sure that ```/var/lib/quarkdb``` is on the **same** physical filesystem
as ```/path/to/backup```. This allows hard-linking the SST files, resulting in a
backup that takes virtually no additional space on the disk, and takes a couple
of seconds to create, even if your DB is half a terabyte.
2. Validate the newly created checkpoint using the command-line tool ``quarkdb-validate-checkpoint``.
After that, you should be able to stream or rsync this directory over the network.
Once you're done, make sure to **remove** it. Otherwise, the contents of ```/var/lib/quarkdb```
and ```/path/to/backup``` will soon start diverging as QDB processes more writes,
and new SST files are written. In short: ```/path/to/backup``` will no longer be
"for free", and start to consume actual disk space.
3. Stream or rsync the contents of `/path/to/backup` over the network to the intended long-term storage destination.
## Restore
4. Delete `/path/to/backup`.
Once we have a checkpoint, here are the steps to spin up an entirely new QuarkDB
instance out of it:
1. If the checkpoint was produced by a standalone instance, you can skip this
step. Otherwise, you need to change the hostname associated to the raft journal
by running the following command:
Things to note:
```
quarkdb-recovery --path /path/to/backup/current/raft-journal --command "recovery-force-reconfigure-journal localhost:8888| new-cluster-uuid"
```
- ``/path/to/backup`` should be on the same physical filesystem as the actual
data. This allows hard-linking of the underlying SST files, resulting in a backup
that consumes virtually no additional space on disk, and takes a couple of
seconds to create even for large datasets.
This way, the new node will identify as ```localhost:8888```, for example, instead
of whatever hostname the machine we retrived the backup from had. Replace
```new-cluster-uuid``` with a unique string, such as a UUID.
- ``/path/to/backup`` may not consume much disk space at first, but it will as
soon as the contents between ``/path/to/backup`` and your DB start to diverge
significantly. Make sure to delete it once the contents have been copied to
their final destination.
2. It should now be possible to directly spin up a new node from the checkpoint
directory - example configuration file for raft mode:
## Can't I just copy the main data directory? (ie ```/var/lib/quarkdb```)
```
xrd.port 8888
xrd.protocol redis:8888 libXrdQuarkDB.so
redis.mode raft
redis.database /path/to/backup
redis.myself localhost:8888
```
Nooo. When directly copying the files of a running live instance you are likely
to end up with a backup that is corrupted. Between the time you start the backup
to the time it finishes, the underlying SST files will have likely changed.
Make sure to use the same hostname:port pair in the configuration file, as
well as in ```quarkdb-recovery``` command invocation.
If the node is not currently running, however, just copying the main data directory
is safe.
The resulting cluster will be raft-enabled, and single-node. It's possible
to expand it through regular membership updates.
## How to restore?
Restore works by creating an entirely new cluster out of a checkpoint. However, our checkpoint
initially remembers the old cluster members --- we need to reconfigure it and specify
new node and clusterID.
1. Change the cluster's `MEMBERS` and `CLUSTER-ID` using `quarkdb-recovery` command
line tool:
```
quarkdb-recovery --path /path/to/backup/current/raft-journal --command "recovery-force-reconfigure-journal localhost:8888| new-cluster-uuid"
```
This way, the new cluster will be composed of a single new member, such as
```localhost:8888``` and have a different clusterID. This is important to
prevent the old and new clusters from accidentally interfering with one another.
2. Spin up a new node from the checkpoint directory --- example configuration file:
<pre>```
xrd.port 8888
xrd.protocol redis:8888 libXrdQuarkDB.so
redis.mode raft
redis.database /path/to/backup
redis.myself localhost:8888
```</pre>
Use the same host:port pair in `redis.myself` and```quarkdb-recovery``` command invocation.
The resulting cluster will be raft-enabled, and single-node. It's possible
to expand it through regular membership updates.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment