From 7881c3b60327f1a9f83b9b94b2d3f3ae4499802e Mon Sep 17 00:00:00 2001
From: Ben Morrice <ben.morrice@cern.ch>
Date: Thu, 6 Jan 2022 10:45:51 +0100
Subject: [PATCH] add nomad/troubleshooting with 'lost leader' details

---
 docs/nomad/access.md          | 19 ------------
 docs/nomad/troubleshooting.md | 58 +++++++++++++++++++++++++++++++++++
 mkdocs.yml                    |  1 +
 3 files changed, 59 insertions(+), 19 deletions(-)
 create mode 100644 docs/nomad/troubleshooting.md

diff --git a/docs/nomad/access.md b/docs/nomad/access.md
index dd17dcd..b6318ba 100644
--- a/docs/nomad/access.md
+++ b/docs/nomad/access.md
@@ -35,22 +35,3 @@ You can also check job logs on <https://es-linux6.cern.ch/kibana_private/goto/45
 
 All cronjobs' definitions can be found here:
 <https://gitlab.cern.ch/linuxsupport/cronjobs>
-
-## Troubleshooting
-
-If jobs are not starting due to placement failures (no resources on any node),
-the cluster might have gotten itself into a broken state. I've also seen problems
-with `docker pull` after S3 failures. Here are some things you can do to try to recover:
-
-!!! danger "Important Note"
-    Do this only to one server at a time, keeping the others running. Otherwise
-    you risk loosing the entire cluster, which would be a real PITA to recover.
-
-1. Restart Nomad: `service nomad restart`
-2. Stop Nomad, restart the docker daemon, restart nomad
-3. Stop Nomad, delete the client DB, restart Nomad:
-  `service nomad stop && rm -f /var/lib/nomad/client/state.db && service nomad start`
-  (Do this especially if nomad crashes on startup. There have been some bugs in
-  the past that prevented nomad from restarting cleanly)
-4. Reboot the node
-
diff --git a/docs/nomad/troubleshooting.md b/docs/nomad/troubleshooting.md
new file mode 100644
index 0000000..ccc0c1f
--- /dev/null
+++ b/docs/nomad/troubleshooting.md
@@ -0,0 +1,58 @@
+# Troubleshooting
+
+## Placement failures
+
+If jobs are not starting due to placement failures (no resources on any node),
+the cluster might have gotten itself into a broken state. I've also seen problems
+with `docker pull` after S3 failures. Here are some things you can do to try to recover:
+
+!!! danger "Important Note"
+    Do this only to one server at a time, keeping the others running. Otherwise
+    you risk loosing the entire cluster, which would be a real PITA to recover.
+
+1. Restart Nomad: `service nomad restart`
+2. Stop Nomad, restart the docker daemon, restart nomad
+3. Stop Nomad, delete the client DB, restart Nomad:
+  `service nomad stop && rm -f /var/lib/nomad/client/state.db && service nomad start`
+  (Do this especially if nomad crashes on startup. There have been some bugs in
+  the past that prevented nomad from restarting cleanly)
+4. Reboot the node
+
+## Lost leader
+
+Without a leader, nomad is unable to schedule nor execute any jobs. The negotiation of a leader should be automatic through the raft protocol, however it can happen that nomad gets confused and leader election is stuck or flaps. If this happens, you will see the following:
+
+```
+# . setNomad.sh
+# /usr/local/bin/nomad server members
+Name                        Address                    Port  Status  Leader  Protocol  Build  Datacenter  Region
+lxsoftadm27.cern.ch.global  2001:1458:d00:39::100:469  4648  alive   false   2         1.1.4  meyrin      global
+lxsoftadm28.cern.ch.global  2001:1458:d00:16::3c       4648  alive   false   2         1.1.4  meyrin      global
+lxsoftadm29.cern.ch.global  2001:1458:d00:2d::100:4a   4648  alive   false   2         1.1.4  meyrin      global
+lxsoftadm30.cern.ch.global  2001:1458:d00:32::100:1c   4648  alive   false   2         1.1.4  meyrin      global
+lxsoftadm31.cern.ch.global  2001:1458:d00:12::3d       4648  alive   false   2         1.1.4  meyrin      global
+
+Error determining leaders: 1 error occurred:
+    * Region "global": Unexpected response code: 500 (No cluster leader)
+```
+
+To fix this issue, the [upstream instructions](https://learn.hashicorp.com/tutorials/nomad/outage-recovery?in=nomad/manage-clusters#manual-recovery-using-peers-json) can be followed
+
+This process is essentially:
+1. Stop Nomad `systemctl stop nomad` on all servers
+2. Populate `/var/lib/nomad/server/raft/peers.json` on all servers
+3. Start Nomad `systemctl start nomad` on all servers
+
+Notes:
+  * `/var/lib/nomad/server/raft/peers.json` is removed after Nomad startup
+  * The upstream instructions use ipv4 for the `address` in `peers.json`. As we utilise ipv6, we should also utilise ipv6 in the creation of `peers.json`
+  * A script to assist with the creation of peers.json can be found [here](https://gitlab.cern.ch/-/snippets/1964)
+
+The `peers.json` file should look similar to the below (however, containing an entry for all nomad servers)
+
+```
+ {
+    "id": "f57cdac1-91eb-5415-0d12-01a5cb8a6570",
+    "address": "[2001:1458:d00:16::3c]:4647"
+  }
+```
diff --git a/mkdocs.yml b/mkdocs.yml
index a6a75c5..7fd4ff8 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -54,6 +54,7 @@ nav:
         - 'Mirroring new repos': nomad/mirroring.md
         - 'Crons': nomad/crons.md
         - 'Development': nomad/dev.md
+        - 'Troubleshooting': nomad/troubleshooting.md
     - 'AIMS2':
         - 'AIMS2 ': aims2/aims2.md
         - 'AIMS2 Architecture': aims2/aims2architecture.md
-- 
GitLab