Skip to content
Snippets Groups Projects

Nomad nodes

Merged Ghost User requested to merge nomad_nodes into master
Files
3
+ 50
0
# Nomad
Nomad is a workload orchestrator that we use for automating tasks.
## Access
### UI
To access Nomad's UI, point your browser to:
https://lxsoftadm.cern.ch:4646
And get the admin ACL token:
```
tbag show nomad_submit_token --hg lxsoft/adm
```
### CLI
TODO
### Monitoring
There's some basic monitoring available here:
https://kojimon.web.cern.ch/d/_ffC7H-ik/nomad?refresh=10s&orgId=1&from=now-7d&to=now
## Cronjobs
All cronjobs' definitions can be found here:
https://gitlab.cern.ch/linuxsupport/cronjobs
## Troubleshooting
If jobs are not starting due to placement failures (no resources on any node),
the cluster might have gotten itself into a broken state. I've also seen problems
with `docker pull` after S3 failures. Here are some things you can do to try to recover:
!!! danger "Important Note"
Do this only to one server at a time, keeping the others running. Otherwise
you risk loosing the entire cluster, which would be a real PITA to recover.
1. Restart Nomad: `service nomad restart`
2. Stop Nomad, restart the docker daemon, restart nomad
3. Stop Nomad, delete the client DB, restart Nomad:
`service nomad stop && rm -f /var/lib/nomad/client/state.db && service nomad start`
(Do this especially if nomad crashes on startup. There have been some bugs in
the past that prevented nomad from restarting cleanly)
4. Reboot the node
Loading