Skip to content
Snippets Groups Projects

Update collectd alarms - freeze, oldtmp, notupdated

Merged Marta Vila Fernandes requested to merge los1068 into master
All threads resolved!
+ 39
10
@@ -19,30 +19,59 @@ Last update for [lxsoft](https://gitlab.cern.ch/ai/it-puppet-hostgroup-lxsoft/-/
@@ -19,30 +19,59 @@ Last update for [lxsoft](https://gitlab.cern.ch/ai/it-puppet-hostgroup-lxsoft/-/
After it, run puppet and restart collectd on each lxsoft web node.
After it, run puppet and restart collectd on each lxsoft web node.
## c8_freeze
## freeze
This alarm indicates that the C8/CS8 snapshot process can't update the symlinks because
| Distribution | Freeze alarm |
 
|--------------|-----------------|
 
| CS8 | freeze-stream8 |
 
| CS9 | freeze-stream9 |
 
| ALMA8 | freeze-alma8 |
 
| ALMA9 | freeze-alma9 |
 
| RHEL8 | freeze-rhel8 |
 
| RHEL9 | freeze-rhel9 |
 
 
These alarms indicates that the respective snapshot process can't update the symlinks because
there's a `.freeze.*all` file. You should also have received an email explaining why the
there's a `.freeze.*all` file. You should also have received an email explaining why the
process was stopped and what you need to do to resolve it.
process was stopped and what you need to do to resolve it.
## c8_notupdated
## notupdated
 
 
| Distribution | notupdated alarm | Snapshot Nomad jobs | ES dashboard
 
|
 
|--------------|---------------------|------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
 
| CS8 | notupdated-stream8 | https://gitlab.cern.ch/linuxsupport/cronjobs/stream8_snapshots/ | https://es-linux.cern.ch/kibana/app/dashboards?security_tenant=internal#/view/6b8102c0-51cb-11eb-932f-51687e53f66a |
 
| CS9 | notupdated-stream9 | https://gitlab.cern.ch/linuxsupport/cronjobs/stream9_snapshots/ | https://es-linux.cern.ch/kibana/app/dashboards?security_tenant=internal#/view/25061790-0bf6-11ed-8484-13abadf100a6 |
 
| ALMA8 | notupdated-alma8 | https://gitlab.cern.ch/linuxsupport/cronjobs/alma_snapshots/ | https://es-linux.cern.ch/dashboards/goto/1bb9a3e765f7e99aa77fcbaf6f1994e5?security_tenant=internal |
 
| ALMA9 | notupdated-alma9 | https://gitlab.cern.ch/linuxsupport/cronjobs/alma_snapshots/ | https://es-linux.cern.ch/dashboards/goto/18a216d237182fc2b3529a17468dfa39?security_tenant=internal |
 
| RHEL8 | notupdated-rhel8 | https://gitlab.cern.ch/linuxsupport/cronjobs/rhel_snapshots/ | https://es-linux.cern.ch/dashboards/goto/9bff1a346435c88c080cc492fe34a276?security_tenant=internal |
 
| RHEL9 | notupdated-rhel9 | https://gitlab.cern.ch/linuxsupport/cronjobs/rhel_snapshots/ | https://es-linux.cern.ch/dashboards/goto/78ef38cf73cb832b6fa7f15c756869d7?security_tenant=internal |
This alarm is triggered when the `.8-latest` (or `.s8-latest`) symlink hasn't been updated in over 30 hours.
These alarms are triggered when the `.{8/9}-latest` (or `.{s8/s9}-latest`) symlink hasn't been updated in over 30 hours.
This symlink is always supposed to point to the latest snapshot and is supposed to be updated
This symlink is always supposed to point to the latest snapshot and is supposed to be updated
every day. If it hasn't been updated, chances are something is wrong with the [centos8_snapshots](https://gitlab.cern.ch/linuxsupport/cronjobs/centos8_snapshots) or [stream8_snapshots](https://gitlab.cern.ch/linuxsupport/cronjobs/stream8_snapshots)
every day. If it hasn't been updated, chances are something is wrong with the snapshot Nomad job.
Nomad job. Check the logs in the ES dashboard for [CS8](https://es-linux.cern.ch/kibana/app/dashboards?security_tenant=internal#/view/6b8102c0-51cb-11eb-932f-51687e53f66a) or [CS9](https://es-linux.cern.ch/kibana/app/dashboards?security_tenant=internal#/view/25061790-0bf6-11ed-8484-13abadf100a6).
Check the logs in the ES dashboard.
## c8_oldtmp
 
## oldtmp
!!! danger ""
!!! danger ""
Before deleting _anything_, make sure you know what you're doing. If in doubt, double-check with the rest of the team.
Before deleting _anything_, make sure you know what you're doing. If in doubt, double-check with the rest of the team.
This alarm indicates that there are old `.tmp.*` directories in `/mnt/data1/dist/cern/centos/*-snapshots/`.
These alarms indicate that there are old `.tmp.*` directories in `/mnt/data1/dist/cern/{dist}/{version}-snapshots/`.
Those directories are created when the snapshot is run, but they are renamed at the end of the process.
Those directories are created when the snapshot is running, but they are renamed at the end of the process.
If there are directories left over, it means something interrupted that day's snapshot and needs to be investigated.
If there are directories left over, it means something interrupted that day's snapshot and needs to be investigated.
If the snapshots are currently failing, don't delete today's `.tmp.*` snapshot, and **never** delete the `.*-latest` symlink
If the snapshots are currently failing, don't delete today's `.tmp.*` snapshot, and **never** delete the `.*-latest` symlink
(or any symlink, for that matter). The latest symlink should always exist and point to something.
(or any symlink, for that matter). The latest symlink should always exist and point to something.
 
| Distribution | old alarm | dist | version |
 
|--------------|-----------------|--------|---------|
 
| CS8 | oldtmp-stream8 | centos | s8 |
 
| CS9 | oldtmp-stream9 | centos | s9 |
 
| ALMA8 | oldtmp-alma8 | alma | 8 |
 
| ALMA9 | oldtmp-alma9 | alma | 9 |
 
| RHEL8 | oldtmp-rhel8 | rhel | 8 |
 
| RHEL9 | oldtmp-rhel9 | rhel | 9 |
 
## repo_wrong
## repo_wrong
This alarm indicates an issue with the named repo, as indicated by `/usr/bin/repoquery --repofrompath=<repoid>,<repopath> -qa`.
This alarm indicates an issue with the named repo, as indicated by `/usr/bin/repoquery --repofrompath=<repoid>,<repopath> -qa`.
Loading