implement DAQ Doctor's CrashDetector functionality
Created by: andreh12
see the original DAQ Doctor's code here.
Description from the DAQ Doctor documentation:
The CrashDetector helps to find crashed applications or crashed computers. In addition it watches out for disk Controller (called “SAS controller”) faults which turned out to be a frequent problem on the computers deployed in the cluster. The failure mode of these controllers turns the disk into a read-only device making the writing of any data impossible. When the failure occurs applications continue to run until they need to write data to disk. Since many of the DAQ applications do not write data to disk they usually continue running until a state transition is requested by run-control (in this case log messages are usually written to a file).
Crashes of computers and crashes of the jobControl applications can be detected by jobControl flashlists not being updated any more. The CrashDetector searches for entries older than 100 seconds and reports these.
To distinguish between a crashed computer from a crashed applications or a crashed jobControl, the CrashDetector tests in all computers witch old jobControl flashlists if the ssh daemon still answeres. For this it tries to establish a connection to port 22 of the computer. It has been found empirically that this test is a reliable method to identify crashed computers. If a crashed RU or BUFU computer which is actively participating in the ongoing run, is identified, it is blacklisted and a new configuration is generated and registered. Then the shifter is informed that the next state transition will fail due to the broken computer and the shifter needs to re-cycle the daq system in order to pick up the new configuration.
To find crashed applications the jobControl flashlists are searched for jobs in a state different from “alive”. This test is performed in those phases of a run when all applications should be running.
Failures of a SAS controller are detected by system-scripts which have been installed and are run in regular intervals for this purpose on all machines. The output of these scripts are aggregated in a file on a shared NFS file-system. The DaqDoctor parses this file. Failed SAS controllers in RUs or BUFUs are handled the same as if these computers were crashed.
The CrashDetector contains a hard-coded list of computers which are essential for global data taking but which are not part of the DAQ configurations ((e.g. the computers of the monitoring system and the RunControl servers). The SAS checks are performed also on these computers.
For all checks only those computers are considered which are actively taken part in the ongoing run. For example computers in a masked slice are NOT considered.
To avoid a “storm” of new configurations with many computers blacklisted (e.g. in case of a cooling failure in a rack), the DaqDoctor is limited to do at maximum one new configuration in 20 minutes.