Keep cta-taped alive if at least one drive handler is alive
Description of the problem
If the drive fails critically (for example, fails to unmount the tape), the drive handler requests shutdown:
SubprocessHandler::ProcessingStatus DriveHandler::processFatal(serializers::WatchdogMessage& message) {
...
m_processingStatus.shutdownRequested = true;
...
}
Then the process manager kills ALL the subprocesses and shuts down the cta-taped:
// Check the current statuses for shutdown requests
// If any process requests a shutdown, we will trigger it in all.
bool anyAskedShutdown = std::count_if(m_subprocessHandlers.cbegin(),
m_subprocessHandlers.cend(),
[&](const SubprocessAndStatus &i){
if (i.status.shutdownRequested) {
cta::log::ScopedParamContainer params(m_logContext);
params.add("SubprocessName", i.handler->index);
m_logContext.log(log::INFO, "Subprocess requested shutdown");
}
return i.status.shutdownRequested;
});
if (anyAskedShutdown) {
for(auto & sp: m_subprocessHandlers) {
sp.status = sp.handler->shutdown();
cta::log::ScopedParamContainer params(m_logContext);
params.add("SubprocessName", sp.handler->index)
.add("ShutdownComplete", sp.status.shutdownComplete);
m_logContext.log(log::INFO, "Signaled shutdown to subprocess handler");
}
}
// If all processes completed their shutdown, we can exit
bool shutdownComplete=true;
for (auto & sp: m_subprocessHandlers) { shutdownComplete &= sp.status.shutdownComplete; }
if (shutdownComplete) {
m_logContext.log(log::INFO, "All subprocesses completed shutdown. Exiting.");
RunPartStatus ret;
ret.doExit = true;
ret.exitCode = EXIT_SUCCESS;
return ret;
}
This logic creates problems for the multi-drive setup, where multiple drive handlers operate in parallel and one failing drive should not cause the whole cta-taped process to be killed.
Proposal
The process manager should count running drive handler subprocesses and proceed with shutdown only if all drive handlers requested shutdown. Otherwise remember which subprocess had failed, do not try to restart it and keep running the rest of subprocesses.