Investigate cta-taped Watchdog timer configuration
In CTA/tapeserver/daemon/TapedConfiguration.hpp
, several parameters are set for the Watchdog. They are logged and saved to the drive configuration in the DB, but otherwise they don't appear to be used anywhere.
//----------------------------------------------------------------------------
// Watchdog: parameters for timeouts in various situations.
//----------------------------------------------------------------------------
/// Maximum time allowed to complete a single mount scheduling.
cta::SourcedParameter<time_t> wdScheduleMaxSecs{
"taped", "WatchdogScheduleMaxSecs", 60, "Compile time default"};
/// Maximum time allowed to complete mount a tape.
cta::SourcedParameter<time_t> wdMountMaxSecs{
"taped", "WatchdogMountMaxSecs", 900, "Compile time default"};
/// Maximum time allowed after mounting without a single tape block move
cta::SourcedParameter<time_t> wdNoBlockMoveMaxSecs{
"taped", "WatchdogNoBlockMoveMaxSecs", 1800, "Compile time default"};
/// Time to wait after scheduling came up idle
cta::SourcedParameter<time_t> wdIdleSessionTimer{
"taped", "WatchdogIdleSessionTimer", 10, "Compile time default"};
There are also several constants set in ./tapeserver/castor/tape/tapeserver/daemon/Constants.hpp
with the same values. But these do not appear to be used either:
/**
* The delay in seconds the master process of the tapeserverd daemon should
* wait before launching another transfer session whilst the corresponding
* drive is idle.
*/
const unsigned int TAPESERVER_TRANSFERSESSION_TIMER = 10;
/**
* The compile-time default value for the maximum time in seconds that the
* data-transfer session can take to get the transfer job from the client.
*/
const time_t TAPESERVER_WAITJOBTIMEOUT = 60; // 1 minute
/**
* The compile-time default value for the maximum time in seconds that the
* data-transfer session can take to mount a tape.
*/
const time_t TAPESERVER_MOUNTTIMEOUT = 900; // 15 minutes
/**
* The compile-time default value for the maximum time in seconds the
* data-transfer session of tapeserverd can cease to move data blocks.
*/
const time_t TAPESERVER_BLKMOVETIMEOUT = 1800; // 30 minutes
@smurray says that the most important one is the block move timeout:
This particular one is to spot a stuck transfer, whether that's disk to memory, memory to tape, tape to memory or memory to disk, or all cases is not clear.
The point here is to monitor block movements and to NOT naively monitor file movements. Timing out on the transmission of a whole file is rather brutal and can make the wrong decision. There's a big difference between slow block movements and no block movements.
A lot of effort was spent to collect and store this configuration value. Why is a block movement timeout not being used?
We should investigate the tape server code to see where this timeout should go.