Silent crash in grid job when No space left on device
Hi, we found that in our easyjet production on grid, though all jobs finished successfully, some of the output files cannot be opended. After checking the log, it's found that thoese relevant jobs should be failed because of WriteBuffer ERROR but they are not. There is a job for example: https://bigpanda.cern.ch/job/6588987280/. I also attach the logs below
AthenaEventLoopMgr INFO ===>>> done processing event #123000, run #0 123001 events processed so far <<<===
AthenaEventLoopMgr INFO ===>>> start processing event #123500, run #0 123500 events processed so far <<<===
AthenaEventLoopMgr INFO ===>>> done processing event #123500, run #0 123501 events processed so far <<<===
AthenaEventLoopMgrWARNING INFO message limit (500) reached for AthenaEventLoopMgr. Suppressing further output.
ToolSvc.BookkeeperTool INFO Copying input containers for source ''
ToolSvc.BookkeeperTool INFO Preparing local cache for source '' with 1 variations
ToolSvc.TrigDecisionTool INFO Updating config in slot 0 with SMK: 2695 and L1PSK: 20511 and HLTPSK: 15177 and BGSK: 0 getForceConfigUpdate()=1 HLT Chains: 3059
ToolSvc.BookkeeperTool INFO Copying input containers for source ''
ToolSvc.BookkeeperTool INFO Preparing local cache for source '' with 1 variations
ToolSvc.TrigDecisionTool INFO Updating config in slot 0 with SMK: 2695 and L1PSK: 20511 and HLTPSK: 15177 and BGSK: 0 getForceConfigUpdate()=1 HLT Chains: 3059
ToolSvc.BookkeeperTool INFO Copying input containers for source ''
ToolSvc.BookkeeperTool INFO Preparing local cache for source '' with 1 variations
ToolSvc.TrigDecisionTool INFO Updating config in slot 0 with SMK: 2695 and L1PSK: 20511 and HLTPSK: 15177 and BGSK: 0 getForceConfigUpdate()=1 HLT Chains: 3059
ToolSvc.BookkeeperTool INFO Copying input containers for source ''
ToolSvc.BookkeeperTool INFO Preparing local cache for source '' with 1 variations
ToolSvc.TrigDecisionTool INFO Updating config in slot 0 with SMK: 2695 and L1PSK: 20511 and HLTPSK: 15177 and BGSK: 0 getForceConfigUpdate()=1 HLT Chains: 3059
ToolSvc.BookkeeperTool INFO Copying input containers for source ''
ToolSvc.BookkeeperTool INFO Preparing local cache for source '' with 1 variations
ToolSvc.TrigDecisionTool INFO Updating config in slot 0 with SMK: 2695 and L1PSK: 20511 and HLTPSK: 15177 and BGSK: 0 getForceConfigUpdate()=1 HLT Chains: 3059
ToolSvc.BookkeeperTool INFO Copying input containers for source ''
ToolSvc.BookkeeperTool INFO Preparing local cache for source '' with 1 variations
ToolSvc.TrigDecisionTool INFO Updating config in slot 0 with SMK: 2695 and L1PSK: 20511 and HLTPSK: 15177 and BGSK: 0 getForceConfigUpdate()=1 HLT Chains: 3059
TFile::WriteBuffer ERROR error writing all requested bytes to file output-tree.root, wrote 393 of 483
TBranchElement::WriteB... ERROR basket's WriteBuffer failed.
TFile::WriteBuffer ERROR error writing to file output-tree.root (-1) No space left on device
TBranchElement::WriteB... ERROR basket's WriteBuffer failed.
TBranchElement::WriteB... ERROR basket's WriteBuffer failed.
TBranchElement::WriteB... ERROR basket's WriteBuffer failed.
TBranchElement::WriteB... ERROR basket's WriteBuffer failed.
TBranchElement::WriteB... ERROR basket's WriteBuffer failed.
TBranchElement::WriteB... ERROR basket's WriteBuffer failed.
TBranchElement::WriteB... ERROR basket's WriteBuffer failed.
TBranchElement::WriteB... ERROR basket's WriteBuffer failed.
TBranchElement::WriteB... ERROR basket's WriteBuffer failed.
TBranchElement::WriteB... ERROR basket's WriteB
=== stderr ===
250406 03:41:49 2901625 secgsi_GetCACheck: CRL entry for '9c979c2b.0:1' needs refreshing: clean the related entry cache first (0x150b680693a0)
Info in <xAOD::TFileAccessTracer>: Sending file access statistics to http://rucio-lb-prod.cern.ch:18762/traces/
This seems to an Athena-related or even ROOT-related issue. Thanks a lot @aad for agreeing to check!