Skip to content

Silent crash in grid job when No space left on device

Hi, we found that in our easyjet production on grid, though all jobs finished successfully, some of the output files cannot be opended. After checking the log, it's found that thoese relevant jobs should be failed because of WriteBuffer ERROR but they are not. There is a job for example: https://bigpanda.cern.ch/job/6588987280/. I also attach the logs below

AthenaEventLoopMgr                                   INFO   ===>>>  done processing event #123000, run #0 123001 events processed so far  <<<===
AthenaEventLoopMgr                                   INFO   ===>>>  start processing event #123500, run #0 123500 events processed so far  <<<===
AthenaEventLoopMgr                                   INFO   ===>>>  done processing event #123500, run #0 123501 events processed so far  <<<===
AthenaEventLoopMgrWARNING INFO message limit (500) reached for AthenaEventLoopMgr. Suppressing further output.
ToolSvc.BookkeeperTool                               INFO Copying input containers for source ''
ToolSvc.BookkeeperTool                               INFO Preparing local cache for source '' with 1 variations
ToolSvc.TrigDecisionTool                             INFO Updating config in slot 0 with SMK: 2695 and L1PSK: 20511 and HLTPSK: 15177 and BGSK: 0 getForceConfigUpdate()=1 HLT Chains: 3059
ToolSvc.BookkeeperTool                               INFO Copying input containers for source ''
ToolSvc.BookkeeperTool                               INFO Preparing local cache for source '' with 1 variations
ToolSvc.TrigDecisionTool                             INFO Updating config in slot 0 with SMK: 2695 and L1PSK: 20511 and HLTPSK: 15177 and BGSK: 0 getForceConfigUpdate()=1 HLT Chains: 3059
ToolSvc.BookkeeperTool                               INFO Copying input containers for source ''
ToolSvc.BookkeeperTool                               INFO Preparing local cache for source '' with 1 variations
ToolSvc.TrigDecisionTool                             INFO Updating config in slot 0 with SMK: 2695 and L1PSK: 20511 and HLTPSK: 15177 and BGSK: 0 getForceConfigUpdate()=1 HLT Chains: 3059
ToolSvc.BookkeeperTool                               INFO Copying input containers for source ''
ToolSvc.BookkeeperTool                               INFO Preparing local cache for source '' with 1 variations
ToolSvc.TrigDecisionTool                             INFO Updating config in slot 0 with SMK: 2695 and L1PSK: 20511 and HLTPSK: 15177 and BGSK: 0 getForceConfigUpdate()=1 HLT Chains: 3059
ToolSvc.BookkeeperTool                               INFO Copying input containers for source ''
ToolSvc.BookkeeperTool                               INFO Preparing local cache for source '' with 1 variations
ToolSvc.TrigDecisionTool                             INFO Updating config in slot 0 with SMK: 2695 and L1PSK: 20511 and HLTPSK: 15177 and BGSK: 0 getForceConfigUpdate()=1 HLT Chains: 3059
ToolSvc.BookkeeperTool                               INFO Copying input containers for source ''
ToolSvc.BookkeeperTool                               INFO Preparing local cache for source '' with 1 variations
ToolSvc.TrigDecisionTool                             INFO Updating config in slot 0 with SMK: 2695 and L1PSK: 20511 and HLTPSK: 15177 and BGSK: 0 getForceConfigUpdate()=1 HLT Chains: 3059
TFile::WriteBuffer        ERROR   error writing all requested bytes to file output-tree.root, wrote 393 of 483
TBranchElement::WriteB... ERROR   basket's WriteBuffer failed.
TFile::WriteBuffer        ERROR   error writing to file output-tree.root (-1) No space left on device
TBranchElement::WriteB... ERROR   basket's WriteBuffer failed.
TBranchElement::WriteB... ERROR   basket's WriteBuffer failed.
TBranchElement::WriteB... ERROR   basket's WriteBuffer failed.
TBranchElement::WriteB... ERROR   basket's WriteBuffer failed.
TBranchElement::WriteB... ERROR   basket's WriteBuffer failed.
TBranchElement::WriteB... ERROR   basket's WriteBuffer failed.
TBranchElement::WriteB... ERROR   basket's WriteBuffer failed.
TBranchElement::WriteB... ERROR   basket's WriteBuffer failed.
TBranchElement::WriteB... ERROR   basket's WriteBuffer failed.
TBranchElement::WriteB... ERROR   basket's WriteB

=== stderr ===
250406 03:41:49 2901625 secgsi_GetCACheck: CRL entry for '9c979c2b.0:1' needs refreshing: clean the related entry cache first (0x150b680693a0)
Info in <xAOD::TFileAccessTracer>: Sending file access statistics to http://rucio-lb-prod.cern.ch:18762/traces/

This seems to an Athena-related or even ROOT-related issue. Thanks a lot @aad for agreeing to check!