Apptainer occasionally fails to start containers
Apptainer occasionally fails to start containers with a pretty cryptic message
ERROR : Failed to get file information for file descriptor 3: Bad file descriptor
ERROR : Could not write info to setgroups: Permission denied
This is disruptive to testing as often we need to re-build (depending on which platform fails).
We should probably
- Pass
--debug
to apptainer to get a sample of debug logs when the problem happens (to understand what this file descriptor is pointing to). - Detect and handle the failure to start the container. (In the second example below, the build of a project fails, and the install and the rest of the projects happily continue.)
- Until the underlying problem is understood, we should probably retry running apptainer. We should consider if the retrial should be done on any error. The exit code is for example also 1 (the same as when apptainer fails to start) when the build (ninja) fails for a genuine problem.
Some related discussions: https://github.com/apptainer/apptainer/issues/430 https://github.com/apptainer/singularity/pull/4953 https://github.com/apptainer/singularity/issues/5206
Examples
https://jenkins-lhcb-nightlies.web.cern.ch/job/nightly-builds/job/build/337143/consoleFull
2023-09-23 12:15:56,463:DEBUG : running cmake --install LHCb/build --prefix LHCb/InstallArea/x86_64_v2-centos7-clang12-opt
2023-09-23 12:15:56,463:DEBUG : apptainer command: /cvmfs/lhcbdev.cern.ch/nightly-environments/5ced04d9e2e8ad4a44b36e93198a8c2f88c23ebb2313ab50b202d5e241e8cb8d/bin/apptainer exec --contain --bind /cvmfs --bind /home/lblocal/jenkins-build/workspace/nightly-builds/build@2:/workspace --bind /home/lblocal/jenkins-build/workspace/nightly-builds/build@2 --pwd /workspace/build --env PATH=/cvmfs/lhcb.cern.ch/lib/bin/x86_64-centos7:/cvmfs/lhcb.cern.ch/lib/bin/x86_64-centos7:/cvmfs/lhcb.cern.ch/lib/bin/Linux-x86_64:/cvmfs/lhcb.cern.ch/lib/bin:/cvmfs/lhcbdev.cern.ch/nightly-environments/5ced04d9e2e8ad4a44b36e93198a8c2f88c23ebb2313ab50b202d5e241e8cb8d/bin:/cvmfs/lhcbdev.cern.ch/conda/miniconda/linux-64/1622055603/condabin:/usr/sue/bin:/usr/lib64/ccache:/usr/local/bin:/usr/bin:/opt/puppetlabs/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin /cvmfs/lhcb.cern.ch/containers/os-base/centos7-devel/prod/amd64 cmake --install LHCb/build --prefix LHCb/InstallArea/x86_64_v2-centos7-clang12-opt
2023-09-23 12:15:56,620:DEBUG : [91mERROR : Failed to get file information for file descriptor 3: Bad file descriptor
2023-09-23 12:15:56,621:DEBUG : [0m[91mERROR : Could not write info to setgroups: Permission denied
2023-09-23 12:15:56,622:DEBUG : command exited with code 1
2023-09-23 12:15:56,622:DEBUG : Completed at: 2023-09-23 12:15:56.622240
https://jenkins-lhcb-nightlies.web.cern.ch/job/nightly-builds/job/build/337186/consoleFull
2023-09-23 19:48:31,458:DEBUG : running cmake --build Detector/build -j 10 -- -k0
2023-09-23 19:48:31,458:DEBUG : apptainer command: /cvmfs/lhcbdev.cern.ch/nightly-environments/5ced04d9e2e8ad4a44b36e93198a8c2f88c23ebb2313ab50b202d5e241e8cb8d/bin/apptainer exec --contain --bind /cvmfs --bind /home/lblocal/jenkins-build/workspace/nightly-builds/build@2:/workspace --bind /home/lblocal/jenkins-build/workspace/nightly-builds/build@2 --pwd /workspace/build --env PATH=/cvmfs/lhcb.cern.ch/lib/bin/x86_64-el9:/cvmfs/lhcb.cern.ch/lib/bin/x86_64-centos7:/cvmfs/lhcb.cern.ch/lib/bin/Linux-x86_64:/cvmfs/lhcb.cern.ch/lib/bin:/cvmfs/lhcbdev.cern.ch/nightly-environments/5ced04d9e2e8ad4a44b36e93198a8c2f88c23ebb2313ab50b202d5e241e8cb8d/bin:/cvmfs/lhcbdev.cern.ch/conda/miniconda/linux-64/1622055603/condabin:/usr/sue/bin:/usr/lib64/ccache:/usr/local/bin:/usr/bin:/opt/puppetlabs/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin /cvmfs/lhcb.cern.ch/containers/os-base/alma9-devel/prod/amd64 cmake --build Detector/build -j 10 -- -k0
2023-09-23 19:48:31,609:DEBUG : [91mERROR : Failed to get file information for file descriptor 3: Bad file descriptor
2023-09-23 19:48:31,609:DEBUG : [0m[91mERROR : Could not write info to setgroups: Permission denied
2023-09-23 19:48:31,610:DEBUG : command exited with code 1
2023-09-23 19:48:31,615:DEBUG : running cmake --install Detector/build --prefix Detector/InstallArea/x86_64_v3-el9-gcc12+cuda12_1-opt+g