Skip to content

correctly pass x509userproxy to condor

Closes ATLASG-2561. Tagging @krumnack. This should target main, but also 24.2, 22.2 (I don't see the tag, please clarify), and possibly also 21.2.

This MR fixes two problems listed in ATLASG-2561 when submitting an EvenLoop job without a shared filesystem using the condor driver

  • failed to run condor_submit
  • failed to read the userproxy from /tmp

The first one is due to the fact that if the variable X509_USER_PROXY is not set an empty line is written in the submission file. A check is set and if the env is not present a warning is printed.

The second problem is due to a (not clear to me) problem of htcondor: even if the official documentation does not mention it condor cannot read from /tmp where usually the userproxy is created. Actually on the official manual I read

x509userproxy = <full-pathname>

Used to override the default path name for X.509 user certificates. The default location for X.509 proxies is the /tmp directory,

but I got an error

error reading from /tmp/x509up_u11547: (errno 2) No such file or directory; STARTER failed to receive file(s) from ...

obviously the file exists and it is readable. This problem is mentioned in the CERN manual

If you provide $(Proxy_path) with the default location of your proxy in /tmp/x509up_u$(id -u), please note that that file is not readable for Condor:

In this MR the userproxy is copied into the submission folder.

To test

I used PhysicsAnalysis/Algorithms/AnalysisAlgorithmsConfig/share/FullCPAlgorithmsTest_eljob.py with the following patch

diff --git a/PhysicsAnalysis/Algorithms/AnalysisAlgorithmsConfig/share/FullCPAlgorithmsTest_eljob.py b/PhysicsAnalysis/Algorithms/AnalysisAlgorithmsConfig/share/FullCPAl
index 0be305f..f63b2ff 100755
--- a/PhysicsAnalysis/Algorithms/AnalysisAlgorithmsConfig/share/FullCPAlgorithmsTest_eljob.py
+++ b/PhysicsAnalysis/Algorithms/AnalysisAlgorithmsConfig/share/FullCPAlgorithmsTest_eljob.py
@@ -131,11 +131,23 @@ else :
 # way it tests whether the code works correctly with that driver,
 # which is a lot more similar to the way the batch/grid drivers work.
 driver = ROOT.EL.LocalDriver()
+driver = ROOT.EL.CondorDriver()
+job.options().setBool(ROOT.EL.Job.optBatchSharedFileSystem, False)
+job.options().setString(ROOT.EL.Job.optCondorConf, "RequestMemory=8GB")
+#certificate_path = os.path.expandvars('$X509_USER_PROXY') 
+#certificate_newpath = os.path.join(os.getcwd(), os.path.split(certificate_path)[1])
+#import shutil
+#logging.info("copying certificate from %s to %s", certificate_path, certificate_newpath)
+#shutil.copyfile(certificate_path, certificate_newpath)
+driver.shellInit = """
+pwd
+hostname
+ls
+echo "RUNNING THE TEST"
+voms-proxy-info -all
+"""
 
-if options.direct_driver :
-    # this is for testing purposes, as only the direct driver respects
-    # the limit on the number of events.
-    driver = ROOT.EL.DirectDriver()
 
 print ("submitting job now", flush=True)
-driver.submit( job, submitDir )
+driver.submitOnly( job, submitDir )

and with package_filter.txt:

+ PhysicsAnalysis/Algorithms/AsgAnalysisAlgorithms
+ PhysicsAnalysis/Algorithms/AnalysisAlgorithmsConfig
+ PhysicsAnalysis/D3PDTools/EventLoop
- .*

The code still does not work because of another bug ( ATLASG-2563)

Edited by Ruggero Turra

Merge request reports