Implement Bulk Grid Submission, but in Go (!907) · Merge requests · aft / algorithms / training-dataset-dumper

This extends the idea in !904 to implementing the entire grid submission script with bulk submission and in go. Before I explain the reasoning behind this please note that this pull request should be seen merely as a proposal in the sense that I am happy to rewrite everything in python instead as originally proposed, if there is clear opposition to this. I can also do this after we already merged this and it becomes clear that it introduced more problems than advantages. I did think however, that such a proposal would be easier to discuss if we already have a concrete example at hand.

The main reasons for trying to introduce go here are:

Error handling: As someone who had to debug the grid submission script several times before, I can attest to the fact that its not fun and that concise human readable error messages are much preferable to cryptic or silent bash errors. Python does this a bit better of course, but stacktraces are also not ideal and still lead to the user having to debug the script instead of just immediately identifying the source of the error. For example, see what happens when an incorrect dataset ID is passed to the grid-json-generate python script from !904:

/home/jona/studies/phd/qt/training-dataset-dumper/./grid-json-generate:30: SyntaxWarning: invalid escape sequence '\.'
  dsid_re = re.compile('[^\.]*\.([0-9]{6,8})\..*')
Traceback (most recent call last):
  File "/home/jona/studies/phd/qt/training-dataset-dumper/./grid-json-generate", line 68, in <module>
    run()
  File "/home/jona/studies/phd/qt/training-dataset-dumper/./grid-json-generate", line 58, in run
    out_list = [get_dict(ds, **kwargs) for ds in args.datasets]
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jona/studies/phd/qt/training-dataset-dumper/./grid-json-generate", line 31, in get_dict
    dsid = dsid_re.match(ds).group(1)
           ^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'group'

The source of the error (the malformed dataset ID) can only be identified through looking at line 31, reading through the script to understand what line 31 does in context and looking at the docs for re.Pattern.match to see that match returns None when it fails to match the pattern. This might get better or worse, depending on the specific error. Compare this to the output in the equivalent case for the go script:

error: failed to build the output dataset ID from input dataset ID `mc20_13TeV.601589a.PhPy8EG_A14_ttbar_hdamp258p75_nonallhadron.deriv.DAOD_FTAG1.8547_s3797_r13144_p6698`

Caused by:
  0: failed to extract run period ID and atlas tags from input dataset ID `mc20_13TeV.601589a.PhPy8EG_A14_ttbar_hdamp258p75_nonallhadron.deriv.DAOD_FTAG1.8547_s3797_r13144_p6698`
  1: expected dataset ID `mc20_13TeV.601589a.PhPy8EG_A14_ttbar_hdamp258p75_nonallhadron.deriv.DAOD_FTAG1.8547_s3797_r13144_p6698` to match either `[^\.]+\.([0-9]{6,8})\..+\.((_?[a-z][0-9]{3,6}){3,})` or `[^\.]+\.periodAllYear\..+\.(grp[0-9]{2}_v[0-9]{2}_p[0-9]{4})`

exit status 1

This is not a cherry-picked example btw, I can confidently say that essentially all errors that can be caused due to wrong user input or a wrong environment are covered by human-readable errors messages in the go script. This is because go explicitly returns an error from every function that can fail and makes it easy to wrap the error with human-readable context before it's passed through the function call-chain, making it very hard to 'accidentally' ignore an error.

Testing out a modern language within ATLAS: Atlas tools almost exclusively rely on python, C++ and bash, while even gigantic historically grown projects like the linux kernel start to switch to more modern languages. While that cannot easily be changed, there is no reason to make our lives unnecessarily hard for smaller and newer tools or tools that are self-contained, if only to test if such changes would be worthwhile.
All the niceties that come with it (and a lot of other modern languages), like speed, static typing with little overhead, good tooling, etc.

To already anticipate some counter-arguments to this:

Isn't go a compiled language and not a scripting language? Yes Go is a compiled language, but it does support scripting through go run which compiles go files and executes them all together without overhead (compilation for small files is very fast, possibly faster than python startup times). Go files also support she-bangs that use go run.
Is go even supported by the ATLAS environment? Go is supported by the ATLAS environment, either in version 1.12.5 (which is a bit old for my taste) in /cvmfs/sft.cern.ch/lcg/releases/go/1.12.5-ff8e6/ or version 1.18 as part of gcc under /cvmfs/sft.cern.ch/lcg/releases/gcc/13.1.0-b3d18/x86_64-el9/bin/go. As far as I could tell there is no lsetup go, so currently I have just hardcoded /cvmfs/sft.cern.ch/lcg/releases/gcc/13.1.0-b3d18/x86_64-el9/bin/go in the shebang together with some environment variables that need to be set for the gcc version. If we merge this, we might want to investigate this a bit better with people who know than me. Note though that I have already successfully submitted test jobs with the go script in an athena container, so everything does work without problems.
The main advantage of go seems to be error handling, couldn't we do the same in python? Technically yes, but in practice nobody does it and for good reasons. Let's say for example we want to handle a matching error for a regular expression dsid_re, i.e. dsid_re.match(ds). How would we handle this properly? In the following way:

try:
    matches = dsid_re.match(ds)
except re.PatternError(e):
    raise Exception(f"failed to match {ds} with {dsid_re}: {e}")

Let's ignore proper error types and formatting for now and focus on the real issues. First of all, this is a bit less readable than the typical go variant¹ (especially if the function throws multiple errors, the above becomes much more verbose, unlike the go variant):

matches, err = dsid_re.FindAllStringSubmatch(ds)
if err != nil {
    return fmt.Errorf("failed to match %s with %s: %w", ds, dsid_re, err)
}

but second and more importantly, it requires us to know which errors the match function can throw in the first place. We don't want to catch just any exception, but we also don't want to miss an error. If, as is often the case, the error that is thrown is not documented, we basically can't do anything to handle it. Overall error handling is just much better standardized in go, whereas in python it's merely treated as an inconvenience that had to be supported somehow.

Most physicists don't know go, how are we going to maintain this? Go is close to C/C++, garbage collected and generally considered a very simple language with a small spec and only around 25 keywords (compared to 35 in python and 63 in C++) that purposefully tries to avoid unnecessary new features. It is genuinely easy to learn/understand. I personally learned a little bit of go basics more than two years ago and then never touched it again until writing this script. To be completely fair, I will also point out that most physicist do also not write involved bash scripts on a daily basis and know what an associative array is and can easily read stuff like

set -eu
IFS=',' read -r -a EXT_FILES_ARR <<< "${EXT_FILES}"
for key in "${INPUT_DATASETS[@]}"; do
    value="${FILES_TO_SUBMIT[$key]--1}"
    printf "%s:%s\n" "$key" "$value"
done | xargs -P 10 -I {} bash -c 'submit-job ${1%:*} ${1#*:}' bash {}

Executing commands is really easy and without overhead in bash, wouldn't it be better if we keep parts of the script in bash? I would argue that it is easier to stick to a single language and that it is easy in go to execute external commands, but not too easy, such that we can still do some robust error handling. If you want the same simplicity as in bash, it suffices to define a small helper function for commands with utf-8 output, which also incorporates stderr and stdout into the error:

func execute(name string, args ...string) (string, error) {
    command := exec.Command(name, args...)

    var stderr bytes.Buffer
    command.Stderr = &stderr

    outputData, err := command.Output()
    output := strings.TrimSpace(string(outputData))
    if err != nil {
        return output, fmt.Errorf(
            "failed to run `%s`|%w|stderr:\n%s\n\nstdout:\n%s",
            command.String(),
            err,
            stderr.String(),
            output,
        )
    }

    return output, nil
}

and then just e.g.

paths, err := execute("ls", "dir1", "dir2")

Will this break any existing usage pattern of the grid submission script? No, the script is executed with source FTagDumper/grid/setup.sh && source build/x*/setup.sh && grid-submit <options> MODE where <options> and MODE are the same as before.

Additional Remarks:

The -n flag is not supported yet, since I wasn't sure how to implement it. -n gives one global event count that might lead to different numbers of files for each dataset to minimally reach that event count. As far as I can see though the --nFiles argument of prun, applies globally to the entire grid submission and there is no way (that I can see) on how you can set the number of files individually per dataset. I already implemented some rucio wrappers though that can be used to determine the file count from the event count (currently commented).
--no-meta is not supported yet, since this is also needed per dataset ID.

Review checklist:

CI Passing
Comments addressed
Source branch is up to date with target

This particular example works a bit differently in go, I changed it to give a better idea on how this usually works ↩

Edited Aug 06, 2025 by Jona Ackerschott

Implement Bulk Grid Submission, but in Go

Merge request reports