MC Job Options issueshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues2021-05-18T11:32:21+02:00https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/142Follow-up from "LO EFT samples for 4top"2021-05-18T11:32:21+02:00Spyros ArgyropoulosFollow-up from "LO EFT samples for 4top"The following discussions from !1161 should be addressed:
> This shouldn't be part of a JobOption. The first part was fixed properly in 21.6.60 and the second part is obviously gonna cause problems. `ATHENA_PROC_NUMBER` is set to 8 ...The following discussions from !1161 should be addressed:
> This shouldn't be part of a JobOption. The first part was fixed properly in 21.6.60 and the second part is obviously gonna cause problems. `ATHENA_PROC_NUMBER` is set to 8 because the machine has 8 cores, it shouldn't be set to 80 in the JOs.
Should we add the following checks/changes:
- if ATHENA_PROC_NUMBER > 1 and release < 21.2.60 => ERROR
- if ATHENA_PROC_NUMBER > 1 => run only 1 event in CI
- change the way we check whether the jO changes ATHENA_PROC_NUMBER - this would only be safe to catch in the transform btw, but until it is implemented there we could change the check to not use anywhere ATHENA_PROC_NUMBER (not even printing it), so e.g. look in the jO and if there is an uncommented line with "ATHENA_PROC_NUMBER" in it then give error
@cgutschoS1.2021Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/138Check for multiple instances of TestHepMC (and TestLHE?)2021-04-01T14:20:41+02:00Christian GutschowCheck for multiple instances of TestHepMC (and TestLHE?)In general, the transform will create an instance of TestHepMC (and in the future also TestLHE) and run some checks as part of the job. For some setups the default thresholds used in these packages may be too strict and occasionally we g...In general, the transform will create an instance of TestHepMC (and in the future also TestLHE) and run some checks as part of the job. For some setups the default thresholds used in these packages may be too strict and occasionally we get JOs that try to loosen them a bit, which is usually fine.
We recently had a case (!1066) where a fresh instance of TestHepMC was created, and the threshold were tweaked on the new instance but not the one that the transform had already created, which was then causing issues down the line.
Could we catch this sort of thing in the CI? I imagine it would just be a case of checking for a line like
```
genSeq += TestHepMC()
```
and throwing an error?S1.2021Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/135Sanity check for EVNT-to-EVNT transforms2021-06-17T11:07:17+02:00Christian GutschowSanity check for EVNT-to-EVNT transformsHi,
here's an [example JO](https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/blob/master/950xxx/950096/mc.Sh_2210_Zee_E2Etransform_valid.py) for an EVNT-to-EVNT transform.
This basically clones an input EVNT, but only copies the ...Hi,
here's an [example JO](https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/blob/master/950xxx/950096/mc.Sh_2210_Zee_E2Etransform_valid.py) for an EVNT-to-EVNT transform.
This basically clones an input EVNT, but only copies the event if it passes some Athena filter, hence most of the logic being protected by the `if runArgs.trfSubstepName == 'afterburn':` statement.
Now, because it copies the original EVNT, the new EVNT would have the MC channel number (or run number in the HepMC GenEvent) set to the original DSID and not the new DSID (of the E2E transform JO).
This can now be patched using the `postSeq.CountHepMC.CorrectRunNumber = True` flag seen at the bottom. Could we use the CI to catch cases where such a JO is being added, but that tag is missing from the JO?
(In principle, there is a printout in the `log.afterburn` produced by an E2E transform which one could grep for, but the CI doesn't handle jobs without input EVNT files yet.)
Thoughts/ideas?S1.2021Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/125JO shouldn't hardcode ATHENA_PROC_NUMBER2020-11-14T13:51:07+01:00Christian GutschowJO shouldn't hardcode ATHENA_PROC_NUMBERThe environment variable for multi-threading `ATHENA_PROC_NUMBER` should be set by prodsys, not the JOs.
Can we make the CI fail if the JOs try to assign a value to that? (The JO are free to ask if this environment variable exists and w...The environment variable for multi-threading `ATHENA_PROC_NUMBER` should be set by prodsys, not the JOs.
Can we make the CI fail if the JOs try to assign a value to that? (The JO are free to ask if this environment variable exists and what it's value is (e.g. to pass it into Madgraph), but they shouldn't try to overwrite its value
See e.g. MR !745 where this had to be corrected, but e.g. [this JO](https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/blob/master/421xxx/421006/mc.MGPy8EG_A14NNPDF23_tWgamma_art.py) where it's used in an acceptable way.S2.2020Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/120Allow runArgs to be referred to in JOs but not to be overwritten by JOs2020-08-22T13:08:09+02:00Christian GutschowAllow runArgs to be referred to in JOs but not to be overwritten by JOsSee !631 for an example.See !631 for an example.S2.2020Spyros ArgyropoulosSpyros Argyropoulos2020-08-14https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/118Add checks for `inputfilecheck` and `inputGeneratorFile`2020-08-03T10:25:32+02:00Christian GutschowAdd checks for `inputfilecheck` and `inputGeneratorFile`Please see this test commit: 52aa8087
which has the following two lines in the JO:
```
evgenConfig.inputfilecheck = 'PhPy8EG_NNPDF30LO_EWK_ZZeeee'
runArgs.inputGeneratorFile = 'PhPy8EG_NNPDF30LO_EWK_ZZeeee._00052.events.tar.gz'
```
Th...Please see this test commit: 52aa8087
which has the following two lines in the JO:
```
evgenConfig.inputfilecheck = 'PhPy8EG_NNPDF30LO_EWK_ZZeeee'
runArgs.inputGeneratorFile = 'PhPy8EG_NNPDF30LO_EWK_ZZeeee._00052.events.tar.gz'
```
The first one I thought the CI would already be catching [along with `inputconfcheck`, no?] and the second one is clearly a problem for central production.
Can we catch these? I guess the logParser should already throw an error before the files are even committed to gitlab.S2.2020Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/117Check number of files in gridpack2023-03-01T07:42:53+01:00Christian GutschowCheck number of files in gridpackThe number of files in a gridpack shouldn't exceed 80k, otherwise some grid sites will crash. This has happened a number of times recently, e.g. for the FxFx job where the gridpack contained several files per Feynman diagram. MadGraph co...The number of files in a gridpack shouldn't exceed 80k, otherwise some grid sites will crash. This has happened a number of times recently, e.g. for the FxFx job where the gridpack contained several files per Feynman diagram. MadGraph control cleans up logs and .o files in the latest release, but for older releases it would be good to have a dedicated pipeline step that throws an error if the number of files in the gridpack is larger than 80k. Probably something like `tar -ztvf *.tgz *.tar.gz` could work?S2.2020Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/113Improve `check_modified_files` behaviour2020-08-04T10:02:29+02:00Spyros ArgyropoulosImprove `check_modified_files` behaviourDo a local rebase before checking what changed to avoid failed pipelines for commits that are behind master.Do a local rebase before checking what changed to avoid failed pipelines for commits that are behind master.S2.2020Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/112CI/logParser addition: maximum value for inputFilesPerJob2020-05-19T10:36:36+02:00Christian GutschowCI/logParser addition: maximum value for inputFilesPerJobThe maximum number of input LHE/EVNT files is `inputFilesPerJob=1000`.
Could this be added to the CI/logParser (whichever is best)?The maximum number of input LHE/EVNT files is `inputFilesPerJob=1000`.
Could this be added to the CI/logParser (whichever is best)?Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/102Add checks for input files2020-07-30T07:37:56+02:00Spyros ArgyropoulosAdd checks for input filesAdd checks:
* [ ] no `evgenConfig.inputfilecheck`
* [ ] no `evgenConfig.inputconfcheck` allowed
both are always in the top JO
Also
* [ ] Restructure checks so that everything related to reading the jO is done in one place and everyt...Add checks:
* [ ] no `evgenConfig.inputfilecheck`
* [ ] no `evgenConfig.inputconfcheck` allowed
both are always in the top JO
Also
* [ ] Restructure checks so that everything related to reading the jO is done in one place and everything related to reading the log is done in `logParser`S2.2020https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/101Long compilation time when running MadGraph in atlas/slc6-atlasos causing CI...2021-04-22T16:46:58+02:00Jason Robert VeatchLong compilation time when running MadGraph in atlas/slc6-atlasos causing CI timeouts## How to reproduce the problem
```
# Mount cvmfs
sudo mkdir -p /cvmfs/atlas.cern.ch
sudo mkdir -p /cvmfs/atlas-condb.cern.ch
sudo mkdir -p /cvmfs/grid.cern.ch
sudo mkdir -p /cvmfs/sft.cern.ch
sudo mount -t cvmfs atlas.cern.ch /cvmfs/at...## How to reproduce the problem
```
# Mount cvmfs
sudo mkdir -p /cvmfs/atlas.cern.ch
sudo mkdir -p /cvmfs/atlas-condb.cern.ch
sudo mkdir -p /cvmfs/grid.cern.ch
sudo mkdir -p /cvmfs/sft.cern.ch
sudo mount -t cvmfs atlas.cern.ch /cvmfs/atlas.cern.ch
sudo mount -t cvmfs atlas-condb.cern.ch /cvmfs/atlas-condb.cern.ch
sudo mount -t cvmfs grid.cern.ch /cvmfs/grid.cern.ch
sudo mount -t cvmfs sft.cern.ch /cvmfs/sft.cern.ch
# Get the docker image
docker pull atlas/slc6-atlasos
# Run image in a container and mount cvmfs
docker run -it -v /cvmfs:/cvmfs b4cfa1203c45
# Inside the docker container get the mcjoboptions repo (or alternatively you can copy it from your local area with docker cp)
kinit USER@CERN.CH
git clone https://:@gitlab.cern.ch:8443/atlas-physics/pmg/mcjoboptions.git
cd mcjoboptions
git checkout dsid_jveatch_500538
export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase
./scripts/run_athena.sh
```
## Debugging
#### Bottleneck: compilation time
Comparing the running times at several execution points on lxplus and in the container it seems that the problem lies on the compilation times:
```
Docker (running on a dual-core laptop with cvmfs mounted via fuse):
generate 19:24:54 INFO: Using LHAPDF v6.2.3 interface for PDFs
generate 19:26:19 INFO: Compiling source…
generate 19:31:53 INFO: ...done, continuing with P* directories => 334 sec
generate 19:31:53 INFO: Compiling StdHEP (can take a couple of minutes) ...
generate 19:45:23 INFO: …done. => 810 sec
generate 19:45:24 INFO: Compiling on 1 cores
generate 19:45:24 INFO: Compiling P0_gg_ttx...
generate 19:54:37 INFO: P0_gg_ttx done. => 553 sec
vs lxplus (interactive run)
10:15:08 INFO: Using LHAPDF v6.2.3 interface for PDFs
10:15:14 INFO: Compiling source...
10:15:26 INFO: ...done, continuing with P* directories => 12 sec
10:15:26 INFO: Compiling StdHEP (can take a couple of minutes) ...
10:16:04 INFO: …done. => 38 sec
10:16:05 INFO: Compiling on 1 cores
10:16:05 INFO: Compiling P0_gg_ttx...
10:16:45 INFO: P0_gg_ttx done. => 40 sec
```
#### Size/memory
The container available space is 53GB and where the compilation becomes slow the size of the container is ~230 MB so much smaller => **disk size does not seem to be causing the slowdown**
The available memory was changed from 1GB to 8GB without any effect on the compilation time in the container.
#### Reading from cvmfs
I run a script that 1) reads all the lines from a file that lives on cvmfs and 2) copies this script to a local directory and remove it.
The local run on my laptop (with cvmfs mounted with fuse gives this):
```
Reading 500 times
real 0m21.504s
user 0m12.937s
sys 0m8.429s
Copying 500 times
real 0m4.993s
user 0m0.620s
sys 0m2.440s
```
Running the script from the container, where the locally available cvmfs directory (see above) is mounted to the container as a volume, gives this:
```
Reading 500 times
real 1m44.217s
user 0m18.329s
sys 0m20.376s
Copying 500 times
real 0m3.716s
user 0m0.570s
sys 0m0.981s
```
**So reading a file seems to be 5x slower when running from the docker container**
#### Next steps
* [ ] To debug further we would need to know exactly how cvmfs is mounted in the gitlab runner
* [ ] Also need to check whether there is any correlation between slow reading times on cvmfs and MG - does MG call the compilers from cvmfs/reads any other info from cvmfs? Probably
---
Original report from Jason - similar issues observed with other processes which are apparently very different than this one (an NLO one and a LO one with a long decay chain)
Job [#7937441](https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/jobs/7937441) failed for 9a6a4445a5bcf7ae08ac81888cccd79ef4cc4af3:
Dear experts,
The run_athena job for my branch times out. I have been trying to debug this from my side, but I am at a loss about how to proceed. The estimated execution time from each log.generate.short is ~0.1 hours, so I wouldn't expect this to be an issue. Could you please advise?
Thanks in advance,
JasonFutureSpyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/100Make -m obligatory in commit script2020-04-28T18:32:35+02:00Spyros ArgyropoulosMake -m obligatory in commit script* [x] Remove current parsing logic
* [x] Check that skipping athena,logParser works as before* [x] Remove current parsing logic
* [x] Check that skipping athena,logParser works as beforeS1.2020Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/97Handling of jO files processed in non-Unix filesystems2020-04-17T13:57:23+02:00Spyros ArgyropoulosHandling of jO files processed in non-Unix filesystemsFrom @avroy
> calculating [nEvents](https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/blob/master/scripts/run_athena.sh#L70) failed with the following errors:
```
(standard_in) 1: illegal character: ^M
(standard_in) 1: illegal c...From @avroy
> calculating [nEvents](https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/blob/master/scripts/run_athena.sh#L70) failed with the following errors:
```
(standard_in) 1: illegal character: ^M
(standard_in) 1: illegal character: ^M
```
> Naively, this is due to carriage return and not uniformly processed across operating systems
## Todo
See how to handle this:
* during commit script? : probably not ideal since not everyone uses it
* doing a `dos2unix` in the CI? : might require special image - need to see if `dos2unix` is available in the images we use
* doing a `sed 's/^M//g'` as described [here](https://stackoverflow.com/questions/2658931/why-error-illegal-character-m?answertab=votes#tab-top) in all CI jobs? : @avroy can you test whether this works for you?S1.2020Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/95Add check to see if gridpack was used and if the grid pack is provided2020-08-04T13:43:36+02:00Spyros ArgyropoulosAdd check to see if gridpack was used and if the grid pack is provided
I was wondering how to catch such cases and avoid having pipeline jobs running for 1h and failing without apparent reason. We would need an indicator in log.generate that a gridpack was used.
I don't see PowhegConfig.gridpack printed i...
I was wondering how to catch such cases and avoid having pipeline jobs running for 1h and failing without apparent reason. We would need an indicator in log.generate that a gridpack was used.
I don't see PowhegConfig.gridpack printed in the log that Olga provided. I see
```
16:47:17 Py:PowhegControl INFO | powheginput keyword use-old-grid set to 1.0000000000000000
Does this tell us whether a gridpack was used?
```
Comment by @fsiegert
> Hi @sargyrop,
I think there are things which we'll never be able to catch if requesters modify the DSID directory before submitting but after having run the evgen test. This is not only relevant for gridpacks, but also potentially removing include files etc. So I wouldn't put too much effort into catching these cases if it's not easy.
We just need to educate users that they:
run the evgen test in a clean working directory
should not modify the DSID directory before submission
Best,
Frank
I think this is a pretty straightforward check: if ((gridpack used) && ! (gridpack present)) then ERROR So I am only asking how to specify (gridpack used)
Comment by @amoroso :
> Hi @fsiegert, @sargyrop,
I wonder if we couldn't catch case 2 within the CI. We could add a checksum to the DSID directory to the Gen_tf output, and have a pipeline check that the checksum in the attached logfile and the one recomputed by the CI are the same.
cheers, Simone
## Solution for Madgraph
GRID presence can be identified by lines like:
```
06:17:07 Py:MadGraphUtils INFO Generating events from gridpack
```S2.2020Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/93CI addition to JO name check: no minus signs allowed2020-03-26T11:10:45+01:00Christian GutschowCI addition to JO name check: no minus signs allowedIt looks like the production system doesn't allow "-" in the JO name, can we get the CI to check this?It looks like the production system doesn't allow "-" in the JO name, can we get the CI to check this?S1.2020Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/87Allo jO files to be links in whitelist2020-03-09T15:22:42+01:00Spyros ArgyropoulosAllo jO files to be links in whitelistAs needed in !265
`mc.*.py` should be allowed as a link tooAs needed in !265
`mc.*.py` should be allowed as a link tooS1.2020Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/86Check that external files are world-readable2020-03-12T17:18:40+01:00Christian GutschowCheck that external files are world-readableCan we implement a check that sym-linked files are world-readable with something like
```
"$(find "$filename" -perm -004)"
```
in case the cvmfs sync script cannot easily be patched? Not clear to me whether this is better done in the C...Can we implement a check that sym-linked files are world-readable with something like
```
"$(find "$filename" -perm -004)"
```
in case the cvmfs sync script cannot easily be patched? Not clear to me whether this is better done in the CI or as part of the commit script. If the latter is possible, perhaps that would be a good point to flag this up, but if people sneakily try to bypass the commit script, perhaps we should also check it in the CI?S1.2020Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/77Remove 95*xxx from check_modified_files2020-02-23T10:34:18+01:00Spyros ArgyropoulosRemove 95*xxx from check_modified_filesRemove 95*xxx from check_modified_filesRemove 95*xxx from check_modified_filesS1.2020Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/72CI: how to check for modification of files2020-02-04T11:55:18+01:00Spyros ArgyropoulosCI: how to check for modification of filesShould we restrict `check_modified_file.sh` to only look for changes in `common` and DSID directories?
## Pros
* easier to use for developers (they don't have to commit with `[skip modfiles]`)
## Cons
* More unsafe (e.g. commit of jO...Should we restrict `check_modified_file.sh` to only look for changes in `common` and DSID directories?
## Pros
* easier to use for developers (they don't have to commit with `[skip modfiles]`)
## Cons
* More unsafe (e.g. commit of jO with high priority comes in with checks in scripts that have been commented out due to failures, gets merged in master and as a result the checks are disabled for everyone)S1.2020Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/65Running pipelines in custom swarm runner2021-06-17T13:04:50+02:00Spyros ArgyropoulosRunning pipelines in custom swarm runnerInstructions from Lukas here: https://clouddocs.web.cern.ch/containers/tutorials/swarmgitlab.html
Idea would be that if we set this up we could run with the full number of events exactly as in production.
The instructions below work. ...Instructions from Lukas here: https://clouddocs.web.cern.ch/containers/tutorials/swarmgitlab.html
Idea would be that if we set this up we could run with the full number of events exactly as in production.
The instructions below work. Wo have to understand whether this is what we want:
* [ ] does it have access to cvmfs? If not how would we set it up so that it has?
* [ ] does it buy us anything from using the shared runners?
* [ ] how tough would the maintenance be?
* [ ] is it better to just set up a dedicated machine? Maybe we should ask someone from the CERN IT to do it?Future