MC Job Options issueshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues2021-04-22T16:46:58+02:00https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/101Long compilation time when running MadGraph in atlas/slc6-atlasos causing CI...2021-04-22T16:46:58+02:00Jason Robert VeatchLong compilation time when running MadGraph in atlas/slc6-atlasos causing CI timeouts## How to reproduce the problem
```
# Mount cvmfs
sudo mkdir -p /cvmfs/atlas.cern.ch
sudo mkdir -p /cvmfs/atlas-condb.cern.ch
sudo mkdir -p /cvmfs/grid.cern.ch
sudo mkdir -p /cvmfs/sft.cern.ch
sudo mount -t cvmfs atlas.cern.ch /cvmfs/at...## How to reproduce the problem
```
# Mount cvmfs
sudo mkdir -p /cvmfs/atlas.cern.ch
sudo mkdir -p /cvmfs/atlas-condb.cern.ch
sudo mkdir -p /cvmfs/grid.cern.ch
sudo mkdir -p /cvmfs/sft.cern.ch
sudo mount -t cvmfs atlas.cern.ch /cvmfs/atlas.cern.ch
sudo mount -t cvmfs atlas-condb.cern.ch /cvmfs/atlas-condb.cern.ch
sudo mount -t cvmfs grid.cern.ch /cvmfs/grid.cern.ch
sudo mount -t cvmfs sft.cern.ch /cvmfs/sft.cern.ch
# Get the docker image
docker pull atlas/slc6-atlasos
# Run image in a container and mount cvmfs
docker run -it -v /cvmfs:/cvmfs b4cfa1203c45
# Inside the docker container get the mcjoboptions repo (or alternatively you can copy it from your local area with docker cp)
kinit USER@CERN.CH
git clone https://:@gitlab.cern.ch:8443/atlas-physics/pmg/mcjoboptions.git
cd mcjoboptions
git checkout dsid_jveatch_500538
export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase
./scripts/run_athena.sh
```
## Debugging
#### Bottleneck: compilation time
Comparing the running times at several execution points on lxplus and in the container it seems that the problem lies on the compilation times:
```
Docker (running on a dual-core laptop with cvmfs mounted via fuse):
generate 19:24:54 INFO: Using LHAPDF v6.2.3 interface for PDFs
generate 19:26:19 INFO: Compiling source…
generate 19:31:53 INFO: ...done, continuing with P* directories => 334 sec
generate 19:31:53 INFO: Compiling StdHEP (can take a couple of minutes) ...
generate 19:45:23 INFO: …done. => 810 sec
generate 19:45:24 INFO: Compiling on 1 cores
generate 19:45:24 INFO: Compiling P0_gg_ttx...
generate 19:54:37 INFO: P0_gg_ttx done. => 553 sec
vs lxplus (interactive run)
10:15:08 INFO: Using LHAPDF v6.2.3 interface for PDFs
10:15:14 INFO: Compiling source...
10:15:26 INFO: ...done, continuing with P* directories => 12 sec
10:15:26 INFO: Compiling StdHEP (can take a couple of minutes) ...
10:16:04 INFO: …done. => 38 sec
10:16:05 INFO: Compiling on 1 cores
10:16:05 INFO: Compiling P0_gg_ttx...
10:16:45 INFO: P0_gg_ttx done. => 40 sec
```
#### Size/memory
The container available space is 53GB and where the compilation becomes slow the size of the container is ~230 MB so much smaller => **disk size does not seem to be causing the slowdown**
The available memory was changed from 1GB to 8GB without any effect on the compilation time in the container.
#### Reading from cvmfs
I run a script that 1) reads all the lines from a file that lives on cvmfs and 2) copies this script to a local directory and remove it.
The local run on my laptop (with cvmfs mounted with fuse gives this):
```
Reading 500 times
real 0m21.504s
user 0m12.937s
sys 0m8.429s
Copying 500 times
real 0m4.993s
user 0m0.620s
sys 0m2.440s
```
Running the script from the container, where the locally available cvmfs directory (see above) is mounted to the container as a volume, gives this:
```
Reading 500 times
real 1m44.217s
user 0m18.329s
sys 0m20.376s
Copying 500 times
real 0m3.716s
user 0m0.570s
sys 0m0.981s
```
**So reading a file seems to be 5x slower when running from the docker container**
#### Next steps
* [ ] To debug further we would need to know exactly how cvmfs is mounted in the gitlab runner
* [ ] Also need to check whether there is any correlation between slow reading times on cvmfs and MG - does MG call the compilers from cvmfs/reads any other info from cvmfs? Probably
---
Original report from Jason - similar issues observed with other processes which are apparently very different than this one (an NLO one and a LO one with a long decay chain)
Job [#7937441](https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/jobs/7937441) failed for 9a6a4445a5bcf7ae08ac81888cccd79ef4cc4af3:
Dear experts,
The run_athena job for my branch times out. I have been trying to debug this from my side, but I am at a loss about how to proceed. The estimated execution time from each log.generate.short is ~0.1 hours, so I wouldn't expect this to be an issue. Could you please advise?
Thanks in advance,
JasonFutureSpyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/102Add checks for input files2020-07-30T07:37:56+02:00Spyros ArgyropoulosAdd checks for input filesAdd checks:
* [ ] no `evgenConfig.inputfilecheck`
* [ ] no `evgenConfig.inputconfcheck` allowed
both are always in the top JO
Also
* [ ] Restructure checks so that everything related to reading the jO is done in one place and everyt...Add checks:
* [ ] no `evgenConfig.inputfilecheck`
* [ ] no `evgenConfig.inputconfcheck` allowed
both are always in the top JO
Also
* [ ] Restructure checks so that everything related to reading the jO is done in one place and everything related to reading the log is done in `logParser`S2.2020https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/104Harmonise whitelist with Gen_tf2021-01-04T15:27:52+01:00Spyros ArgyropoulosHarmonise whitelist with Gen_tfCurrently the transform allows setups which are explicitly excluded in the whitelist, e.g. `DSID/dat/*.dat` which is excluded here: https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/blob/master/scripts/whitelist.sh#L9 as discussed ...Currently the transform allows setups which are explicitly excluded in the whitelist, e.g. `DSID/dat/*.dat` which is excluded here: https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/blob/master/scripts/whitelist.sh#L9 as discussed in !298
I no longer remember why we excluded some cases but we should definitely harmonise what is done in the transform and what is done in the CI.
@ewelina could you go through the whitelist and let me know what is treated differently there and in `Gen_tf` so that we harmonise?
Tag @cgutscho @fsiegertS1.2021https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/105int conversion of a string "nevents" that contains a float2020-04-29T21:17:33+02:00Xiaohu Sunint conversion of a string "nevents" that contains a floatQuite often people define nevents by multiplying a bunch of numbers (safe margin, truth efficiency etc.), then nevents is a float. The log file would contain
20:49:39 Py:MadGraphUtils INFO Setting nevents = 11000.0.
where "1100.0" ...Quite often people define nevents by multiplying a bunch of numbers (safe margin, truth efficiency etc.), then nevents is a float. The log file would contain
20:49:39 Py:MadGraphUtils INFO Setting nevents = 11000.0.
where "1100.0" is picked by logParser as a string.
Then in the check script
https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/blob/master/scripts/logParser.py#L271
neventsMG=int(generatorDict['nevents'][0])
will crash, as int("11000.0") would not work.
ValueError: invalid literal for int() with base 10
Would this be fixed? Thanks!
Best,
Xiaohu2020-04-30https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/106Job Failed #81849032020-04-29T21:17:04+02:00Xiaohu SunJob Failed #8184903Job [#8184903](https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/jobs/8184903) failed for 127510e74ddbca868a29efecbd1b8c6144bf63b8:Job [#8184903](https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/jobs/8184903) failed for 127510e74ddbca868a29efecbd1b8c6144bf63b8:https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/107logParser crash due to double print of nevents in MG2020-12-09T10:09:13+01:00Spyros ArgyropouloslogParser crash due to double print of nevents in MGlogParser failing again because of double print-out of nevents:
```
22:32:58 Py:MadGraphUtils INFO Setting nevents = 11000.
22:33:05 Py:MadGraphUtils INFO "nevents" = 11000
```
The first printout seemed to be the old implementa...logParser failing again because of double print-out of nevents:
```
22:32:58 Py:MadGraphUtils INFO Setting nevents = 11000.
22:33:05 Py:MadGraphUtils INFO "nevents" = 11000
```
The first printout seemed to be the old implementation before the restructuring in rel. 21.6.23, however I don't understand why both printouts are printed now. Is this expected @zmarshal @hmildner @mcfayden ?
The jO is attached - provided by @ewelina - this was run in 21.6.27.
[mc.MGPy8EG_A14NNPDF23_tWgamma.py](/uploads/66b17b0604410f93d826969cc504c7ef/mc.MGPy8EG_A14NNPDF23_tWgamma.py)
Just to say if this is expected we can easily change the behaviour to parse lines containing `"nevents"` (with quotes) currently it tries to find lines containing `nevents` (without quotes) and since the printout is different (trailing dot) the first print-out is not parsed correctly. S1.2020Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/108logParser crash in CI not handled correctly2020-05-05T20:25:13+02:00Spyros ArgyropouloslogParser crash in CI not handled correctlySee https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/jobs/8197685See https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/jobs/8197685S1.2020Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/109logParser fails in CI when run on MadGraph due to nevents check2020-05-17T14:24:09+02:00Spyros ArgyropouloslogParser fails in CI when run on MadGraph due to nevents checkAs seen in !412 when running a jO with:
```
evgenConfig.nEventsPerJob = 10000
nevents = runArgs.maxEvents1.2 if runArgs.maxEvents>0 else 1.1evgenConfig.nEventsPerJob
```
`logParser` fails with
```
ERROR: Increase nevents to be gener...As seen in !412 when running a jO with:
```
evgenConfig.nEventsPerJob = 10000
nevents = runArgs.maxEvents1.2 if runArgs.maxEvents>0 else 1.1evgenConfig.nEventsPerJob
```
`logParser` fails with
```
ERROR: Increase nevents to be generated in MG from 120 to 11000
```S1.2020Spyros ArgyropoulosSpyros Argyropoulos2020-05-16https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/110unwarranted logParser fail at commit-script stage?2020-05-11T15:51:37+02:00Christian Gutschowunwarranted logParser fail at commit-script stage?From @mgignac:
The commit script complained on [this line](https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/blob/master/scripts/logParser.py#L222), but I'm not sure that the logic is correct in the logParser. In my log file, I ha...From @mgignac:
The commit script complained on [this line](https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/blob/master/scripts/logParser.py#L222), but I'm not sure that the logic is correct in the logParser. In my log file, I have a single message that's being flagged:
`Matrix_Element_Handler::GenerateOneEvent(): Point for '2_3__u__u__W+__d__u' exceeds maximum by 15.4543.`
And when the above line fails, it's dividing by zero (e.g. `nEventsRequested` is not set).
```
Traceback (most recent call last):
File "scripts/logParser.py", line 624, in <module>
main()
File "scripts/logParser.py", line 485, in main
sherpaChecks(opts.INPUT_FILE)
File "scripts/logParser.py", line 223, in sherpaChecks
logwarn("","WARNING: be aware of: "+str(numexceeds*100./nEventsRequested)+"% of the event weights exceed the maximum by a factor of ten")
ZeroDivisionError: float division by zero
```https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/111commit_new_dsid.sh creates wrong links at the -n step2020-06-03T09:54:09+02:00Judita Mamuziccommit_new_dsid.sh creates wrong links at the -n stepDear Experts,
I would like to upload the new JO and the control file for a case when many DSIDs use a same control file, in SUSY. The folder structure is:
```
Dir1:
mc.1.py -> Control.py
Control.py
Dir2:
mc.2.py -> ../Dir1/Control.py
`...Dear Experts,
I would like to upload the new JO and the control file for a case when many DSIDs use a same control file, in SUSY. The folder structure is:
```
Dir1:
mc.1.py -> Control.py
Control.py
Dir2:
mc.2.py -> ../Dir1/Control.py
```
The jobs run successfully, and the first step of checks with commit_new_dsid.sh using --dry-run is also successful. However, when the option -n is used in the second step like:
```
./scripts/commit_new_dsid.sh -d=100001-100082 -n -m="SUSY direct stau, TFilt."
```
the linked files become wrong.
Initial input:
```
ls -lah 100xxx/100001/mc* 100xxx/100002/mc*
100xxx/100001/mc.MGPy8EG_StauStauDirect_120p0_1p0_TFilt.py -> SUSY_SimplifiedModel_StauStauDirect.py
100xxx/100002/mc.MGPy8EG_StauStauDirect_160p0_1p0_TFilt.py -> ../100001/SUSY_SimplifiedModel_StauStauDirect.py
```
After step -n:
```
ls -lah 501xxx/501047/mc* 501xxx/501048/mc*
501xxx/501047/mc.MGPy8EG_StauStauDirect_120p0_1p0_TFilt.py -> SUSY_SimplifiedModel_StauStauDirect.py
501xxx/501048/mc.MGPy8EG_StauStauDirect_160p0_1p0_TFilt.py -> ../../501xxx/501047/mc.MGPy8EG_StauStauDirect_160p0_1p0_TFilt.py
```
where the last file should be:
```
501xxx/501048/mc.MGPy8EG_StauStauDirect_160p0_1p0_TFilt.py -> ../501047/SUSY_SimplifiedModel_StauStauDirect.py
```
It seems there is a problem with the copy of files with a soft link.
I attach here the reduced example.
Many thanks for your help.
Cheers,
Judita
/cc @gstark , @sargyrop , @wfawcett , @cgutscho
[100xxx_short.tar.gz](/uploads/2ac3b683ca55e00c418bc56071864798/100xxx_short.tar.gz)Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/112CI/logParser addition: maximum value for inputFilesPerJob2020-05-19T10:36:36+02:00Christian GutschowCI/logParser addition: maximum value for inputFilesPerJobThe maximum number of input LHE/EVNT files is `inputFilesPerJob=1000`.
Could this be added to the CI/logParser (whichever is best)?The maximum number of input LHE/EVNT files is `inputFilesPerJob=1000`.
Could this be added to the CI/logParser (whichever is best)?Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/113Improve `check_modified_files` behaviour2020-08-04T10:02:29+02:00Spyros ArgyropoulosImprove `check_modified_files` behaviourDo a local rebase before checking what changed to avoid failed pipelines for commits that are behind master.Do a local rebase before checking what changed to avoid failed pipelines for commits that are behind master.S2.2020Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/114Limit of inputFilesPerJob2021-01-21T18:13:58+01:00Xiaohu SunLimit of inputFilesPerJobIn this ticket https://its.cern.ch/jira/browse/ATLMCPROD-8583 the request needs external LHE files.
In order to have 10000 events per job, we have to set inputFilesPerJob to 200 for some of the JOs. But this tiggers an error in logparse...In this ticket https://its.cern.ch/jira/browse/ATLMCPROD-8583 the request needs external LHE files.
In order to have 10000 events per job, we have to set inputFilesPerJob to 200 for some of the JOs. But this tiggers an error in logparser checks that inputFilesPerJob is limited up to 100.
Well, we cannot cut 10000 events to 5000 to allow inputFilesPerJob back in the limit, because that will touch the CPU hour limit (5000 in the JO takes <1 hour to finish in this case).
Do you suggest how to proceed?
Thanks!Christian GutschowChristian Gutschowhttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/115Wrong printing of branches using a DSID2020-08-01T16:41:21+02:00Spyros ArgyropoulosWrong printing of branches using a DSIDI had a wrong error message when I tried to commit JOs for 421332:
the message I got was that dsid_jveatch_600076 already uses this DSID.
I have checked this branch and it was not the case.
I found that this DSID was used in one of the e...I had a wrong error message when I tried to commit JOs for 421332:
the message I got was that dsid_jveatch_600076 already uses this DSID.
I have checked this branch and it was not the case.
I found that this DSID was used in one of the earlier branches awaiting approval.
I think the problem is that the list of branches
https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/blob/master/scripts/check_jo_consistency.py#L118
is ordered from the newest branch to the oldest and when a new branch is submitted for merging it is updated for the changes that were introduced in other branches awaiting the approval - this way always the newest one will be pointed as the one using already a given DSID (in case of conflict).S2.2020Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/116Don't commit emacs backup files2020-06-23T15:59:46+02:00Christian GutschowDon't commit emacs backup filesCurrently the files ending in `blah~` seem to be included by the commit scripts see e.g. [here](https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/tree/master/500xxx/500908).
Cheers,
ChrisCurrently the files ending in `blah~` seem to be included by the commit scripts see e.g. [here](https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/tree/master/500xxx/500908).
Cheers,
ChrisSpyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/117Check number of files in gridpack2023-03-01T07:42:53+01:00Christian GutschowCheck number of files in gridpackThe number of files in a gridpack shouldn't exceed 80k, otherwise some grid sites will crash. This has happened a number of times recently, e.g. for the FxFx job where the gridpack contained several files per Feynman diagram. MadGraph co...The number of files in a gridpack shouldn't exceed 80k, otherwise some grid sites will crash. This has happened a number of times recently, e.g. for the FxFx job where the gridpack contained several files per Feynman diagram. MadGraph control cleans up logs and .o files in the latest release, but for older releases it would be good to have a dedicated pipeline step that throws an error if the number of files in the gridpack is larger than 80k. Probably something like `tar -ztvf *.tgz *.tar.gz` could work?S2.2020Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/118Add checks for `inputfilecheck` and `inputGeneratorFile`2020-08-03T10:25:32+02:00Christian GutschowAdd checks for `inputfilecheck` and `inputGeneratorFile`Please see this test commit: 52aa8087
which has the following two lines in the JO:
```
evgenConfig.inputfilecheck = 'PhPy8EG_NNPDF30LO_EWK_ZZeeee'
runArgs.inputGeneratorFile = 'PhPy8EG_NNPDF30LO_EWK_ZZeeee._00052.events.tar.gz'
```
Th...Please see this test commit: 52aa8087
which has the following two lines in the JO:
```
evgenConfig.inputfilecheck = 'PhPy8EG_NNPDF30LO_EWK_ZZeeee'
runArgs.inputGeneratorFile = 'PhPy8EG_NNPDF30LO_EWK_ZZeeee._00052.events.tar.gz'
```
The first one I thought the CI would already be catching [along with `inputconfcheck`, no?] and the second one is clearly a problem for central production.
Can we catch these? I guess the logParser should already throw an error before the files are even committed to gitlab.S2.2020Spyros ArgyropoulosSpyros Argyropouloshttps://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/119Mentioning ATLMCPROD ticket in MR doesn't push link to Jira any longer2020-07-31T10:47:09+02:00Christian GutschowMentioning ATLMCPROD ticket in MR doesn't push link to Jira any longer... not sure there's much we can do about this though?
Any ideas anyone?... not sure there's much we can do about this though?
Any ideas anyone?https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/120Allow runArgs to be referred to in JOs but not to be overwritten by JOs2020-08-22T13:08:09+02:00Christian GutschowAllow runArgs to be referred to in JOs but not to be overwritten by JOsSee !631 for an example.See !631 for an example.S2.2020Spyros ArgyropoulosSpyros Argyropoulos2020-08-14https://gitlab.cern.ch/atlas-physics/pmg/mcjoboptions/-/issues/121logParser rejects logs with nEventsPerJob > 10k2020-08-28T16:51:47+02:00Christian GutschowlogParser rejects logs with nEventsPerJob > 10kFollowing the successful test in ATLMCPROD-8659, we should allow cases where `nEventsPerJob` is a multiple of 10k.
Currently it fails saying
```
- CountHepMC Events passing all checks and written = 20000 <-- ERROR: Not an acceptable n...Following the successful test in ATLMCPROD-8659, we should allow cases where `nEventsPerJob` is a multiple of 10k.
Currently it fails saying
```
- CountHepMC Events passing all checks and written = 20000 <-- ERROR: Not an acceptable number of events for production (1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000)
```S2.2020Spyros ArgyropoulosSpyros Argyropoulos