Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in
  • MC Job Options MC Job Options
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 6
    • Issues 6
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
  • Jira
    • Jira
  • Merge requests 5
    • Merge requests 5
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Code review
    • Issue
    • Repository
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • atlas-physics
  • pmg
  • MC Job OptionsMC Job Options
  • Issues
  • #101
Closed
Open
Issue created Apr 14, 2020 by Jason Robert Veatch@jveatchDeveloper0 of 2 checklist items completed0/2 checklist items

Long compilation time when running MadGraph in atlas/slc6-atlasos causing CI timeouts

How to reproduce the problem

# Mount cvmfs
sudo mkdir -p /cvmfs/atlas.cern.ch
sudo mkdir -p /cvmfs/atlas-condb.cern.ch
sudo mkdir -p /cvmfs/grid.cern.ch
sudo mkdir -p /cvmfs/sft.cern.ch
sudo mount -t cvmfs atlas.cern.ch /cvmfs/atlas.cern.ch
sudo mount -t cvmfs atlas-condb.cern.ch /cvmfs/atlas-condb.cern.ch
sudo mount -t cvmfs grid.cern.ch /cvmfs/grid.cern.ch
sudo mount -t cvmfs sft.cern.ch /cvmfs/sft.cern.ch

# Get the docker image
docker pull atlas/slc6-atlasos 

# Run image in a container and mount cvmfs
docker run -it -v /cvmfs:/cvmfs b4cfa1203c45 

# Inside the docker container get the mcjoboptions repo (or alternatively you can copy it from your local area with docker cp)
kinit USER@CERN.CH
git clone https://:@gitlab.cern.ch:8443/atlas-physics/pmg/mcjoboptions.git
cd mcjoboptions
git checkout dsid_jveatch_500538
export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase
./scripts/run_athena.sh

Debugging

Bottleneck: compilation time

Comparing the running times at several execution points on lxplus and in the container it seems that the problem lies on the compilation times:

Docker (running on a dual-core laptop with cvmfs mounted via fuse):
generate 19:24:54 INFO: Using LHAPDF v6.2.3 interface for PDFs
generate 19:26:19 INFO: Compiling source… 
generate 19:31:53 INFO:           ...done, continuing with P* directories => 334 sec
generate 19:31:53 INFO: Compiling StdHEP (can take a couple of minutes) ...
generate 19:45:23 INFO:           …done. => 810 sec
generate 19:45:24 INFO: Compiling on 1 cores
generate 19:45:24 INFO:  Compiling P0_gg_ttx...
generate 19:54:37 INFO:     P0_gg_ttx done. => 553 sec

vs lxplus (interactive run)
10:15:08 INFO: Using LHAPDF v6.2.3 interface for PDFs
10:15:14 INFO: Compiling source...
10:15:26 INFO:           ...done, continuing with P* directories => 12 sec
10:15:26 INFO: Compiling StdHEP (can take a couple of minutes) ...
10:16:04 INFO:           …done.  => 38 sec
10:16:05 INFO: Compiling on 1 cores
10:16:05 INFO:  Compiling P0_gg_ttx...
10:16:45 INFO:     P0_gg_ttx done. => 40 sec

Size/memory

The container available space is 53GB and where the compilation becomes slow the size of the container is ~230 MB so much smaller => disk size does not seem to be causing the slowdown

The available memory was changed from 1GB to 8GB without any effect on the compilation time in the container.

Reading from cvmfs

I run a script that 1) reads all the lines from a file that lives on cvmfs and 2) copies this script to a local directory and remove it.

The local run on my laptop (with cvmfs mounted with fuse gives this):

Reading 500 times
real	0m21.504s
user	0m12.937s
sys	0m8.429s
Copying 500 times
real	0m4.993s
user	0m0.620s
sys	0m2.440s

Running the script from the container, where the locally available cvmfs directory (see above) is mounted to the container as a volume, gives this:

Reading 500 times
real	1m44.217s
user	0m18.329s
sys	0m20.376s
Copying 500 times
real	0m3.716s
user	0m0.570s
sys	0m0.981s

So reading a file seems to be 5x slower when running from the docker container

Next steps

  • To debug further we would need to know exactly how cvmfs is mounted in the gitlab runner
  • Also need to check whether there is any correlation between slow reading times on cvmfs and MG - does MG call the compilers from cvmfs/reads any other info from cvmfs? Probably

Original report from Jason - similar issues observed with other processes which are apparently very different than this one (an NLO one and a LO one with a long decay chain)

Job #7937441 failed for 9a6a4445:

Dear experts,

The run_athena job for my branch times out. I have been trying to debug this from my side, but I am at a loss about how to proceed. The estimated execution time from each log.generate.short is ~0.1 hours, so I wouldn't expect this to be an issue. Could you please advise?

Thanks in advance,

Jason

Edited Apr 14, 2020 by Spyros Argyropoulos
Assignee
Assign to
Time tracking