Skip to content

Introduce timing information in logParser [skip modfiles]

Spyros Argyropoulos requested to merge timing into master

This MR introduces timing information according to what was discussed in #45 (closed) #51 (closed) and will allow to reactivate the athena CI job.

Handling of CountHepMC == nEventsPerJob

The check that CountHepMC == nEventsPerJob has not been removed, because I think it might still be a good handle to catch unwanted bugs. Instead I added a flag in the commit script that can bypass this check. Here is what happens when trying to add 421305, run with --maxEvents=10:

mcjoboptions_3 sargyrop$ ./scripts/ -d=421305 -n
INFO: will use following remote for pushing: origin
New DSID directory: 421xxx/421305 ...
	OK: log.generate file found.
- CountHepMC Events passing all checks and written =   10  <-- ERROR: This is not equal to nEventsPerJob=1000 
Errors : 1 , Warnings : 5   -> Errors encountered! Not ready for production! 

	ERROR: log.generate contains errors
	Fix them before committing anything!

In this case users should make sure that disabling the CountHepMC check is what they really want and run with -t:

./scripts/ -d=421305 -t -n
INFO: will use following remote for pushing: origin
New DSID directory: 421xxx/421305 ...
	OK: log.generate file found.
	OK: log.generate file contains no errors
	OK: CI job expected to last less than 1h - time estimate: 0.17 hours
	Will now add files to git commit
		File: log.generate cannot be added to the commit. Skipping.
		Skipping: log.generate.short since the logParser CI job will not be run
		Added: mc_13TeV.Sh_228_Wmunu_EnhLogPtV.GRID.tar.gz

How timing works in logParser

There are 3 timing informations provided:

  1. if CountHepMC == nEventsPerJob: it just prints the timing information from log.generate (same as what we had before)
  2. if CountHepMC != nEventsPerJob: it will extrapolate the timing information using extrapCPU=float(generateDict['nEventsPerJob'])*cpuPerJob/float(CountHepMC)
  3. timing for CI job: this is something extra to help debugging and have more meaningful printouts, it will just extrapolate the timing information to max(10,0.01*nEventsPerJob), i.e. the number of events we will run in the CI

Timing limits

The actual (case 1) or extrapolated CPU time (case 2 above) lead to the following behaviour:

  • CPU < 1 h OR CPU > 18 h => ERROR
  • CPU between 8-12 h => GOOD
  • CPU=1-8 OR 12-18 h => WARNING
    i.e. currently the limits are quite loose and we would only cut jobs which last less than 1h or more than 18h.

For the CI (case 3) above we cut everything that is expected to last more than 1h.

Here is an example from logParser run on the above 421305 DSID with 10 events (nEventsPerJob=1000 in this case):

Performance metrics:
- actual CPU (10 events) = 0.17 hrs
- CPU extrapolated to 1000 events =  16.6 hrs  <-- WARNING: CPU time not optimal - should be between 8-12h. Adjust nEventsPerJob! 
- estimated CPU for CI job = 0.17 hrs
- Virtual memory =   2957.352 Mb 

Notice that the first line is the actual time it took to run the job (with --maxEvents=10), the 2nd is the extrapolation to nEventsPerJob=1000 and the 3rd is the extrapolation for the max(10,0.01*1000)=10 events that would run in the CI.

Note that logParser in CI is always run with -t so there should be no extra change needed in order to re-activate the athena CI jobs.

Closes #51 (closed) I think this also closes #15 (closed)

Edited by Spyros Argyropoulos

Merge request reports