2nd IML Workshop Challenge
The IML team prepared again a challenge in context of the 2nd IML workshop at CERN. Special thanks to Michele Selvaggi, who prepared the data samples. The challenge is to regress the soft drop mass  of jets with a transverse momentum of several TeV. Such energies are expected at the FCC (Future Circular Collider ). The masses of jets are an important ingredient to identify if a heavy particles, like for example the W boson, has caused the jet. The simulated data has been produced with Delphes using 1000 pile up events. The upgrade detector CMS (phase II) configuration was used, as pile up events where not available FCC steerings.
The dataset is available as root tree/numpy recarray. The branches/fields that are present for each jet are:
- Generator-level soft drop mass (the regression target): genjet_sd_m
- Particle flow jet: pt, eta, phi, m (mass): recojet_pt, ...
- Particle flow soft drop jet: pT, eta, phi, m (mass): recojet_sd_pt, ...
Each jet consists of constituents, i.e. reconstructed particles that were clustered into a jet. It contains all constituents, i.e. no soft-drop removal is applied. Each constituent contains the following information:
- Number of constituents: n_constituents
- Information per constituent: pt, eta, phi, charge sign, track impact parameter dXY and dZ (set to -999 for constituents without track), Energy deposited in EM (constituents_Eem) or hadronic (constituents_Ehad) calorimeter
The training dataset as root file and python file is stored on the public IML EOS space, which can be accessed from within the CERN network with eos mount:
or through xrootd with kerberos ticket:
The files are now also public for download on Zenodo at the following link:
The test dataset is:
FAQ missing permissions: Read access to the IML EOS storage is linked to the mailing list lhc-machinelearning-wg. You can signup with this link. The access permission information is cached and it might take up to 1h until your subscription is checked again. Should you even after 1h not have access, please get in contact with the coordinators.
The loss is evaluated using only jets in the range between 5 and 7 TeV in transverse momentum. The loss is defined by the size of the window around the median that incorporate 2/3 of the samples, i.e. similar to the standard deviation of a Gauss. Thus the aim is a small resolution. The window size is divided by the median to avoid gain by just down scaling the mass. Below a numpy that produces the loss:
def evaluate_loss(predictions, truth): ratio = predictions/truth. a = np.nanpercentile(ratio, 84, interpolation='nearest') b = np.nanpercentile(ratio, 16, interpolation='nearest') c = np.nanpercentile(ratio, 50, interpolation='nearest') loss = (a-b)/(2.*c) return loss
Here, predictions and truth are numpy-arrays containing the predicted/true generator-level jet mass values. This definition of resolution width has been used by the ATLAS collaboration.
Some people have pointed out that the challenge metric only covers the resolution (width of the jet mass distribution), not the scale. This is intentional, as the central value can typically be calibrated, while we want to focus on how the resolution in high pileup can be reduced.
The winner will be evaluated at the 11th of April.
Participants need to submit the prediction as numpy array or root tree for the qcd_test dataset at the 10th of April before 24:00 CERN time. If you have access to the CERN AFS filesystem (e.g. as CERN account holder), please upload the file to
and send a mail to inform the IML coordinators using the subject 'IML challenge contribution'. If you are not able to upload the file, you can directly send a link to the file (e.g. a dropbox link) to the IML coordinators.
The winner will be listed on the public LPCC IML-webpage and get a bottle of Champagne, as well a chance to present his solution at the next IML meeting after the workshop.