Skip to content
Snippets Groups Projects

Brief readme of the code examples used for the RSE interview portfolio:

ExportTrees (included here)

Collaborative effort with B. Ravina, C. Macdonald, S. Fracchia, A. Lopez-Solis (all Sheffield). Languages used: C++, Python (BASH for simple LINUX execution).

ExportTrees: Package written in C++/Python (via HEP ROOT interface to process files where per event you have std::vector (i.e. dimensionful n-tuples) into files with single dimensioned variables (flat n-tuples). Embedded in ROOT/MVAVariables.cxx is the implementation of the BDT classifier evaluation on a given input, assuming several kinematic variables as input, also defined at this stage. Predominantly simple kinematic variables used in the tt+MET 0L analysis, such as missing energy in the transverse plane, 4-momenta of reconstructed objects in the event and simple kinematic relationships otherwise used in the previous paper publications.

BDT implementation is two BDTs independently trained on odd/even numbered events (Event number barcode) and applied to the other set (i.e. even trained applied to odd ) which are statistically independent, then the scores are merged to produce a single variable BDTG_highstop/BDT_highstop. This is also referred to as "two-fold cross validation". Selection cuts are then applied downstream in the user analysis code (not shown) on this variable to choose an optimal selection on the BDT score to best distinguish background and signal.

Reproducibility/preservation used:

-> gitlab CI (gitlab.ci.yml) setup for simple deployment/build testing as part of merge requesting. Original parent repository has a full working pipeline setup.

-> Original parent repository is forked to a group accessible (i.e. non user associated) area with some persistent access, allowing a developer to re-use the code directly for an analysis/debug common issues not seen during the original development phase.

-> Direct "push-to-master" disabled, so we submit merge requests, undergo code review etc. Branches are also saved normally, and gitlab UI allows for graphing of changes in merge requests.

-> Not shown here, but later included for analysis reproduction: CI build now produces a dockerfile for a given merge request as part of the YAML step. We use a tool called RECAST and a YAML-based driving file (see later) to "automate" an entire analysis chain, so a user can input a new signal model (with the production of these also itself tagged and documented by ATLAS, so that the input has the right input variables), and be able to extract a new confidence limit from it. This is particularly pertinent as many analyses are often re-used by "combination" efforts/automated multi-analysis scans.

-> Other issues: BDT evaluation and training forms a "loop" using this code (as outputs from this package are required as input to the BDT training). This is a substantial overhead in terms of reprocessing the same events, but appears to be unavoidable (we can circumvent this using CLI options to skip the MVA evaluation to produce a training sample). In production, this loop would only occur once. Some ROOT macros (C++) were designed to do the BDT evaluation, but this would have required substantial extra work to then set up for batch processing etc.

Recast (see link):

Language used: YAML (executing BASH/python commands)

Collaborative effort with C. Macdonald (Sheffield). My work was in testing and validating each of these steps individually, and checking that the relevant inputs were available. https://gitlab.cern.ch/recast-atlas/susy/ana-susy-2018-12

Automated analysis using input file, and automated YAML based setup using RECAST (see documentation for more details).

Steps correspond to:

  1. Grid based production code -> Converts ATLAS data formats (xAOD) into reduced size files with only relevant variables/objects saved (Python/C++). Some of our analysis required different variables/skimming settings, so multiple steps correspond to this.

  2. Batch based production code (ExportTrees, see above).

  3. Profile likelihood fits. Monte-Carlo Background estimates are corrected in sideband regions to best reflect their true values in the signal region, and include the impact of statistical and systematic uncertainties. Profile likelihood estimation is then used to extract the statistical significance of the result using asymptotic formulae, and estimate the confidence limit (CLs).

Runs the entire ATLAS SUSY stop0L analysis including all production steps from input to final fit result, involving as many automated components as possible. The aim of such a project is to be able to run a physics analysis even if the original developers are no longer available (to account for the relatively short durations of PhD/postdoc contracts).

-> Issues: YAML file configuration is a bit tricky with syntax to chain the steps together.

-> As the original analysis was designed in discrete steps (where each step could be run relatively quickly separately on either grid or batch computing to diagnose problems), a possible solution could have been used whereby the analysis is run in fewer automatically configured steps, but this would have differed substantially from the actual workflow used for the paper.

MVA Training (included in repo):

Languages used: C++

Example TMVA code for MVA training using ROOT TMVA (Originally pure C++, subsequent software versions have allowed for PyROOT/Python based C++ configuration). Chosen at the time over pythonic solutions (e.g. XGBoost, pandas) as TMVA already can read in ROOT file formats (ATLAS default data structure format), as it is now part of the ROOT core software release. In addition, TMVA could run multiple different MVA classifiers/settings in a readily configurable manner, with the output being a ROOT executable macro (.C) and an XML file corresponding to the trees.

Example shown is a ROOT macro with configurable settings covering the BDT & BDTG settings, in this case for a simple signal/background classification problem between a SUSY signal against ttbar, using simple kinematics as input.

Also managed in a version controlled repository (not shown here).

NB: Neural Network implementation direct via ROOT was at the time not possible/non-functional. ATLAS users prefer to use Tensorflow/Keras directly via either conda/virtualenv/singularity/other setups than TMVA, with a pure pythonic package (independent of ROOT entirely so no cpp install required) called "Uproot" to convert ROOT file formats to Pandas dataframes (and back again), so the usual neural network toolkits can be used. This particular point relates to code I've used in my postdoc (and the tutorial in the next item).

Google Collab tutorial (see link)

Language used: Python

Delivered in conjunction with M. Sullivan (Liverpool) to ATLAS UK meeting in 2021. Tutorial covers conversion to pandas dataframes from ROOT format, setup of the NN in Tensorflow etc, activation functions, back-propagation etc.

Primary work I presented was on the sections about MVA training and activation functions, and contributed to project setup.

Purely pythonic setup here (Tensorflow is C++ backend), with the notable upside of Google Collab is the access to cloud GPU when run. https://github.com/manthony-42/ATLAS_UK_NN_TUTORIAL/blob/main/DNN.ipynb

This tutorial also demonstrates the efficacy of using Jupyter notebooks in AI/ML development, since it can sidestep many "environment setup" related issues with a suitable notebook choice (a common issue with setting up tensorflow/keras without a package manager e.g. conda/ python3 virtualenv).

I think notebooks are great for testing/demonstration of ML projects without extensive user-side setups, but I prefer to have CLI executable scripts/containerised code for production code (particularly on distributed systems such as HTCondor, or large file processing), to maximise efficient usage of the resources available.

Other code portfolio items

I thought these were useful links to include as well (since this is development on the core ATLAS Athena codebase involving CI, nightly testing, validation and full merge reviews), but I think these are less pertinent to the direct scope of the interview since these are "non-ML" projects.

ATLAS ATHENA Event Reconstruction: jets/missing energy reconstruction (global particle flow).

Language: C++/Python Ongoing work as part of the ATLAS collaboration.

What this does: Flow Elements (and electrons or photons) have two C++ objects associated to them, namely charged particle tracks and calorimeter clusters (electrically charged objects have both, neutral objects only have a cluster).

To ensure that we’re referencing the same track/cluster, we have a container of tracks, clusters and the electrons/photons/flow elements with pointers to the tracks/cluster containers. We then loop over the flow elements and the electron/photon containers, then check if a track or cluster is shared between them based on an index match of the linked object. If so, we save this as a pointer link (ATLAS Athena jargon an “ElementLink”) to the given container and pointer. In principle this matches electrons/photons to a given flow element, to allow for removal of flow elements corresponding to these objects.

(Source code -> Public) https://gitlab.cern.ch/atlas/athena/-/blob/master/Reconstruction/eflowRec/src/PFEGamFlowElementAssoc.cxx (Header-> Public) https://gitlab.cern.ch/atlas/athena/-/blob/master/Reconstruction/eflowRec/eflowRec/PFEGamFlowElementAssoc.h

Python driving implementation (in Athena this is set by “JobOptions”): https://gitlab.cern.ch/atlas/athena/-/blob/master/Reconstruction/eflowRec/python/PFCfg.py#L171

Physics validation (run on a “long term” basis with frequent validation campaigns) -> Public: https://gitlab.cern.ch/atlas/athena/-/blob/master/Reconstruction/PFlow/PFlowValidation/PFODQA/src/PhysValFE.cxx

Issues we encountered: -> Core infrastructure. Physics objects in ROOT have to be carefully loaded into CINT via a dictionary include. Flow Elements were considered “new” objects and this was a substantial learning process to rectify.

-> Code reusability. Since the “FlowElement” is a C++ object type, we wanted to be able to use this code in as many applications as possible (see discussion in the python setup about the track-calo-cluster/TCC objects). This interoperability introduces new behaviour (such as an object not having an associated track if at that stage a particular object is discarded), which had to be rectified with several additional nullptr catches and diagnostics.

-> Code reusability across software versions. This software was purely designed for the latest & greatest Athena release (Athena Release 22/ Multi-threaded Athena).

What I would do differently:

-> Scope out the main use cases in advance and add more “future proof” context dependent checks. However this is hard to know in advance of a given academic project. Add more nullptr catches and diagnostic information which can be set in a more “context dependent” manner.

ATLAS release tester

Language: Python/BASH

Nightly based testing of the FastChainPileup package using the ATLAS Release Tester (ART) framework, using a web display to produce diagnostic results based on nightly run results on the BNL computer system (normally run on the grid, but to enfoce identical CPU configuration, this is restricted to a specific computing site). This runs tests of a particular input and output and validates against a given cross reference. Since this is process specific, this is in addition to "merge request" type tests such as Jenkins & CI, and allows a user to cross reference the physics impact (via plots) of any software changes.

See https://cds.cern.ch/record/2709656/files/ATL-SOFT-PROC-2020-012.pdf for more details.

Code examples I worked on (now subsequently update by ATLAS collaboration to suit their use case): https://gitlab.cern.ch/atlas/athena/-/blob/master/Simulation/FastSimulation/FastChainPileup/test/test_fastchain_mc16a_ttbar.sh

Originally used to produce traffic light interface to debug common analysis issues for the FastChainPileup toolkit, and diagnose complex issues within the runtime job configuration.