Generalize BDT in MVAUtils (!27921) · Merge requests · atlas / athena

Ruggero Turra requested to merge ATLAS-EGamma/athena:generalize_node_bdt_MVAUtils into 21.2 Nov 12, 2019

In this MR the MVAUtils is partially redesigned to allow different kinds of BDT, trained from tools different from TMVA, for example lgbm and xgboost. An implentation of lgbm is already included.

This MR is needed by some analyses which would like to use lgbm (many are already using xgboost using directly the xgboost c-api).

The BDT class is now splitted in three layers:

BDT: this is the only class the user needs and its interface is unchanged by this MR. Now it is reading the input weights and deciding which implementation of the forest to instantiate.
Forest: this implement the forest of decision trees, but the implementation of the node of the BDT is not here. Different Forest classes are provided, since for example the way the trees are summed is different between TMVA/lgbm (for example for TMVA classification output is a normalized weighted sum of all the response of all the trees, while lgbm is a simple sum followed by a sigmoid activation).
Node: this implements the logic at each node of the tree which is different for TMVA/lgbm (e.g. TMVA is using >= as decision, lgbm is using <=, lgbm support nan as input, ...)

Forest is implemented with polymorphism with virtual classes but the Node class is not (static polymorphism) to avoid to query a virtual table for each node evaluation.

The way the response of each tree is summed is now reversed: from the last (usually smaller response) to the first (usually larger response). This should improve numerical stability. It doesn't impact performance.

Everything else is unchanged: the weights are from TTree and old TTree works in the new implementation. The memory usage and cpu speed is identical.

Lgbm has many options (much more than tmva). Only the most common or default options are supported. The new lgbm implementation support only regression/multiclassification/binary classification. It supports continuous inputs, including nan as missing values (but not other kinds of missing values as in lgbm). It does not support categorical input variables. It supports only default activation functions (e.g. standard sigmoid for binary classification).

A python utility to convert the weights saved by lgbm training to our TTree format is provided. It needs lgbm installed (writing a parser is too complicated, error prone, do not guaranteed compatibility with future lgbm formats, ...).

Performances

20% faster for simple case of multiresponse where the time for the final softmax normalization is not negligible. Now better implementation of the softmax function: O(n) instead of O(n^2).
For the rest very same performance (checked with Reconstruction/MVAUtils/util/check_timing_mvautils)
Size of the node classes is the same as before, also in case when we encode what to do in case of nan inputs (used only for lgbm).

Checks

Back-compatilibty

tested Reconstruction/egamma/egammaMVACalib/test/ut_test.py (the package is responsible for the egamma energy calibration and uses MVAUtils). One test comparing expected ouput with the one computed using MVAUtils fails with the new implementation due to a tiny relative difference of 1E-7.

======================================================================
FAIL: test_chain_photon (__main__.TestEgammaMVACalib)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "ut_test.py", line 170, in test_chain_electron
    self.assertAlmostEqual(expected_result, result, places=6)
AssertionError: 131959.0625 != 131959.015625 within 6 places

======================================================================
FAIL: test_chain_photon (__main__.TestEgammaMVACalib)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "ut_test.py", line 199, in test_chain_photon
    self.assertAlmostEqual(expected_result, result, places=6)
AssertionError: 167575.859375 != 167575.703125 within 6 places

----------------------------------------------------------------------

Threshould is changed in the test.

Other checks

In the utility to convert lgbm weights file to our TTree a check is done, depending on the task (binary classification / multiclassification / regression). It tests if the output of lgbm (computed using lgbm library) is the same as the one from MVAUtils using random inputs (it is possible, and suggested, to use inputs provided by the user, an option allows to do that).
Many automatic checks are in Reconstruction/MVAUtils/test/ut_test_MVAUtils.py. All the output functions are checked. Two kind of checks are done. The first just create a very simple forest (with hard-coded weights) and check if the output is the desidered one using hard-coded inputs. The second load lgbm library (if possible), train some BDTs (from iris dataset) and then performs similar checks. If it is not possible to load lgbm library it tries to install it. In case of failure it skips the test without failures.
Checks comparing the output of the energy calibration are failing due to difference of <0.1 MeV on the calibrated energy. The problem has been traced to be related to the different way the response of each tree is summed.

     Before: ((offset + tree0) + tree1) + ...
     Now:     offset + ((tree0 + tree1) + tree2 + ...)

Since this has been understood the reference file have been updated (also the order of the sum has been reversed to minimize numerical stability).

Edited Nov 13, 2019 by Ruggero Turra

Generalize BDT in MVAUtils

Performances

Checks

Back-compatilibty

Other checks

Merge request reports