Highlight

After several throughput optimizations, we actually have better throughput in A5000 compared to the hard-coded dxDy version, and the tracking performance of Matching With UT is also better than Matching No UT now in 2024-expected MC (Better means better in both efficiency and ghost rate).

HLT1 Tracking with UT preparation

This MR contains several changes to prepare for 2024 Aug data-taking with the UT in HLT1:

Support UT per-sector dxDy information in HLT1:

- Add per-sector dxDy information to UTGeometry; versioning is introduced to maintain backward compatibility.
- Add per-sector dxDy decoding and necessary Gaudi tests to verify decoded UT hits.
- Adapt all tracking algorithms that use UT hits, such as VeloUT tracking, matching with UT, and downstream tracking.
- Introduce new SmartUTHitCache to replace the old UTHitCache in matching with UT and downstream tracking algorithms:
    1. Remove the global cache and use the original UTHits container when it doesn't fit in shared memory.
    2. Use CUDA SIMD intrinsics to read hit information from the shared memory cache, avoiding repeated shared memory access.
    3. Cache UT hit information into 64 bits (8 bytes) as follows:
      *************************************************************
      *    16 bits   *    16 bits   *    16 bits   *    16 bits   *
      *-----------------------------------------------------------*
      *    xAtYEq0   * zAtYEq0 - z0 *      yMid    * dxDy - dxDy0 *
      *              *              *              *    dY type   *
      *************************************************************
      Note: We use 16 bits (half_t) to store (dxDy - dxDy0), with the last
            bit used to store the dY type (type AB or type CD). This
            introduces a +/- 1 error in the (dxDy - dxDy0) result, but
            it should be generally harmless to our tracking results.
- Add a unit test sequence to validate the new UTHitCache performance.

Update Matching with UT algorithm for 2024 data-taking (90% UT hit efficiency)

- Update the v2 NN for `matching with UT`: trained with 2024-expected MC simulation, uses both `num_ut_hits` and `ut_chi2_per_ndof`, the `num_ut_hits` distribution is re-sampled with real data distribution using dumped MEPs files before the NN training.
- Update the search window size for `matching with UT`: use a wide constant as a robust search window size, which minimizes the impact of imperfect UT alignment in data-taking.
- Add momentum parametrization versioning:
    0. Old HLT2 parameters for both MagUp and MagDown: A large bias in MagUp is observed in data-taking.
    1. Update v0 parameters with Run2 Magnetic Field Map: A small bias in both MagUp and MagDown is observed.
    2. Use v0 for MagDown and v1 for MagUp: A small bias in MagUp is observed.
    3. Add additional offset to v1, use v1+offset for MagUp and v0 for MagDown: good mass peak in data-taking **Current setup**
    4. Use v1+offset for both MagUp and MagDown: good mass peak in dumped MEPs. (Future setup)
    5. Update v1 with Run3 Magnetic Field Map, the offset computation is not yet done. (Future setup)

Minor fixes:

Replace risk atomicAdd bound check with safe one, for both Matching With UT and Downstream, which should fix the instability of downstream tracking in CI/CD test.
Optimize shared memory allocation based on CUDA occupancy: taking into account Registers per thread, Threads per block and CUDA Compute Capability. (Optimized mainly for A5000)
Code clean-up

FAQ

How did you decide on the new max_tracks value? What does the distribution of the number of track candidates look like? from !1711 (comment 8226016)

Removing clone killing and ghost killing, the number of long tracks goes close to 1000 tracks per event in 2024-expected MC:

The same check is done with real data, but it looks better (due to 90% UT hit efficiency):

So I reverted this change back to 1000 tracks/event, as it should never reach this limit with 90% UT hit efficiency.

Can you remind my what gamma is and why does it need updating to be more performant on data? from !1711 (comment 8226017)

The trajectory of the charged particle in a constant Magnetic Field can be described as a second-order polynomial, the gamma is the second-order coefficient, updating it during pattern recognition is dangerous since one outlier hit can completely bias the trajectory, so we use a more robust updater here: update_gamma = (old_gamma + new_gamma)/2 and it improves the tracking performance in both Data and MC.

How often do we end up in the case of too many tracks? Was this tested with the Velo and SciFi safety GEC cuts? from !1711 (comment 8226018)

In principle, we should never reach this limit in Matching no UT, but in Matching with UT we rarely reach the limit in 2024-expected MC. Currently, we are fine due to low UT hit efficiency, but need to check with data once UT hit efficiency recovers to 99%.

Do I understand correctly that you don't remove track candidates with several shared UT hits anymore? from !1711 (comment 8226019)

Yes, we only do one clone killing to make sure each Velo-SciFi pair only matches one UT segment, but if two pairs match the same UT segment, it's too dangerous to kill one of them, since we allow one UT segment to have only 2 UT hits due to lack of UT hit efficiency. I checked this with Data and it has a visible efficiency impact.

how many UT hits does this matching require for a VELO-SciFi track? from !1711 (comment 8297441)

In Matching with UT we require only 2 UT hits, which correspond to 99% tracking efficiency when UT Hit Efficiency is 90%. If we require 3 UT Hits instead, the expected tracking efficiency will be 95%. This is now configurable as an algorithm property, so we can consider requiring 3 UT hits instead of 2 UT hits if it's worth it in the future.

Related to the HitCache: Can you please elaborate a bit @jzhuo ? Can this result in non-deterministic behavior? from !1739 (comment 8318527)

In principle, the transformation between float and half_t is deterministic, but converting float to half_t involves a loss of precision. What may happen is that two hits have different xAtYEq0 positions in float, but end up with the same xAtYEq0 position in half_t. This may cause non-deterministic behaviour because the tracking result will depend on the order of UT hits.

Note: the precision of half_t is approximately 4 significant digits. For values like xAtYEq0, yMin, and yMax, we do expect to see some different floats of hits in the outer region ending up converting into the same half_t.

Fortunately, we also have zAtYEq0. By caching (zAtYEq0 - z0) instead of zAtYEq0, where z0 represents the mean zAtYEq0 in each layer, we are able to improve the cache precision to the order of 10^-8 mm. Thus, even if two different UT hits have the same xAtYEq0 after converting float to half_t, the zAtYEq0 should be different.

This precision improvement technique is also applied to dxDy, where we cache dxDy-dxDy0. Although we use the last bit to store additional information, the cache precision of dxDy is still in the order of 10^-6.

As I mentioned, we have a new validation sequence called ut_hit_cache_validation. Here is part of the output from it using real data:

2_Caching Bias, dxDy   , layer = 0: mean = 5.239284e-08, std = 4.529770e-13, min = -3.379770e-06, max = 3.269874e-06
2_Caching Bias, dxDy   , layer = 1: mean = 6.912754e-08, std = 3.854798e-13, min = -1.795590e-06, max = 1.758337e-06
2_Caching Bias, dxDy   , layer = 2: mean = -8.409276e-08, std = 5.640480e-13, min = -2.875924e-06, max = 3.352761e-06
2_Caching Bias, dxDy   , layer = 3: mean = 2.522947e-08, std = 4.861299e-13, min = -2.766028e-06, max = 2.533197e-06
2_Caching Bias, xAtYEq0, layer = 0: mean = -1.371311e-04, std = 3.998522e-03, min = -2.500000e-01, max = 2.500000e-01
2_Caching Bias, xAtYEq0, layer = 1: mean = 2.426437e-05, std = 4.245186e-03, min = -2.500000e-01, max = 2.500000e-01
2_Caching Bias, xAtYEq0, layer = 2: mean = -1.796699e-04, std = 5.110814e-03, min = -2.500000e-01, max = 2.500000e-01
2_Caching Bias, xAtYEq0, layer = 3: mean = 6.355754e-05, std = 5.373291e-03, min = -2.500000e-01, max = 2.500000e-01
2_Caching Bias, yMax   , layer = 0: mean = 2.794118e-03, std = 2.381422e-02, min = -7.503967e-01, max = 7.502441e-01
2_Caching Bias, yMax   , layer = 1: mean = -1.568776e-03, std = 2.229192e-02, min = -7.755127e-01, max = 7.758179e-01
2_Caching Bias, yMax   , layer = 2: mean = 1.895555e-03, std = 2.341687e-02, min = -7.901917e-01, max = 7.743530e-01
2_Caching Bias, yMax   , layer = 3: mean = 6.446959e-03, std = 2.492940e-02, min = -7.499390e-01, max = 7.498779e-01
2_Caching Bias, yMid   , layer = 0: mean = 2.923942e-03, std = 2.383094e-02, min = -7.500000e-01, max = 7.500000e-01
2_Caching Bias, yMid   , layer = 1: mean = -1.163184e-03, std = 2.202258e-02, min = -7.500000e-01, max = 7.500000e-01
2_Caching Bias, yMid   , layer = 2: mean = 1.415870e-03, std = 2.329405e-02, min = -7.499390e-01, max = 7.496948e-01
2_Caching Bias, yMid   , layer = 3: mean = 6.558019e-03, std = 2.494691e-02, min = -7.497559e-01, max = 7.500000e-01
2_Caching Bias, yMin   , layer = 0: mean = 3.054153e-03, std = 2.384936e-02, min = -7.505493e-01, max = 7.503052e-01
2_Caching Bias, yMin   , layer = 1: mean = -7.574534e-04, std = 2.205459e-02, min = -8.118286e-01, max = 7.605591e-01
2_Caching Bias, yMin   , layer = 2: mean = 9.366853e-04, std = 2.349043e-02, min = -7.950439e-01, max = 7.676392e-01
2_Caching Bias, yMin   , layer = 3: mean = 6.669103e-03, std = 2.496446e-02, min = -7.500000e-01, max = 7.503052e-01
2_Caching Bias, zAtYEq0, layer = 0: mean = 5.376643e-04, std = 5.289255e-06, min = -3.417969e-03, max = 3.417969e-03
2_Caching Bias, zAtYEq0, layer = 1: mean = -1.751014e-04, std = 3.806802e-06, min = -3.417969e-03, max = 2.929688e-03
2_Caching Bias, zAtYEq0, layer = 2: mean = -2.189049e-04, std = 4.322091e-06, min = -2.685547e-03, max = 2.929688e-03
2_Caching Bias, zAtYEq0, layer = 3: mean = 3.042924e-04, std = 1.911490e-06, min = -2.197266e-03, max = 1.953125e-03

This shows the biggest bias of dxDy is in the order of ~10^-6, of xAtYEq0 is ~0.25 mm, of Y* is ~0.75 mm, and of zAtYEq0 is ~0.003 mm.

Therefore, the precision is sufficient for our tracking requirements. For non-deterministic behaviour to occur, two hits must be close enough that the distance between them is smaller than this precision. In such cases, it will likely be classified as a single UT cluster instead of two UT hits (thanks to having UT clustering in HLT1!).

FIY: @dovombru @cagapopo @gligorov @adeoyang @dtou

Edited Aug 12, 2024 by Jiahui Zhuo

HLT1 Tracking with UT preparation

Highlight

HLT1 Tracking with UT preparation

FAQ

Merge request reports