TrigConfHLTUtils: performance improvement to string2hash (!63798) · Merge requests · atlas / athena

Background

We use the string2hash function to calculate a 32-bit hash for each HLT identifier (e.g. chain names) that is stored in the data instead of the plain string. This hash function has been unchanged since Run-1. And while it would be much safer and faster to use a vectorized 64-bit hash (e.g. CxxUtils::crc64) that will have to be post-poned for Phase-II. In principle, the hash values should only be calculated and stored once during the job startup. However, it's easy to mistakenly calculate and/or lookup the hashes during execution (see ATR-27765).

Status quo

On each invocation of string2hash, the hash value is calculated, a collision detection is performed and if not available yet, the hash->name mapping is stored in a tbb::concurrent_hash_map.

Improvement 1

In addition to the hash->name mapping also store the reverse name->hash mapping. This avoids having to re-calculate the hash and perform the collision detection each time. This is labeled as "TBB (with lookup)" in the following.

Improvement 2

Replace the TBB hash map with the concurrent maps from CxxUtils (by @ssnyder) that are optimized for the write-once, read-often use-case. This is labeled as "CxxUtils (with lookup)" in the following.

Implementation details

CxxUtils::ConcurrentStrMap is used for the name->hash mapping
CxxUtils::ConcurrentMap is used for the reverse hash->name mapping. Note that this map only supports storage of 64-bit (pointer) values. So we cannot store the string directly but have to store a pointer to a string.
An additional wrapper class (HashStore) is needed to trigger the memory cleanup at the end of the job (this is mostly cosmetic to avoid one-time leaks in valgrind, etc.)

Performance comparison

The following plot shows the results of a synthetic benchmark. The ~10k unique identifiers (L1, HLT, leg names) were taken from the v1Dev menu. The hashes for all strings are calculated once and then read 1M times from 1-8 threads. The results show:

The additional lookup speeds up the TBB implementation by a factor 2.7 with 1 thread
The CxxUtils and improved TBB implementations perform the same with 1 thread
For multiple threads the CxxUtils implementation vastly outperforms the (improved) TBB implementation (factor 4-7)

Performance impact

Despite these nice results, the impact on a real trigger job is however entirely negligible. Running an HLT job with 4 threads in vtune I don't see a measurable difference between the two implementations. This is probably because unlike for the synthetic benchmark there is almost no concurrent access for the hash map in a real trigger job. So the TBB maps perform just fine.

Relates to ATR-27539 and cc @tbold @tamartin @smh @abarton.

Edited Jun 28, 2023 by Frank Winklmeier

TrigConfHLTUtils: performance improvement to string2hash