TrigConfHLTUtils: performance improvement to string2hash
Background
We use the string2hash
function to calculate a 32-bit hash for each HLT identifier (e.g. chain names) that is stored in the data instead of the plain string. This hash function has been unchanged since Run-1. And while it would be much safer and faster to use a vectorized 64-bit hash (e.g. CxxUtils::crc64) that will have to be post-poned for Phase-II. In principle, the hash values should only be calculated and stored once during the job startup. However, it's easy to mistakenly calculate and/or lookup the hashes during execution (see ATR-27765).
Status quo
On each invocation of string2hash
, the hash value is calculated, a collision detection is performed and if not available yet, the hash->name mapping is stored in a tbb::concurrent_hash_map
.
Improvement 1
In addition to the hash->name mapping also store the reverse name->hash mapping. This avoids having to re-calculate the hash and perform the collision detection each time. This is labeled as "TBB (with lookup)" in the following.
Improvement 2
Replace the TBB hash map with the concurrent maps from CxxUtils (by @ssnyder) that are optimized for the write-once, read-often use-case. This is labeled as "CxxUtils (with lookup)" in the following.
Implementation details
-
CxxUtils::ConcurrentStrMap
is used for the name->hash mapping -
CxxUtils::ConcurrentMap
is used for the reverse hash->name mapping. Note that this map only supports storage of 64-bit (pointer) values. So we cannot store the string directly but have to store a pointer to a string. - An additional wrapper class (
HashStore
) is needed to trigger the memory cleanup at the end of the job (this is mostly cosmetic to avoid one-time leaks in valgrind, etc.)
Performance comparison
The following plot shows the results of a synthetic benchmark. The ~10k unique identifiers (L1, HLT, leg names) were taken from the v1Dev menu. The hashes for all strings are calculated once and then read 1M times from 1-8 threads. The results show:
- The additional lookup speeds up the TBB implementation by a factor 2.7 with 1 thread
- The CxxUtils and improved TBB implementations perform the same with 1 thread
- For multiple threads the CxxUtils implementation vastly outperforms the (improved) TBB implementation (factor 4-7)
Performance impact
Despite these nice results, the impact on a real trigger job is however entirely negligible. Running an HLT job with 4 threads in vtune I don't see a measurable difference between the two implementations. This is probably because unlike for the synthetic benchmark there is almost no concurrent access for the hash map in a real trigger job. So the TBB maps perform just fine.