Skip to content

Better jet hashing: More random random numbers

Dan Guest requested to merge dguest/training-dataset-dumper:morerand into main

As I mentioned in mattermost and in some codimd notes there's a problem with giving every jet in an event the same random number for fold selection. This adds some more sources of per-jet entropy.

There are a few problems with this implementation:

  • The biggest one (that I don't know how to solve) is that some jets will always have the same properties, regardless of how many we include. This means there's still a bias toward jets within the same event using the same fold. I'm thinking we might just have to accept that a bias toward the same fold is better than always using the same fold.
  • Right now if we have two sources of random integers, which give, say 25 and 12 respectively, this will be the same as 12 and 25. We're loosing some entropy there, which could be fixed using a per-key hash. Is it worth it?

I tried a few different associated object counts for the hashing.

config N overlapping
egcps 1
egcp 32
egct 96
egc 529
ec 6863
e 61919
total 81707

where

variables
c constituents
g ghost tracks
t ftag tracks
p n pixel hits
s n sct hits
e event hash

So we get < 1% overlapping hashes if we use the number of constituents and the number of ghost tracks to hash. If we add the cone associated track count it drops to around 1 in a thousand, but that variable is sensitive to the jet calibration. Adding track quality variables like n sct or n pixel hits lowers the number to the floor that I'm sensitive to.


Future work

There are a few things I don't love about this, which I plan to work on after this MR. I'll do all this work without changing the hashing behavior, though, so this should be fine to produce outputs.

  • There's too much map lookup. This came out of all the use of key arrays, which don't naturally lend themselves to structures that put related information in the same place.
  • I'm passing in a json string in one place, which is either pretty ugly or a great solution, depending on how much you love the gaudi configuration. My plan is to move as much as possible over to the json, actually, since I find the gaudi interface super limiting.
Edited by Dan Guest

Merge request reports