The mysterious halving half-precision z0
Context
As part of a much bigger cleanup effort (!811 (merged)) @npond came across a fun bug. With an updated AOD1 for the trigger tests, two variables (signed and unsighed z0, i.e. lifetimeSignedZ0SinTheta
) on one track are written out halved when we write them to half precision. The "magic" output value is -0.2499390692
, and and writes to half precision as -0.125
.
Hints from further probing
Given that 0.25
and 0.125
are represented with a 0x1
mantissa this looks weirdly like a single bit shift in the mantissa or a mystery addition of 1 to the exponent.
Other things we noticed:
- The problem was there before @npond's MR, it's the updated test file that triggered it.
- We failed to reproduce the problem writing a full-precision float: moving the variable category or using the
-p
flag would make the problem vanish. - Changing the variable (for this one track) upstream of writing by anything 4.16e-5 or smaller would leave the bug intact, but larger values (e.g. 1e-4) would make it go way.
- On the other hand, forcing every track to the this magic value makes the problem go away.
- Shifting the z0 values for every other track (excluding the magic one) by 1e-4 also makes the problem go away.
- Disabling lossless compression has no effect.
- Saving more or fewer tracks has no effect.
- The problem appears on the 36th jet, but if I skip the first 30 it goes away.
- The problem remains if I remove every other track variable.
Next steps
We don't want to hold up !811 (merged) for this, but I have a few ideas for where to look:
-
There might be something wrong on the HDF5 writing side. This could be in a few places:
- Our interface with HDF5. I spent some time failing to reproduce this error starting with our existing unit tests, but there might be a bit more to do there. Previously I fixed some issues (atlas/athena!73569 (merged)) that seemed to suggest that the HDF5 interfaces won't always detect inputs that are clearly a bit weird (i.e. the size of the data doesn't match the size in memory).
- A bug in the HDF5 library. This seems pretty unlikely given how ubiquitous the data format is and how little it has changed in over a decade. Our half precision floats are not a standard predefined type in the library, but the implementation is taken from
h5py
so it should be well tested.
In either case we should probably update to the latest HDF5 version. Version 1.14 adds a predefined half-precision type which should, at the very least, make
h5ls
look nicer, but it's possible that the newer versions are more stringent with input validation or that they have fixed something. The update should be relatively easy, since the files already exist in the CERN external repository, it should just be a matter of updating a few lines of code. -
There could be memory corruption coming from somewhere else entirely. This seems very unlikely, but I have seen spooky errors caused by e.g. checking an uninitialized enum, so it's not completely impossible.
-
r15280/AOD.601479.e8514_e8528_s4159_s4114_r15280_tid37668991_00.small.pool.root
↩