RecJobTransforms + SimuJobTransforms: Switch the compression of all temporary files to ZLIB to speed up reading/writing
As we're discussing in ATEAM-656, this MR changes the compression algorithm for all temporary files to ZLIB
. In this context, a file is temporary if either of the following two criteria is met:
- The transform runs a chained workflow but has no
--outputXYZFile
specified for the intermediate step(s), e.g.RAWtoESD
followed byESDtoAOD
without specifying--outputESDFile
(this is not being done) - The files that are written out by the worker processes in
AthenaMP
(this is already being done)
In the first case, the output filename is set to be tmp.XYZ
where XYZ
stands for the appropriate step, while in the second case _000
is appended to the file name, both by convention.
From a quick test based on q431
w/ 50 events, here is the comparison of StreamESD
performance, as well as resulting ESD
file sizes, by different compression schemes (compression level is always set to 1):
Compression | File Size [MB] | CPU-time [sec/evt] | Note |
---|---|---|---|
LZMA | 139 MB | 855 | Leading CPU consumer |
ZLIB | 180 MB | 371 | 4th leading CPU consumer |
ZSTD | 181 MB | 287 | 4th leading CPU consumer |
LZ4 | 245 MB | 221 | 4th leading CPU consumer |
Again, we're not proposing to change the compression scheme for permanent files (which is LZMA
for all upstream formats including AODs
and - at least for the time being - ZLIB
for DAODs
), only for the temporary ones. Going from ZLIB
to LZ4
would increase the file size by about 35% while improving the StreamESD
CPU performance by 40%. In all three cases, ZLIB
, ZSTD
, and LZ4
, the ESDtoAOD
performances are practically the same in this test.
This should especially help w/ high thread count AthenaMT
jobs w/ chained workflows where the temporary intermediate files are currently being compressed w/ LZMA
, which is very expensive.