Replace VTune by Linux Perf in throughput tests
VTune is replaced by the use of Linux perf
Linux perf is supported in two modes:
-
fp
mode which requires frame pointers to be available, e.g. through using a platform likex86_64_v3-centos7-gcc10fp-opt
- default is
dwarf
mode which uses dwarf based stack unwinding which needs debug symbols, thus you need a platform that compiles with-g
Both modes are really low overhead compared to vtune.
E.g. hlt1_pp_default
can run on a single numa node at ~13.5kHZ while being profiled, compared to 13.65kHz normally.
Benefits:
- linux perf is available by default, no need to install vtune
- low overhead, thus a more realistic measurement
- easier to use, no need for hacks like late attaching to the job, or causing segfaults through profiling
Some Caveats:
- "fp" we have a few broken stack frames that can't be avoided because those are samples taken while we are in
libc
which doesn't have frame pointers "dwarf" very nice detailed flamegraphs but some inconsistencies because some algorithms seem to not be showing their complete ancestry. (maybe broken DWARF, or old kernel🤷 ?!)
EDIT:
Caveat of dwarf mode solved if we run with a stack-size setting of 64kB
But this mode dumps about 5 GB of data, thus we should make sure the handler doesn't try to back that up.
I don't think this needs any changes from the database side, but we could at some point remove the sourcing of the intel stuff.
cc @maszyman
Edited by Christoph Hasse