WIP: First prototype for a low overhead algorithm timer
This is meant as a first proposal and "discussion starter" for a minimal overhead timing of algorithms.
The timing is based on the Time Stamp Clock also know as TSC. See for example Intel's benchmarking paper
I've in the end decided to use the lfence;rdtsc;
approach which is used in the Linux Kernel. This seems to work quite well.
As to whether we actually need the lfence instruction or not... I am not 100% sure. I tested it and it didn't make to seem much difference.
Since this normally is only important in micro benchmarks where you want to make sure all instructions are actually executed between the rdtsc
calls we might be ok without them. No strong preference from my side...
As to the overhead of this timing, I'd refer to the Agner Fog's work who shows the latency of these calls on many architectures. For example the on a Broadwell machine the rdtsc
call has a 24 cycles latency and the lfence
instruction only takes four cycles.
Printing the timing in the end, I decided to stick to the cycles as this is what we measure quite precisely.
Converting these cycles is a bit of a pain as determining the actual frequency of the TSC is a bit of a pain on not easily done from userspace.
I do however print a rough approximation of the TSC frequency so people can reason about absolute timing.
Edit: fyi an example output could look like:
AlgorithmTimerSvc INFO AlgorithmTimerSvc Summary, the Time Stamp Clock is approx running at: 2194990 kHz
AlgorithmTimerSvc INFO Algorithm Name | Mean Cycles | Sigma | nEntries |
AlgorithmTimerSvc INFO ----------------------------------------------------------------------------------------------------
AlgorithmTimerSvc INFO FloatDataConsumer | 31362 | 22722 | 2 |
AlgorithmTimerSvc INFO Gaudi::Examples::ContextConsumer | 11171 | 5269 | 2 |
AlgorithmTimerSvc INFO Gaudi::Examples::ContextIntConsumer | 7391 | 821 | 2 |
AlgorithmTimerSvc INFO IntDataConsumer | 24598 | 18064 | 2 |
AlgorithmTimerSvc INFO IntDataProducer | 31269 | 22335 | 2 |
AlgorithmTimerSvc INFO IntIntToFloatFloatData | 24897 | 5017 | 2 |
AlgorithmTimerSvc INFO IntToFloatData | 14422 | 7002 | 2 |
AlgorithmTimerSvc INFO OtherIntDataProducer | 9402 | 1074 | 2 |