Skip to content

WIP: First prototype for a low overhead algorithm timer

Christoph Hasse requested to merge chasse/Gaudi:AlgorithmTimer into master

This is meant as a first proposal and "discussion starter" for a minimal overhead timing of algorithms.
The timing is based on the Time Stamp Clock also know as TSC. See for example Intel's benchmarking paper

I've in the end decided to use the lfence;rdtsc; approach which is used in the Linux Kernel. This seems to work quite well. As to whether we actually need the lfence instruction or not... I am not 100% sure. I tested it and it didn't make to seem much difference. Since this normally is only important in micro benchmarks where you want to make sure all instructions are actually executed between the rdtsc calls we might be ok without them. No strong preference from my side...

As to the overhead of this timing, I'd refer to the Agner Fog's work who shows the latency of these calls on many architectures. For example the on a Broadwell machine the rdtsc call has a 24 cycles latency and the lfence instruction only takes four cycles.

Printing the timing in the end, I decided to stick to the cycles as this is what we measure quite precisely.
Converting these cycles is a bit of a pain as determining the actual frequency of the TSC is a bit of a pain on not easily done from userspace.
I do however print a rough approximation of the TSC frequency so people can reason about absolute timing.

fyi @clemenci @sponce

Edit: fyi an example output could look like:

AlgorithmTimerSvc    INFO AlgorithmTimerSvc Summary, the Time Stamp Clock is approx running at: 2194990 kHz
AlgorithmTimerSvc    INFO                        Algorithm Name  |      Mean Cycles  |            Sigma  |         nEntries  |  
AlgorithmTimerSvc    INFO ----------------------------------------------------------------------------------------------------
AlgorithmTimerSvc    INFO                     FloatDataConsumer  |            31362  |            22722  |                2  |  
AlgorithmTimerSvc    INFO      Gaudi::Examples::ContextConsumer  |            11171  |             5269  |                2  |  
AlgorithmTimerSvc    INFO   Gaudi::Examples::ContextIntConsumer  |             7391  |              821  |                2  |  
AlgorithmTimerSvc    INFO                       IntDataConsumer  |            24598  |            18064  |                2  |  
AlgorithmTimerSvc    INFO                       IntDataProducer  |            31269  |            22335  |                2  |  
AlgorithmTimerSvc    INFO                IntIntToFloatFloatData  |            24897  |             5017  |                2  |  
AlgorithmTimerSvc    INFO                        IntToFloatData  |            14422  |             7002  |                2  |  
AlgorithmTimerSvc    INFO                  OtherIntDataProducer  |             9402  |             1074  |                2  | 
Edited by Christoph Hasse

Merge request reports