Refactor PV Beamline Peak
This MR refactors PV Beamline Peak by dividing the logic into two kernels.
PV Beamline Peak's time in the NVIDIA profiler is a red herring. It is an example where the latency is high, but the amount of GPU resources it takes is insignificant. This can be proven by running the same kernel twice in the algorithm, which results in a negligible time overhead.
Merge request reports
Activity
The diffs would naively (number of lines + or -) indicate this is a rewrite but looking more closely it seems like a set of small technical changes. Is that right?
There doesn't seem to be any performance improvement from this?
cc @freiss
This doesn't improve performance yet, hence the WIP.
The idea behind this MR is to divide the logic of PV Beamline Peak into two kernels: pv_beamline_calculate_cluster_edges and pv_beamline_peak. If you look at the breakdown of the sequence you can see the work remaining should be done on pv_beamline_calculate_cluster_edges, which now takes 6.80 % versus the small 0.90 % of pv_beamline_peak. This in itself in my opinion would merit considering merging, since it divides the logic into smaller more easily optimizable pieces.
The optimizations done in pv_beamline_peak are varied. It essentially allows it to run in two block dimensions, X for the event, Y for threads within an event.
atomicAdd
s are required now. All the logic remains the same though, and as a precondition it requires cluster edges populated.If we want to keep the logic of
pv_beamline_calculate_cluster_edges
as is, I can think of a way to further optimize it but it is not trivial. Here is what I wrote about it as a TODO. I estimate this would still take a day or two to properly test it:// Start from the end, work through the list until the beginning, loading 33 elements at a time. // Broadcast the condition empty != prevempty to all other threads. // * If there are more than two 1s: It is possible to collect the thresholds, // create the masks and do a sum with intrinsics as many times as needed (all threads know). // * If there is a single 1, or for the first 1: Keep that condition and sum all the previous elements, // keeping it for the next iteration (it's like a carry). // The carry is initialized to 0.f on the first iteration.
A different and much simpler optimization would be to reduce the
BeamlinePVConstants::Common::Nbins
. @freiss I assume 3200 bins are required, or can this be reduced?Edited by Daniel Hugo Campora Perez
mentioned in issue Moore#313 (closed)
mentioned in issue Moore#316 (closed)
mentioned in merge request !764 (merged)