Make TopMatrixImpl iterative and move to header
This MR turns the implementation of TopMatrixImpl
from a recursion into an iteration. This reduces the register footprint of the function in CUDA a fair bit.
However, I have also observed, without several other optimizations of other MRs, that this implementation might be slighly slower. It might also be worthwhile to compare both implementations on the CPU.
-
Depends on !900 (merged) -
Benchmark this MR on GPU -
Benchmark this MR on CPU
Edited by Bernhard Manfred Gruber