Delayed selections
This MR implements delayed selection algorithm execution to improve its performance and scalability. It maintains the same configurability as of now.
- Selections are executed in the posterior Gather Selections algorithm, in a single kernel.
- Performance drastically improves. Up to 100 lines have been tested with a performance impact of about
6%
with respect to 1 line. - All selection algorithm initializations have been moved to the kernel execution.
- All selection algorithm copies have been moved to GatherSelections.
- Separable compilation is now enabled, enabled by default.
- An option to compile with / without separable compilation has been added. If separable compilation is disabled, a custom "unity" build is instantiated which joins all source files of the selections library.
- HIP does not support separable compilation at the moment, and hence must be compiled with separable compilation disabled (set at configuration time automatically).
- Code-generation of a new file
ExternLines.cuh
is also necessary (unfortunately) to allow invoking a function defined in a separate compilation unit (see https://forums.developer.nvidia.com/t/consistency-of-functions-pointer/29325/6).
Done in collaboration with @ahennequ.
TODO:
-
Optimize performance -
CPU compatibility -
Manage lifetime of objects used in selections -
Bring back monitoring functionality -
HIP build -
HIP runs
hlt1_pp_default
:
Performance of Device-averaged speedup: 1.0685950709991698
% change: 6.859507099916984
NVIDIA RTX A5000 speedup (% change): 1.040242194593474 (4.024219459347389%)
NVIDIA RTX A6000 speedup (% change): 1.1261653368244064 (12.616533682440645%)
AMD EPYC 7502 32-Core speedup (% change): 0.9943645196516324 (-0.5635480348367583%)
NVIDIA GeForce RTX 2080 Ti speedup (% change): 1.0576478278517496 (5.76478278517496%)
NVIDIA GeForce RTX 3090 speedup (% change): 1.1245554760745868 (12.455547607458684%)
Edited by Daniel Hugo Campora Perez
Merge request reports
Activity
Filter activity
added RTA label
- Resolved by Daniel Hugo Campora Perez
I like Easter eggs! Is this one addressing the scalability of selection lines?
added 1 commit
- f9969bd5 - Restore HLT1.py, unsuccessful attempt at making HIP work.
added 1 commit
- d4dbf321 - Created many hlt1 pp default sequences for tests.
added 11 commits
- 55634850 - Test 0.
- e25a0c8a - Fix some bugs.
- b95979ed - Remove unused arguments.
- fb699407 - Reduce size of parameters in lines.
- 331f0f14 - Refactored initialization.
- b9222e49 - Make all lines homogeneous.
- f84c050b - Add back inputs to lines that missed them.
- fc45a287 - Use proper bool statement.
- fe23045d - Write decisions in-place into contiguous memory
- ec07222c - Merge branch 'dcampora_delayed_fn_exec_lines_test_1' into dcampora_delayed_fn_exec_lines_test_0
- ee8c47b3 - All throughput improvements consolidated.
Toggle commit listadded 33 commits
-
4c655a11...009fc325 - 8 commits from branch
master
- efb04c0b - First prototype (only CPU).
- 7bca17c3 - Attempt to use device functions defined on the host.
- 5080da1b - Test LTO in CUDA target.
- 8918f8db - Put back execution of lines.
- ec027691 - Put back line execution.
- 0cd233aa - Attempt to make single global kernel execute all lines.
- 061aae07 - Several attempts at making static pointer to CUDA function work.
- 75c73654 - Added better cuda separable compilation options.
- a2f4f0f0 - Fix CPU build.
- e5c1cdfd - Example using extern.
- e290c26f - First version running all lines.
- 2e9cbcc8 - Fixed formatting
- ff62b545 - Restore HLT1.py, unsuccessful attempt at making HIP work.
- f4b1cc35 - Created many hlt1 pp default sequences for tests.
- a9ccc94d - Test 0.
- d0e90ff3 - Fix some bugs.
- 190fd42f - Remove unused arguments.
- 25c2ded8 - Reduce size of parameters in lines.
- a176ca07 - Refactored initialization.
- 18b7bfbb - Make all lines homogeneous.
- 18fa9071 - Add back inputs to lines that missed them.
- f3084ed6 - Use proper bool statement.
- 9e70fb7d - Write decisions in-place into contiguous memory
- c6b5b3e3 - All throughput improvements consolidated.
- aaccffd0 - Restore configurations.
Toggle commit list-
4c655a11...009fc325 - 8 commits from branch
added 1 commit
- 18db7e58 - Add dependencies from selection algorithms to gather_selections. Add hack to...
added 1 commit
- 773f0d7a - Add MONITOR_SELECTIONS, init_monitor and monitor functionality.
added 1 commit
- 2e7bb1c8 - Extend option SEPARABLE_COMPILATION to other architectures. Bugfixes.
added 1 commit
- 0ab44bcf - Fix bug in Line.cuh process_line about indices.
added hlt1-throughput-decreased label
added 1 commit
- 78b75446 - Remove one more parameter, update documentation.
added 1 commit
- da4d6f0f - Fix bug introduced with host_decisions_sizes.
added 2 commits
removed hlt1-throughput-decreased label
added enhancement label
added 62 commits
-
f9ccb09f...0cb3d9d3 - 14 commits from branch
master
- 789d6eed - First prototype (only CPU).
- c06cef41 - Attempt to use device functions defined on the host.
- 5dbf33d3 - Test LTO in CUDA target.
- 340349ae - Put back execution of lines.
- 7538de9f - Put back line execution.
- de5fac20 - Attempt to make single global kernel execute all lines.
- 239956e9 - Several attempts at making static pointer to CUDA function work.
- 2353738a - Added better cuda separable compilation options.
- a40d5697 - Fix CPU build.
- 5e326108 - Example using extern.
- 0012a491 - First version running all lines.
- a4124c1a - Fixed formatting
- eb8b927a - Restore HLT1.py, unsuccessful attempt at making HIP work.
- fb982bda - Created many hlt1 pp default sequences for tests.
- 49bf1e9e - Test 0.
- d57b43a3 - Fix some bugs.
- 8801185c - Remove unused arguments.
- a212e84d - Reduce size of parameters in lines.
- 6e293227 - Refactored initialization.
- 529aacbb - Make all lines homogeneous.
- 1dd6719d - Add back inputs to lines that missed them.
- 3c335d18 - Use proper bool statement.
- 4b41e271 - Write decisions in-place into contiguous memory
- cc68f0fd - All throughput improvements consolidated.
- 6c8bfd25 - Restore configurations.
- f9cbe21e - Fix compilation after rebase.
- cc2af723 - Remove temporarily out deps.
- 82b10b95 - Add dependencies from selection algorithms to gather_selections. Add hack to...
- 113b7078 - Remove test hlt1 pp default seqs.
- 5dbb9d57 - Add MONITOR_SELECTIONS, init_monitor and monitor functionality.
- 554bcda0 - Steps toward hip building
- accf1d15 - Trying to use properties of derived_instance calls.
- 13c1e78a - Added monitoring functionality.
- 55df6974 - Attempt at making HIP work.
- fccd04e3 - Made HIP build unified builds.
- 307d0882 - Fixed formatting
- dc5f2311 - Extend option SEPARABLE_COMPILATION to other architectures. Bugfixes.
- ba5b814e - Better command to create unified filename.
- 6c089925 - Fix bug in Line.cuh process_line about indices.
- 6bda1606 - Make HIP work somehow.
- a4ae6f0d - Fixed formatting
- 029be560 - Remove one more parameter, update documentation.
- d2e308b2 - Fix TwoTrackMVALine.
- 6ee4f49f - Fixed formatting
- 8dd7e493 - Fix bug introduced with host_decisions_sizes.
- edcdd25e - Fixed formatting
- 1f56f383 - More sane setting for HIP.
- 37746fee - Adapt line to new selection model.
Toggle commit list-
f9ccb09f...0cb3d9d3 - 14 commits from branch
mentioned in issue Moore#426 (closed)
added 62 commits
-
37746fee...996ceb32 - 13 commits from branch
master
- 10772704 - First prototype (only CPU).
- 55ec7242 - Attempt to use device functions defined on the host.
- 3ae7e285 - Test LTO in CUDA target.
- 4b4d55d2 - Put back execution of lines.
- 3e261a63 - Put back line execution.
- 6fed69c5 - Attempt to make single global kernel execute all lines.
- 23c36bd3 - Several attempts at making static pointer to CUDA function work.
- 49cee189 - Added better cuda separable compilation options.
- f837ae75 - Fix CPU build.
- ae815a24 - Example using extern.
- c874fe60 - First version running all lines.
- 0592cd61 - Fixed formatting
- d48285fd - Restore HLT1.py, unsuccessful attempt at making HIP work.
- 87406c45 - Created many hlt1 pp default sequences for tests.
- 8c0befb8 - Test 0.
- 58ba0e2f - Fix some bugs.
- a4c3a7f2 - Remove unused arguments.
- ae90117b - Reduce size of parameters in lines.
- 4786e703 - Refactored initialization.
- 0f39ca10 - Make all lines homogeneous.
- 7068abf3 - Add back inputs to lines that missed them.
- 42190137 - Use proper bool statement.
- 23bed3a3 - Write decisions in-place into contiguous memory
- f7d61d22 - All throughput improvements consolidated.
- 8611a492 - Restore configurations.
- 2ed79102 - Fix compilation after rebase.
- 22a48599 - Remove temporarily out deps.
- 679e4d49 - Add dependencies from selection algorithms to gather_selections. Add hack to...
- ae806526 - Remove test hlt1 pp default seqs.
- 288dd974 - Add MONITOR_SELECTIONS, init_monitor and monitor functionality.
- 1e21a439 - Steps toward hip building
- d6fadce2 - Trying to use properties of derived_instance calls.
- a5a3a220 - Added monitoring functionality.
- a1aa9b65 - Attempt at making HIP work.
- 3fc5d7d1 - Made HIP build unified builds.
- 261d5f06 - Fixed formatting
- 98fa34ef - Extend option SEPARABLE_COMPILATION to other architectures. Bugfixes.
- fd3523ca - Better command to create unified filename.
- 43c78c97 - Fix bug in Line.cuh process_line about indices.
- 9c0b173b - Make HIP work somehow.
- be36b417 - Fixed formatting
- c233f963 - Remove one more parameter, update documentation.
- b5039508 - Fix TwoTrackMVALine.
- f911bb1f - Fixed formatting
- 4f8e840f - Fix bug introduced with host_decisions_sizes.
- 6c15f857 - Fixed formatting
- 292c6fc1 - More sane setting for HIP.
- a5bec0d8 - Adapt line to new selection model.
- 8eb08af2 - Print using MB instead of MiB. Fix bug of not populating always particle...
Toggle commit list-
37746fee...996ceb32 - 13 commits from branch
added hlt1-throughput-decreased label
Please register or sign in to reply