Profile using ncu instead of nsys. Use custom metric. Keep artifacts of profile. (!636) · Merge requests · LHCb / Allen

Daniel Hugo Campora Perez requested to merge dcampora_update_profiling into master Aug 27, 2021

This MR introduces using a different profiler and metric to evaluate the performance in Allen.

Until now, we were using nsys profile (and previously nvprof). This is typically good enough if one is interested in analyzing individually how much time does a kernel take to execute. This latency-driven approach is however irrelevant for Allen, where we execute the same kernels over and over, and instead we are interested in how many resources overall is the kernel utilizing. This is a similar consideration to the throughput vs latency one repeatedly brought up in DAQ.

While a priori I did not find a metric that would do exactly what we want out of the box, by combining the SM Active Cycles and SM [%] one can obtain numbers that look much more like what we would want. These metrics are described as:

SM Active Cycles: # of cycles with at least one warp in flight.
SM[%]: SM throughput assuming ideal load balancing across SMSPs.

The numbers obtained can be tested by rerunning the same kernel of a particular algorithm and looking at the impact on throughput that it has.

Thus, this MR concretely does the following:

It uses ncu instead of nsys to perform the profiling on the selected CUDA architecture.
The profile file is kept as an artifact. It can be downloaded and opened with Nsight Compute for further analysis (eg. some suggested materials to look at this: https://indico.cern.ch/event/962112/contributions/4110591/attachments/2159863/3643851/CERN_Nsight_Compute.pdf).
The metric used in the reported Performance breakdown is now SM Active Cycles * SM[%].

Breakdown of sequence following this convention:

velo_estimate_input_size_kernel            │████████████████████████████████████████████     14.75 %
lf_triplet_seeding                         │███████████████████████████████████████          13.04 %
velo_search_by_triplet                     │█████████████████████████████████                11.07 %
velo_sort_by_phi                           │█████████████████████████                        8.42 %
ut_find_permutation                        │████████████████                                 5.39 %
velo_masked_clustering_kernel              │███████████████                                  5.05 %
ut_search_windows                          │███████████                                      3.84 %
muon_populate_hits                         │████████                                         2.74 %
scifi_direct_decoder_v4                    │████████                                         2.73 %
is_muon                                    │████████                                         2.68 %
lf_extend_tracks                           │███████                                          2.65 %
pv_beamline_multi_fitter                   │███████                                          2.62 %
pv_beamline_extrapolate                    │███████                                          2.40 %
two_track_evaluator                        │██████                                           2.30 %
compass_ut                                 │████                                             1.66 %
lf_search_initial_windows                  │████                                             1.64 %
lf_triplet_keep_best                       │████                                             1.60 %
ut_pre_decode                              │████                                             1.45 %
fit_secondary_vertices                     │███                                              1.24 %
ut_calculate_number_of_hits                │███                                              1.07 %
velo_consolidate_tracks                    │███                                              1.05 %
ut_decode_raw_banks_in_order               │███                                              1.04 %
scifi_raw_bank_decoder_v4                  │██                                               0.92 %
velo_kalman_filter                         │██                                               0.89 %
scifi_pre_decode_v4                        │██                                               0.86 %
scifi_calculate_cluster_count_v4           │██                                               0.77 %
muon_add_coords_crossing_maps              │█                                                0.67 %
lf_quality_filter                          │█                                                0.62 %
lf_quality_filter_length                   │█                                                0.61 %
muon_populate_tile_and_tdc                 │█                                                0.55 %
scifi_consolidate_tracks                   │█                                                0.54 %
pv_beamline_histo                          │█                                                0.35 %
muon_calculate_srq_size                    │█                                                0.34 %
filter_tracks                              │▌                                                0.25 %
ut_consolidate_tracks                      │▌                                                0.22 %
lf_calculate_parametrization               │▌                                                0.21 %
velo_pv_ip                                 │▌                                                0.18 %
kalman_velo_only                           │▌                                                0.16 %
ut_copy_track_hit_number                   │▌                                                0.14 %
ut_select_velo_tracks                      │▌                                                0.13 %
kalman_pv_ipchi2                           │▌                                                0.11 %
pv_beamline_calculate_denom                │▌                                                0.11 %
ut_select_velo_tracks_with_windows         │▌                                                0.10 %
scifi_copy_track_hit_number                │▌                                                0.10 %
velo_copy_track_hit_number                 │▌                                                0.10 %
velo_three_hit_tracks_filter               │▌                                                0.07 %
two_track_catboost_line_t                  │▌                                                0.05 %
d2kpi_line_t                               │▌                                                0.05 %
two_track_mva_line_t                       │▌                                                0.04 %
kstopipi_line_t                            │▌                                                0.04 %
d2kk_line_t                                │▌                                                0.04 %
d2pipi_line_t                              │▌                                                0.04 %
di_muon_soft_line_t                        │▌                                                0.04 %
low_pt_di_muon_line_t                      │▌                                                0.04 %
di_muon_mass_line_t                        │▌                                                0.04 %
postscaler                                 │▌                                                0.04 %
pv_beamline_cleanup                        │▌                                                0.03 %
pv_beamline_peak                           │▌                                                0.03 %
two_track_preprocess                       │▌                                                0.03 %
dec_reporter                               │▌                                                0.03 %
track_mva_line_t                           │▌                                                0.02 %
single_high_pt_muon_line_t                 │▌                                                0.02 %
track_muon_mva_line_t                      │▌                                                0.02 %
low_pt_muon_line_t                         │▌                                                0.01 %
global_decision                            │▌                                                0.00 %
velo_calculate_number_of_candidates_kernel │▌                                                0.00 %
beam_crossing_line_t                       │▌                                                0.00 %
odin_event_type_line_t                     │▌                                                0.00 %
velo_micro_bias_line_t                     │▌                                                0.00 %
passthrough_line_t                         │▌                                                0.00 %
                                           ┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼
                                           0     2     4     6     8     10    12    14    16

Edited Aug 28, 2021 by Daniel Hugo Campora Perez

Profile using ncu instead of nsys. Use custom metric. Keep artifacts of profile.

Merge request reports