Skip to content

Profile using ncu instead of nsys. Use custom metric. Keep artifacts of profile.

Daniel Campora Perez requested to merge dcampora_update_profiling into master

This MR introduces using a different profiler and metric to evaluate the performance in Allen.

Until now, we were using nsys profile (and previously nvprof). This is typically good enough if one is interested in analyzing individually how much time does a kernel take to execute. This latency-driven approach is however irrelevant for Allen, where we execute the same kernels over and over, and instead we are interested in how many resources overall is the kernel utilizing. This is a similar consideration to the throughput vs latency one repeatedly brought up in DAQ.

While a priori I did not find a metric that would do exactly what we want out of the box, by combining the SM Active Cycles and SM [%] one can obtain numbers that look much more like what we would want. These metrics are described as:

  • SM Active Cycles: # of cycles with at least one warp in flight.
  • SM[%]: SM throughput assuming ideal load balancing across SMSPs.

The numbers obtained can be tested by rerunning the same kernel of a particular algorithm and looking at the impact on throughput that it has.

Thus, this MR concretely does the following:

Breakdown of sequence following this convention:

velo_estimate_input_size_kernel            │████████████████████████████████████████████     14.75 %
lf_triplet_seeding                         │███████████████████████████████████████          13.04 %
velo_search_by_triplet                     │█████████████████████████████████                11.07 %
velo_sort_by_phi                           │█████████████████████████                        8.42 %
ut_find_permutation                        │████████████████                                 5.39 %
velo_masked_clustering_kernel              │███████████████                                  5.05 %
ut_search_windows                          │███████████                                      3.84 %
muon_populate_hits                         │████████                                         2.74 %
scifi_direct_decoder_v4                    │████████                                         2.73 %
is_muon                                    │████████                                         2.68 %
lf_extend_tracks                           │███████                                          2.65 %
pv_beamline_multi_fitter                   │███████                                          2.62 %
pv_beamline_extrapolate                    │███████                                          2.40 %
two_track_evaluator                        │██████                                           2.30 %
compass_ut                                 │████                                             1.66 %
lf_search_initial_windows                  │████                                             1.64 %
lf_triplet_keep_best                       │████                                             1.60 %
ut_pre_decode                              │████                                             1.45 %
fit_secondary_vertices                     │███                                              1.24 %
ut_calculate_number_of_hits                │███                                              1.07 %
velo_consolidate_tracks                    │███                                              1.05 %
ut_decode_raw_banks_in_order               │███                                              1.04 %
scifi_raw_bank_decoder_v4                  │██                                               0.92 %
velo_kalman_filter                         │██                                               0.89 %
scifi_pre_decode_v4                        │██                                               0.86 %
scifi_calculate_cluster_count_v4           │██                                               0.77 %
muon_add_coords_crossing_maps              │█                                                0.67 %
lf_quality_filter                          │█                                                0.62 %
lf_quality_filter_length                   │█                                                0.61 %
muon_populate_tile_and_tdc                 │█                                                0.55 %
scifi_consolidate_tracks                   │█                                                0.54 %
pv_beamline_histo                          │█                                                0.35 %
muon_calculate_srq_size                    │█                                                0.34 %
filter_tracks                              │▌                                                0.25 %
ut_consolidate_tracks                      │▌                                                0.22 %
lf_calculate_parametrization               │▌                                                0.21 %
velo_pv_ip                                 │▌                                                0.18 %
kalman_velo_only                           │▌                                                0.16 %
ut_copy_track_hit_number                   │▌                                                0.14 %
ut_select_velo_tracks                      │▌                                                0.13 %
kalman_pv_ipchi2                           │▌                                                0.11 %
pv_beamline_calculate_denom                │▌                                                0.11 %
ut_select_velo_tracks_with_windows         │▌                                                0.10 %
scifi_copy_track_hit_number                │▌                                                0.10 %
velo_copy_track_hit_number                 │▌                                                0.10 %
velo_three_hit_tracks_filter               │▌                                                0.07 %
two_track_catboost_line_t                  │▌                                                0.05 %
d2kpi_line_t                               │▌                                                0.05 %
two_track_mva_line_t                       │▌                                                0.04 %
kstopipi_line_t                            │▌                                                0.04 %
d2kk_line_t                                │▌                                                0.04 %
d2pipi_line_t                              │▌                                                0.04 %
di_muon_soft_line_t                        │▌                                                0.04 %
low_pt_di_muon_line_t                      │▌                                                0.04 %
di_muon_mass_line_t                        │▌                                                0.04 %
postscaler                                 │▌                                                0.04 %
pv_beamline_cleanup                        │▌                                                0.03 %
pv_beamline_peak                           │▌                                                0.03 %
two_track_preprocess                       │▌                                                0.03 %
dec_reporter                               │▌                                                0.03 %
track_mva_line_t                           │▌                                                0.02 %
single_high_pt_muon_line_t                 │▌                                                0.02 %
track_muon_mva_line_t                      │▌                                                0.02 %
low_pt_muon_line_t                         │▌                                                0.01 %
global_decision                            │▌                                                0.00 %
velo_calculate_number_of_candidates_kernel │▌                                                0.00 %
beam_crossing_line_t                       │▌                                                0.00 %
odin_event_type_line_t                     │▌                                                0.00 %
velo_micro_bias_line_t                     │▌                                                0.00 %
passthrough_line_t                         │▌                                                0.00 %
                                           ┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼
                                           0     2     4     6     8     10    12    14    16  
Edited by Daniel Campora Perez

Merge request reports