Profile using ncu instead of nsys. Use custom metric. Keep artifacts of profile.
This MR introduces using a different profiler and metric to evaluate the performance in Allen.
Until now, we were using nsys profile
(and previously nvprof
). This is typically good enough if one is interested in analyzing individually how much time does a kernel take to execute. This latency-driven approach is however irrelevant for Allen, where we execute the same kernels over and over, and instead we are interested in how many resources overall is the kernel utilizing. This is a similar consideration to the throughput vs latency
one repeatedly brought up in DAQ.
While a priori I did not find a metric that would do exactly what we want out of the box, by combining the SM Active Cycles
and SM [%]
one can obtain numbers that look much more like what we would want. These metrics are described as:
- SM Active Cycles: # of cycles with at least one warp in flight.
- SM[%]: SM throughput assuming ideal load balancing across SMSPs.
The numbers obtained can be tested by rerunning the same kernel of a particular algorithm and looking at the impact on throughput that it has.
Thus, this MR concretely does the following:
- It uses
ncu
instead ofnsys
to perform the profiling on the selected CUDA architecture. - The profile file is kept as an artifact. It can be downloaded and opened with Nsight Compute for further analysis (eg. some suggested materials to look at this: https://indico.cern.ch/event/962112/contributions/4110591/attachments/2159863/3643851/CERN_Nsight_Compute.pdf).
- The metric used in the reported Performance breakdown is now
SM Active Cycles * SM[%]
.
Breakdown of sequence following this convention:
velo_estimate_input_size_kernel │████████████████████████████████████████████ 14.75 %
lf_triplet_seeding │███████████████████████████████████████ 13.04 %
velo_search_by_triplet │█████████████████████████████████ 11.07 %
velo_sort_by_phi │█████████████████████████ 8.42 %
ut_find_permutation │████████████████ 5.39 %
velo_masked_clustering_kernel │███████████████ 5.05 %
ut_search_windows │███████████ 3.84 %
muon_populate_hits │████████ 2.74 %
scifi_direct_decoder_v4 │████████ 2.73 %
is_muon │████████ 2.68 %
lf_extend_tracks │███████ 2.65 %
pv_beamline_multi_fitter │███████ 2.62 %
pv_beamline_extrapolate │███████ 2.40 %
two_track_evaluator │██████ 2.30 %
compass_ut │████ 1.66 %
lf_search_initial_windows │████ 1.64 %
lf_triplet_keep_best │████ 1.60 %
ut_pre_decode │████ 1.45 %
fit_secondary_vertices │███ 1.24 %
ut_calculate_number_of_hits │███ 1.07 %
velo_consolidate_tracks │███ 1.05 %
ut_decode_raw_banks_in_order │███ 1.04 %
scifi_raw_bank_decoder_v4 │██ 0.92 %
velo_kalman_filter │██ 0.89 %
scifi_pre_decode_v4 │██ 0.86 %
scifi_calculate_cluster_count_v4 │██ 0.77 %
muon_add_coords_crossing_maps │█ 0.67 %
lf_quality_filter │█ 0.62 %
lf_quality_filter_length │█ 0.61 %
muon_populate_tile_and_tdc │█ 0.55 %
scifi_consolidate_tracks │█ 0.54 %
pv_beamline_histo │█ 0.35 %
muon_calculate_srq_size │█ 0.34 %
filter_tracks │▌ 0.25 %
ut_consolidate_tracks │▌ 0.22 %
lf_calculate_parametrization │▌ 0.21 %
velo_pv_ip │▌ 0.18 %
kalman_velo_only │▌ 0.16 %
ut_copy_track_hit_number │▌ 0.14 %
ut_select_velo_tracks │▌ 0.13 %
kalman_pv_ipchi2 │▌ 0.11 %
pv_beamline_calculate_denom │▌ 0.11 %
ut_select_velo_tracks_with_windows │▌ 0.10 %
scifi_copy_track_hit_number │▌ 0.10 %
velo_copy_track_hit_number │▌ 0.10 %
velo_three_hit_tracks_filter │▌ 0.07 %
two_track_catboost_line_t │▌ 0.05 %
d2kpi_line_t │▌ 0.05 %
two_track_mva_line_t │▌ 0.04 %
kstopipi_line_t │▌ 0.04 %
d2kk_line_t │▌ 0.04 %
d2pipi_line_t │▌ 0.04 %
di_muon_soft_line_t │▌ 0.04 %
low_pt_di_muon_line_t │▌ 0.04 %
di_muon_mass_line_t │▌ 0.04 %
postscaler │▌ 0.04 %
pv_beamline_cleanup │▌ 0.03 %
pv_beamline_peak │▌ 0.03 %
two_track_preprocess │▌ 0.03 %
dec_reporter │▌ 0.03 %
track_mva_line_t │▌ 0.02 %
single_high_pt_muon_line_t │▌ 0.02 %
track_muon_mva_line_t │▌ 0.02 %
low_pt_muon_line_t │▌ 0.01 %
global_decision │▌ 0.00 %
velo_calculate_number_of_candidates_kernel │▌ 0.00 %
beam_crossing_line_t │▌ 0.00 %
odin_event_type_line_t │▌ 0.00 %
velo_micro_bias_line_t │▌ 0.00 %
passthrough_line_t │▌ 0.00 %
┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼
0 2 4 6 8 10 12 14 16