Skip to content
Snippets Groups Projects

Throughput decrease checker; submit build $OPTIONS to grafana

Merged Ryunosuke O'Neil requested to merge roneil/throughput-alarm into master

Fail publish throughput jobs and add a hlt1-throughput-decreased label if:

  • The averaged throughput change goes below a certain value (see .gitlab-ci.yml, where this variable is set, currently -2.5%)
  • The throughput change for any device goes below a (looser) value of -7.5%

The throughput percentage change is calculated from the speedup, using this simple formula

        change = (speedup - 1.0) * 100.0

i.e. a speedup of 0.96x translates to a -4% change.

cc @dovombru

Edited by Ryunosuke O'Neil

Merge request reports

Merged by Rosen MatevRosen Matev 3 years ago (Dec 9, 2021 8:23am UTC)

Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
    • Resolved by Dorothea Vom Bruch

      I remember we already had a discussion as to whether individual GPU throughput should also be checked. Was there a reason not to check for that in addition to the average? I think this could be useful, as sometimes a decrease is more pronounced in some architectures / GPU types.

  • Ryunosuke O'Neil added 2 commits

    added 2 commits

    Compare with previous version

  • Ryunosuke O'Neil resolved all threads

    resolved all threads

  • added 1 commit

    Compare with previous version

  • added 1 commit

    • d73e4019 - make threshold less sensitive at -7.5%. Catch throughput alarm properly

    Compare with previous version

  • added 1 commit

    Compare with previous version

  • Ryunosuke O'Neil marked this merge request as ready

    marked this merge request as ready

  • Ryunosuke O'Neil added 3 commits

    added 3 commits

    • cbd9a6df - print error properly
    • 8a49976f - try not to print everything to make the log a bit more readable
    • ac92a420 - Improved logging output

    Compare with previous version

  • Dorothea Vom Bruch resolved all threads

    resolved all threads

    • Resolved by Rosen Matev

      Thanks for adding this! The warning in the mattermost channel is good. But what exactly do the percentages mean? "Device averaged speed-up (% change): 0.99 (0.78%)". I.e. what is the 0.99 and what the 0.78?

      Could we also add a warning on the MR itself? This will make it easier for the shifter and maintainer to spot the decrease. It think in Moore an automatic label is added to the MR if the throughput decreases. Maybe we can do something similar?

  • assigned to @lpica

  • Ryunosuke O'Neil added 2 commits

    added 2 commits

    • 2a9ca9fd - grafana: also submit build options
    • ca60efbe - send "default" for empty buildopts string

    Compare with previous version

  • Ryunosuke O'Neil changed title from Average throughput decrease checker to Average throughput decrease checker; submit build $OPTIONS to grafana

    changed title from Average throughput decrease checker to Average throughput decrease checker; submit build $OPTIONS to grafana

  • added 1 commit

    • f1cf6459 - work around unbound variable error

    Compare with previous version

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading