Metrics: Add ability to only declare warning/error if relevant condition satisfied for multiple consecutive updates of metric value
From discussion with Karol in !50 (merged)
Implementation:
- Add a member function to the
Metric
class, which allows people to specify that warnings/errors should only be issued for that metric if the error thresholds are met for at least N seconds - Update the
Metric
&MetricSnapshot
classes to store the values from the last N seconds (rather than only keeping the most recent values) - Update the
getStatus
methods of those classes to only return error/warning if all of the values from the last N seconds meet the error/warning condition
At the same time, should update SWATCH to automatically remove previous metric values at the end of FSM transitions, and ensure that the getStatus
methods can't return warning/error for the first N seconds after the end of a transition.
Comment from Karol:
If the monitoring loop iteration is used as a unit, then we even don't need to store the history of the metrics values, but we just need to count the number of the consecutive iterations with metric values that exceeded the error threshold (and separately warning threshold), and reset these counters when the metric value drops below the corresponding threshold.