Skip to content

Add additional metrics for monitoring cta-taped data transfers

Currently, the metrics that track IO only track how many bytes/files were written to tape and how many bytes/files were written to disk. However, this paints only half the picture as we also need to understand something about how much data was read (and whether any errors occurred). As such, we need more fine-grained metrics for cta.taped.transfer.io and cta.taped.transfer.count.

These will be replaced by the following:

  • cta.taped.transfer.file.size -> Counter for number of bytes transferred
  • cta.taped.transfer.file.count -> Counter for number of files transferred

These will have the following attributes:

  • cta.io.direction: read/write
  • cta.io.medium: disk/tape

This leaves two additional metrics:

  • A counter for how many bytes/files were reported as being archived
    • Do/should we combine this with the other metrics? Conceptually, they are different, but they could be combined by simply adding one more attribute to say whether it was reporting or something else. At the same time, this does not leave room for other reporting metrics.
    • Possible names:
      • cta.report.bytes
      • cta.report.file.size -> a histogram instead of a counter to gain insights in file sizes. We don't have a histogram everywhere, because this adds a lot of cardinality due to the buckets
    • Attributes:
      • Either reuse the same attributes as for the metrics above, or stick with a simple "cta.report.type" attribute.
  • A Gauge/UpDownCounter for the size of the memory buffer in cta-taped
    • cta.taped.buffer.usage
    • cta.taped.buffer.limit
  • A Gauge/UpDownCounter for the number of active disk/tape threads
    • cta.taped.transfer.active

Example Archive Session

An archive session will produce the following time series:

  • cta.taped.transfer.file.size (cta.io.direction: write, cta.io.medium: tape)
  • cta.taped.transfer.file.size (cta.io.direction: read, cta.io.medium: disk)
  • cta.taped.transfer.file.count (cta.io.direction: write, cta.io.medium: tape)
  • cta.taped.transfer.file.count (cta.io.direction: read, cta.io.medium: disk)
  • cta.taped.transfer.active (cta.io.direction: write, cta.io.medium: tape)
  • cta.taped.transfer.active (cta.io.direction: read, cta.io.medium: disk)
  • cta.taped.buffer.usage
  • cta.taped.buffer.limit

Example Retrieve Session

A retrieve session will produce the following time series:

  • cta.taped.transfer.file.size (cta.io.direction: read, cta.io.medium: tape)
  • cta.taped.transfer.file.size (cta.io.direction: write, cta.io.medium: disk)
  • cta.taped.transfer.file.count (cta.io.direction: read, cta.io.medium: tape)
  • cta.taped.transfer.file.count (cta.io.direction: write, cta.io.medium: disk)
  • cta.taped.transfer.active (cta.io.direction: read, cta.io.medium: tape)
  • cta.taped.transfer.active (cta.io.direction: write, cta.io.medium: disk)
  • cta.taped.buffer.usage
  • cta.taped.buffer.limit
Edited by Niels Alexander Buegel