Fix device detection to align with nvidia-smi (!1752) · Merge requests · LHCb / Allen

Roel Aaij requested to merge ra-cleanup-device-selection into 2024-patches Aug 27, 2024

The CUDA calls were using the default ordering which is some buultin heuristic. nvidia-smi orders by PCI bus ID, which is more predictable and makes things consistent between the commands.

The monitoring aggregation thread also needed a set_device call when selection a device other than 0 and not using CUDA_VISIBLE_DEVICES.

An optimization to increase the number of hardware connections was missed when running in production or on MEPs.

Edited Aug 27, 2024 by Roel Aaij

Fix device detection to align with nvidia-smi

Merge request reports