Skip to content

Fix device detection to align with nvidia-smi

Roel Aaij requested to merge ra-cleanup-device-selection into 2024-patches

The CUDA calls were using the default ordering which is some buultin heuristic. nvidia-smi orders by PCI bus ID, which is more predictable and makes things consistent between the commands.

The monitoring aggregation thread also needed a set_device call when selection a device other than 0 and not using CUDA_VISIBLE_DEVICES.

An optimization to increase the number of hardware connections was missed when running in production or on MEPs.

Edited by Roel Aaij

Merge request reports