Don't free host memory within a sequence of algorithms
Until now, the host and device memory manager worked in exactly the same way, i.e. freeing up memory buffers once the algorithms needing them have finished processing. However, this can lead to problems in host memory, if a GPU is used as device. The reason is that device algorithms and also copies between device and host are schedule asynchronously, i.e. the calls will run in the designated order on the device, but the host does not know when executions are finished, unless a synchronization, such as cudaStreamSynchronize()
, is called. This can lead to the fact that the host memory manager frees buffers before they have been written to by a copy from device to host or content in host memory has changed before it was copied to the device.
To avoid this, host memory is not freed any more in the host memory manager with this merge request.This means that host memory reserved for the execution of one sequence is not freed, but it is freed after the processing of a sequence has finished.
An alternative way to handle the problem is to synchronize before device-host and host-device copies. An issue is opened to study the performance impact of this solution: !458 (merged)