Skip to content

InputCopyStream should not rely on incidents

The InputCopyStream algorithm inherits from OutputStream, and should persist all TES locations that are present in the input file. It collects the list of locations to propagate using the DataSvcFileEntriesTool tool. This tool scans the TES and caches leaves associated to the input file, and relies on the incidents service to clear this cache at the beginning of each event. The use of incidents to clear the cache is problematic for schedulers that don't support the incident service.

A couple of possible solutions:

  1. Drop DataSvcFileEntriesTool and move the leaf traversal logic in to InputCopyStream. This is clean, but 'inefficient' because multiple instances of InputCopyStream will end up creating the same list of leaves.
  2. Convert DataSvcFileEntriesTool to an algorithm which stores the leaf list on the TES, and make this list a data dependency of InputCopyStream.

Number 1 is nice because the 'API' of InputCopyStream doesn't change. It's not nice because it has no caching at all.

Number 2 is nice because avoids the use of a tool, and has caching.

Both solutions suffer from the problem of not being able to robustly checking if an OutputStream instance has been run before an InputCopyStream instance (which can cause problems, was tracked by Savannah ID 76642).

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information