Skip to content

Switch to new training file format

We have discussed moving to a "new" training file format, which is actually just the TDD output format we know and love. The reasons for doing this are:

  • easier to work with the training files (e.g. running trainings with different variables from a single preprocessed file)
  • consistency of files (e.g. easy to plot variables in final training files, and train and test loops use the same dataloader)
  • storage size improvements (due to typed storage)
  • dataloader read performance improvements (due to above)

I am planning to make the switch soon in salt. The idea is to use the existing umami *-hybrid-resampled.h5 file, rather than *-hybrid-resampled_scaled_shuffled.h5. As far as I can tell, the resampled files are shuffled. Variable normalisation will be handled on the on the fly in the dataloaders, which has a negligible impact on speed. This update should then have full backward compatibility.

Tagging @pgadow @alfroch in case they have any comments or concerns.

Edited by Samuel Van Stroud