Add benchmarking and physics validation of outputs to pipeline
Ideally using scikit-validate, the pipeline would benchmark it's performance, and also compare that no distributions have changed (unless this is expected). Since this is a nightly build and not a Merge Request, there'd need to be some consideration for how to report these results back, nightly email? Issue opened if things change in unexpected ways?