Draft: muP transfer of parameters between models of different complexity
This MR is adapting the salt codebase to run the muP transfer. This technique makes it possible to optimise the hyperparameters of a lower complexity model than the target model. Thanks to the correct parametrisation and initialisation of the network, the performance hierarchy of the low complexity model should match that of the high complexity one. Furthermore, for a given architecture, the muP parametrised network should perform equally or better than the same network in the (current) standard parametrisation.
This work is based on this paper and uses this GitHub repository and the muP package