Stability / overtraining Issue with TransformerV2
I've tried training a couple of GN2 models, one using the default Transformer, and another using TransformerV2. The second model seemed to fail training at around epoch 15, I've spoken to a couple others who've also run into this problem. It would be worth investigating the causes of this. Possible solutions are
Decreasing the maximum learning rate Increasing the 'pct_start' in the learning rate scheduler slightly.