Fair points! Ye Pytorch's fp8 experimental support does scaling of the gradients. Interesting point on a larger range for the forward pass, and a small range for the gradients! I did not know that - so learnt something today!! Thanks! I'll definitely read that paper!