I mean is this any different than standard gradient descent with something like Adam as optimiser.
That’s my assumption based on the headline. But the quick skim I gave the article seemed to only discuss it in the context of NLP. Not exactly my field of study.