After more testing, (1) I'm more sure it is indeed correct, and
(2) it is a significant speedup - we do a lot of those multiplications.
function old new delta
sp_512to256_mont_reduce_8 191 223 +32
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>