RSA intrinsics [Was: RFR(L): 8069539: RSA acceleration] (original) (raw)

Andrew Haley aph at redhat.com
Thu Jun 4 18:41:53 UTC 2015


On 04/06/15 19:32, Anthony Scarpino wrote:

On 06/04/2015 10:08 AM, Andrew Haley wrote:

I'm sorry this is a rather long email, and I pray for your patience.

I'm getting close to something I can put forward for review. The performance is encouraging. [ Some background: The kernel of RSA and Diffie-Hellman key exchange is Montgomery multiplication. Optimizing RSA basically comes down to optimizing Montgomery multiplication. The core of OpenJDK's RSA is BigInteger.oddModPow(). ] My Montgomery multiply routine (mostly portable C, with a small assembly-language insert) executes a 1024-bit multiply/reduce in about 2000ns; the hand-coded assembly language equivalent in OpenSSL is faster (as you'd expect) but not very much faster: about 1700ns. In other words, compiled C is only about 20% slower. Firstly, this is pretty remarkable performance by GCC (Yay! Go GCC!) and it shows I'm on the right track. It also shows that there isn't a huge amount to be gained by hand-coding Montgomery multiplication, but anybody who fancies their hand can try to improve on GCC. This is also very nice because porting it to non-x86 hardware is fairly straightforward -- certainly far easier than writing a large assembly- language routine. I'm sure when I see the code it will become clearer, but I'm guessing you are taking a C version of Montgomery multiply, using GCC to turn it into assembly with some cpu acceleration instructions and putting that into an intrinsic?

I have written a Montgomery multiply routine: it is mostly C, with a tiny bit (really, just a few instructions) of inline assembly language. It's called from a HotSpot intrinsic.

It would be nice to keep all of the data in an array of jlongs for the duration of oddModPow(). Here's one idea: I could write a version of oddModPow() which is guaranteed never to use the Java version of the Montgomery multiply. This would use a JNI method which calls the native Montgomery multiply routine, guaranteeing that that we always use that native routine, even from the interpreter and C1. Then I could keep all the internal state in native word order, and all this word-swapping would just go away. This would have the additional benefit that it would be faster when using the interpreter and C1. So, we'd have two versions of oddModPow() in BigInteger, and switch between them depending on whether the platform had support for a native Montgomery multiplication. I'm assuming the JNI method is closer to OpenSSL numbers?

It's the same Montgomery multiplication as the intrinsic, but called from a JNI method instead of a HotSpot intrinsic.

The downside of having two versions of oddModPow() would, of course, be some code duplication.

Or am I just making too much fuss about this? Maybe I should be happy with what I've got. Looking at the 2k RSA sign numbers 2.3x better after your patch vs hs-comp. I'd be happy with that improvement.

OK. I kinda guessed that would be the response, really.

Andrew.



More information about the hotspot-compiler-dev mailing list