
Notes from a frustrated programmer December 25 Ok, it's Christmas Day and I've just implemented Montgomery Reduction in the bigint library. So what the hell am I doing, coding on Christmas Day? Actually I think I lead a fairly balanced life, but this crypto stuff is driving me crazy. I was planning to add lots of cool features, like improving/fixing the client side of the library. But once you start playing with bigint stuff and trying to improve the performance, the saga begins. Take the function: x = d^{e }mod m This is one of the key functions in RSA public key cryptography. A public encryption takes next to no time, due to the exponent being 0x10001  a very small number compared to the huge exponents when doing private decryption. And an SSL server does lots of these and so its one of the key performance times. Most of my algorithms came from "The Handbook of Applied Cryptography Ch 14 (1996)" which is a bit of a bible on the topic. All the exponentiation techniques consist of converting an exponent function into a number of multiplications (including squaring). So x^{5} is x^{2}x^{2}x^{1 }and consists of 2 squares and 1 multiply. I'm using righttoleft exponentiation for the Classical/Barrett algorithms and lefttoright exponentiation for Montgomery algorithm. I've looked at using addition chains, but there is no simple algorithm for working them out. They work basically on the fact that an exponent can use previous results to calculate successive result. So x^{12} is x^{1}x^{2}x^{3}x^{6} which results in 4 multiplications instead of 6. So there is possibly a 33% improvement here which is probably as good as it will get (there is no deterministic algorithm to work out the exact improvement). This algorithm has quite high memory requirements due to the storage of the intermediate multiplications. I'll also look at the sliding window algorithm which the GMP library uses. Since they get spectacular results, I'll look into this one soon. But for the time being its the basic "repeated square and multiply" algorithms. So now to the modular reduction functions. I won't go into how these algorithms work there are all fairly well known. But before I go on, I am also using Karatsuba multiplication which for large numbers results in some savings. It's main premise is to turn 4 multiplications instead of 3. The documentation says that instead of O(n^{2}) you get O(n^{1.589}) in improvement. So when you are talking a large n, this could mean a huge saving. But what they don't tell you about is all the additions/subtractions that you need (7 in all). Fortunately additions/subtractions are linear i.e. O(n), so the saving kicks in for the large numbers. So here are some times for the various algorithms for the 512/1024/2048/4096 modulii. The times were taken on an AMD3400+ 64 bit micro running Fedora Core 3 (32 bit) Linux. I've also included the times from GMP as a reference. Any precalculated stuff was not included in the timings (such as µ in Barrett reduction). All times are in milliseconds.
So a couple of observations. Firstly GMP kills us for performance. To be fair, they've been doing this since 1991, and have many of the key operations in assembler, so I'd expect some improvements. But they are over 10 times faster which I did not expect. Karatsuba gives a benefit of around 10% for Barrett 4096, but the Classical algorithm gets very little benefit. The Karatsuba threshold has been set at 20 components, so is only really of benefit to modulii of 2048 bits and above. Barrett is about 20% faster than Classical for the 4096 bit modulii. And the times for Montgomery algorithm is significantly slower than the Classical. Now, much the documentation (e.g. see here) says that Montgomery should be much faster. Which means I'll have to go through the code line by line and work out what is going on. Stay tuned. Now, since I've got a 64 bit processor, I tried the above on FC3 (64 bit) with no code changes, and got the following:
It doesn't take much to realize the a 64 bit system is much much faster  many of the figures are about 1/3 of the corresponding 32 bit versions. And the GMP times are again amazing. But I did notice the test code size go up about 1020% in size. I'm going to have to go over the mathematics of the above and see if I do the correct number of operations. Anyway my aim to be 2 times slower than GMP and then I'll stop. But its been an interesting adventure. Stay tuned. 27th December I've done some performance tweaks and profiling to reveal what's going on. I've changed Montgomery to use the benefits of Karatsuba multiplication (it will also compare the 3 algorithms better). So firstly the new times:
This is a 400ms improvement for Barrett 4096 and almost a 200ms improvement for Montgomery 4096  and just from changing a couple of lines of code. But Barrett still kills Montgomery for performance. I then did profiling using gprof on a single encryption/decryption on a 4096 bit modulus and they revealed the following::
So a couple of observations. Firstly, a single precision addition/subtraction is as expensive as as a single precision multiply (see bi_add() and bi_int_multiply() in Montgomery 4096) . This goes to show how fast the ALU's of CPU's are today. To be fair, I don't have access to the carry/overflow bit in C like I would in assembler, so this would make a big difference (I estimate a possible 2030% improvement here). An assembler version of bi_add() would greatly improve Montgomery, but a corresponding assembler version of bi_subtract() would also greatly improve Classical (to the extent I think they both would match Barrett  or at least be close). Secondly the literature out there (like the comparison of the 3 algorithms here, and even the Handbook of Applied Cryptography) is incorrect, in that they base the performance of the respective algorithms on the number of multiplies. This was probably ok 1020 years ago when these documents were written, but its not applicable now. I'm going out on a limb here, but I think that Montgomery has had its day in crypto software implementations (in hardware where addition is extremely fast, it may be a different story  as well as functions written in assembler). Finally, 10% gains are possible by subtly changing a single line of code as I've found. So all of the above is very dependent on the implementation and the compiler. But I think I'm in the ballpark. I'm now going to implement bi_square() which should improve all algorithms by about 510% (I've got to sort out an overflow issue first), and then sliding window exponentiation for which I'm expecting big improvements. I'm also going to check GMP's basic functions in C to see if I can get any more ideas. 31st December New Years Eve. I've added bi_square(), implemented sliding window exponentiation and done several tweaks. So here are the new times:
So a gain of more than 50% over the last few days (ok, I did go on holidays for a some of those days). Montgomery is still slower than classical, but at one stage I had Montgomery as faster (when I tweaked bi_add()). It still goes to show how implementation dependent this all is. The new code has meant a 10% increase in size (around 5kB, to 55kB now for the web server). I don't think I can go much faster in C. I think huge savings can be done in assembler  I'm predicting that's where GMP is getting most of its performance gains. They are still about 4 times faster, but I'm getting closer. I'll have a look and see what else I can tweak. 3rd January The new year must of sparked some inspiration  I've made some huge performance gains. Most of it was from implementing CRT (Chinese Remainder Theorem). I've also made all the algorithms configurable to see their effect on performance and code size. Just have a look at the following:
Just compare this with the previous table! The objective has now been reached  we are within 2 times the speed of GMP. And without a single line of assembler (ok, to be fair, GMP does not use CRT and I'd hate to see how fast it would be if it did). Unfortunately Montgomery doesn't work well with CRT so it reverts to classical when used. Here is a table that includes the feature (or lack thereof) and the code size of the performance tester binary which should give an idea of the changes in code size. All the disabled features are using Barrett reduction.
Karatsuba only works on 4096 bit keys and is only marginally effective on 2048 bit keys. Since this is an embedded toolkit, this feature is disabled by default (and it saves 2kB). Barrett also uses an extra 2kB compared to classical, but it gives more than a third in performance gain, so it is worth it. CRT gives a 3.3x improvement factor (close to the theoretical improvement of 4 times), and slidingwindow exponentiation gives an improvement factor of 2.2x. And both features are very cheap in terms of code size. Squaring gives an improvement factor of 1.15x. Montgomery can be ignored unless bigint addition can be improved greatly (and it's useless with CRT anyway). Anyway, mission accomplished. An embedded system that is 20 times slower should be able to accept just under 5 1024 bit connections per second. Just imagine how much faster the above would be with assembler (I'm predicting gains of over 100%). It's been quite a ride. So now I can go back to improving the client functionality... 15th January Since the makefiles have been adjusted to run under Win32, the bigint performance tests now run. So here are the results (using Barrett reduction, and no Karatsuba):
Ok, as much as I'd hate to admit it, this test shows how much more efficient the Microsoft compiler it is. In fact it's about 30% better. 
Copyright © Cameron HamiltonRich
2016
. All rights reserved. 