Notes from a frustrated programmer

December 25

Ok, it's Christmas Day and I've just implemented Montgomery Reduction in the bigint library. So what the hell am I doing, coding on Christmas Day? Actually I think I lead a fairly balanced life, but this crypto stuff is driving me crazy.

I was planning to add lots of cool features, like improving/fixing the client side of the library. But once you start playing with bigint stuff and trying to improve the performance, the saga begins.

Take the function:

x = de mod m

This is one of the key functions in RSA public key cryptography. A public encryption takes next to no time, due to the exponent being 0x10001 - a very small number compared to the huge exponents when doing private decryption. And an SSL server does lots of these and so its one of the key performance times.

Most of my algorithms came from "The Handbook of Applied Cryptography Ch 14 (1996)" which is a bit of a bible on the topic.

All the exponentiation techniques consist of converting an exponent function into a number of multiplications (including squaring). So x5 is x2x2x1 and consists of 2 squares and 1 multiply. I'm using right-to-left exponentiation for the Classical/Barrett algorithms and left-to-right exponentiation for Montgomery algorithm.

I've looked at using addition chains, but there is no simple algorithm for working them out. They work basically on the fact that an exponent can use previous results to calculate successive result. So x12 is x1x2x3x6 which results in 4 multiplications instead of 6. So there is possibly a 33% improvement here which is probably as good as it will get (there is no deterministic algorithm to work out the exact improvement). This algorithm has quite high memory requirements due to the storage of the intermediate multiplications.

I'll also look at the sliding window algorithm which the GMP library uses. Since they get spectacular results, I'll look into this one soon. But for the time being its the basic "repeated square and multiply" algorithms.

So now to the modular reduction functions. I won't go into how these algorithms work -there are all fairly well known. But before I go on,  I am also using Karatsuba multiplication which for large numbers results in some savings. It's main premise is to turn 4 multiplications instead of 3. The documentation says that instead of O(n2) you get O(n1.589) in improvement. So when you are talking a large n, this could mean a huge saving. But what they don't tell you about is all the additions/subtractions  that you need (7 in all). Fortunately additions/subtractions are linear i.e. O(n), so the saving kicks in for the large numbers.

So here are some times for the various algorithms for the 512/1024/2048/4096 modulii. The times were taken on an AMD3400+ 64 bit micro running  Fedora Core 3 (32 bit) Linux. I've also included the times from GMP as a reference. Any pre-calculated stuff was not included in the timings (such as µ in Barrett reduction). All times are in milliseconds.

 Algorithm 512 bits 1024 bits 2048 bits 4096 bits Classical 15 74 481 3385 Classical (no Karatsuba) 15 74 502 3779 Barrett 10 56 384 2753 Barrett (no Karatsuba) 11 57 405 3133 Montgomery 10 72 522 3954 GMP 1.5 5.6 34.4 234.8

So a couple of observations. Firstly GMP kills us for performance. To be fair, they've been doing this since 1991, and have many of the key operations in assembler, so I'd expect some improvements. But they are over 10 times faster which I did not expect.

Karatsuba gives a benefit of around 10% for Barrett 4096, but the Classical algorithm gets very little benefit. The Karatsuba threshold has been set at 20 components, so is only really of benefit to modulii of 2048 bits and above.

Barrett is about 20% faster than Classical for the 4096 bit modulii.

And the times for Montgomery algorithm is significantly slower than the Classical. Now, much the documentation (e.g. see here) says that Montgomery should be much faster. Which means I'll have to go through the code line by line and work out what is going on. Stay tuned.

Now, since I've got a 64 bit processor, I tried the above on FC3 (64 bit) with no code changes, and got the following:

 Algorithm 512 bits 1024 bits 2048 bits 4096 bits Classical 15 72 260 977 Barrett 9 54 215 981 Mongomery 4 26 300 1155 GMP 1.2 7.5 27.9 129.4

It doesn't take much to realize the a 64 bit system is much much faster - many of the figures are about 1/3 of the corresponding 32 bit versions. And the GMP times are again amazing. But I did notice the test code size go up about 10-20% in size.

I'm going to have to go over the mathematics of the above and see if I do the correct number of operations. Anyway my aim to be 2 times slower than GMP and then I'll stop. But its been an interesting adventure. Stay tuned.

27th December

I've done some performance tweaks and profiling to reveal what's going on. I've changed Montgomery to use the benefits of Karatsuba multiplication (it will also compare the 3 algorithms better). So firstly the new times:

 Algorithm 512 bits 1024 bits 2048 bits 4096 bits Barrett 10 51 333 2355 Montgomery 14 75 521 3785 Classical (no real change) 14 75 482 3390

This is a 400ms improvement for Barrett 4096 and almost a 200ms improvement for Montgomery 4096 - and just from changing a couple of lines of code.

But Barrett still kills Montgomery for performance. I then did profiling using gprof on a single encryption/decryption on a 4096 bit modulus and they revealed the following::

 Barrett 4096 Number Function Time 12328 partial_multiply 22.1 166376 regular_multiply 9.89 320424 bi_add 2.95 166716 bi_subtract 1.30

 Classical 4096 Number Function Time 944948 bi_subtract 14.63 788737 bi_int_multiply 13.92 166376 regular_multiply 9.61 320424 bi_add 2.67

 Montgomery 4096 Number Function Time 1110496 bi_add 29.88 789890 bi_int_multiply 13.58 166536 regular_multiply 10.03 950130 comp_left_shift 6.53

So a couple of observations. Firstly, a single precision addition/subtraction is as expensive as as a single precision multiply (see bi_add() and bi_int_multiply() in Montgomery 4096) . This goes to show how fast the ALU's of CPU's are today. To be fair, I don't have access to the carry/overflow bit in C like I would in assembler, so this would make a big difference (I estimate a possible 20-30% improvement here). An assembler version of bi_add() would greatly improve Montgomery, but a corresponding assembler version of bi_subtract() would also greatly improve Classical (to the extent I think they both would match Barrett - or at least be close).

Secondly the literature out there (like the comparison of the 3 algorithms here, and  even the Handbook of Applied Cryptography) is incorrect, in that they base the performance of the respective algorithms on the number of multiplies. This was probably ok 10-20 years ago when these documents were written, but its not applicable now.

I'm going out on a limb here, but I think that Montgomery has had its day in crypto software implementations (in hardware where addition is extremely fast, it may be a different story - as well as functions written in assembler).

Finally, 10% gains are possible by subtly changing a single line of code as I've found. So all of the above is very dependent on the implementation and the compiler. But I think I'm in the ballpark.

I'm now going to implement bi_square() which should improve all algorithms by about 5-10% (I've got to sort out an overflow issue first), and then sliding window exponentiation for which I'm expecting big improvements. I'm also going to check GMP's basic functions in C to see if I can get any more ideas.

31st December

New Years Eve. I've added bi_square(), implemented sliding window exponentiation and done several tweaks. So here are the new times:

 Algorithm 512 bits 1024 bits 2048 bits 4096 bits Barrett 8 23 159 1125 Montgomery 10 46 329 2420 Classical 11 46 312 2210

So a gain of more than 50% over the last few days (ok, I did go on holidays for a some of those days). Montgomery is still slower than classical, but at one stage I had Montgomery as faster (when I tweaked bi_add()). It still goes to show how implementation dependent this all is. The new code has meant a 10% increase in size (around 5kB, to 55kB now for the web server).

I don't think I can go much faster in C. I think huge savings can be done in assembler - I'm predicting that's where GMP is getting most of its performance gains. They are still about 4 times faster, but I'm getting closer. I'll have a look and see what else I can tweak.

3rd January

The new year must of sparked some inspiration - I've made some huge performance gains. Most of it was from implementing CRT (Chinese Remainder Theorem). I've also made all the algorithms configurable to see their effect on performance and code size.

Just have a look at the following:

 Algorithm 512 bits 1024 bits 2048 bits 4096 bits Barrett 5 12 50 349 Classical/Montgomery 6 19 84 548

Just compare this with the previous table! The objective has now been reached - we are within 2 times the speed of GMP. And without a single line of assembler (ok, to be fair, GMP does not use CRT and I'd hate to see how fast it would be if it did). Unfortunately Montgomery doesn't work well with CRT so it reverts to classical when used.

Here is a table that includes the feature (or lack thereof) and the code size of the performance tester binary which should give an idea of the changes in code size. All the disabled features are using Barrett reduction.

 Algorithm 512 bits 1024 bits 2048 bits 4096 bits Code Size Barrett 5 12 50 349 40819 Classical 6 19 84 548 38135 Montgomery 6 19 84 548 40783 No CRT 8 27 167 1163 40796 No Karatsuba 5 12 54 367 38073 No Sliding Window 7 22 119 788 40819 No Square 6 13 62 404 39423 All features off 9 43 303 2327 36776

Karatsuba only works on 4096 bit keys and is only marginally effective on 2048 bit keys. Since this is an embedded toolkit, this feature is disabled by default (and it saves 2kB). Barrett also uses an extra 2kB compared to classical, but it gives more than a third in performance gain, so it is worth it.

CRT gives a 3.3x improvement factor (close to the theoretical improvement of 4 times), and sliding-window exponentiation gives an improvement factor of 2.2x. And both features are very cheap in terms of code size. Squaring gives an improvement factor of 1.15x.

Montgomery can be ignored unless bigint addition can be improved greatly (and it's useless with CRT anyway).

Anyway, mission accomplished. An embedded system that is 20 times slower should be able to accept just under 5 1024 bit connections per second. Just imagine how much faster the above would be with assembler (I'm predicting gains of over 100%).

It's been quite a ride. So now I can go back to improving the client functionality...

15th January

Since the makefiles have been adjusted to run under Win32, the bigint performance tests now run. So here are the results (using Barrett reduction, and no Karatsuba):

 Algorithm 512 bits 1024 bits 2048 bits 4096 bits Barrett 1 6 37 261

Ok, as much as I'd hate to admit it, this test  shows how much more efficient the Microsoft compiler it is. In fact it's about 30% better.