Crypto Blog

Ok, it's Christmas Day and I've just implemented Montgomery Reduction in the bigint library. So what the hell am I doing, coding on Christmas Day? Actually I think I lead a fairly balanced life, but this crypto stuff is driving me crazy.

I was planning to add lots of cool features, like improving/fixing the client side of the library. But once you start playing with bigint stuff and trying to improve the performance, the saga begins.

This is one of the key functions in RSA public key cryptography. A public encryption takes next to no time, due to the exponent being 0x10001 - a very small number compared to the huge exponents when doing private decryption. And an SSL server does lots of these and so its one of the key performance times.

Most of my algorithms came from "The Handbook of Applied Cryptography Ch 14 (1996)" which is a bit of a bible on the topic.

All the exponentiation techniques consist of converting an exponent function into a number of multiplications (including squaring). So x⁵ is x²x²x¹and consists of 2 squares and 1 multiply. I'm using right-to-left exponentiation for the Classical/Barrett algorithms and left-to-right exponentiation for Montgomery algorithm.

I've looked at using addition chains, but there is no simple algorithm for working them out. They work basically on the fact that an exponent can use previous results to calculate successive result. So x¹² is x¹x²x³x⁶ which results in 4 multiplications instead of 6. So there is possibly a 33% improvement here which is probably as good as it will get (there is no deterministic algorithm to work out the exact improvement). This algorithm has quite high memory requirements due to the storage of the intermediate multiplications.

I'll also look at the sliding window algorithm which the GMP library uses. Since they get spectacular results, I'll look into this one soon. But for the time being its the basic "repeated square and multiply" algorithms.

So now to the modular reduction functions. I won't go into how these algorithms work -there are all fairly well known. But before I go on, I am also using Karatsuba multiplication which for large numbers results in some savings. It's main premise is to turn 4 multiplications instead of 3. The documentation says that instead of O(n²) you get O(n^1.589) in improvement. So when you are talking a large n, this could mean a huge saving. But what they don't tell you about is all the additions/subtractions that you need (7 in all). Fortunately additions/subtractions are linear i.e. O(n), so the saving kicks in for the large numbers.

So here are some times for the various algorithms for the 512/1024/2048/4096 modulii. The times were taken on an AMD3400+ 64 bit micro running Fedora Core 3 (32 bit) Linux. I've also included the times from GMP as a reference. Any pre-calculated stuff was not included in the timings (such as µ in Barrett reduction). All times are in milliseconds.

Algorithm	512 bits	1024 bits	2048 bits	4096 bits
Classical	15	74	481	3385
Classical (no Karatsuba)	15	74	502	3779
Barrett	10	56	384	2753
Barrett (no Karatsuba)	11	57	405	3133
Montgomery	10	72	522	3954
GMP	1.5	5.6	34.4	234.8

So a couple of observations. Firstly GMP kills us for performance. To be fair, they've been doing this since 1991, and have many of the key operations in assembler, so I'd expect some improvements. But they are over 10 times faster which I did not expect.

Karatsuba gives a benefit of around 10% for Barrett 4096, but the Classical algorithm gets very little benefit. The Karatsuba threshold has been set at 20 components, so is only really of benefit to modulii of 2048 bits and above.

And the times for Montgomery algorithm is significantly slower than the Classical. Now, much the documentation (e.g. see here) says that Montgomery should be much faster. Which means I'll have to go through the code line by line and work out what is going on. Stay tuned.

Now, since I've got a 64 bit processor, I tried the above on FC3 (64 bit) with no code changes, and got the following:

Algorithm	512 bits	1024 bits	2048 bits	4096 bits
Classical	15	72	260	977
Barrett	9	54	215	981
Mongomery	4	26	300	1155
GMP	1.2	7.5	27.9	129.4

It doesn't take much to realize the a 64 bit system is much much faster - many of the figures are about 1/3 of the corresponding 32 bit versions. And the GMP times are again amazing. But I did notice the test code size go up about 10-20% in size.

I'm going to have to go over the mathematics of the above and see if I do the correct number of operations. Anyway my aim to be 2 times slower than GMP and then I'll stop. But its been an interesting adventure. Stay tuned.

I've done some performance tweaks and profiling to reveal what's going on. I've changed Montgomery to use the benefits of Karatsuba multiplication (it will also compare the 3 algorithms better). So firstly the new times:

Algorithm	512 bits	1024 bits	2048 bits	4096 bits
Barrett	10	51	333	2355
Montgomery	14	75	521	3785
Classical (no real change)	14	75	482	3390

This is a 400ms improvement for Barrett 4096 and almost a 200ms improvement for Montgomery 4096 - and just from changing a couple of lines of code.

But Barrett still kills Montgomery for performance. I then did profiling using gprof on a single encryption/decryption on a 4096 bit modulus and they revealed the following::

Barrett 4096	Number	Function	Time
	12328	partial_multiply	22.1
	166376	regular_multiply	9.89
	320424	bi_add	2.95
	166716	bi_subtract	1.30

Classical 4096	Number	Function	Time
	944948	bi_subtract	14.63
	788737	bi_int_multiply	13.92
	166376	regular_multiply	9.61
	320424	bi_add	2.67

Montgomery 4096	Number	Function	Time
	1110496	bi_add	29.88
	789890	bi_int_multiply	13.58
	166536	regular_multiply	10.03
	950130	comp_left_shift	6.53

So a couple of observations. Firstly, a single precision addition/subtraction is as expensive as as a single precision multiply (see bi_add() and bi_int_multiply() in Montgomery 4096) . This goes to show how fast the ALU's of CPU's are today. To be fair, I don't have access to the carry/overflow bit in C like I would in assembler, so this would make a big difference (I estimate a possible 20-30% improvement here). An assembler version of bi_add() would greatly improve Montgomery, but a corresponding assembler version of bi_subtract() would also greatly improve Classical (to the extent I think they both would match Barrett - or at least be close).

Secondly the literature out there (like the comparison of the 3 algorithms here, and even the Handbook of Applied Cryptography) is incorrect, in that they base the performance of the respective algorithms on the number of multiplies. This was probably ok 10-20 years ago when these documents were written, but its not applicable now.

I'm going out on a limb here, but I think that Montgomery has had its day in crypto software implementations (in hardware where addition is extremely fast, it may be a different story - as well as functions written in assembler).

Finally, 10% gains are possible by subtly changing a single line of code as I've found. So all of the above is very dependent on the implementation and the compiler. But I think I'm in the ballpark.

I'm now going to implement bi_square() which should improve all algorithms by about 5-10% (I've got to sort out an overflow issue first), and then sliding window exponentiation for which I'm expecting big improvements. I'm also going to check GMP's basic functions in C to see if I can get any more ideas.

New Years Eve. I've added bi_square(), implemented sliding window exponentiation and done several tweaks. So here are the new times:

Algorithm	512 bits	1024 bits	2048 bits	4096 bits
Barrett	8	23	159	1125
Montgomery	10	46	329	2420
Classical	11	46	312	2210

So a gain of more than 50% over the last few days (ok, I did go on holidays for a some of those days). Montgomery is still slower than classical, but at one stage I had Montgomery as faster (when I tweaked bi_add()). It still goes to show how implementation dependent this all is. The new code has meant a 10% increase in size (around 5kB, to 55kB now for the web server).

I don't think I can go much faster in C. I think huge savings can be done in assembler - I'm predicting that's where GMP is getting most of its performance gains. They are still about 4 times faster, but I'm getting closer. I'll have a look and see what else I can tweak.

The new year must of sparked some inspiration - I've made some huge performance gains. Most of it was from implementing CRT (Chinese Remainder Theorem). I've also made all the algorithms configurable to see their effect on performance and code size.

Algorithm	512 bits	1024 bits	2048 bits	4096 bits
Barrett	5	12	50	349
Classical/Montgomery	6	19	84	548

Just compare this with the previous table! The objective has now been reached - we are within 2 times the speed of GMP. And without a single line of assembler (ok, to be fair, GMP does not use CRT and I'd hate to see how fast it would be if it did). Unfortunately Montgomery doesn't work well with CRT so it reverts to classical when used.

Here is a table that includes the feature (or lack thereof) and the code size of the performance tester binary which should give an idea of the changes in code size. All the disabled features are using Barrett reduction.

Algorithm	512 bits	1024 bits	2048 bits	4096 bits	Code Size
Barrett	5	12	50	349	40819
Classical	6	19	84	548	38135
Montgomery	6	19	84	548	40783
No CRT	8	27	167	1163	40796
No Karatsuba	5	12	54	367	38073
No Sliding Window	7	22	119	788	40819
No Square	6	13	62	404	39423
All features off	9	43	303	2327	36776

Karatsuba only works on 4096 bit keys and is only marginally effective on 2048 bit keys. Since this is an embedded toolkit, this feature is disabled by default (and it saves 2kB). Barrett also uses an extra 2kB compared to classical, but it gives more than a third in performance gain, so it is worth it.

CRT gives a 3.3x improvement factor (close to the theoretical improvement of 4 times), and sliding-window exponentiation gives an improvement factor of 2.2x. And both features are very cheap in terms of code size. Squaring gives an improvement factor of 1.15x.

Montgomery can be ignored unless bigint addition can be improved greatly (and it's useless with CRT anyway).

Anyway, mission accomplished. An embedded system that is 20 times slower should be able to accept just under 5 1024 bit connections per second. Just imagine how much faster the above would be with assembler (I'm predicting gains of over 100%).

It's been quite a ride. So now I can go back to improving the client functionality...

Since the makefiles have been adjusted to run under Win32, the bigint performance tests now run. So here are the results (using Barrett reduction, and no Karatsuba):

Algorithm	512 bits	1024 bits	2048 bits	4096 bits
Barrett	1	6	37	261

Ok, as much as I'd hate to admit it, this test shows how much more efficient the Microsoft compiler it is. In fact it's about 30% better.