AES Algorithm Efficiency

This page gives the results I have obtained in implementing AES candidate algorithms from an efficiency perspective.

Results

The figures given in the table below are in clock cycles for key set-up and clock cycles per block for encryption and decryption. The rows giving speed figures provide encryption and decryption speeds (and their average) for 128 bit keys on the 200 MHz Pentium Pro reference platform without any I/O byte/word/block inversion costs. The key set-up timings are for computing both the encryption and decryption key schedules. For Rijndael and CRYPTON the cost of the key schedule for decryption is much higher than that for encryption so two figures are given for these algorithms (the encryption and decryption key schedule cycle counts respectively). The encryption and decryption speeds for 128 bit keys and full input/output byte order conversion are provided below the main figures where these are different (note, however, that these are not fair for comparison purposes since many of the AES algorithms chose arbitrary byte order conventions).

Here (as a result of many requests) are the source code files used to obtain these results:

For comparison DES in C (des.h, des.c and key.c), optimised for the Pentium Pro, requires 456 cycles for each 64 bit data block, corresponding to a speed of 28 Mbits/second on this platform running at 200 Mhz (faster than many AES design teams appear to believe).

Algorithm

RC6

Rijndael

MARS

Twofish

CRYPTON

CAST

Serpent

DFC

HPC

SAFER

LOKI

FROG

DEAL

MAGENTA

Source Code

Key Set-up (128)

1632

305:1389

4316

8414

531:1369

744:1270

4333

9473

2402

5222

120749

4278

7430

1416182

8635

Key Set-up (192)

1885

277:1595

4377

11628

539:1381

748:1284

4342

9540

2449

5203

120754

7426

7303

1422837

8653

Key Set-up (256)

1877

374:1960

4340

15457

552:1392

784:1323

4325

9913

2349

5177

120731

11313

7166

1423613

11698

Encrypt (128)

270

374

369

376

474

476

633

687

952

1203

1429

1722

2134

2417

2339

6539

Encrypt (192)

267

439

373

376

473

469

633

696

952

1288

1477

2555

2138

2433

2358

6531

Encrypt (256)

270

502

369

381

469

470

639

691

952

1178

1462

3391

2131

2440

3115

8711

Decrypt (128)

226

352

376

374

474

470

634

691

914

1244

1599

1709

2192

2227

2365

6534

Decrypt (192)

235

425

379

374

470

633

693

914

1235

1599

2530

2189

2255

2363

6528

Decrypt (256)

227

500

376

374

483

469

638

706

914

1226

1526

3338

2184

2240

3102

8705

Encryption speed (128)

94.8

68.4

69.4

68.1

54.1

53.8

40.4

37.3

26.9

21.3

17.9

14.9

12.0

10.6

10.9

3.9

Decryption speed (128)

113.3

72.7

68.1

68.4

54.1

54.5

40.4

37.0

28.0

20.6

16.0

15.0

11.7

11.5

10.8

3.9

Mean speed (128)

103.2

70.2

68.7

68.3

54.1

40.4

37.2

27.4

20.9

16.9

14.9

11.8

11.0

10.9

3.9

Encryption speed (128, full)

37.8

35.1

26.0

20.5

17.3

14.6

11.8

10.4

Decryption speed (128, full)

38.3

35.4

26.8

20.9

16.0

14.8

11.5

11.2

Min. Secure Rounds

11?

>36

>10

Min. Secure Cycles

245

344

240

300

440

550

590

483

1353

1466

1498

4840

2500

2550

11000

Diffusion, 1 round

Diffusion, 2 rounds

Overall Diffusion Count

420

332

370

388

288

495

468

940

192

512

468

328

376

147

Comments

1. The order of the algorithms in the table is based on mean encryption/decryption speed for 128 bit keys.

2. Different designers have chosen a different number of rounds for their algorithms and this clearly impacts in a major way on performance. This is illustrated by Serpent in particular, which has a much higher diffusion count than other algorithms.

3. Both DFC and HPC make use of 64 bit arithmetic. Since this will be the norm by the time the AES algorithm is selected it makes little sense to judge the speed of these algorithms on 32 bit architectures.

4. I suspect that there is quite a bit that can still be done for E2 and SAFER+ since I have not spent a great deal of time on these. I am grateful for suggestions from Kazumaro Aoki of NTT Japan for speeding up my E2 implementation.

5. For TWOFISH I have pushed encryption/decryption speed without concern for the impact on the cost of the key schedule or memory use (I employ 8K+ bytes of table space). There are many optimisation alternatives for this algorithm and major reductions in key schedule and memory cost can be achieved if high volume encryption/decryption speed is not the primary concern. My thanks to Niels Ferguson for suggesting some useful speed ups for my original implementation.

6. Where an algorithm meets a superset of the AES requirements I have only implemented the AES compliant elements. Hence Rijndael is only implemented for a block size of 128 bits and for HPC only HPC(medium) has been coded.

7. For Crypton I have implemented both the version submitted for AES and the revised version 1 (issued on December 21st 1998).

8. The DEAL timings are based on the DES source code in which the IP/FP permutations have been implemented in DEAL and not in DES (as described in the DEAL specification).

9. I have implemented Loki97, Frog and MAGENTA for completeness although it appears that they have all been broken.

Security versus Performance Issues (HEALTH WARNINGS)

The work presented here is primarily about performance but it is not sensible to consider this in complete isolation from security since there are inevitably trade-offs between these attributes in the real world. One obvious example is the number of rounds employed by each algorithm and another is the quality of each round in achieving the 'mixing' that the cipher is intended to provide. The lower rows in the table are a first attempt to consider the ways in which security and performance are being traded off by each of the AES algorithms.

Diffusion. The lower rows dealing with diffusion should not be seen as a direct measure of algorithm security. They simply give an indication of how good the algorithms are at spreading out the influence of input bits on output bits. Obviously, diffusion counts say nothing about the 'quality' of the diffusion achieved and this alone means that this sort of analysis needs to be treated with extreme caution.

Minimum Number of Secure Rounds. Eli Biham has suggested that AES algorithm performance should be measured by timing the 'minimum number of secure rounds' for each candidate - that is the estimated number of rounds needed to make a brute force key search the most efficient form of attack. This is a controversial suggestion that some AES participants do not accept even in principle (for example, see the paper by Bruce Schneier and the Twofish team). In my view, however, we have to look at this in two steps - it seems to me that the principle has merit but that there is (and may continue to be) a problem in obtaining impartial and widely accepted values for the minimum number of secure rounds for each of the AES algorithms.

At the moment the table uses figures equivalent to those suggested by Eli Biham. However as an AES algorithm contributor, Eli will not been seen by all as an impartial observer and it would hence be preferable to have a wider community basis for setting the values to be used in such a process. For the present therefore it is important to treat these figures (and any conclusions derived from them) with caution.

Implementation Assurance. It seems certain that there will be a significant number of the AES algorithms that will be judged to be secure at an algorithm level. In practice, however, most failures in cryptographic systems derive not from weaknesses in the algorithms used but rather from the exploitation of subtle flaws in the way the algorithms are implemented or through the exploitation of interactions between algorithm implementations and the environments in which they operate. This means that it is vital to judge implementations not simply from the perspective of the performance that they provide but also from the perspective of 'implementation assurance' - can we be confident that the implementation does what the algorithm requires, no more and no less, and can we also be confident that it does not interact with the wider environment in ways that can be exploited to undermine the security that the algorithm itself is otherwise capable of providing.

To answer these questions we need much more hard evidence derived from the implementation of the AES candidates in environments that are representative of practical use. In particular we have very little evidence at the moment of how the AES candidates behave in low end environments of the kind that may apply in much of electronic commerce and consumer applications.

We also need to understand whether the different AES algorithms are architecturally different in their ability to support implementation assurance techniques. None of the existing specifications are precise enough to support formal design and analysis methods and it even seems possible that at least some of the algorithms on offer are sufficiently complex to make these techniques difficult or even impossible to apply. This may also be true of other approaches for implementation assurance. This is an area where much more work is needed before any final algorithm choices are made.

Performance Measurement Approach

All the implementations provided and discussed here have been coded from scratch using the AES specification documents without prior reference to the source code provided by the design teams. However, comparisons with the reference code has sometimes been necessary because the specification and reference code for some algorithms are not consistent.

I have tried to keep a fairly consistent style across the source code to aid fair comparison. All the routines have been biased towards encryption and decryption speed without regard to key set-up or to memory use in tables. I have also used VTune, the Intel performance tuning software, to look for optimised code sequences. This was kindly supplied by Intel and I much appreciate their generous support. The code has been optimised for the Pentium Pro/II and it is hence likely to be relatively poor on the Pentium. Although I have pushed for speed, I have not used some techniques where the resulting source code has become especially obscure.

All the algorithms and results reported are based on implementation using Microsoft Visual C++ version 6. I have generally coded in standard C but I have used non-standard features where this is reasonable and would be done in practice. For example I have used the intrinsic circular rotate operations that Microsoft C provides. Some of the algorithms are endian neutral, others are definitively either big-endian or little-endian. Most of the difficulties I have had with implementation relate to confusion about endianness. I am running on a little endian machine (a Pentium II) and I have checked out the code in this context only. My inputs and outputs to algorithms are in 32 bit words, not bytes, and this impacts on the byte numbering conventions that are used. The algorithm source code provided below contains code to invert byte order on input and output to match the standard test values but timings do not include these sections of code.

Please bear in mind that there may be mistakes in the source code - please let me know if you find any. If you use any of it, be especially careful about endianess since I have not designed to eliminate such problems.

I am happy for the source code given here to be used provided that any conditions set by the algorithm designers are complied with. My only additional requirement if you do use it is that you acknowledge its origin. At the request of NTT, E2 source code is available for academic study purposes only.

Timing

Obtaining accurate and repeatable timings has proved more difficult than I expected. After many experiments I have settled on the use of the Pentium Pro/II cycle counter as the basis for timing. For each of the three routines in question (key set-up, encrypt and decrypt) I time two code sequences containing respectively single and double calls to the routines in question (with random key and data inputs). After running the routines to eliminate initialisation effects I then time these two sequences 100 times each and take the difference between their minimum timings as the cycle count for the routine in question. My thanks to Doug Whiting for a useful exchange of views on timing issues.

Fair Algorithm Comparison - Timing and Byte Order

There is scope for confusion in the definition of the AES interface on whether any required changes to byte order lie are internal or external to this interface. The programming interface is byte oriented and suggests that byte order changes are not needed. But the test vectors are 'numeric' and this suggests that such changes are needed as a part of the AES algorithms. I cover this in more detail here.

A consequence of this uncertainty is that some design teams have implemented byte order inversion for input and output blocks whereas others have not. Since byte order change involves processing costs this can impact on performance comparisons.

I am running on a little-endian machine (Pentium II) and this means that algorithms which involve byte, word or block order inversion ('big-endian' algorithms) are at a disadvantage if their speed is compared with 'little-endian' algorithms in my environment. Equally if I were using a big-endian machine it would be 'little-endian' AES algorithms would be put at a disadvantage. In order to obtain fair comparisons I have therefore chosen to base my timings on those that can be achieved when an algorithm is run on a machine that matches its own endianness. In practice this means that I do not include the processing cost of byte, word or block order inversions on input and output in my timing measurements. For some algorithms there are more deeply buried endian issues but, fortunately, these turn out to be small in comparison with the costs involved in input/output byte order changes. I hence believe that the timings given here are a fair assessment of what the AES algorithms can achieve when run in environments that match their own design endianess.

Since other comparisons, for example, that being compiled by Helger Lipmaa, include results for both big and little endian processors it is important that they use full timing data including endian conversion costs (however care has to be taken to ensure that results on big and little endian machines are given equal weight). The figures I have provided for Helger are hence full timings that include I/O conversion costs.

Diffusion

I have attempted to measure how effective the algorithms are at mixing up bits by measuring their 'diffusion'. I do this as follows. After setting up random key and data input blocks I run the encryption round function for the algorithm in question once or twice. I then change just one bit in the data input block and count how many bits in the output block change. I do this for each bit in the input block and I repeat this for a large number of random inputs to obtain average values. I then calculate for each input bit how many output bits change on average; I also calculate for each output bit how many input bits change its value (these two outputs are very similar on average but they are sometimes very different for particular bits). Since some algorithms operate using 'half-rounds' I use the best of the 1 and 2 round diffusion values to calculate an overall diffusion count for each algorithm.

Acknowledgements

A number of people have contributed ideas for improvements in the source code and the analysis presented here. I gratefully acknowledge the contributions made by the following people:

Changes:	29th October	added ranking rows to table and additional comments
	31st October	corrected Twofish cycle count per round figure (thanks to Doug Whiting for finding this)
	31st October	added diffusion per round values
	3rd November	changed diffusion rows and added a health warning
	12th November	added performance estimates for the minimum number of secure rounds
	30th November	added results for HPC and DEAL
	6th December	revised (improved) RC6 speed
	11th December	revised speed figures and simplified the table layout
	13th December	added Frog and Loki97
	15th December	added MAGENTA - completed implementation of all 15 AES candidates
	19th December	added improved Serpent figures based on new S box decompositions
	19th December	corrected CAST figures (error spotted by Helger Lipmaa)
	20th December	expanded on 'endian' timing issues (at Helger Lipmaa's suggestion)
	21st December	corrected error in the rc6 implementation and changed timing figures
	23rd December	added figures for the revised version of Crypton
	3rd January 1999	improved implementations of RC6 and MARS
	13th January 1999	corrected an error in MARS spotted by Louis Granboulan
	14th January 1999	published final (pre-AES2) timing data (including improved Twofish figures)
	20th January 1999	updates to timings (improved DEAL, DES and Rijndael)
	20th January 1999	changed page format to make results easier to access
	22nd January 1999	improved timings for Twofish key setup
	23rd January 1999	minor updates to timing data and added figures for 'full cost' encryption
	28th January 1999	added text on the (as yet) inadequately discussed issue of implementation assurance
	16th February 1999	improved DFC speed