Speed of AES Implementation

Question

I have written an C implementation of AES and have tried to make it as fast as possible (Im just starting out in Programming and have training in IT). I have achieved an Speed increase of around 600% so far but its still awfully slow. To Compare my AES-Implementation with something i have used the "openssl speed" command in the Linux-Terminal. In 3 seconds this implementation encrypts around 36 977 043 blocks (16byte). I am ~25 times slower (at 72 seconds for the 36... bytes) than that which kinda sucks. Im curious about 2 things.

What would be a good goal to achieve, how fast is a realistic goal to aim at.
Why is my Code so slow, and how can i change that.

To my code: I have tried to leave out on some of my functions so see how much faster the code gets without them. The full code took 72 seconds.

Without Mixcolumns 14 seconds #here is a big problem
Without Shiftrows 67 seconds
Without Subbytes 61 seconds

My encryption function:

uint32_t * encrypt(uint32_t * expkey,uint32_t state[4]){ uint32_t temp[4]; state[0] = state[0] ^ expkey[0]; state[1] = state[1] ^ expkey[1]; state[2] = state[2] ^ expkey[2]; state[3] = state[3] ^ expkey[3]; for(int round = 1; round < Nr; round++){ // Subbytes for (int c = 0; c < 4;c++){ temp[c] = ((sbox[state[c] >> 24 & 0xFF]) << 24 ) + ((sbox[state[c] >> 16 & 0xFF]) << 16 ) + ((sbox[state[c] >> 8 & 0xFF]) << 8 ) + (sbox[state[c] & 0xFF]); } // Shiftrows state[0] = (((temp[0] >> 24) & 0xFF) << 24) + (((temp[1] >> 16) & 0xFF) << 16) + (((temp[2] >> 8) & 0xFF) << 8) + (temp[3] & 0xFF); state[1] = (((temp[1] >> 24) & 0xFF) << 24) + (((temp[2] >> 16) & 0xFF) << 16) + (((temp[3] >> 8) & 0xFF) << 8) + (temp[0] & 0xFF); state[2] = (((temp[2] >> 24) & 0xFF) << 24) + (((temp[3] >> 16) & 0xFF) << 16) + (((temp[0] >> 8) & 0xFF) << 8) + (temp[1] & 0xFF); state[3] = (((temp[3] >> 24) & 0xFF) << 24) + (((temp[0] >> 16) & 0xFF) << 16) + (((temp[1] >> 8) & 0xFF) << 8) + (temp[2] & 0xFF); // Mixcolums for (int c = 0; c < 4;c++){ state[c] = ((xtime((state[c] >> 24) & 0xFF) ^ xtime3((state[c] >> 16) & 0xFF) ^ ((state[c] >> 8) & 0xFF) ^ (state[c] & 0xFF)) << 24) + ((((state[c] >> 24) & 0xFF) ^ xtime((state[c] >> 16) & 0xFF) ^ xtime3((state[c] >> 8) & 0xFF) ^ (state[c] & 0xFF)) << 16) + ((((state[c] >> 24) & 0xFF) ^ ((state[c] >> 16) & 0xFF) ^ xtime((state[c] >> 8) & 0xFF) ^ xtime3(state[c] & 0xFF)) << 8 ) + (xtime3((state[c] >> 24) & 0xFF) ^ ((state[c] >> 16) & 0xFF) ^ ((state[c] >> 8) & 0xFF) ^ xtime(state[c] & 0xFF)); } // Add Key state[0] = state[0] ^ expkey[round * 4]; state[1] = state[1] ^ expkey[round * 4 + 1]; state[2] = state[2] ^ expkey[round * 4 + 2]; state[3] = state[3] ^ expkey[round * 4 + 3]; } // Last Subbytes for (int c = 0; c < 4;c++){ temp[c] = ((sbox[state[c] >> 24 & 0xFF]) << 24 ) + ((sbox[state[c] >> 16 & 0xFF]) << 16 ) + ((sbox[state[c] >> 8 & 0xFF]) << 8 ) + (sbox[state[c] & 0xFF]); } */ // Last Shiftrow state[0] = (((temp[0] >> 24) & 0xFF) << 24) + (((temp[1] >> 16) & 0xFF) << 16) + (((temp[2] >> 8) & 0xFF) << 8) + (temp[3] & 0xFF); state[1] = (((temp[1] >> 24) & 0xFF) << 24) + (((temp[2] >> 16) & 0xFF) << 16) + (((temp[3] >> 8) & 0xFF) << 8) + (temp[0] & 0xFF); state[2] = (((temp[2] >> 24) & 0xFF) << 24) + (((temp[3] >> 16) & 0xFF) << 16) + (((temp[0] >> 8) & 0xFF) << 8) + (temp[1] & 0xFF); state[3] = (((temp[3] >> 24) & 0xFF) << 24) + (((temp[0] >> 16) & 0xFF) << 16) + (((temp[1] >> 8) & 0xFF) << 8) + (temp[2] & 0xFF); // Last Add Key state[0] = state[0] ^ expkey[Nr * 4]; state[1] = state[1] ^ expkey[Nr * 4 + 1]; state[2] = state[2] ^ expkey[Nr * 4 + 2]; state[3] = state[3] ^ expkey[Nr * 4 + 3]; return state; }

And the xtime function:

uint8_t xtime(uint8_t x){ return (x << 1) ^ (0x11b & -(x >> 7)); }

I am looking forward to all tips tricks and improvements.

Questions seeking feedback on working code, including performance improvements, would be more appropriate over at Code Review. — kaylum
– kaylum, Commented Feb 14, 2021 at 19:23
First of all, are you sure that openssl encrypted in user space? If it used AES-NI instructions that could explain why it's at least 10 times faster. — Cheatah
– Cheatah, Commented Feb 14, 2021 at 19:27
@Cheatah: Why do you think user space is relevant? The AES instructions just operate on data; they are not privileged. — Eric Postpischil
– Eric Postpischil, Commented Feb 14, 2021 at 20:07

Thom Wiggers · Accepted Answer · 2023-07-05 15:47:49Z

The OpenSSL is using the AES-NI where available.

openssl speed -evp aes-128-cbc

and outputs

 The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes aes-128-cbc 531549.19k 969335.21k 1045437.10k 1066826.75k 1054665.39k 1052120.41k

Since you are not using the AES-NI you need to compare it with the software version

 OPENSSL_ia32cap=”~0x200000200000000″ openssl speed -elapsed -evp aes-128-cbc

 The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes aes-128-cbc 143802.75k 161369.51k 165049.17k 166054.57k 166262.10k 166461.44k

if we compare the last column, you will see that AES-NI is ~6.3 times faster than the software version of the OpenSSL. This means that you are around 4 times slower than the software version.

In many cases, the compiler optimization parameters can also affect the speed, too. Look into the manual of your compiler, if you are using GCC then they are -O[0..3]

About the code;

If you look at the AES code of OpenSSL you will see that they use pre-computed tables and this is a very common technique.

The Subytes, Shiftrows and MixColums are turned into table lookup. The speed difference is these. And note that the table lookup is vulnerable to cache-timing attacks.

Not an exact answer to your problems, however, this should be in your mind.

Thom Wiggers · Accepted Answer · 2023-07-05 15:49:33Z

You have to optimize the code.

There are unnecessary shifts << and back >>. You can simply mask. Replace code like this:

= (temp[0] >> 24) & 0xFF) << 24) + (((temp[1] >> 16) & 0xFF) << 16) + (((temp[2] >> 8) & 0xFF) << 8) + (temp[3] & 0xFF);

with code like this:

= (temp[0] & 0xFF000000) + (temp[1] & 0x00FF0000) + (temp[2] & 0x0000FF00) << 8) + (temp[3] & 0xFF);

Replace this:

 ^ expkey[Nr * 4 ]; ^ expkey[Nr * 4 + 1]; ^ expkey[Nr * 4 + 2]; ^ expkey[Nr * 4 + 3];

with this:

ptr = expkey + (Nr << 2); // Shift 2 is equal to * 2 ^ ptr[0]; ^ ptr[1]; ^ ptr[2]; ^ ptr[3];

This way you will eliminate a lot of computations

Most of this should already be done by the compiler, and if not, be filed as compiler bugs.

Collectives™ on Stack Overflow

Speed of AES Implementation

2 Answers 2

1 Comment

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Linked

Related