3

I have written an C implementation of AES and have tried to make it as fast as possible (Im just starting out in Programming and have training in IT). I have achieved an Speed increase of around 600% so far but its still awfully slow. To Compare my AES-Implementation with something i have used the "openssl speed" command in the Linux-Terminal. In 3 seconds this implementation encrypts around 36 977 043 blocks (16byte). I am ~25 times slower (at 72 seconds for the 36... bytes) than that which kinda sucks. Im curious about 2 things.

  1. What would be a good goal to achieve, how fast is a realistic goal to aim at.
  2. Why is my Code so slow, and how can i change that.

To my code: I have tried to leave out on some of my functions so see how much faster the code gets without them. The full code took 72 seconds.

  • Without Mixcolumns 14 seconds #here is a big problem
  • Without Shiftrows 67 seconds
  • Without Subbytes 61 seconds

My encryption function:

uint32_t * encrypt(uint32_t * expkey,uint32_t state[4]){ uint32_t temp[4]; state[0] = state[0] ^ expkey[0]; state[1] = state[1] ^ expkey[1]; state[2] = state[2] ^ expkey[2]; state[3] = state[3] ^ expkey[3]; for(int round = 1; round < Nr; round++){ // Subbytes for (int c = 0; c < 4;c++){ temp[c] = ((sbox[state[c] >> 24 & 0xFF]) << 24 ) + ((sbox[state[c] >> 16 & 0xFF]) << 16 ) + ((sbox[state[c] >> 8 & 0xFF]) << 8 ) + (sbox[state[c] & 0xFF]); } // Shiftrows state[0] = (((temp[0] >> 24) & 0xFF) << 24) + (((temp[1] >> 16) & 0xFF) << 16) + (((temp[2] >> 8) & 0xFF) << 8) + (temp[3] & 0xFF); state[1] = (((temp[1] >> 24) & 0xFF) << 24) + (((temp[2] >> 16) & 0xFF) << 16) + (((temp[3] >> 8) & 0xFF) << 8) + (temp[0] & 0xFF); state[2] = (((temp[2] >> 24) & 0xFF) << 24) + (((temp[3] >> 16) & 0xFF) << 16) + (((temp[0] >> 8) & 0xFF) << 8) + (temp[1] & 0xFF); state[3] = (((temp[3] >> 24) & 0xFF) << 24) + (((temp[0] >> 16) & 0xFF) << 16) + (((temp[1] >> 8) & 0xFF) << 8) + (temp[2] & 0xFF); // Mixcolums for (int c = 0; c < 4;c++){ state[c] = ((xtime((state[c] >> 24) & 0xFF) ^ xtime3((state[c] >> 16) & 0xFF) ^ ((state[c] >> 8) & 0xFF) ^ (state[c] & 0xFF)) << 24) + ((((state[c] >> 24) & 0xFF) ^ xtime((state[c] >> 16) & 0xFF) ^ xtime3((state[c] >> 8) & 0xFF) ^ (state[c] & 0xFF)) << 16) + ((((state[c] >> 24) & 0xFF) ^ ((state[c] >> 16) & 0xFF) ^ xtime((state[c] >> 8) & 0xFF) ^ xtime3(state[c] & 0xFF)) << 8 ) + (xtime3((state[c] >> 24) & 0xFF) ^ ((state[c] >> 16) & 0xFF) ^ ((state[c] >> 8) & 0xFF) ^ xtime(state[c] & 0xFF)); } // Add Key state[0] = state[0] ^ expkey[round * 4]; state[1] = state[1] ^ expkey[round * 4 + 1]; state[2] = state[2] ^ expkey[round * 4 + 2]; state[3] = state[3] ^ expkey[round * 4 + 3]; } // Last Subbytes for (int c = 0; c < 4;c++){ temp[c] = ((sbox[state[c] >> 24 & 0xFF]) << 24 ) + ((sbox[state[c] >> 16 & 0xFF]) << 16 ) + ((sbox[state[c] >> 8 & 0xFF]) << 8 ) + (sbox[state[c] & 0xFF]); } */ // Last Shiftrow state[0] = (((temp[0] >> 24) & 0xFF) << 24) + (((temp[1] >> 16) & 0xFF) << 16) + (((temp[2] >> 8) & 0xFF) << 8) + (temp[3] & 0xFF); state[1] = (((temp[1] >> 24) & 0xFF) << 24) + (((temp[2] >> 16) & 0xFF) << 16) + (((temp[3] >> 8) & 0xFF) << 8) + (temp[0] & 0xFF); state[2] = (((temp[2] >> 24) & 0xFF) << 24) + (((temp[3] >> 16) & 0xFF) << 16) + (((temp[0] >> 8) & 0xFF) << 8) + (temp[1] & 0xFF); state[3] = (((temp[3] >> 24) & 0xFF) << 24) + (((temp[0] >> 16) & 0xFF) << 16) + (((temp[1] >> 8) & 0xFF) << 8) + (temp[2] & 0xFF); // Last Add Key state[0] = state[0] ^ expkey[Nr * 4]; state[1] = state[1] ^ expkey[Nr * 4 + 1]; state[2] = state[2] ^ expkey[Nr * 4 + 2]; state[3] = state[3] ^ expkey[Nr * 4 + 3]; return state; } 

And the xtime function:

uint8_t xtime(uint8_t x){ return (x << 1) ^ (0x11b & -(x >> 7)); } 

I am looking forward to all tips tricks and improvements.

3
  • 9
    Questions seeking feedback on working code, including performance improvements, would be more appropriate over at Code Review. Commented Feb 14, 2021 at 19:23
  • First of all, are you sure that openssl encrypted in user space? If it used AES-NI instructions that could explain why it's at least 10 times faster. Commented Feb 14, 2021 at 19:27
  • 2
    @Cheatah: Why do you think user space is relevant? The AES instructions just operate on data; they are not privileged. Commented Feb 14, 2021 at 20:07

2 Answers 2

5

The OpenSSL is using the AES-NI where available.

openssl speed -evp aes-128-cbc 

and outputs

 The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes aes-128-cbc 531549.19k 969335.21k 1045437.10k 1066826.75k 1054665.39k 1052120.41k 

Since you are not using the AES-NI you need to compare it with the software version

 OPENSSL_ia32cap=”~0x200000200000000″ openssl speed -elapsed -evp aes-128-cbc 
 The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes aes-128-cbc 143802.75k 161369.51k 165049.17k 166054.57k 166262.10k 166461.44k 

if we compare the last column, you will see that AES-NI is ~6.3 times faster than the software version of the OpenSSL. This means that you are around 4 times slower than the software version.

In many cases, the compiler optimization parameters can also affect the speed, too. Look into the manual of your compiler, if you are using GCC then they are -O[0..3]


About the code;

If you look at the AES code of OpenSSL you will see that they use pre-computed tables and this is a very common technique.

The Subytes, Shiftrows and MixColums are turned into table lookup. The speed difference is these. And note that the table lookup is vulnerable to cache-timing attacks.

Sign up to request clarification or add additional context in comments.

1 Comment

Not an exact answer to your problems, however, this should be in your mind.
2

You have to optimize the code.

  1. There are unnecessary shifts << and back >>. You can simply mask. Replace code like this:
= (temp[0] >> 24) & 0xFF) << 24) + (((temp[1] >> 16) & 0xFF) << 16) + (((temp[2] >> 8) & 0xFF) << 8) + (temp[3] & 0xFF); 

with code like this:

= (temp[0] & 0xFF000000) + (temp[1] & 0x00FF0000) + (temp[2] & 0x0000FF00) << 8) + (temp[3] & 0xFF); 
  1. Replace this:
 ^ expkey[Nr * 4 ]; ^ expkey[Nr * 4 + 1]; ^ expkey[Nr * 4 + 2]; ^ expkey[Nr * 4 + 3]; 

with this:

ptr = expkey + (Nr << 2); // Shift 2 is equal to * 2 ^ ptr[0]; ^ ptr[1]; ^ ptr[2]; ^ ptr[3]; 

This way you will eliminate a lot of computations

1 Comment

Most of this should already be done by the compiler, and if not, be filed as compiler bugs.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.