10
\$\begingroup\$

I have written the following implementation of the Caesar cipher for an old Linux computer with a Pentium MMX chip. The code has been written with the following design goals in mind:

  • the code should perform well
  • to avoid timing side-channel attacks, the runtime should be independent of the data used
  • the code should run on a Pentium MMX

Please let me know of possible improvements and performance pitfalls I fell in.

 # MMX caeasar cipher implementation # for i586 with MMX # signature: # caesar(out, in, len, key) # key is between 0 and 25 .section .text .globl caesar .type caesar,@function .align 16 caesar: push %ebp mov %esp,%ebp push %ebx push %edi push %esi mov 8(%ebp),%edi # edi: destination mov 12(%ebp),%esi # esi: source mov 16(%ebp),%edx # edx: length and $~7,%edx # only process full qwords test %edx,%edx jz 1f xor %ecx,%ecx # ecx: index movd 20(%ebp),%mm5 punpcklbw %mm5,%mm5 punpcklwd %mm5,%mm5 punpckldq %mm5,%mm5 # mm5: key bytes movq %mm5,%mm4 psubb twentysix,%mm4 # mm4: key - 26 movq Amask,%mm6 psubb %mm4,%mm6 # mm6: 'A' - 1 - (key - 26) .align 16 0: movq (%esi,%ecx),%mm0 movq %mm0,%mm1 pand ucmask,%mm1 # mm1: xmm0 in upper case movq %mm1,%mm3 movq %mm1,%mm2 pcmpgtb Amask,%mm2 # mm2: 0xff where 'A' <= buf[i] pcmpgtb Zmask,%mm3 # mm3: 0xff where 'Z' < buf[i] pcmpgtb %mm6,%mm1 # mm1: 0xff where 'A' + key <= buf[i] pandn %mm1,%mm3 pand %mm4,%mm3 # mm3: key-26 where 'A' + (26 - key) <= buf[i] <= 'Z' pandn %mm2,%mm1 pand %mm5,%mm1 # mm1: key where 'A' <= buf[i] < 'A' + (26 - key) por %mm3,%mm1 # mm1: 0/key/key-26 paddb %mm1,%mm0 movq %mm0,(%edi,%ecx) add $8,%ecx cmp %edx,%ecx jb 0b emms add %ecx,%edi add %ecx,%esi 1: mov 16(%ebp),%edx # length mov 20(%ebp),%ecx # key and $7,%edx # loop tail # process remaining bytes add %edx,%edi # end of output buffer add %edx,%esi # end of input buffer neg %edx # index counts up to 0 lea -26(%ecx),%eax # al: key - 26 mov $'A'+26-1,%ah sub %cl,%ah # ah: threshold ('A' - 1 - (key - 26)) .align 16 0: mov (%esi,%edx),%bl and $~0x20,%bl # bl: uppercase c cmp $'Z'+1,%bl # cf: uc <= 'Z' sbb %ch,%ch # ch: 0xff if uc <= 'Z' cmp %bl,%ah # cf: thresh < uc sbb %bh,%bh # bh: 0xff if thresh < uc and %bh,%ch and %al,%ch # ch: key - 26 & isleZ & isgttthres not %bh # bh: ~isgtthresh cmp $'A',%bl cmc # cf: 'A' <= uc sbb %bl,%bl # bl: 0xff if 'A' <= uc and %bh,%bl and %cl,%bl # bl: key & isgeA & ~isgtthresh or %ch,%bl add (%esi,%edx),%bl mov %bl,(%edi,%edx) inc %edx jnz 0b 1: pop %esi pop %edi pop %ebx leave ret .size caesar,.-caesar .section .rodata.cst8,"aM",@progbits,8 .align 8 .type ucmask,@object ucmask: .fill 8, 1, ~0x20 .size ucmask, 8 .type Amask,@object Amask: .fill 8, 1, 'A' - 1 .size Amask, 8 .type Zmask,@object Zmask: .fill 8, 1, 'Z' .size Zmask, 8 .type twentysix,@object twentysix: .fill 8, 1, 26 .size twentysix, 8 

You can use this harness to test the code:

#define _POSIX_C_SOURCE 200809L #include <stdio.h> #include <stdlib.h> #include <string.h> extern void caesar(const char *out, char *in, size_t len, int key); extern int main(int argc, char *argv[]) { char *buf = NULL; size_t len = 0; ssize_t count; int key = 13; if (argc > 1) key = atoi(argv[1]) % 26; while (count = getline(&buf, &len, stdin), count != EOF) { caesar(buf, buf, strlen(buf), key); fputs(buf, stdout); } return (EXIT_SUCCESS); } 
\$\endgroup\$
6
  • 2
    \$\begingroup\$ Aren't there 8 mm registers? Since register<->register operations tend to be faster than register<->memory, could you use mm7 for Amask? \$\endgroup\$ Commented Apr 2, 2018 at 6:41
  • \$\begingroup\$ @DavidWohlferd Indeed, that seems useful. \$\endgroup\$ Commented Apr 2, 2018 at 8:56
  • \$\begingroup\$ So you want to tune for P5 (Pentium MMX)? You mention i686: the first P6 chip with MMX was Pentium II. (PPro didn't have MMX). Tuning for out-of-order P6 is significantly different from the in-order dual-issue pipeline in P5. On P5, most MMX instructions can pair in either the U or V pipes, so instruction scheduling is very important. On P6, decode bottlenecks and register-read stalls are going to be the major worries, not scheduling. (And early P6 doesn't have micro-fusion of memory operands; you get 2 uops.) \$\endgroup\$ Commented Dec 18, 2018 at 23:08
  • \$\begingroup\$ Can you require your in/out buffers are padded to a multiple of 8 bytes, so you can use SIMD without a scalar fallback? (You can already do that by using a possibly-overlapping unaligned last vector, unless the total length is less than 8, though. If operating in-place with src=dst needs to be supported, make sure you load that last maybe-overlapping source vector before the final main loop vector store.) \$\endgroup\$ Commented Dec 18, 2018 at 23:16
  • \$\begingroup\$ @PeterCordes I am not sure; I think I wrote i686 because I thought this was the first CPU to support MMX. This whole thing is more of an educational exercise / joke, but optimisation advice for either CPU would still be useful. Padding cannot be mandated though. \$\endgroup\$ Commented Dec 18, 2018 at 23:21

2 Answers 2

6
\$\begingroup\$

A big danger.

When the len parameter happens to be a muliple of 8 and all of the full qwords have been processed, there will be no remaining bytes to process.
Your code calculates the number of remaining bytes using

mov 16(%ebp),%edx # length ... and $7,%edx # loop tail 

but forgets to exit if this produces zero. The loop that follows is started anyway and at the bottom where it reads

inc %edx jnz 0b 

the increment of %edx will keep producing NZ for a very long time!

Resolve this by exiting if the and instruction sets ZF=1.

mov 16(%ebp), %edx # length ... and $7, %edx # loop tail jz 1f 

A small optimization

and $~7,%edx # only process full qwords test %edx,%edx jz 1f 

Because the and instruction defines the zero flag (ZF), there's no need to write test %edx, %edx.


Allow me

mov $'A'+26-1,%ah 

This struck me as being a bit too much of a complication. Why not simply write

mov $'Z', %ah 

Now if it was your intent to come close to the comment that follows in the next line
# ah: threshold ('A' - 1 - (key - 26))
then writing

lea -26(%ecx), %eax # al: key - 26 mov $'A'-1, %ah sub %al, %ah # ah: threshold ('A' - 1 - (key - 26)) 

would have nailed it.

\$\endgroup\$
0
0
\$\begingroup\$

After applying the advice I received, this is the final version of the code:

 # MMX caeasar chiffre implementation # for i686 with MMX # signature: # caesar(out, in, len, key) # key is between 0 and 25 .section .text .globl caesar .type caesar,@function .align 16 caesar: push %ebp mov %esp,%ebp push %ebx push %edi push %esi mov 8(%ebp),%edi # edi: destination mov 12(%ebp),%esi # esi: source mov 16(%ebp),%edx # edx: length and $~7,%edx # only process full qwords test %edx,%edx jz 1f xor %ecx,%ecx # ecx: index movd 20(%ebp),%mm5 punpcklbw %mm5,%mm5 punpcklwd %mm5,%mm5 punpckldq %mm5,%mm5 # mm5: key bytes movq Amask,%mm7 # for later use movq %mm5,%mm4 psubb twentysix,%mm4 # mm4: key - 26 movq %mm7,%mm6 psubb %mm4,%mm6 # mm6: 'A' - 1 - (key - 26) .align 16 0: movq (%esi,%ecx),%mm0 movq %mm0,%mm1 pand ucmask,%mm1 # mm1: xmm0 in upper case movq %mm1,%mm3 movq %mm1,%mm2 pcmpgtb %mm7,%mm2 # mm2: 0xff where 'A' <= buf[i] pcmpgtb Zmask,%mm3 # mm3: 0xff where 'Z' < buf[i] pcmpgtb %mm6,%mm1 # mm1: 0xff where 'A' + key <= buf[i] pandn %mm1,%mm3 pand %mm4,%mm3 # mm3: key-26 where 'A' + (26 - key) <= buf[i] <= 'Z' pandn %mm2,%mm1 pand %mm5,%mm1 # mm1: key where 'A' <= buf[i] < 'A' + (26 - key) por %mm3,%mm1 # mm1: 0/key/key-26 paddb %mm1,%mm0 movq %mm0,(%edi,%ecx) add $8,%ecx cmp %edx,%ecx jb 0b emms add %ecx,%edi add %ecx,%esi 1: mov 16(%ebp),%edx # length mov 20(%ebp),%ecx # key and $7,%edx # loop tail jz 1f # all bytes processed alread? # process remaining bytes add %edx,%edi # end of output buffer add %edx,%esi # end of input buffer neg %edx # index counts up to 0 lea -26(%ecx),%ebx # bl: key - 26 mov $'A'-1,%ah sub %bl,%ah # ah: threshold ('A' - 1 - (key - 26)) .align 16 0: mov (%esi,%edx),%al and $~0x20,%al # al: uppercase c cmp $'Z'+1,%al # cf: uc <= 'Z' sbb %ch,%ch # ch: 0xff if uc <= 'Z' cmp %al,%ah # cf: thresh < uc sbb %bh,%bh # bh: 0xff if thresh < uc and %bh,%ch and %al,%ch # ch: key - 26 & isleZ & isgttthres not %bh # bh: ~isgtthresh cmp $'A',%al cmc # cf: 'A' <= uc sbb %al,%al # al: 0xff if 'A' <= uc and %bh,%al and %cl,%al # al: key & isgeA & ~isgtthresh or %ch,%al add (%esi,%edx),%al mov %al,(%edi,%edx) inc %edx jnz 0b 1: pop %esi pop %edi pop %ebx leave ret .size caesar,.-caesar .section .rodata.cst8,"aM",@progbits,8 .align 8 .type ucmask,@object ucmask: .fill 8, 1, ~0x20 .size ucmask, 8 .type Amask,@object Amask: .fill 8, 1, 'A' - 1 .size Amask, 8 .type Zmask,@object Zmask: .fill 8, 1, 'Z' .size Zmask, 8 .type twentysix,@object twentysix: .fill 8, 1, 26 .size twentysix, 8 
\$\endgroup\$
1

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.