Caesar cipher in i686 assembly + MMX

Question

I have written the following implementation of the Caesar cipher for an old Linux computer with a Pentium MMX chip. The code has been written with the following design goals in mind:

the code should perform well
to avoid timing side-channel attacks, the runtime should be independent of the data used
the code should run on a Pentium MMX

Please let me know of possible improvements and performance pitfalls I fell in.

 # MMX caeasar cipher implementation # for i586 with MMX # signature: # caesar(out, in, len, key) # key is between 0 and 25 .section .text .globl caesar .type caesar,@function .align 16 caesar: push %ebp mov %esp,%ebp push %ebx push %edi push %esi mov 8(%ebp),%edi # edi: destination mov 12(%ebp),%esi # esi: source mov 16(%ebp),%edx # edx: length and $~7,%edx # only process full qwords test %edx,%edx jz 1f xor %ecx,%ecx # ecx: index movd 20(%ebp),%mm5 punpcklbw %mm5,%mm5 punpcklwd %mm5,%mm5 punpckldq %mm5,%mm5 # mm5: key bytes movq %mm5,%mm4 psubb twentysix,%mm4 # mm4: key - 26 movq Amask,%mm6 psubb %mm4,%mm6 # mm6: 'A' - 1 - (key - 26) .align 16 0: movq (%esi,%ecx),%mm0 movq %mm0,%mm1 pand ucmask,%mm1 # mm1: xmm0 in upper case movq %mm1,%mm3 movq %mm1,%mm2 pcmpgtb Amask,%mm2 # mm2: 0xff where 'A' <= buf[i] pcmpgtb Zmask,%mm3 # mm3: 0xff where 'Z' < buf[i] pcmpgtb %mm6,%mm1 # mm1: 0xff where 'A' + key <= buf[i] pandn %mm1,%mm3 pand %mm4,%mm3 # mm3: key-26 where 'A' + (26 - key) <= buf[i] <= 'Z' pandn %mm2,%mm1 pand %mm5,%mm1 # mm1: key where 'A' <= buf[i] < 'A' + (26 - key) por %mm3,%mm1 # mm1: 0/key/key-26 paddb %mm1,%mm0 movq %mm0,(%edi,%ecx) add $8,%ecx cmp %edx,%ecx jb 0b emms add %ecx,%edi add %ecx,%esi 1: mov 16(%ebp),%edx # length mov 20(%ebp),%ecx # key and $7,%edx # loop tail # process remaining bytes add %edx,%edi # end of output buffer add %edx,%esi # end of input buffer neg %edx # index counts up to 0 lea -26(%ecx),%eax # al: key - 26 mov $'A'+26-1,%ah sub %cl,%ah # ah: threshold ('A' - 1 - (key - 26)) .align 16 0: mov (%esi,%edx),%bl and $~0x20,%bl # bl: uppercase c cmp $'Z'+1,%bl # cf: uc <= 'Z' sbb %ch,%ch # ch: 0xff if uc <= 'Z' cmp %bl,%ah # cf: thresh < uc sbb %bh,%bh # bh: 0xff if thresh < uc and %bh,%ch and %al,%ch # ch: key - 26 & isleZ & isgttthres not %bh # bh: ~isgtthresh cmp $'A',%bl cmc # cf: 'A' <= uc sbb %bl,%bl # bl: 0xff if 'A' <= uc and %bh,%bl and %cl,%bl # bl: key & isgeA & ~isgtthresh or %ch,%bl add (%esi,%edx),%bl mov %bl,(%edi,%edx) inc %edx jnz 0b 1: pop %esi pop %edi pop %ebx leave ret .size caesar,.-caesar .section .rodata.cst8,"aM",@progbits,8 .align 8 .type ucmask,@object ucmask: .fill 8, 1, ~0x20 .size ucmask, 8 .type Amask,@object Amask: .fill 8, 1, 'A' - 1 .size Amask, 8 .type Zmask,@object Zmask: .fill 8, 1, 'Z' .size Zmask, 8 .type twentysix,@object twentysix: .fill 8, 1, 26 .size twentysix, 8

You can use this harness to test the code:

#define _POSIX_C_SOURCE 200809L #include <stdio.h> #include <stdlib.h> #include <string.h> extern void caesar(const char *out, char *in, size_t len, int key); extern int main(int argc, char *argv[]) { char *buf = NULL; size_t len = 0; ssize_t count; int key = 13; if (argc > 1) key = atoi(argv[1]) % 26; while (count = getline(&buf, &len, stdin), count != EOF) { caesar(buf, buf, strlen(buf), key); fputs(buf, stdout); } return (EXIT_SUCCESS); }

Aren't there 8 mm registers? Since register<->register operations tend to be faster than register<->memory, could you use mm7 for Amask? — David Wohlferd
– David Wohlferd, Commented Apr 2, 2018 at 6:41
So you want to tune for P5 (Pentium MMX)? You mention i686: the first P6 chip with MMX was Pentium II. (PPro didn't have MMX). Tuning for out-of-order P6 is significantly different from the in-order dual-issue pipeline in P5. On P5, most MMX instructions can pair in either the U or V pipes, so instruction scheduling is very important. On P6, decode bottlenecks and register-read stalls are going to be the major worries, not scheduling. (And early P6 doesn't have micro-fusion of memory operands; you get 2 uops.) — Peter Cordes
– Peter Cordes, Commented Dec 18, 2018 at 23:08
Can you require your in/out buffers are padded to a multiple of 8 bytes, so you can use SIMD without a scalar fallback? (You can already do that by using a possibly-overlapping unaligned last vector, unless the total length is less than 8, though. If operating in-place with src=dst needs to be supported, make sure you load that last maybe-overlapping source vector before the final main loop vector store.) — Peter Cordes
– Peter Cordes, Commented Dec 18, 2018 at 23:16
@PeterCordes I am not sure; I think I wrote i686 because I thought this was the first CPU to support MMX. This whole thing is more of an educational exercise / joke, but optimisation advice for either CPU would still be useful. Padding cannot be mandated though. — FUZxxl
– FUZxxl, Commented Dec 18, 2018 at 23:21

Sep Roland · Accepted Answer · 2018-04-08 15:29:08Z

A big danger.

When the len parameter happens to be a muliple of 8 and all of the full qwords have been processed, there will be no remaining bytes to process.
Your code calculates the number of remaining bytes using

mov 16(%ebp),%edx # length ... and $7,%edx # loop tail

but forgets to exit if this produces zero. The loop that follows is started anyway and at the bottom where it reads

inc %edx jnz 0b

the increment of %edx will keep producing NZ for a very long time!

Resolve this by exiting if the and instruction sets ZF=1.

mov 16(%ebp), %edx # length ... and $7, %edx # loop tail jz 1f

A small optimization

and $~7,%edx # only process full qwords test %edx,%edx jz 1f

Because the and instruction defines the zero flag (ZF), there's no need to write test %edx, %edx.

Allow me

mov $'A'+26-1,%ah

This struck me as being a bit too much of a complication. Why not simply write

mov $'Z', %ah

Now if it was your intent to come close to the comment that follows in the next line
# ah: threshold ('A' - 1 - (key - 26))
then writing

lea -26(%ecx), %eax # al: key - 26 mov $'A'-1, %ah sub %al, %ah # ah: threshold ('A' - 1 - (key - 26))

would have nailed it.

FUZxxl · Accepted Answer · 2018-12-18 20:54:19Z

After applying the advice I received, this is the final version of the code:

 # MMX caeasar chiffre implementation # for i686 with MMX # signature: # caesar(out, in, len, key) # key is between 0 and 25 .section .text .globl caesar .type caesar,@function .align 16 caesar: push %ebp mov %esp,%ebp push %ebx push %edi push %esi mov 8(%ebp),%edi # edi: destination mov 12(%ebp),%esi # esi: source mov 16(%ebp),%edx # edx: length and $~7,%edx # only process full qwords test %edx,%edx jz 1f xor %ecx,%ecx # ecx: index movd 20(%ebp),%mm5 punpcklbw %mm5,%mm5 punpcklwd %mm5,%mm5 punpckldq %mm5,%mm5 # mm5: key bytes movq Amask,%mm7 # for later use movq %mm5,%mm4 psubb twentysix,%mm4 # mm4: key - 26 movq %mm7,%mm6 psubb %mm4,%mm6 # mm6: 'A' - 1 - (key - 26) .align 16 0: movq (%esi,%ecx),%mm0 movq %mm0,%mm1 pand ucmask,%mm1 # mm1: xmm0 in upper case movq %mm1,%mm3 movq %mm1,%mm2 pcmpgtb %mm7,%mm2 # mm2: 0xff where 'A' <= buf[i] pcmpgtb Zmask,%mm3 # mm3: 0xff where 'Z' < buf[i] pcmpgtb %mm6,%mm1 # mm1: 0xff where 'A' + key <= buf[i] pandn %mm1,%mm3 pand %mm4,%mm3 # mm3: key-26 where 'A' + (26 - key) <= buf[i] <= 'Z' pandn %mm2,%mm1 pand %mm5,%mm1 # mm1: key where 'A' <= buf[i] < 'A' + (26 - key) por %mm3,%mm1 # mm1: 0/key/key-26 paddb %mm1,%mm0 movq %mm0,(%edi,%ecx) add $8,%ecx cmp %edx,%ecx jb 0b emms add %ecx,%edi add %ecx,%esi 1: mov 16(%ebp),%edx # length mov 20(%ebp),%ecx # key and $7,%edx # loop tail jz 1f # all bytes processed alread? # process remaining bytes add %edx,%edi # end of output buffer add %edx,%esi # end of input buffer neg %edx # index counts up to 0 lea -26(%ecx),%ebx # bl: key - 26 mov $'A'-1,%ah sub %bl,%ah # ah: threshold ('A' - 1 - (key - 26)) .align 16 0: mov (%esi,%edx),%al and $~0x20,%al # al: uppercase c cmp $'Z'+1,%al # cf: uc <= 'Z' sbb %ch,%ch # ch: 0xff if uc <= 'Z' cmp %al,%ah # cf: thresh < uc sbb %bh,%bh # bh: 0xff if thresh < uc and %bh,%ch and %al,%ch # ch: key - 26 & isleZ & isgttthres not %bh # bh: ~isgtthresh cmp $'A',%al cmc # cf: 'A' <= uc sbb %al,%al # al: 0xff if 'A' <= uc and %bh,%al and %cl,%al # al: key & isgeA & ~isgtthresh or %ch,%al add (%esi,%edx),%al mov %al,(%edi,%edx) inc %edx jnz 0b 1: pop %esi pop %edi pop %ebx leave ret .size caesar,.-caesar .section .rodata.cst8,"aM",@progbits,8 .align 8 .type ucmask,@object ucmask: .fill 8, 1, ~0x20 .size ucmask, 8 .type Amask,@object Amask: .fill 8, 1, 'A' - 1 .size Amask, 8 .type Zmask,@object Zmask: .fill 8, 1, 'Z' .size Zmask, 8 .type twentysix,@object twentysix: .fill 8, 1, 26 .size twentysix, 8

While OPs are encouraged to answer their own questions bear in mind that "Your answer must meet the standards of a Code Review answer, just like any other answer. Describe what you changed, and why.". Code-only answers that don't actually review the code are insufficient and are subject to deletion. — Sᴀᴍ Onᴇᴌᴀ
– Sᴀᴍ Onᴇᴌᴀ ♦, Commented May 19 at 15:53

Stack Exchange Network

Caesar cipher in i686 assembly + MMX

2 Answers 2

A big danger.

A small optimization

Allow me

You must log in to answer this question.

Hot Network Questions

Caesar cipher in i686 assembly + MMX

2 Answers 2

A big danger.

A small optimization

Allow me

You must log in to answer this question.

Related

Hot Network Questions