To keep n previous values, is memmove faster than a modulo operation?

Question

I'm writing a digital filter, and I need to keep the last X values and sum them all together.

Now there are two possible approaches to this. Either I shift the whole array using memmove to make room for the next value, and have the right indexes to the array as hard-coded values in my summing algorithm.

memmove(&Fifo[0], &Fifo[1], 12 * 4); // Shift array to the left Result += Factor[1] * (Fifo[5] + Fifo[7]); Result += Factor[2] * (Fifo[4] + Fifo[8]); Result += Factor[3] * (Fifo[3] + Fifo[9]); Result += Factor[4] * (Fifo[2] + Fifo[10]); Result += Factor[5] * (Fifo[1] + Fifo[11]); Result += Factor[6] * (Fifo[0] + Fifo[12]);

Or alternatively, I don't copy any memory, but increment a counter instead, and calculate each index from that using a modulo operation (like a circular buffer).

i++; // Increment the index Result += Factor[1] * (Fifo[(i + 5) % 13] + Fifo[(i + 7) % 13]); Result += Factor[2] * (Fifo[(i + 4) % 13] + Fifo[(i + 8) % 13]); Result += Factor[3] * (Fifo[(i + 3) % 13] + Fifo[(i + 9) % 13]); Result += Factor[4] * (Fifo[(i + 2) % 13] + Fifo[(i + 10) % 13]); Result += Factor[5] * (Fifo[(i + 1) % 13] + Fifo[(i + 11) % 13]); Result += Factor[6] * (Fifo[(i + 0) % 13] + Fifo[(i + 12) % 13]);

Since its an embedded ARM cpu, I was wondering what would be more efficient. Since I assume that the CPU has to move at least one 32-bit value internally to do the modulo operation, could it be that just moving the whole array is just as fast as calculating the right indexes?

@JoachimPileborg Benchmarking would just tell me what approach is faster, but I'd like to understand WHY it would be faster. — Maestro
– Maestro, Commented Sep 4, 2013 at 11:21
For that you might want to take a look at the generated code. And also remember that function calls can be expensive compared to other instructions. — Some programmer dude
– Some programmer dude, Commented Sep 4, 2013 at 11:23
The why will depend on your compiler and what it outputs. You simply may need to become an assembly novice to understand the specifics at this point. — pattivacek
– pattivacek, Commented Sep 4, 2013 at 13:17

user694733 · Accepted Answer · 2013-09-04 11:50:10Z

If you need to know which is faster, you need to do benchmark. If you want to know why, you need to examine the assembly.

That being said, there is also halfway solution which could be good enough: Use buffer larger than needed and only do memmove when your buffer is full. That way you only have to keep track of starting offset, and not have to worry about the problems that come with circular buffers. You have to use more memory though.

So if you wish to have 5 elements and use buffer for 10 elements, you only have to do memmove every 5 insertions. (Except the first pass when you can do 10 insertions)

starblue · Accepted Answer · 2013-09-22 08:03:07Z

I've done exactly that on a Cortex M0 (LPC11C14) for a FIR filter of size 15 (Savitzky-Golay for measuring line voltage).

I found that in my case copying was somewhat slower than using a circular buffer of size 16 and computing the indices using the modulo operator. Note that 16 is a power of two, which makes division very cheap.

I tried several variants and used a port pin for measuring execution time, I recommend you do the same.

In most cases, doing x mod y where y is some power of two is handled as x AND (y - 1), which is quite cheap indeed.
I thought that, too, but found that it is compiled into a left shift followed by a right shift, so the most significant bits fall off.

Michel · Accepted Answer · 2013-09-04 11:43:32Z

Assuming 32-bit values, Modulo on ARM can be executed in 2 assembly instructions, but so is moving memory (1 to get it in a register, 1 to get it out). So no definitive answer here; it will depend on the code around it.

My gut feeling says you should go for the circular buffer approach.

maep · Accepted Answer · 2013-09-04 12:39:09Z

There is a third way which requires neither memmove nor modulo involving two switch blocks. I'm too lazy to type it up, but the idea is that you calculate the offset, use the first switch to calculate one 'half' of the buffer, then recaulculate the offset and use the second switch to calculate the other half of the buffer. You basically enter the second switch where the first one 'left'. Note that in one switch block the instruction order would have to be reverted.

Leeor · Accepted Answer · 2013-09-04 13:21:30Z

My intuition says that the memmove may cause all sorts of memory conflicts and prevent internal bypasses, since you load and store to the same area, perhaps even the same cache lines. Some processors would simply give up on optimizing this and defer all the memory operations, effectively serializing them (an embedded CPU may be simple enough to do this anyway, but i'm talking about the general case - on x86 or even cortex a15 you may get a bigger penalty)

Collectives™ on Stack Overflow

To keep n previous values, is memmove faster than a modulo operation?

5 Answers 5

Comments

2 Comments

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

2 Comments

Comments

Comments

Comments

Related