Is there an advantage in choosing either loop as an outer loop?

Question

I am extending an existing logging library. It is a system with two sides: The frontend is where tasks write their log messages into, the backend is where an application can plug listeners into which forward those messages to different sinks. The backend used to be one hard-wired listener, I am now extending this for flexibility. The code is to be used exclusively on embedded devices, where high performance (measured in number of bytes forwarded per millisecond) is a very important design and implementation objective.

For performance reasons, messages are buffered, and forwarding is done in a background task. That task fetches a chunk of messages from a queue, formats them all, and then passes them to the listeners via registered functions. Those listeners will filter messages, and will only write those to their sink that pass the filter criterion.

Given this, I end up having N notification functions (the listeners) to send M messages to, a rather classic N*M problem. Now I have two possibilities: I can loop over the messages, and then loop over the notification functions passing the message to each one.

for(m in formatted_messages) for(n in notification_functions) n(m); void n(message) { if( filter(message) ) write(message); }

Or I could loop over all the notification functions, and pass them all the messages I have at once:

for(n in notification_functions) n(formatted_messages); void n(messages) { for(m in messages) if( filter(m) ) write(m); }

Is there any fundamental considerations regarding which design is more likely to allow a higher number of messages to be processed per time slice? (Note how this question determines the listener's interface. This isn't a micro-optimization question, but one about how to make a design that does not hinder performance. I can measure only much later, and redesigning the listener interface then will be costly.)

Some considerations I have already made:

Those listeners need to write the messages somewhere, which is rather expensive, so the function calls by themselves might not be too important performance-wise.
In 95% of all cases, there will only be one listener.

Something tells me that choosing the loop over the functions as the outer loop (i. e. looping over the functions first then over messages) could be somewhat more performant, but maybe it's just wrong intuition and you shouldn't even be listening to me. — user529758
– user529758, Commented Jun 27, 2013 at 17:59
Can you combine multiple messages sent to the same listener? — Mysticial
– Mysticial, Commented Jun 27, 2013 at 17:59
The first approach is M*N, but the seconds seems like N to me. Could this be? — StackedCrooked
– StackedCrooked, Commented Jun 27, 2013 at 17:59
How important is fairness? The first approach spreads the messages evenly to all listeners. While the second approach leaves longer gaps between the calling of the listeners. — StackedCrooked
– StackedCrooked, Commented Jun 27, 2013 at 18:02

Reed Copsey · Accepted Answer · 2013-06-27 17:58:45Z

Is there any fundamental considerations regarding which design is more likely to allow a higher number of messages to be processed per time slice?

In general, the main considerations with this often boil down to two main things.

If one of your loops is looping over objects which can potentially have good memory locality (such as looping over an array of values), keeping that portion in the inner loop can potentially keep the objects within the CPU cache, and improve performance.
If you plan to try to parallelize the operation, keeping the "larger" (in terms of count) collection in the outer loop allows you to parallelize the outer loop effectively, and not cause over subscription of threads, etc. It's typically simpler and cleaner to parallelize an algorithm at the outer level, so designing the loops with the potentially larger parallel "blocks" of work at the outer loop can simplify this, if it's a possibility later.

Those listeners need to write the messages somewhere, which is rather expensive, so the function calls by themselves might not be too important performance-wise.

This will probably completely negate any benefits of moving one loop outside of the other.

In 95% of all cases, there will only be one listener.

If this is the case, I would likely put the listener loop at the outer scope, unless you plan to parallelize this operation. Given that this is going to run in a background thread on an embedded device, parallelizing is unlikely, so having the listener loop as the outer loop should reduce the overall instruction count (it effectively becomes a loop over M operations, instead of M loops over a single operation).

Re #1: The "objects" are mainly strings plus some meta data (__FILE__, __LINE__, timestamp etc), but my current approach is the turn them into string ("formatting") once, in the background task, rather than individually in the listeners. So basically the "objects" are plain char buffers (usually between 0.1-1kB). What do you think about the locality of char buffers? Re #2: Wouldn't parallelization require that I forward the M messages to N parallel working tasks — which is just the problem I currently have?
@sbi If they're character buffers, there won't be much in the way of locality. Basically, if there's anything that's either large, or using a pointer (even internally), figure locality goes out the window. If the character buffers are a constant sized, and stored in an array, there may be locality, but again, in this case, the fact that the listeners are doing IO probably means it really doesn't matter at all in this case.

David Rodríguez - dribeas · Accepted Answer · 2013-06-27 18:17:49Z

The order of the loops will probably have much less of an advantage than the change in the signature of the listener (note that whichever loop is outside, the listener could maintain the first interface, i.e. both loops can be in the caller).

The natural advantage of the second interface (i.e. sending a sequence of messages to each listener) is that you enable possible grouping on the implementation of the listener. For example, if writing to a device, the listener can pack multiple messages into a single write, while if the interface takes a single message, then either the listener caches (which has a memory and cpu cost) or needs to perform multiple writes one per call.

Coalescing message write operations in the listeners is a very important consideration I hadn't thought of. Thanks for making this observation!

Mats Petersson · Accepted Answer · 2013-06-27 18:09:21Z

So, several factors will play in here:

How close together are messages in the cache, and how much space do they take up? If they are relatively small (a few kilobytes, or less) and close together (e.g. not a linked list with memory allocated several seconds apart in a system that does lots of other memory allocation).

If they are close, and small, then I believe the second option is more efficient, since the messages are going to be prefetched/cached together, where calling all n listener and filter functions (also assuming there are MANY functions, not one, two or three) may cause more "cache-throwout" of previous messages. This would also depend on how complex the listener and filter functions actually are, of course. How much work do they do? If each function does quite a bit of work, it's probably not that important which order you do it, because it will just be marginal.

Branch target predictors seem to hate it when you call lots of different functions from the same instruction. But yeah, as you note, there's a tradeoff between good data cache usage and making good use of your CPU.
Messages are cut off at 1k, but 99.9% of them are 0.1k max anyway. They are stored in dynamically allocated buffers (1k each) that (usually) are not freed, but recycled. I could write an allocator that allocates a hole chunk of those buffers to improve their locality, but since in the frontend they're consumed by parallel threads, they end up scrambled anyway, which screws their locality, so I don't think this would buy me much. As I wrote, usually (99%) it's one function, and more then two should be extremely rare. (Of course, it could be users use it in unpredicted ways.)

tmyklebu · Accepted Answer · 2013-06-27 18:08:17Z

There aren't any "fundamental" reasons why one is a better design than the other. There are a few very minor speed differences that might come into play depending on how your library is used. I would personally prefer to iterate over listeners first and messages second.

I'm guessing the handler bodies are typically pretty fast. You'll probably want to iterate over the listeners as the outer loop so that you're calling the same code repeatedly. Stuff like indirect call prediction will work much better this way. Of course, you wind up making worse use of the data cache, but hopefully each message buffer is small enough to fit easily into L1.

Why not also make the listeners accept a const vector<message> & and have them do their own iteration? They can do whatever buffering is beneficial and only do a single expensive write at the end.

Wherever I wrote messages or formatted_messages, then that might be a std::vector of buffers. (I'm not using std::string, though.) Each buffer is 1k, but usually messages use <0.1k. If by "the handler bodies" you refer to the listeners, then, no, they are not really fast considering the requirements. Writing to internal flash memory is so slow for this (~100ms) that we use NVRAM as cash and have yet another background task copy from there to flash memory...

Collectives™ on Stack Overflow

Is there an advantage in choosing either loop as an outer loop?

4 Answers 4

2 Comments

1 Comment

2 Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

1 Comment

2 Comments

1 Comment

Related