0

I am working on a fractal rendering software. The basic setup is that I have a big 2-dimensional array (picture), where values are incremented.

The simple rendering process is

while( iteration < maxIteration ) { update some pixel in array; } 

This is stupidly simple to parallellize; just have several threads to do this simultaneously, since each thread will (very likely) work with different pixels at the same time, and even if there is an update collision in the array, this is fine. The array is shared among the threads!

However, to keep track of the total number of iteratins done, I need iteration to be volatile, which I suspect slows down the code a little.

What baffels me is that I get virtually the same speed for 4 threads and 16 threads, and I run this on a 64-core machine, which is verified by Runtime.getRuntime().availableProcessors().

One issue is that I have no control over where in the array the threads work, hence, the issue might be a big case cache misses? The array is of the size of a fullhd-image: 1920x1080x4 longs.

Thus, I seek possible issues, and solutions to them, since I think this might be a common type of problem.

Edit: The code I am trying to optimize is available here (sourceforge). The class ThreadComputator represents one thread, and all these do iterations. The number of iterations done is stored in the shared variable currentIteration, which (in the current code) is incremented in a synchronized block.

All threads write to the Histogram object, which essentially is a big array of doubles. Writing to this does not need to be atomic, as overwrites will be rare, and the error is tolerated.

8
  • Why don't you have control over the chunk of the array each thread works on? This is a common Data Parallelism (fork/join for one) problem which has been addressed by many software products. Commented Oct 16, 2014 at 17:44
  • Because I implement the chaos game algorithm. This means that the next pixel I need to work on depends non-deterministically on current pixel. The huge array is like a histogram for a random process. Think of it as each thread throws darts at the array; if I only consider a smaller part, then all other thrown darts will be wasted. There is no way to "force" the algorithm to only work on a sub-part of the array without doing the calculations for the other part. en.wikipedia.org/wiki/Chaos_game Commented Oct 16, 2014 at 18:33
  • You probably shouldn't be using volatile. I suggest AtomicInteger instead (or AtomicLong). Commented Oct 16, 2014 at 20:21
  • @ngreen : That uses synchronize; isn't that slower than volatile? Commented Oct 16, 2014 at 20:25
  • No, it uses a CAS instruction. No locks at all. Commented Oct 16, 2014 at 20:32

1 Answer 1

2

I think you've answered your own question.

Because I implement the chaos game algorithm. This means that the next pixel I need to work on depends non-deterministically on current pixel. 

And you have a memory system on your computer that is functionally random access; but, the fastest performance is only possible if you have localized (within the cache pages) reads and writes.

I'd re-implement your algorithm like so:

  1. Get all of your desired writes for a "time instant", wrap them in a class / data structure such that they can be ordered and grouped by memory page / cache line.
  2. Generate the list of memory pages requiring access.
  3. Randomly assign a to-be-accessed memory page to a thread.
  4. Run all the updates for that page before that thread works on another memory page.

Yes, it won't be 100% random anymore; however you can mitigate that by counting the "write time" and assuming that all writes in the same write time occurred simultaneously. It will still thrash your memory pretty badly, but at least it will thrash is somewhat less.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.