Calculation time peaks when operating on arrays of specific size

Question

I made a program, that allocates two arrays of C++ doubles. First contains values of sin(x) for x from 0 to pi/4. To the second one i write second derivatives of sin(x), calculated using three point formula:

(vals[i-1]-2*vals[i]+vals[i+1])/(dx*dx)

I perform those calculations for arrays of different size, repeat each a few times, and output average time of calculations. That way i get nice graph showing time needed to calculate derivatives for array of specific size.

It all sounds fairly easy, but I've encountered a weird problem. Take look at the graph: When arrays are small, nothing weird happens. However, when they are larger than 10000 elements (with two arrays that means about ~~16kB~~ 160kB of memory), I get periodical peaks, each about 512 elements after previous. This happens on multiple CPU's (AMD Sempron 3850, Intel Q9400 and Intel Xeon 5310), and doesn't happen on anothers (Intel i5 5200U, Intel Nehalem-C).

If that wasn't enough, when arrays get to about 65'000 elements, AMD 3850 on Windows 7 gets sudden increase in time. This doesn't happen on Debian.

I think it might be somehow related to CPU caching, given 'magical' numbers of elements, and to OS's scheduling, but I can't think of any specific explanation and way to confirm it (except for profiling CPU cache hit-miss ratio).

Code:

int main(int, char**) { double maxerr=0.0; double avgerr=0.0; for(unsigned SIZE=5000u; SIZE<100000u; SIZE+=1u) { const unsigned REPEATS=10000000/SIZE+1; const double STEP=1.0/(SIZE-1.0)*atan(1.0)*4.0; double *vals; double *derivs; // Alokacja vals= new double[SIZE]; derivs=new double[SIZE]; // Inicjalizacja for(unsigned i=0u; i<SIZE; ++i) { vals[i]=sin(i*STEP); derivs[i]=0.0; } // Obliczenia normalne const double TIME_FPU_START=msclock(); for(unsigned r=0u; r<REPEATS; ++r) { const double STEP2RCP=1.0/(STEP*STEP); for(unsigned i=1u; i<SIZE-1u; ++i) { derivs[i]=(vals[i-1]-2.0*vals[i]+vals[i+1u])*STEP2RCP; } } const double TIME_FPU_END=msclock(); const double TIME_FPU=(TIME_FPU_END-TIME_FPU_START)/REPEATS; maxerr=0.0; avgerr=0.0; for(unsigned i=1u; i<SIZE-1; ++i) { double err=fabs((derivs[i]+sin(i*STEP))/(-sin(i*STEP))); avgerr+=err/(SIZE-2); if(err>maxerr) { //printf("%f > %f\n", derivs[i], -sin(i*step)); maxerr=err; } } const double MAX_REL_ERR_FPU=maxerr; const double AVG_REL_ERR_FPU=avgerr; // Podsumowanie fprintf(stdout, "%u %.8f %.8f\n", SIZE, TIME_FPU, AVG_REL_ERR_FPU); fprintf(stderr, "%u %.8f %.8f\n", SIZE, TIME_FPU, AVG_REL_ERR_FPU); // Kasowanie delete [] vals; delete [] derivs; } return 0; }

The "msclock" function is plain old clock()/CLOCKS_PER_SEC on linux, and QueryPerformanceTimer on Windows (clock() and all its variants on Windows seem to have resolution of about 16ms.

Data file: https://pastebin.com/4iDn5Y9k First column denotes array size, second one is average time and last is average error of derivative.

Additional info: I've checked addresses of arrays I'm getting. First one (vals) seems to constantly stay at the same value, while second one (derivs) is constantly increasing. This probably means that beginning of second array is in the same block as end of first array. With page size being 4kiB, increase of 512 elements might require using another page.

Additional info: On Debian peaks are closely related to addresses of arrays (columns 4 and 5):

50670 0.00021090 0.00000016 0x55efa5e04c30 0x55efa5e67bb0 50680 0.00021244 0.00000017 0x55efa5e04c30 0x55efa5e67c00 50690 0.00170968 0.00000023 0x7f1617d0b010 0x7f1617ca7010 @ 50700 0.00021141 0.00000024 0x55efa5e04c30 0x55efa5e67ca0 @ 50710 0.00021873 0.00000027 0x55efa5e04c30 0x55efa5e67cf0 50720 0.00021126 0.00000002 0x55efa5e04c30 0x55efa5e67d40

Additional info: For arrays smaller than 10'000 elements, both arrays are always at the same address. For larger, second array keeps moving. That might be the reason that I'm not getting peaks for small sizes.

Also I've played around with "perf stat", as @geza suggested. First output when run for sizes from 5000 to 12930:

 Performance counter stats for './a.out': 70733.635707 task-clock (msec) # 0.999 CPUs utilized 6,008 context-switches # 0.085 K/sec 3 cpu-migrations # 0.000 K/sec 3,866 page-faults # 0.055 K/sec 90,800,323,159 cycles # 1.284 GHz (50.00%) 0 stalled-cycles-frontend (50.01%) 0 stalled-cycles-backend # 0.00% backend cycles idle (50.01%) 160,806,960,842 instructions # 1.77 insn per cycle (50.00%) 16,118,385,897 branches # 227.874 M/sec (50.00%) 5,818,878 branch-misses # 0.04% of all branches (49.99%) 70.839631335 seconds time elapsed

...And for sizes from 12940 to 30000:

 Performance counter stats for './a.out': 157468.056544 task-clock (msec) # 0.999 CPUs utilized 13,420 context-switches # 0.085 K/sec 10 cpu-migrations # 0.000 K/sec 32,271 page-faults # 0.205 K/sec 201,597,569,555 cycles # 1.280 GHz (50.00%) 0 stalled-cycles-frontend (50.00%) 0 stalled-cycles-backend # 0.00% backend cycles idle (50.00%) 351,405,781,351 instructions # 1.74 insn per cycle (50.00%) 35,289,858,166 branches # 224.108 M/sec (50.00%) 10,494,566 branch-misses # 0.03% of all branches (50.00%) 157.692170504 seconds time elapsed

Program ran twice the time, so more two times more context-switches aren't surprising. On this range, however, I'm getting more page faults.

Another interesting thing, When I restarted the program for the second run, second array didn't move around the memory so much. Seems that it starts doing so when program has run for a while.

The longer I'm testing things, the less I know ;_;

Considering that the usual size for double is 8 bytes, then 10000 double elements would be 80000 bytes, not 16KiB. you're off by a factor of five. Not that it really matters though, not on anything remotely modern and non-embedded, as it's still a drop in the ocean. — Some programmer dude
– Some programmer dude, Commented Dec 13, 2017 at 10:31
FWIW, 10752 elements == 86016 B == 21 pages exactly. Not sure of the relevance of that specific number, but I imagine it's a clue. — Oliver Charlesworth
– Oliver Charlesworth, Commented Dec 13, 2017 at 12:15
Given the peaks are all at exact multiples of page size, I suspect that pathological aliasing of the two arrays is the cause. The only real mystery is why this behaviour is seemingly masked below 10752. — Oliver Charlesworth
– Oliver Charlesworth, Commented Dec 13, 2017 at 12:56
It has something to do with address alignment. If I align address to page size, then I can produce some weird behavior. Not exactly yours, but there is some irregularity. it linearly grows until 8692 (time: 4.4microsec), then it jumps to 6microsec until 8830, then falls back to 4.4 microsec. And there are irregularities at other sizes as well. (maybe the two arrays don't need to be page aligned, but have the same alignment in a page?) — geza
– geza, Commented Dec 13, 2017 at 13:33
@crueltear - Ok well that explains it then. Overall, the behaviour is being governed by the whims of the memory allocation algorithm (nothing to do with HW, presumably). Beyond 10k, you're seeing pathological aliasing every time your arrays exactly align with pages. — Oliver Charlesworth
– Oliver Charlesworth, Commented Dec 13, 2017 at 13:48

Collectives™ on Stack Overflow

Calculation time peaks when operating on arrays of specific size

0

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Linked