How to debug OpenMP scalabilty

Question

I'm trying to implement Conway's Game of Life in C, showing that it can scale with OpenMP. The results in the following picture are from running it on a socket of a machine with 64 cores equally divided in 4 NUMA regions. It shows a clear stall in performance when using 42 cores and then a big decrease when increasing the number of cores. The script is run setting OMP_PLACES=cores and OMP_PROC_BIND=close.

OpenMP scalability results

I made a simplified and lighter version of the code, which obtains similar results, for this forum:

double tend; double tstart; double ex_time; #define NROWS 3000 #define NCOLS 3000 #define NUM_STEPS 50 #define PROB 0.5 int count_alive_neighbours_multi(const unsigned char *restrict map, const int ncols, const int index) { int prev_row = index - ncols; int next_row = index + ncols; int count0 = is_alive(&map[index - 1]) + is_alive(&map[index + 1]); int count1 = is_alive(&map[prev_row]) + is_alive(&map[prev_row - 1]) + is_alive(&map[prev_row + 1]); int count2 = is_alive(&map[next_row]) + is_alive(&map[next_row - 1]) + is_alive(&map[next_row + 1]); return count0 + count1 + count2; } void smaller_static_evolution(int num_steps, unsigned char *restrict current, int ncols, int nrows) { /*Performs a single step of the update of the map*/ /* Create copy of MPI local map (current) but initialize it with OpenMP to * place it in the correct threads */ printf("Running smaller_static_evolution with \nnum_steps: %d\n", num_steps); printf("ncols: %d\n", ncols); printf("nrows: %d\n", nrows); unsigned char *sub_map = NULL; #pragma omp parallel { #pragma omp single { sub_map = (unsigned char *)malloc(ncols * nrows * sizeof(unsigned char)); if (sub_map == NULL) { printf("Could not allocate memory for local maps\n"); exit(1); } } int i = 0; /* Initialize local sub map with OpenMP, to warm up the data properly*/ #pragma omp for schedule(static, 4) for (int i = 0; i < nrows * ncols; i++) { sub_map[i] = current[i]; } for (int iter = 0; iter < num_steps; iter++) { #pragma omp for schedule(static, 4) private(i) for (int row = 1; row < nrows - 1; row++) { /* Split work with OpenMP over rows */ for (int col = 1; col < ncols - 1; col++) { i = row * ncols + col; int alive_counter = count_alive_neighbours_multi(sub_map, ncols, i); sub_map[i] += UPDATE_CELL(alive_counter); } } } /* Copy sub_map into current and free sub_map */ #pragma omp for schedule(static, 4) for (int i = 0; i < nrows * ncols; i++) { current[i] = sub_map[i]; } } free(sub_map); } int main(int argc, char **argv) { int nrows = NROWS; int ncols = NCOLS + calculate_cache_padding(NCOLS); unsigned char *current = (unsigned char *)malloc(nrows * ncols * sizeof(unsigned char)); /* Randomly initialize matrix */ for (int i = 0; i < nrows * ncols; i++) { float random = (float)rand() / RAND_MAX; if (random < PROB) current[i] = 1; else current[i] = 0; } tstart = CPU_TIME; smaller_static_evolution(NUM_STEPS, current, ncols, nrows); free(current); tend = CPU_TIME; ex_time = tend - tstart; printf("\n\n%f\n\n", ex_time); }

The function calculate_cache_padding() returns the number of columns that are necessary to make the total number of columns a multiple of the cache line size (64). Since the loop over rows is split between the OpenMP threads, I'm not expecting this to produce false sharing.

Is there anything that I'm clearly doing wrong? Does the plot suggest anything obvious or do you have any ideas on how I can debug this?

It looks like your code has a data race on sub_map. Fix the code before you try to interpret performance. — Joachim
– Joachim, Commented Jun 19 at 22:27
I am actually surprised that you can get a 30x speed-up on this code. Can you show the compilation options you are using? — PierU
– PierU, Commented Jun 20 at 7:16
@PierU other than warning options I used -fstrict-aliasing -fno-omit-frame-pointer -O3 -march=native --fast-math. The results from the image are from the whole code but this smaller example shows the same tendency. Anyway, @ViktorEijkhout's suggestion was correct and removed the big drop in performance. — pettepiero
– pettepiero, Commented Jun 20 at 8:05
Running the code with ThreadSanitizer highlights that there is a data race between the is_alive accesses in count_alive_neighbours_multi and the write performed with the result of UPDATE_CELL. — Joachim
– Joachim, Commented Jun 20 at 9:18

Joachim · Accepted Answer · 2025-06-20 09:37:13Z

Using a single map is wrong even in sequential execution (you will work on partially updated cells) and leads to data race in parallel execution.

The main issue is that the code is memory bound. Accessing stencil values written by a thread on a different NUMA domain necessarily goes through memory. With smaller chunk sizes, you are increasing the number of cuts through the map and therefore the number of elements accessed by different threads. Elements accessed by only a single thread or by threads in the same NUMA domain can remain at higher cache levels and reduce the necessary memory accesses.

This (MWE) version resolves the data races, fixes the first-touch initialization, but still shows limited scalability:

#include <omp.h> #include <stdio.h> #include <stdlib.h> #define NROWS 3000 #define NCOLS 3000 #define NUM_STEPS 50 #define PROB 0.5 #define is_alive !!* #define calculate_cache_padding(m) ((64 - (m % 64)) % 64) #define CPU_TIME omp_get_wtime() int count_alive_neighbours_multi(const unsigned char *restrict map, const int ncols, const int index) { int prev_row = index - ncols; int next_row = index + ncols; int count0 = is_alive(&map[index - 1]) + is_alive(&map[index + 1]); int count1 = is_alive(&map[prev_row]) + is_alive(&map[prev_row - 1]) + is_alive(&map[prev_row + 1]); int count2 = is_alive(&map[next_row]) + is_alive(&map[next_row - 1]) + is_alive(&map[next_row + 1]); int count = count0 + count1 + count2; if (count == 2) return 1; if (is_alive(&map[index]) && count == 3) return 1; return 0; } void smaller_static_evolution(int num_steps, unsigned char *restrict current, int ncols, int nrows) { /*Performs a single step of the update of the map*/ /* Create copy of MPI local map (current) but initialize it with OpenMP to * place it in the correct threads */ printf("Running smaller_static_evolution with \nnum_steps: %d\n", num_steps); printf("ncols: %d\n", ncols); printf("nrows: %d\n", nrows); unsigned char *sub_map = (unsigned char *)malloc(ncols * nrows * sizeof(unsigned char)); if (sub_map == NULL) { printf("Could not allocate memory for local maps\n"); exit(1); } unsigned char *old_map = (unsigned char *)malloc(ncols * nrows * sizeof(unsigned char)); if (old_map == NULL) { printf("Could not allocate memory for local maps\n"); exit(1); } #pragma omp parallel firstprivate(old_map, sub_map) { #pragma omp masked for (int col = 0; col < ncols; col++) { int i = 0 * ncols + col; old_map[i] = current[i]; i = (nrows - 1) * ncols + col; old_map[i] = current[i]; } /* Initialize local sub map with OpenMP, to warm up the data properly*/ #pragma omp for schedule(static) for (int row = 1; row < nrows - 1; row++) { /* Split work with OpenMP over rows */ for (int col = 0; col < ncols; col++) { int i = row * ncols + col; old_map[i] = current[i]; } } for (int iter = 0; iter < num_steps; iter++) { #pragma omp for schedule(static) for (int row = 1; row < nrows - 1; row++) { /* Split work with OpenMP over rows */ for (int col = 1; col < ncols - 1; col++) { int i = row * ncols + col; sub_map[i] = count_alive_neighbours_multi(old_map, ncols, i); } } // swap unsigned char *tmp_map = old_map; old_map = sub_map; sub_map = tmp_map; } /* Copy sub_map into current and free sub_map */ #pragma omp for schedule(static, 4) for (int i = 0; i < nrows * ncols; i++) { current[i] = sub_map[i]; } } free(sub_map); } int main(int argc, char **argv) { int nrows = NROWS; int ncols = NCOLS + calculate_cache_padding(NCOLS); unsigned char *current = (unsigned char *)malloc(nrows * ncols * sizeof(unsigned char)); // make distribution reproducible srand(1337); /* Randomly initialize matrix */ for (int i = 0; i < nrows * ncols; i++) { float random = (float)rand() / RAND_MAX; if (random < PROB) current[i] = 1; else current[i] = 0; } double tstart = CPU_TIME; smaller_static_evolution(NUM_STEPS, current, ncols, nrows); free(current); double tend = CPU_TIME; double ex_time = tend - tstart; printf("\n\n%f\n\n", ex_time); }

Thank you for your time Joachim. I was using a single map variable because I wanted the application to use as less space as possible (a constraint I decided to give to myself but not a necessary one). Therefore, I mapped old values to the MSB and new values to the LSB, shifting left at every iteration (I didn't include this detail in the example). I guess this idea was not so good after all and introduced the data races you are talking about?
If you really want to reduce the memory footprint, you could just use a single bit for each cell, i.e., store 8 columns in one byte. It would make the code a bit more complex. But, shrinking the memory foot print should also provide benefits at the memory bandwidth bottleneck.

Victor Eijkhout · Accepted Answer · 2025-06-19 16:01:25Z

First: You should allocate the sub_map only once per program run. That's a sequential bottleneck.

Then: Do not use a chunk size in the static schedule. It makes the cores interleave what data they access, and that may conflict with the close binding. Also, your chunk size is much smaller than the cacheline, so you probably get false sharing and other conflicts.

Thank you this seems to have solved the problem. I still don't really understand why the malloc can't be called inside the #pragma omp single region. I guess I don't see why I should do it either, but what's the mistake?
The malloc is called exactly once in the code shown. It doesn't matter for performance, whether its inside of the parallel or outside. If you move that code out of the time measurement, it would make a difference.

Collectives™ on Stack Overflow

How to debug OpenMP scalabilty

2 Answers 2

2 Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Related