C multithread performance issue

Question

I am writing a multi-threaded program to traverse an n x n matrix, where the elements in the main diagonal are processed in a parallel manner, as shown in the code below:

int main(int argc, char * argv[] ) { /* VARIABLES INITIALIZATION HERE */ gettimeofday(&start_t, NULL); //start timing for (int slice = 0; slice < 2 * n - 1; ++slice) { z = slice < n ? 0 : slice - n + 1; int L = 0; pthread_t threads[slice-z-z+1]; struct thread_data td[slice-z-z+1]; for (int j=z; j<=slice-z; ++j) { td[L].index= L; printf("create:%d\n", L ); pthread_create(&threads[L],NULL,mult_thread,(void *)&td[L]); L++; } for (int j=0; j<L; j++) { pthread_join(threads[j],NULL); } } gettimeofday(&end_t, NULL); printf("Total time taken by CPU: %ld \n", ( (end_t.tv_sec - start_t.tv_sec)*1000000 + end_t.tv_usec - start_t.tv_usec)); return (0); } void *mult_thread(void *t) { struct thread_data *my_data= (struct thread_data*) t; /* SOME ADDITIONAL CODE LINES HERE */ printf("ThreadFunction:%d\n", (*my_data).index ); return (NULL); }

The problem is that this multithreaded implementation gave me a very bad performance compared with the serial (naive) implementation.

Are there some adjustments that could be done to improve the performance of the multithreaded version ??

Have you attempted to identify those parts of the code that might benefit from multithreaded execution? Adding more threads to a program doesn't automatically make it faster. — Robert Harvey
– Robert Harvey, Commented Apr 19, 2015 at 22:34
is the task done by each thread large enough to overcome the overhead? Also it might be an idea to use a maximum amount of threads because your system can handle only 4, 8, 16 etc threads — Simon
– Simon, Commented Apr 19, 2015 at 22:39
I could imagine that the single-threaded version profits from caching in a way the multi-threaded one can't because memory accesses come in randomly. Have a look at this question. Maybe try storing the matrix as diagonals? — dddsnn
– dddsnn, Commented Apr 19, 2015 at 22:54
Create a new thread also have some cost . if the cost in "SOME ADDITIONAL CODE LINES HERE" is not much higher than the cost in thread creating , it may even make the performance worse.You may create some thread first (2 * cpu thread , in my experience),and write you task to a queue or some other structure. — Lumen
– Lumen, Commented Apr 19, 2015 at 23:29
You are missing the point that @RobertHarvey and the others are making. Using multithreading does not automatically increase performance. And in some cases it will even cause performance degradation. Show us the "ADDITIONAL CODE LINES". Without that no one can tell you whether there is something you are doing wrong that results in no performance gain or whether the problem you are trying to solve does not benefit from multithreading. — kaylum
– kaylum, Commented Apr 20, 2015 at 0:06

Lumen · Accepted Answer · 2015-04-23 13:43:47Z

a thread pool may make it better.

define a new struct type as follow.

typedef struct { struct thread_data * data; int status; // 0: ready // 1: adding data // 2: data handling, 3: done int next_free; } thread_node;

init :

size_t thread_size = 8; thread_node * nodes = (thread_node *)malloc(thread_size * sizeof(thread_node)); for(int i = 0 ; i < thread_size - 1 ; i++ ) { nodes[i].next_free = i + 1; nodes[i].status = 0 ; } nodes[thread_size - 1].next_free = -1; int current_free_node = 0 ; pthread_mutex_t mutex;

get thread :

int alloc() { pthread_mutex_lock(&mutex); int rt = current_free_node; if(current_free_node != -1) { current_free_node = nodes[current_free_node].next_free; nodes[rt].status = 1; } pthread_mutex_unlock(&mutex); return rt; }

return thread :

void back(int idx) { pthread_mutex_lock(&mutex); nodes[idx].next_free = current_free_node; current_free_node = idx; nodes[idx].status = 0; pthread_mutex_unlock(&mutex); }

create the threads first, and use alloc() to try to get a idle thread, update the pointer.

don't use join to judge the status.
modify your mult_thread as a loop and after the job finished , just change your status to 3
for each loop in the thread , you may give it more work

I wish it will give you some help.

------------ UPDATED Apr. 23, 2015 -------------------

here is a example.

compile & run with command $ g++ thread_pool.cc -o tp -pthread --std=c++

yu:thread_pool yu$ g++ tp.cc -o tp -pthread --std=c++11 && ./tp 1227135.147 1227176.546 1227217.944 1227259.340... time cost 1 : 1068.339091 ms 1227135.147 1227176.546 1227217.944 1227259.340... time cost 2 : 548.221607 ms

you may also remove timer and it can also compiled as a std c99 file.

In current , the thread size has been limited to 2. You may also adjust the parameter thread_size, and recompile & run again. More threads may give your some more advantage(in my pc, if I change the thread size to 4, the task will finish in 280ms), while too much thread number may not help you too much if you have no enough cpu thread.

As i am new in pthread programming, a faced some difficulties to understand some parts of your suggested updates: - Are you suggesting to use a fixed number of threads, which is 8 ? - How to call the functions alloc and back ? I would appreciate if you can explain the questions above
@MROF may I know the final result ? if it can answer your question, could you please click the 'accept' button?

Collectives™ on Stack Overflow

C multithread performance issue

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related