I am running an AI inference program based on Tensorflow-gpu.
By running /usr/bin/strace -c -f /usr/bin/time ./program, I got the following output:
<my program's output> 367.91user 1032.14system 26:43.41elapsed 87%CPU (0avgtext+0avgdata 4158812maxresident)k <----OUTPUT OF TIME 1760inputs+24outputs (0major+4259846minor)pagefaults 0swaps % time seconds usecs/call calls errors syscall <----OUTPUT OF STRACE ------ ----------- ----------- --------- --------- ---------------- 88.45 34105.209840 7166 4759375 217371 futex 8.71 3356.854321 49032 68462 poll 2.68 1032.625515 2044803 505 wait4 0.05 21.006854 119 177269 clock_gettime <..and some other syscalls that didn't took very long...> ------ ----------- ----------- --------- --------- ---------------- 100.00 38559.038652 5385548 247845 total It shows that my program spent 34105 seconds in futex alone, which is 20 times longer than the elapsed time of 26:43.41.
I assumed that strace was recording the total system call time on all of my cores, so I re-experimented with only a single core enabled (using taskset), but the problem persisted.
Edit:
I did use taskset with the --all-tasks option:
/usr/bin/taskset --all-tasks --cpu-list 0 /usr/bin/strace -c -f /usr/bin/time ./program
tasksetto constrain where your process and its threads/children executed? For example, did you use the--all-tasksoption?pthreads that all try to grab apthread_mutex_t. Only the first thread to grab the lock willusleepfor 10 seconds in the critical section, causing the other 9 threads to block infutex. I ran it withtaskset,straceandtime, andstracereported that the system calls (especiallyfutex) only took 0.06 seconds in total, whentimereported 10 seconds elapsed. If your assumption were correct,futexshould be 90 seconds, right?