3

I am running an AI inference program based on Tensorflow-gpu.

By running /usr/bin/strace -c -f /usr/bin/time ./program, I got the following output:

<my program's output> 367.91user 1032.14system 26:43.41elapsed 87%CPU (0avgtext+0avgdata 4158812maxresident)k <----OUTPUT OF TIME 1760inputs+24outputs (0major+4259846minor)pagefaults 0swaps % time seconds usecs/call calls errors syscall <----OUTPUT OF STRACE ------ ----------- ----------- --------- --------- ---------------- 88.45 34105.209840 7166 4759375 217371 futex 8.71 3356.854321 49032 68462 poll 2.68 1032.625515 2044803 505 wait4 0.05 21.006854 119 177269 clock_gettime <..and some other syscalls that didn't took very long...> ------ ----------- ----------- --------- --------- ---------------- 100.00 38559.038652 5385548 247845 total 

It shows that my program spent 34105 seconds in futex alone, which is 20 times longer than the elapsed time of 26:43.41.

I assumed that strace was recording the total system call time on all of my cores, so I re-experimented with only a single core enabled (using taskset), but the problem persisted.

Edit:

I did use taskset with the --all-tasks option:

/usr/bin/taskset --all-tasks --cpu-list 0 /usr/bin/strace -c -f /usr/bin/time ./program 
6
  • 1
    Can you edit your question and include details on how you used taskset to constrain where your process and its threads/children executed? For example, did you use the --all-tasks option? Commented Mar 9, 2021 at 4:12
  • @AndyDalton Yes I did. I have edited my question. Commented Mar 9, 2021 at 7:25
  • I think the issue is that's a blocking call, so you can have multiple threads "in" that system call at the same time, even when everything is pinned to a single CPU. Commented Mar 9, 2021 at 19:31
  • 1
    @AndyDalton That was one of my assumptions, but a little experiment I did proved it wrong. I wrote a simple program where it starts 10 pthreads that all try to grab a pthread_mutex_t. Only the first thread to grab the lock will usleep for 10 seconds in the critical section, causing the other 9 threads to block in futex. I ran it with taskset, strace and time, and strace reported that the system calls (especially futex) only took 0.06 seconds in total, when time reported 10 seconds elapsed. If your assumption were correct, futex should be 90 seconds, right? Commented Mar 10, 2021 at 1:31
  • I did a similar experiment, and got the same results as you did. I'm out of ideas. Commented Mar 10, 2021 at 2:06

1 Answer 1

1

Looking at man strace I see:

-c --summary-only ... If -c is used with -f, only aggregate totals for all traced processes are kept. -f --follow-forks Trace child processes as they are created by currently traced processes as a result of the fork(2), vfork(2) and clone(2) system calls. ... 

Here, you're specifying both -c and -f. If your process forks or is multi-threaded, then the results are aggregated across the processes. If, for example, you had multiple threads all blocked in futex(), you could see the behavior that you're seeing.

The futex system call that you observe unusually long times in is a blocking call, so it doesn't actively take CPU time -- you can have multiple threads blocked that call at the same time, even if they're all pinned to the same CPU.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.