Timeline for How do I find what's causing a task to be slow, when CPU, memory, disk and network are not used at 100%?
Current License: CC BY-SA 4.0
12 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Nov 13 at 2:21 | audit | First answers | |||
| Nov 13 at 2:21 | |||||
| Oct 30 at 7:43 | comment | added | haylem | I'd actually recommend generating thread dumps and doing a sampling analysis. While profiling is precise and allows to easily monitor in real-time, depending on the stack used, it tends to generate false-positives as it modifies in itself the nature of the experiment, as the observation requires changes. In comparison, generating thread dumps is less invasive (except for GC pauses / proc pauses for some stacks). The great thing is that it can be left in the background for a while, and is equally interesting in normal usage or when under load. Flamegraphs are handy (but not the only way) | |
| Oct 27 at 18:37 | comment | added | T.E.D. | +1, saw this on the HNQ and "profiling" immediately leapt to mind. If you don't profile, the danger (actually almost a certainty) is that you'll end up wasting days or weeks trying optimize something that isn't even the thing taking all the time. | |
| Oct 25 at 21:07 | comment | added | Peter Cordes | @Basilevs: By CPU time, I mean what Linux perf stat calls task-clock. CPU time / wall-clock time = number of CPU cores you're keeping busy, on average, for the time interval of that part of your workload. If that's a lot lower than you expected, that tells you something. Memory latency is a reason why CPU time might be higher than you expected to run the same number of instructions, once you drill down into IPC (instructions per cycle) to see if the code has low or high throughput for the time it is running on CPU cores. | |
| Oct 25 at 15:58 | comment | added | Basilevs | @BartvanIngenSchenau if by "creating single-threaded results" you mean measuring them, then I'm not equating those. CPU times can't be equated to throughput because of IO and memory latencies. I'm just stating that measuring CPU time or single-threaded throughput will have no predictive power for multi-threaded performance. The single-threaded throughput is correlated with CPU times, but both mean nothing for the real throughput. | |
| Oct 25 at 11:54 | comment | added | Bart van Ingen Schenau | @Basilevs, why do you equate measuring CPU time with creating single-threaded results? | |
| Oct 25 at 11:41 | comment | added | Basilevs | @PeterCordes common misconception. CPU seconds are not the thing you need to profile. IO, memory access, memory caches affect throughput, and all these depend severely on threading configuration. Do not use single-threaded results for anything. CPU time is completely pointless on modern hardware. | |
| Oct 25 at 10:43 | comment | added | Peter Cordes | @Basilevs: You can still profile a part, in terms of how many CPU seconds did it take as well as wall-clock start/stop time even if there is overlap with other parts. Also how long did its threads spend sleeping waiting for disk or waiting for a free CPU. (Those threads are potentially competing with other work on the system, specifically other parts, but can at least rule out e.g. CPU contention if threads spend no time sleeping when they're ready to run, not in disk-sleep.) But yeah, calling them "steps" sounds optimistic and/or simplistic. | |
| Oct 25 at 8:11 | comment | added | Basilevs | @PeterCordes per-step profiling is generally a bad advice due to synchronization issues. Multi-threaded executions can't be evaluated in parts. | |
| Oct 25 at 3:22 | comment | added | Peter Cordes | The OP is probably expecting their workload to take advantage of multiple cores at some points. Profiling to check this would be a good idea. You suggest only logging/recording time at the start/end of each step, but you can also profile CPU utilization on a per-step basis, especially if the steps are non-overlapping. Finding out that some slow step is only using a single CPU core, or is I/O bound (on disk latency instead of throughput perhaps), would suggest an avenue of attack for where to spend some effort parallelizing it or removing serial dependencies. | |
| Oct 25 at 0:18 | history | edited | candied_orange | CC BY-SA 4.0 | edited body |
| Oct 24 at 12:59 | history | answered | Bart van Ingen Schenau | CC BY-SA 4.0 |