IO wait time is higher than disk utilization. Isn't this impossible? [duplicate]

Question

I am trying to improve my understanding, following this (so far) unanswered question: Possible limiting factor during upgrade of Fedora VM - not disk or cpu or network?

I ran the following test load, which took 200 seconds to complete.

sudo perf trace -s time perf stat dnf -y --releasever=30 --installroot=$HOME/fedora-30 --disablerepo='*' --enablerepo=fedora --enablerepo=updates install systemd passwd dnf fedora-release vim-minimal

I am running this on a fairly default, straightforward install of Fedora Workstation 29. It is not a VM. The kernel version is 5.0.9-200.fc29.x86_64. The IO scheduler is mq-deadline.

I use LVM, and the ext4 filesystem. I am not using any encryption on my disk or filesystem. I do not have any network filesystem mounted at all, so I am not reading or writing a network filesystem.

I have 4 "CPUs": 2 cores with 2 threads each.

I have only one disk, /dev/sda, which is a SATA HDD. The HDD supports NCQ: cat /sys/class/block/sda/device/queue_depth shows 32.

vmstat 5 showed that non-idle CPU time sometimes rose to about one CPU, i.e. idle was as low as 75%.

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st ... 1 1 3720 1600980 392948 3669196 0 0 14 30634 8769 1876 4 3 78 15 0 1 1 3720 1600144 393128 3669884 0 0 0 3460 1406 1468 0 1 80 18 0 0 1 3720 1599424 393356 3670532 0 0 0 6830 1416 1480 0 1 73 25 0 0 1 3720 1598448 393700 3671108 0 0 0 7766 1420 1362 0 1 78 20 0 0 1 3720 1597296 393940 3671616 0 0 0 5401 1374 1367 1 2 87 11 0 0 1 3720 1596332 394312 3672416 0 0 0 1162 1647 1558 1 2 85 13 0 3 0 3720 1652064 394564 3673540 0 0 0 17897 15406 1784 1 3 74 23 0 0 0 3720 1972876 394600 3541064 0 0 0 522 8028 18327 3 3 84 10 0 1 0 3720 1974152 394600 3541108 0 0 0 9 422 879 2 1 97 0 0 0 0 3720 1974136 394600 3541120 0 0 0 0 204 455 0 0 99 0 0 (end of test)

And the "IO wait" time (the wa field under cpu in vmstat) rose as high as 25%. I think this means 100% of one CPU. But atopsar -d 5 showed the utilization of my disk did not directly match this. It was much less than 100%:

22:46:44 disk busy read/s KB/read writ/s KB/writ avque avserv _dsk_ ... 22:49:34 sda 5% 0.4 4.0 69.5 413.0 36.9 0.68 ms 22:49:39 sda 8% 0.2 60.0 120.6 30.6 18.7 0.66 ms 22:49:44 sda 8% 0.0 0.0 136.2 16.7 20.4 0.61 ms 22:49:49 sda 10% 0.0 0.0 157.1 44.2 21.4 0.65 ms 22:49:54 sda 9% 0.0 0.0 196.4 39.3 48.0 0.47 ms 22:49:59 sda 9% 0.0 0.0 148.9 36.6 32.6 0.62 ms 22:50:04 sda 10% 0.0 0.0 137.3 130.6 37.2 0.70 ms 22:50:09 sda 11% 0.0 0.0 199.6 5.4 13.5 0.55 ms 22:50:14 sda 2% 0.0 0.0 50.2 4.5 11.8 0.32 ms 22:50:19 sda 0% 0.0 0.0 0.8 11.0 13.3 0.75 ms (end of test)

How can "IO wait" time be higher than disk utilization?

Following is the definition taken from the sar manpage:

%iowait:

Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.

Therefore, %iowait means that from the CPU point of view, no tasks were runnable, but at least one I/O was in progress. iowait is simply a form of idle time when nothing could be scheduled. The value may or may not be useful in indicating a performance problem, but it does tell the user that the system is idle and could have taken more work.

https://support.hpe.com/hpsc/doc/public/display?docId=c02783994

"IO wait" is tricksy to define on multi-CPU systems. See How does a CPU know there is IO pending? . But even if you think I was wrong to multiply the above "IO wait" figure by 4, it would still be higher than the disk utilization figure!

I expect the disk utilization figure in atopsar -d (and equally in atop / sar -d / iostat -x / mxiostat.py) is calculated from one of the kernel iostat fields. The linked doc mentions "Field 10 -- # of milliseconds spent doing I/Os". There is a more detailed definition as well, although I am not sure that the functions it mentions still exist in the current multi-queue block layer.

Thanks to the perf in the test command, I can also report that dnf's fdatasync() calls were responsible for 81 out of 200 seconds of elapsed time. This evidence suggests to me that the "IO wait" figure is giving a more accurate impression than the disk utilization figure.

 199.440461084 seconds time elapsed 60.043226000 seconds user 11.858057000 seconds sys 60.07user 12.17system 3:19.84elapsed 36%CPU (0avgtext+0avgdata 437844maxresident)k 496inputs+2562448outputs (25major+390722minor)pagefaults 0swaps Summary of events: ... dnf (6177), 2438150 events, 76.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ fdatasync 3157 81436.062 0.160 25.795 465.329 1.92% wait4 51 43911.444 0.017 861.009 32770.388 74.93% poll 45188 6052.070 0.001 0.134 96.764 5.58% read 341567 2718.531 0.001 0.008 1372.100 50.63% write 120402 1616.062 0.002 0.013 155.142 9.61% getpid 50012 755.423 0.001 0.015 207.506 32.86% ...

MSalters · Accepted Answer · 2019-05-03 13:09:58Z

0

"IO wait" time rose as high as 25%. I think this means 100% of one CPU".

No, that's basically where you start to go wrong. I/O waiting does not require a CPU core at all. I/O wait is basically a state in the kernel that means "don't waste the CPU on this process, it's still waiting for an external action to complete". When the kernel sees an I/O operation complete, it will find the waiting process and reschedule it.

As a result, you can have 100 processes wait for I/O, and in one second wall clock time they will accumulate 100 seconds of I/O waiting time. And yet the CPU may be fully busy servicing the 101'th process.

Also, you compare I/O wait with disk utilization. While disks are one form of I/O, they are not the only one. Network sockets can also be waited on. And this type of I/O might be hidden when you are using a remote file system.

answered May 3, 2019 at 13:09

MSalters

4216 silver badges12 bronze badges

"%iowait means that from the CPU point of view, no tasks were runnable, but at least one I/O was in progress. iowait is simply a form of idle time when nothing could be scheduled. The value may or may not be useful in indicating a performance problem, but it does tell the user that the system is idle and could have taken more work."

sourcejedi
– sourcejedi

2019-05-03 14:36:29 +00:00
Commented May 3, 2019 at 14:36
To start with, consider a system with only one CPU and one disk. If "no tasks were runnable, but at least one I/O was in progress" for 50% of the time, then disk utilization should be at least 50%.

sourcejedi
– sourcejedi

2019-05-03 14:38:27 +00:00
Commented May 3, 2019 at 14:38
@sourcejedi: Well, there's the potential for some measurement noise. The process might not be scheduled back in at the very nanosecond when its I/O completes. And with I/O prioritization, an I/O first has to go into a priority queue before being picked up by the responsible I/O device driver, so there's some idle time there as well. More bookkeeping challenges: the largest I/O bottleneck with a conventional HDD is the mechanical seek between two reads. Who pays for that? The read before or the read after?

MSalters
– MSalters

2019-05-03 14:48:54 +00:00
Commented May 3, 2019 at 14:48
The default scheduler on Fedora 29 is currently mq-deadline. It is "work-conserving": mq-deadline always tries to fill the device queue first, and my SATA HDD has a standard NCQ with room for 15 requests. This is in contrast to schedulers like bfq or the old CFQ, which may choose to deliberately idle the device for a bit, in an attempt to get the best performance from sequential IO.

sourcejedi
– sourcejedi

2019-05-03 14:55:37 +00:00
Commented May 3, 2019 at 14:55
The second read pays for the seek in between the two reads.

sourcejedi
– sourcejedi

2019-05-03 14:57:07 +00:00
Commented May 3, 2019 at 14:57

| Show 2 more comments

Stack Exchange Network

IO wait time is higher than disk utilization. Isn't this impossible? [duplicate]

1 Answer 1

Linked

Hot Network Questions

IO wait time is higher than disk utilization. Isn't this impossible? [duplicate]

1 Answer 1

Linked

Related

Hot Network Questions