kworker flush 99% I/O hangs file write on machine

Question

We have several 90TB servers (Areca RAID-6 partitioned into 10 ext4 partitions).

The application reads a specialized hardware device and writes data to disk at a constant 8 to 10 MBytes/second. (the app runs separate reader and writer threads with a memory buffer between)

Once in a while file operations slow to a crawl, eg, fwrite() taking 10-30 seconds to write 200 bytes of data, ftell() taking a similar time to get the file position, etc.

Sometimes (but not always), when the system slows down top shows all processes with 0 %CPU, similar to this serverfault posting.

The slowdown will last a few minutes and then the machine will be fine. Later it will happen again.

There is no "daily frequency" of the slowdown so I've eliminated something running via cron / anacron etc.

However, during a slowdown the system load goes from 0.4 (normal) to 3-4. iotop often shows kworker/u16:0+flush-8:112 with very high I/O %. top, vmstat etc show very high %wa, so some I/O bottleneck somewhere.

I've changed the I/O scheduler to mq-deadline.

Also, the application runs fsync every 5 seconds on the file system containing the open file being written.

I've been googling for days on what might be causing this slowdown to find what parameters could be adjusted but no luck so far.

Any suggestions on how to tune Linux for this constant-write scenario?

Centos 8

# uname -a Linux 4.18.0-147.el8.x86_64 #1 SMP Wed Dec 4 21:51:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

IOTOP OUTPUT

Total DISK READ : 161.32 K/s | Total DISK WRITE : 0.00 B/s Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 0.00 B/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND b' 10666 rt/0 root 161.32 K/s 0.00 B/s 0.00 % 4.43 % MyWritingApp' Total DISK READ : 239.58 K/s | Total DISK WRITE : 78.27 M/s Actual DISK READ: 239.58 K/s | Actual DISK WRITE: 2.20 M/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND b' 431412 be/4 root 135.01 K/s 0.00 B/s 0.00 % 44.80 % [kworker/u16:8-flush-8:32]' b' 10666 rt/0 root 104.57 K/s 78.11 M/s 0.00 % 1.88 % MyWritingApp' b' 8853 be/3 root 0.00 B/s 9.27 K/s 0.00 % 0.01 % [jbd2/sdh1-8]' b' 1123 ?dif mysql 0.00 B/s 149.57 K/s 0.00 % 0.00 % mysqld --basedir=/usr' Total DISK READ : 388.68 K/s | Total DISK WRITE : 14.98 M/s Actual DISK READ: 469.32 K/s | Actual DISK WRITE: 1899.42 K/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND b' 338764 be/4 root 388.68 K/s 0.00 B/s 0.00 % 89.25 % [kworker/u16:0+flush-8:112]' b' 1123 ?dif mysql 0.00 B/s 955.83 K/s 0.00 % 0.00 % mysqld --basedir=/usr' b' 2740 be/4 apache 0.00 B/s 1353.76 B/s 0.00 % 0.00 % httpd -DFOREGROUND' b' 10664 be/4 root 0.00 B/s 1353.76 B/s 0.00 % 0.00 % php /usr/local/dvstor/bin64/DvStorStartRecording.php mode=ASI tsNum=0' b' 10666 rt/0 root 0.00 B/s 14.04 M/s 0.00 % 0.00 % MyWritingApp' Total DISK READ : 609.63 K/s | Total DISK WRITE : 15.87 K/s Actual DISK READ: 609.63 K/s | Actual DISK WRITE: 31.74 K/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND b' 338764 be/4 root 601.70 K/s 0.00 B/s 0.00 % 99.43 % [kworker/u16:0+flush-8:112]' b' 711 be/4 root 7.93 K/s 0.00 B/s 0.00 % 3.00 % systemd-journald' b' 9246 be/3 root 0.00 B/s 7.93 K/s 0.00 % 0.00 % [jbd2/sdi1-8]' b' 933 be/3 root 0.00 B/s 1354.15 B/s 0.00 % 0.00 % auditd' b' 1966 be/4 root 0.00 B/s 2.64 K/s 0.00 % 0.00 % rsyslogd -n' b' 2740 be/4 apache 0.00 B/s 1354.15 B/s 0.00 % 0.00 % httpd -DFOREGROUND'

VMSTAT OUTPUT

procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu-------- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 2 194048 334164 612708 12537168 0 1 14 1114 13 217 1 2 95 1 0 0 2 194048 329368 614000 12540308 0 0 646 0 5136 6152 1 1 78 19 0 0 2 194048 324640 615288 12543276 0 0 622 96 5249 6541 1 2 79 18 0 0 2 194048 321584 616624 12546188 0 0 664 512 5660 6815 2 2 81 15 0 0 2 194048 317788 617884 12549124 0 0 626 52 6079 7983 2 3 80 15 0 15930768 K total memory 2445452 K used memory 2673404 K active memory 11678768 K inactive memory 318308 K free memory 617888 K buffer memory 12549120 K swap cache 8212476 K total swap 194048 K used swap 8018428 K free swap

I/O Scheduler

# for d in {b..k}; do echo -n "sd${d} " ; cat /sys/block/sd${d}/queue/scheduler; done sdb [mq-deadline] kyber bfq none sdc [mq-deadline] kyber bfq none sdd [mq-deadline] kyber bfq none sde [mq-deadline] kyber bfq none sdf [mq-deadline] kyber bfq none sdg [mq-deadline] kyber bfq none sdh [mq-deadline] kyber bfq none sdi [mq-deadline] kyber bfq none sdj [mq-deadline] kyber bfq none sdk [mq-deadline] kyber bfq none

+1. Have the same issue but with a 1 TB Hard Drive formatted with NTFS. — Mukesh Sai Kumar
– Mukesh Sai Kumar, Commented Jul 20, 2021 at 10:22
Don't know much about NTFS but see the new answer below about file-level fragmentation — Danny
– Danny, Commented Jul 21, 2021 at 11:40

Danny · Accepted Answer · 2021-07-21 11:39:59Z

In the end I believe the problem was related to file-level fragmentation.

While fsck did not report a high percentage of fragmentation, I ran another tool (sorry, forgot which one, perhaps filefrag) which showed very very high fragmentation in the large media files.

(The file system had many many 1GB video files).

The system level problem was solved by:

back up all files and reformat as xfs (I suppose it could also be ext4 again but thought to try xfs)
change the application to preallocate 1GB of space using the fallocate() system call

Danny · Accepted Answer · 2020-12-26 06:03:27Z

After no responses from the community, here's what I think has solved the problem:

Disabling the journal on the filesystem has seemingly eliminated the incidences of high I/O load and %wa on all partitions. (yes, I know that introduces other risks, but having the machine freeze for I/O is never acceptable in our application)

I disabled the journal using tune2fs:

tune2fs -O ^has_journal <device>

Note Although the filesystem will still be ext4 after removing the journal, parted and file will now incorrectly report it as an ext2 volume. blkid will correctly report the volume as ext4.

Vojtech Trefny suggests this is a bug in libparted according in Inconsistent filesystem type reported after disabling journalling.

EDIT I

Disabling journalling did not solve the problem for all machines.

milahu · Accepted Answer · 2024-11-22 10:11:55Z

in my case,
flush commands like kworker/u16:3-flush-8:0 were stuck at 100% cpu.
many apps were hanging or crashing.
sudo iotop reported zero IO activity.
sudo sync did hang forever.

the issue was caused by an old SSD drive with many bad sectors.
apps were trying to write to disk, but the disk said no,
so the write operations did hang forever.

workaround:
delete some files to free up disk space, so the disk can re-use good sectors.
in my case, reducing the disk usage from 70% to 40% fixed the issue.

alternative:
ignore the issue, reboot the machine, risk data loss or data corruption.

Stack Exchange Network

kworker flush 99% I/O hangs file write on machine

3 Answers 3

You must log in to answer this question.

Linked

Hot Network Questions

kworker flush 99% I/O hangs file write on machine

3 Answers 3

You must log in to answer this question.

Linked

Related

Hot Network Questions