Isolating I/O issue with NVME or hardware?

Question

Hardware:

Samsung 980 PRO M.2 NVMe SSD (MZ-V8P2T0BW) (2TB)
Beelink GTR6, with the SSD in the NVMe slot

Since the hardware arrived, I've installed Ubuntu Server on it as well as a bunch of services (mostly in docker, DBs and services like Kafka).

After 2-3 days of uptime (record is almost a week, but usually it's 2-3 days), I typically start getting buffer i/o errors on the nvme slot (which is also the boot drive):

If I'm quick enough, I can still login via SSH but the system becomes increasingly unstable before commands start failing with an I/O error. When I did manage to login, it did seem to think there's no connected NVME SSDs:

Another instance of the buffer I/O error on the nvme slot:

Because of this and trying to check everything I could find, I ran FSCK on boot to see if there was anything obvious - this is quite common after the hard reset:

# cat /run/initramfs/fsck.log Log of fsck -C -f -y -V -t ext4 /dev/mapper/ubuntu--vg-ubuntu--lv Fri Dec 30 17:26:21 2022 fsck from util-linux 2.37.2 [/usr/sbin/fsck.ext4 (1) -- /dev/mapper/ubuntu--vg-ubuntu--lv] fsck.ext4 -f -y -C0 /dev/mapper/ubuntu--vg-ubuntu--lv e2fsck 1.46.5 (30-Dec-2021) /dev/mapper/ubuntu--vg-ubuntu--lv: recovering journal Clearing orphaned inode 524449 (uid=1000, gid=1000, mode=0100664, size=6216) Pass 1: Checking inodes, blocks, and sizes Inode 6947190 extent tree (at level 1) could be shorter. Optimize? yes Inode 6947197 extent tree (at level 1) could be shorter. Optimize? yes Inode 6947204 extent tree (at level 1) could be shorter. Optimize? yes Inode 6947212 extent tree (at level 1) could be shorter. Optimize? yes Inode 6947408 extent tree (at level 1) could be shorter. Optimize? yes Inode 6947414 extent tree (at level 1) could be shorter. Optimize? yes Inode 6947829 extent tree (at level 1) could be shorter. Optimize? yes Inode 6947835 extent tree (at level 1) could be shorter. Optimize? yes Inode 6947841 extent tree (at level 1) could be shorter. Optimize? yes Pass 1E: Optimizing extent trees Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong (401572584, counted=405399533). Fix? yes Free inodes count wrong (121360470, counted=121358242). Fix? yes /dev/mapper/ubuntu--vg-ubuntu--lv: ***** FILE SYSTEM WAS MODIFIED ***** /dev/mapper/ubuntu--vg-ubuntu--lv: 538718/121896960 files (0.2% non-contiguous), 82178067/487577600 blocks fsck exited with status code 1 Fri Dec 30 17:26:25 2022 ----------------

Running smart-log doesn't seem to show anything concerning, other than the number of unsafe shutdowns (the number of times this has happened so far)...

# nvme smart-log /dev/nvme0 Smart Log for NVME device:nvme0 namespace-id:ffffffff critical_warning : 0 temperature : 32 C (305 Kelvin) available_spare : 100% available_spare_threshold : 10% percentage_used : 0% endurance group critical warning summary: 0 data_units_read : 8,544,896 data_units_written : 5,175,904 host_read_commands : 39,050,379 host_write_commands : 191,366,905 controller_busy_time : 1,069 power_cycles : 21 power_on_hours : 142 unsafe_shutdowns : 12 media_errors : 0 num_err_log_entries : 0 Warning Temperature Time : 0 Critical Composite Temperature Time : 0 Temperature Sensor 1 : 32 C (305 Kelvin) Temperature Sensor 2 : 36 C (309 Kelvin) Thermal Management T1 Trans Count : 0 Thermal Management T2 Trans Count : 0 Thermal Management T1 Total Time : 0 Thermal Management T2 Total Time : 0

I have reached out to support and their initial suggestion along with a bunch of questions was whether I had tried to reinstall the OS. I've given this a go too, formatting the drive and reinstalling the OS (Ubuntu Server 22 LTS).

After that, the issue hadn't happened for 4 days before it finally showed itself as a kernel panic:

Any ideas what I can do to identify if the problem is with the SSD itself or the hardware that the SSD is slotted into (the GTR6)? I have until the 31st to return the SSD, so would love to pin down the most likely cause of the issue sooner rather than later...

I'm even more concerned after seeing reports that others are having serious health issues with the Samsung 990 Pro: https://www.reddit.com/r/hardware/comments/10jkwwh/samsung_990_pro_ssd_with_rapid_health_drops/

Edit: although I realised those reported issues are with the 990 pro, not the 980 pro that I have!

Edit2: someone in overclockers was kind enough to suggest hd sentinel, which does show a health metric, which seems ok:

# ./hdsentinel-019c-x64 Hard Disk Sentinel for LINUX console 0.19c.9986 (c) 2021 [email protected] Start with -r [reportfile] to save data to report, -h for help Examining hard disk configuration ... HDD Device 0: /dev/nvme0 HDD Model ID : Samsung SSD 980 PRO 2TB HDD Serial No: S69ENL0T905031A HDD Revision : 5B2QGXA7 HDD Size : 1907729 MB Interface : NVMe Temperature : 41 °C Highest Temp.: 41 °C Health : 99 % Performance : 100 % Power on time: 21 days, 12 hours Est. lifetime: more than 1000 days Total written: 8.30 TB The status of the solid state disk is PERFECT. Problematic or weak sectors were not found. The health is determined by SSD specific S.M.A.R.T. attribute(s): Available Spare (Percent), Percentage Used No actions needed.

Lastly, none of the things I tried such as the smart-log seem to show something like a health metric. How can I check this in ubuntu?

Thanks!

A USB enclosure can probably help you to confirm whether or not the drive itself is okay. However, it will not confirm whether or not it has something to do the nvme driver in the kernel, since you'll be using the uas driver then. — Tom Yan
– Tom Yan, Commented Jan 26, 2023 at 13:04
it did seem to think there's no connected NVME SSDs You probably have bad hardware. Somewhere. First step is to make sure everything is seated properly. And that means everything. First thing I'd do after making sure everything is seated properly is get a known-good NVME SSD, put that into the system, and start testing it pretty hard. And if possible put the Samsung 980 into another system that's known-good and test it on that system. See which one fails. Intermittent hardware problems are always fun fun fun. — Andrew Henle
– Andrew Henle, Commented Jan 27, 2023 at 15:24
if simply re-seating the nvme does not help, buy a cheap nvme and test with it. if it has the same issues, probably an issue with the nvme slot. you could check your syslog (or enable network logging) to catch the first message in the whole chain of errors that follows. — frostschutz
– frostschutz, Commented Jan 28, 2023 at 11:51
Thanks for the comments. Beelink accepted a return for replacement, hopefully this will just be a one-off problem on the hardware I've received from beelink! — Tiago
– Tiago, Commented Jan 30, 2023 at 11:10

Peter Passchier · Accepted Answer · 2023-07-15 02:34:48Z

I'm having the same issue, with disappearing device... Immediately after boot it's usually there, but it somehow gives reason for the kernel (or driver) to think it's disappearing.

When I did a complete block check in Windows, it just stayed up the whole 14+ hours, and had 0% bad blocks... My drive is just a month old, so I expect the hardware to still be good, it must be a driver or motherboard/bios interaction issue...

Example output:

[ 646.205010] nvme nvme1: I/O 526 QID 2 timeout, aborting [ 646.205039] nvme nvme1: I/O 213 QID 5 timeout, aborting [ 646.264489] nvme nvme1: Abort status: 0x0 [ 646.351285] nvme nvme1: Abort status: 0x0 [ 676.924830] nvme nvme1: I/O 526 QID 2 timeout, reset controller [ 697.972569] nvme nvme1: Device not ready; aborting reset, CSTS=0x1 [ 697.993956] pcieport 10000:e0:1b.0: can't derive routing for PCI INT A [ 697.993965] nvme 10000:e2:00.0: PCI INT A: no GSI [ 709.369577] wlp45s0: AP e0:cc:7a:98:7d:d4 changed bandwidth, new config is 2432.000 MHz, width 2 (2442.000/0 MHz) [ 718.496375] nvme nvme1: Device not ready; aborting reset, CSTS=0x1 [ 718.496381] nvme nvme1: Removing after probe failure status: -19 [ 739.020199] nvme nvme1: Device not ready; aborting reset, CSTS=0x1 [ 739.020477] nvme1n1: detected capacity change from 2000409264 to 0

Now I have tried this: echo 10000:e2:00.0 >/sys/bus/pci/drivers/nvme/bind

When I then do lspci the "missing" device is enumerated correctly (10000:e2:00.0 Non-Volatile memory controller: ADATA Technology Co., Ltd. Device 5766 (rev 01))

But it doesn't show up in lsblk and I don't know how to proceed from here...

The dmesg output after re-binding the drive:

[14893.259570] nvme nvme2: pci function 10000:e2:00.0 [14893.259678] pcieport 10000:e0:1b.0: can't derive routing for PCI INT A [14893.259685] nvme 10000:e2:00.0: PCI INT A: no GSI [14913.760764] nvme nvme2: Device not ready; aborting reset, CSTS=0x1 [14913.760771] nvme nvme2: Removing after probe failure status: -19

Eventually I got a new module, put it in the same slot (replacing the old one), and it is all working normally.

Conclusion: [highly likely] a bad NVMe stick. This happens, and I bet it was the same on your setup.

If you have a new question, please ask it by clicking the Ask Question button. Include a link to this question if it helps provide context. - From Review — dr_
– dr_, Commented Mar 28, 2023 at 13:51
At the moment I'm suspecting a hardware issue with the server this device is in. I've gone through the process of returning it for a replacement, and as before, the issue came back after a week on the new device as well. Maybe I'll try a second ssd of the exact same specs, in case that's related... — Tiago
– Tiago, Commented Mar 28, 2023 at 13:53
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center. — Community
– Community Bot, Commented Mar 28, 2023 at 14:54

Mikko Rantalainen · Accepted Answer · 2025-02-12 19:44:04Z

Here's alternative theory about the root cause which may or may not fit this specific use case:

Some NVMe drives are known to have buggy firmware and implement APST power saving mode incorrectly. The APST power saving mode is implemented by the NVMe storage device firmware and some devices are known to be racy and hang if unfortunate timing is used when the device is entering or exiting the power saving mode. Use kernel flag nvme_core.default_ps_max_latency_us=0 as an easy test to see if this is related to your issue. Monitor the NVMe device temperatures with this change because it may prevent all drive internal power saving modes and cause as much heat output as constantly running the device at max load.

Another option is incorrectly implemented PCIe ASPM power saving mode in your hardware. I've found mode called L0s especially problematic so if your BIOS has option to disable just that mode, maybe try it. Otherwise you could add kernel flags pcie_aspm=off pcie_port_pm=off to totally disable all PCIe power saving modes. Again, monitor the storage device temperature because this may increase the power use depending on the drive implementation. If I've understood correclty, PCIe ASPM power saving mode requires both PCIe attached device and motherboard to both successfully handle the power saving transitions correctly. As a result, this may be unstable only for given motherboard + PCIe device combinations.

If I've understood correctly, neither of the above should cause SATA devices to misbehave so if you have SATA devices only in your system, the problem is caused by something else.

I actually ended up returning the NVMe and the replacement worked flawlessly ever since. Something was wrong with this particular NVMe, hardware or firmware, no idea... — Tiago
– Tiago, Commented Feb 14 at 13:10
One theory that explains some failures is power rail ripples as a result of poor interaction between NVMe drive and motherboard slot it's attached to. One or both parts fail to meet the official specs but the parts might still work correctly when used with some other part that has more margin to the spec limits. Proving this with modern parts is next to impossible for consumers, though, because you might need need an oscilloscope in THz range to measure the problem accurately. — Mikko Rantalainen
– Mikko Rantalainen, Commented Feb 15 at 10:49

Stack Exchange Network

Isolating I/O issue with NVME or hardware?

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Isolating I/O issue with NVME or hardware?

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions