Is my NVME drive dying?

My laptop is working just fine. It’s from 2018 and it has an NVME drive.

It has an EFI boot partition and other partition with LUKS and LVM on top of that.

Since this week I see these logs from time to time:


<span style="color:#323232;">Mar 07 17:31:14 almendra kernel: pcieport 0000:00:1d.6: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
</span><span style="color:#323232;">Mar 07 17:31:14 almendra kernel: pcieport 0000:00:1d.6:   device [8086:34b6] error status/mask=00000001/00002000
</span><span style="color:#323232;">Mar 07 17:31:14 almendra kernel: pcieport 0000:00:1d.6:    [ 0] RxErr                  (First)
</span><span style="color:#323232;">Mar 07 17:31:14 almendra kernel: pcieport 0000:00:1d.6: AER:   Error of this Agent is reported first
</span><span style="color:#323232;">Mar 07 17:31:14 almendra kernel: nvme 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
</span><span style="color:#323232;">Mar 07 17:31:14 almendra kernel: nvme 0000:02:00.0:   device [8086:0975] error status/mask=00000001/00002000
</span><span style="color:#323232;">Mar 07 17:31:14 almendra kernel: nvme 0000:02:00.0:    [ 0] RxErr                  (First)
</span>

The devices are:


<span style="color:#323232;">$ lspci -vv | grep 1d.6
</span><span style="color:#323232;">00:1d.6 PCI bridge: Intel Corporation Device 34b6 (rev 30) (prog-if 00 [Normal decode])
</span><span style="color:#323232;">
</span><span style="color:#323232;">$ lspci -vv | grep 02:00.0
</span><span style="color:#323232;">02:00.0 Non-Volatile memory controller: Intel Corporation Optane NVME SSD H10 with Solid State Storage [Teton Glacier] (prog-if 02 [NVM Express])
</span>

The laptop works like always, but I have the impression that the NVME drive is telling me something bad.

It happens from time to time:


<span style="color:#323232;">$ journalctl --since yesterday | grep -c "nvme 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Physical"
</span><span style="color:#323232;">9
</span>

Do you know what does it mean?

possiblylinux127 , 5 months ago

Chances are it is. Always keep good backups.

Honestly its good practice to replace your drives every 5 years. That’s not always necessary but it can save you some headaches

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

ryannathans , 5 months ago

Look at smart errors

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

rotopenguin , 5 months ago

Given that it’s just an interface error, you could try turning it all off, take the drive out and hit its contacts with electronics contact cleaner (I guess CRC brand is good as any). Work it a little bit, let it dry before putting it all back together.

Another possibility is that power management is being naughty. Fiddle with ASPM or APST.

Oh and do a btrfs/zfs scrub to check that your data is correct.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

possiblylinux127 , 5 months ago

Doing a scrub on bad hardware will make corruption worse in many cases. When you have faulty hardware freeze everything

This person has had the same device for 6 years. If the drive was used heavily it probably just failed due to age

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

rotopenguin , 5 months ago

Yeah, you’re probably right. I’m thinking in terms of “not a raid, no redundant copies available” scrub, where the main output would be a sanity check of data checksums.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

mvirts , 5 months ago

Dont forget to blow on it

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

vsis OP , 5 months ago

I used a hand dust blower intended for photography gear. I opened the laptop, blew the dust, disconnected the SSD and blew the socket and it’s surroundings.

Now I will monitor the logs and see if it helps.

Thanks.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

MangoPenguin , 5 months ago

Regardless of what it is, make sure your backups are working, running often (daily or better is good), and test your restore process fully.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

waigl , 5 months ago

Smartctl works on nvme drives. Use it.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

vsis OP , 5 months ago

I did a short and a long test. It looks good


<span style="color:#323232;">$ sudo smartctl -l selftest /dev/nvme0
</span><span style="color:#323232;">smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.7.6-arch1-2] (local build)
</span><span style="color:#323232;">Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
</span><span style="color:#323232;">
</span><span style="color:#323232;">=== START OF SMART DATA SECTION ===
</span><span style="color:#323232;">Self-test Log (NVMe Log 0x06)
</span><span style="color:#323232;">Self-test status: No self-test in progress
</span><span style="color:#323232;">Num  Test_Description  Status                       Power_on_Hours  Failing_LBA  NSID Seg SCT Code
</span><span style="color:#323232;"> 0   Extended          Completed without error                6334            -     -   -   -    -
</span><span style="color:#323232;"> 1   Short             Completed without error                6334            -     -   -   -    -
</span>

bizdelnick , 5 months ago

Check also sudo smartctl -a /dev/nvme0

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

vsis OP , 5 months ago

sudo smartctl -a /dev/nvme0


<span style="color:#323232;">$ sudo smartctl -a /dev/nvme0
</span><span style="color:#323232;">[sudo] password for ****:
</span><span style="color:#323232;">smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.7.6-arch1-2] (local build)
</span><span style="color:#323232;">Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
</span><span style="color:#323232;">
</span><span style="color:#323232;">=== START OF INFORMATION SECTION ===
</span><span style="color:#323232;">Model Number:                       INTEL HBRPEKNX0202A
</span><span style="color:#323232;">Serial Number:                      BTTE95101RQM512B-1
</span><span style="color:#323232;">Firmware Version:                   G002
</span><span style="color:#323232;">PCI Vendor/Subsystem ID:            0x8086
</span><span style="color:#323232;">IEEE OUI Identifier:                0x5cd2e4
</span><span style="color:#323232;">Controller ID:                      1
</span><span style="color:#323232;">NVMe Version:                       1.3
</span><span style="color:#323232;">Number of Namespaces:               1
</span><span style="color:#323232;">Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
</span><span style="color:#323232;">Namespace 1 Formatted LBA Size:     512
</span><span style="color:#323232;">Local Time is:                      Fri Mar  8 12:09:53 2024 CET
</span><span style="color:#323232;">Firmware Updates (0x14):            2 Slots, no Reset required
</span><span style="color:#323232;">Optional Admin Commands (0x0016):   Format Frmw_DL Self_Test
</span><span style="color:#323232;">Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
</span><span style="color:#323232;">Log Page Attributes (0x0f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
</span><span style="color:#323232;">Maximum Data Transfer Size:         32 Pages
</span><span style="color:#323232;">Warning  Comp. Temp. Threshold:     77 Celsius
</span><span style="color:#323232;">Critical Comp. Temp. Threshold:     80 Celsius
</span><span style="color:#323232;">
</span><span style="color:#323232;">Supported Power States
</span><span style="color:#323232;">St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
</span><span style="color:#323232;"> 0 +     3.50W       -        -    0  0  0  0        0       0
</span><span style="color:#323232;"> 1 +     2.70W       -        -    1  1  1  1        0       0
</span><span style="color:#323232;"> 2 +     2.00W       -        -    2  2  2  2        0       0
</span><span style="color:#323232;"> 3 -   0.0250W       -        -    3  3  3  3     2000    5000
</span><span style="color:#323232;"> 4 -   0.0040W       -        -    4  4  4  4     5000    9000
</span><span style="color:#323232;">
</span><span style="color:#323232;">Supported LBA Sizes (NSID 0x1)
</span><span style="color:#323232;">Id Fmt  Data  Metadt  Rel_Perf
</span><span style="color:#323232;"> 0 +     512       0         0
</span><span style="color:#323232;">
</span><span style="color:#323232;">=== START OF SMART DATA SECTION ===
</span><span style="color:#323232;">SMART overall-health self-assessment test result: PASSED
</span><span style="color:#323232;">
</span><span style="color:#323232;">SMART/Health Information (NVMe Log 0x02)
</span><span style="color:#323232;">Critical Warning:                   0x00
</span><span style="color:#323232;">Temperature:                        30 Celsius
</span><span style="color:#323232;">Available Spare:                    100%
</span><span style="color:#323232;">Available Spare Threshold:          10%
</span><span style="color:#323232;">Percentage Used:                    32%
</span><span style="color:#323232;">Data Units Read:                    6,877,173 [3.52 TB]
</span><span style="color:#323232;">Data Units Written:                 9,397,485 [4.81 TB]
</span><span style="color:#323232;">Host Read Commands:                 54,359,124
</span><span style="color:#323232;">Host Write Commands:                239,213,047
</span><span style="color:#323232;">Controller Busy Time:               2,412
</span><span style="color:#323232;">Power Cycles:                       536
</span><span style="color:#323232;">Power On Hours:                     6,350
</span><span style="color:#323232;">Unsafe Shutdowns:                   62
</span><span style="color:#323232;">Media and Data Integrity Errors:    0
</span><span style="color:#323232;">Error Information Log Entries:      0
</span><span style="color:#323232;">Warning  Comp. Temperature Time:    0
</span><span style="color:#323232;">Critical Comp. Temperature Time:    0
</span><span style="color:#323232;">
</span><span style="color:#323232;">Error Information (NVMe Log 0x01, 16 of 256 entries)
</span><span style="color:#323232;">No Errors Logged
</span><span style="color:#323232;">
</span><span style="color:#323232;">Self-test Log (NVMe Log 0x06)
</span><span style="color:#323232;">Self-test status: No self-test in progress
</span><span style="color:#323232;">Num  Test_Description  Status                       Power_on_Hours  Failing_LBA  NSID Seg SCT Code
</span><span style="color:#323232;"> 0   Extended          Completed without error                6334            -     -   -   -    -
</span><span style="color:#323232;"> 1   Short             Completed without error                6334            -     -   -   -    -
</span>

Limonene , 5 months ago

The good news is: the error shown there was a PCIe bus error, which means the error is somewhere between the NVME controller and your processor’s PCIe interface. Also good news: the errors you experienced were fully corrected, so you probably lost no data.

So the flash memory in the drive isn’t failing. That’s good because if the flash memory starts failing, it’s probably only going to fail more. In this case, your errors may be correctable: by replacing the motherboard, by replacing the processor, by reseating the NVME drive in its slot, by verifying that your power supply is reliable…

However, if your NVME controller actually does fail, it will be little consolation to tell you that your data is all still there on the flash chips, but with no way to get it. So now might be a good time to make a backup. Any time is a good time to make a backup, but now is an especially good time.

If you keep getting these errors at the same rate, then you probably don’t need to do anything, since the errors are being corrected. If you’re worried, you could use BTRFS and enable checksumming of data.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

vsis OP , 5 months ago

[…] by replacing the motherboard, by replacing the processor, by reseating the NVME drive in its slot, by verifying that your power supply is reliable…

I will start with the cheapest option 😅

I assume the power supply is reliable. Having a battery should make it more stable I guess.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

h3ndrik , 5 months ago (edited 5 months ago)

And maybe clean the insides of your laptop, that’s probably the first thing that could solve the issue. See if all cables are still locked in their connectors. Maybe take out the SSD, clean the contacts and you can use compressed air to clean the socket. But be careful, you want to do it right or you might cause damage. No dampness or water, it has to be either isopropyl alcohol or dry. And don’t use a rag that introduces static electricity. And no workshop air compressor. Maybe something like a paintbrush is better suited. And don’t just shove the vacuum in. I’ve done that and it might dislocate small components or key-caps and suck them in and it’s a major annoyance to get them out of the vacuum cleaner bag 😆 Just be a bit careful. But I already had something like loose connectors/components cause random errors. Especially in equipment that is moved around or gets dropped occasionally. After 5 years, you might also find some dust inside. At least it used to be that way, It seems to be less of a problem with modern laptops. And more and more stuff gets soldered anyways.

And don’t do too much if you’re not comfortable with that. IMHO the SSD should be a safe thing to touch for most people. But it’s really easy to break or bend some tiny contacts from other components or ribbon cables. And there are consumer devices that aren’t really meant to be serviced. I wouldn’t disassemble such a model without prior experience. If it’s still working you might also leave it as is. Do backups. Storage devices often fail even without prior warning.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

vsis OP , 5 months ago

OK. I’ll use a dust blower for photography gear. Thanks. Let’s see if it works.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

possiblylinux127 , 5 months ago

Just don’t use a powerful one and keep the device powered off while you clean it.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

vsis OP , 5 months ago (edited 5 months ago)
I opened it. All cables were looking good. I used a hand blower to clean the dust. Taked out the SSD and blew the socket and everything around.

Now I’m going to monitor if it keeps happening.
<span style="color:#323232;">$ journalctl --since yesterday  | grep -c "nvme 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Physical"
</span><span style="color:#323232;">16
</span>
Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

MiltownClowns , 5 months ago

I’m not knowledgeable enough to tell you whether the drive is failing or not, but I just want to double check that you got rolling back ups on this drive right now. Because I’m just an idiot, put to me that drive looks unreliable.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...