[Closed] Help/Advices about debugging zfs pool issues

Hi all,

UPDATE: I closed the post (the timebox I gave myself to understand the issue is now over). Thank you all for the help ^^

DISCLAIMER: The objective of this post is to understand how people would debug issues like these when real data is involved and get to the bottom of the problem. The objective is NOT to “restore service” but to understand what failed. The tone of the post is voluntarily not serious to keep it light.

I am playing a little with TrueNas Scale and ZFS. I was trying to use a second NVME disk via USB to do a replication once a day of the main pool, however I had issues with this secondary pool being SUSPENDED for “too many errors”. This pool is not directly write/read by users/apps, but it is just there to be “replicated on” once a day.

Now, please, I know that using disks via USB is not advised. Also I am not interested in recovering the data, since there is nothing real on it. What I am doing is testing to see if the system is brittle, and if it is, how to debug if there is a real issue.

Now to the point. The pool is SUSPENDED. Good. Why? I mean, the real reason why. To see if the system can be used in real life it needs to be debuggable.

Let’s start. The pool is SUSPENDED:


<span style="color:#323232;">pool: tank-02
</span><span style="color:#323232;">state: SUSPENDED
</span><span style="color:#323232;">status: One or more devices are faulted in response to IO failures.
</span><span style="color:#323232;">action: Make sure the affected devices are connected, then run 'zpool clear'.
</span><span style="color:#323232;">   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-JQ
</span><span style="color:#323232;">config:
</span><span style="color:#323232;">
</span><span style="color:#323232;">    	NAME                                	STATE 	READ WRITE CKSUM
</span><span style="color:#323232;">    	tank-02                             	UNAVAIL  	0 	0 	0  insufficient replicas
</span><span style="color:#323232;">      	xxx-xxx-xxx-xxx-xxx                FAULTED  	3 	0 	0  too many errors
</span><span style="color:#323232;">
</span><span style="color:#323232;">errors: 4 data errors, use '-v' for a list
</span>

To which you may ask: why? Too many errors (the -v says nothing more). Well that doesn’t help, does it. When you run zpool clear:


<span style="color:#323232;"># zpool clear tank-02   	 
</span><span style="color:#323232;">cannot clear errors for tank-02: I/O error
</span>

Incredibly useful as you can see. dmesg to the rescue?


<span style="color:#323232;">WARNING: Pool 'tank-02' has encountered an uncorrectable I/O failure and has been suspended.
</span>

Thanks? I guess. I know it it trying to safeguard data but again… why?

Before you ask:

SMART checks are good
Yes, I restarted the device. As soon as you try to use/mount/import you get to the same issues.
Nothing else peculiar in dmesg. I mean the USB was usb 2-4: USB disconnect, device number 12 whatever the reason why. I mean, kick me if I know why TrueNas scale decided that having /sys/module/usbcore/parameters/autosuspend to 2 is a good idea but again, that is not the point. I need ZFS to reply to me what is the issue for its point of view.

I have read a lot online. Maybe it is the temperarure (usb enclosure heating up), maybe it is the cable, power, “it is the usb controller”, or the chipset doing the usb -> nvme… However, therey are not saying what to check. People is guessing. I saw more tech behind reading tea leaves.

My question for you all is this: ZFS SUSPENDED one of my pools. It (seems to me) is refusing to fix it. Refusing to do anything with it and to tell me why. So, in a real world case, how to debug it? If I have to trust my data to it, I don’t want the only option to be “use many disks and just replace one and the cable when ZFS poo-poo”.

How to know the cause?

Thank you for the help.

PS: I am sure I am missing some very basic ZFS knoweldge on the topic, so please let me know what else can I do to make ZFS talk to me.

Decronym Bot , 17 days ago (edited 17 days ago)

Acronyms, initialisms, abbreviations, contractions, and other phrases which expand to something larger, that I’ve seen in this thread:

Fewer Letters More Letters

NAS Network-Attached Storage

PCIe Peripheral Component Interconnect Express

SATA Serial AT Attachment interface for mass storage

ZFS Solaris/Linux filesystem focusing on data integrity

4 acronyms in this thread; the most compressed thread commented on today has 7 acronyms.

[Thread #791 for this sub, first seen 8th Jun 2024, 16:55] [FAQ] [Full list] [Contact] [Source code]

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Fewer Letters	More Letters
NAS	Network-Attached Storage
PCIe	Peripheral Component Interconnect Express
SATA	Serial AT Attachment interface for mass storage
ZFS	Solaris/Linux filesystem focusing on data integrity

pyrosis , 17 days ago

This takes a degree of understanding of what you are doing and why it fails.

I’ve done some research on this myself and the answer is the USB controller. Specifically the way the USB controller “shares” bandwidth. It is not the way a sata controller or a pci lane deals with this.

ZFS expects direct control of the disk to operate correctly and anything that gets in between the file system and the disk is a problem.

I the case of USB let’s say you have two USB - nvme adapters plugged in to the same system in a basic zfs mirror. ZFS will expect to mirror operations between these devices but will be interrupted by the USB controller constantly sharing bandwidth between these two devices.

A better but still bad solution would be something like a USB to SATA enclosure. In this situation if you installed a couple disks in a mirror on the enclosure… They would be using a single USB port and the controller would at least keep the data on one lane instead of constantly switching.

Regardless if you want to dive deeper you will need to do reading on USB controllers and bandwidth sharing.

If you want a stable system give zfs direct access to your disks and accept it will damage zfs operations over time if you do not.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

justpassingby OP , 17 days ago

Hi,

I’ve done some research on this myself and the answer is the USB controller. Specifically the way the USB controller “shares” bandwidth. It is not the way a sata controller or a pci lane deals with this. ZFS expects direct control of the disk to operate correctly and anything that gets in between the file system and the disk is a problem.

Thanks for sharing. I agree with you 100% and I think everybody commenting here does. The whole point of the thread however was to understand if/how you can identify the location of the problem without guessing. The reality is I got to the conclusion that people… don’t. Like you said people know ZFS is fussy about how does he speaks with the disks and the minimum issue it has it throws a tantrum. So people just switch things until they work (or buy expensive motherboards with many ports). I don’t like the idea of not knowing “why”, so I will just add on my notes that for my specific usecase I cannot trust ZFS + OS (TrueNas scale) to use the USB disk for backups via ZFS send/recieve.

If you want a stable system give zfs direct access to your disks and accept it will damage zfs operations over time if you do not.

I would like to add that I am not trying to mirror my main disk with a usb one. I just wanted to copy the zfs snapshots on the usb drive once a day at midnight. ZFS is just (don’t throw stones at me for this, it is just my opinon) too brittle to use it this way too. I mean when I am trying to clean/recover the pool it just refuses (and there is no one writing on it).

A better but still bad solution would be something like a USB to SATA enclosure. In this situation if you installed a couple disks in a mirror on the enclosure… They would be using a single USB port and the controller would at least keep the data on one lane instead of constantly switching.

In my case there was no switching however. It was a single nvme drive in a single usb line in an enclusure. It was a separate stripe to just recieve data once a day.

Regardless if you want to dive deeper you will need to do reading on USB controllers and bandwidth sharing.

Not without good logs or debugging tools.

I decided I cannot trust it so unfortunately I will take the usb enclosure with the nvme, format it with etx4 and use Kopia to backup the datasets there once a day. It is not what I wanted but it is the best I can get for now.

About better solutions for the my play-NAS in general, I am constrained with the ports I have. I (again personal choice - I understand people disagree with this) don’t want to go SATA. Unfortunately, since I could not find any PCIe switch with ASM2812I (www.asmedia.com.tw/product/…/7c5YQ79xz8urEGr1) I am unable to get more from my m2 nvme pcie 3x4 (speed loss for me is not an issue, my main bottleneck is the network). It is interesting how you can find many more interesting attempt at it in the PIs ecosystem but not for mini PCs.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

pyrosis , 17 days ago

Not without good logs or debugging tools.

You need to know what to observe. You are not going to get the information you are looking for directly from zfs or even system logs.

What I suggest stands. You have to understand the behavior of the USB controller. That information is acquired from researching USB itself.

Now if you intend to utilize something like a USB enclosure you indeed would be better off with something like ext4. However, keep in mind that this effect is not directly a file system issue. It’s an issue with how USB controllers interact with file systems.

That has been my experience from researching this matter. ZFS is simply more sensitive.

In my experience even for motherboards that have port limitations it’s possible to take advantage of pci lanes and install a hba with an onboard SATA controller. They also make pci devices that will accept nvme drives.

Good luck with your experimentation and research.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

corsicanguppy , 19 days ago

Help/Advices

Sure! ‘advice’ isn’t pluralized with an S. It’s like ‘traffic’; and you don’t say ‘traffics’ as a noun.

Happy to help!

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

manos_de_papel , 20 days ago

deleted_by_author

Loading...

NeoNachtwaechter , 20 days ago

This, so much.

ZFS itself sticks to the error stubbornly but does bot have any more info. SMART reports good drives.

This means: look elsewhere.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

justpassingby OP , 19 days ago

Thanks, I understand the point of view. So maybe let me rephrase it. ZFS is not telling me more info that the one I posted above (maybe this is all it sees like you said). Do you know of any other way to make ZFS more verbose on the issue or giving me more info? If not, it is ok but I have a second question: Where would you look on which is the culprit amongst “bad USB controller, firmware, cable, or driver” without trying-by-switching them out? Thank you for your advice.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

manos_de_papel , 19 days ago

deleted_by_author

Loading...

justpassingby OP , 18 days ago

Hi! Thanks for the pointers. Unfortunately dmesg and system logs where the first places I looked at, but I found nothing at the time. I tried it again now to give you the output of a zpool clear, you can obviously ignore the failed email attempt. journalctl:


<span style="color:#323232;">Jun 07 08:06:24 truenas kernel: WARNING: Pool 'tank-02' has encountered an uncorrectable I/O failure and has been suspended.
</span><span style="color:#323232;">Jun 07 08:06:24 truenas zed[799040]: eid=309 class=statechange pool='tank-02' vdev=xxx-xxx-xxx-xxx-xxx vdev_state=ONLINE
</span><span style="color:#323232;">Jun 07 08:06:24 truenas zed[799049]: eid=310 class=statechange pool='tank-02' vdev=xxx-xxx-xxx-xxx-xxx vdev_state=FAULTED
</span><span style="color:#323232;">Jun 07 08:06:24 truenas zed[799057]: eid=313 class=data pool='tank-02' priority=3 err=28 flags=0x20004000 bookmark=0:0:0:1
</span><span style="color:#323232;">Jun 07 08:06:24 truenas zed[799058]: eid=311 class=vdev_clear pool='tank-02' vdev=xxx-xxx-xxx-xxx-xxx vdev_state=FAULTED
</span><span style="color:#323232;">Jun 07 08:06:24 truenas zed[799067]: eid=312 class=data pool='tank-02' priority=3 err=28 flags=0x20004000 bookmark=0:62:0:0
</span><span style="color:#323232;">Jun 07 08:06:24 truenas zed[799081]: eid=316 class=io_failure pool='tank-02'
</span><span style="color:#323232;">Jun 07 08:06:24 truenas zed[799082]: eid=315 class=data pool='tank-02' priority=3 err=28 flags=0x20004000 bookmark=0:0:-1:0
</span><span style="color:#323232;">Jun 07 08:06:24 truenas zed[799090]: eid=314 class=data pool='tank-02' priority=3 err=28 flags=0x20004000 bookmark=0:0:1:0
</span><span style="color:#323232;">Jun 07 08:06:24 truenas find_alias_for_smtplib.py[799114]: sending mail to 
</span><span style="color:#323232;">                                                           To: root
</span><span style="color:#323232;">                                                           Subject: ZFS device fault for pool tank-02 on truenas
</span><span style="color:#323232;">                                                           MIME-Version: 1.0
</span><span style="color:#323232;">                                                           Content-Type: text/plain; charset="ANSI_X3.4-1968"
</span><span style="color:#323232;">                                                           Content-
</span><span style="color:#323232;">Jun 07 08:06:24 truenas find_alias_for_smtplib.py[799114]: No aliases found to send email to root
</span><span style="color:#323232;">Jun 07 08:06:24 truenas zed[799144]: error: statechange-notify.sh: eid=310: mail exit=1
</span>

dmesg says even less.

I also tried to reboot the machine with the drive detached and then attach it at runtime while tailing dmesg and journalctl. Now, they are pretty verbose, so will only add here any interesting part (I didn’t notice anything new however):


<span style="color:#323232;">[...]
</span><span style="color:#323232;">[  221.952569] usb 2-4: Enable of device-initiated U1 failed.
</span><span style="color:#323232;">[  221.954164] usb 2-4: Enable of device-initiated U2 failed.
</span><span style="color:#323232;">[  221.965756] usbcore: registered new interface driver usb-storage
</span><span style="color:#323232;">[  221.983528] usb 2-4: Enable of device-initiated U1 failed.
</span><span style="color:#323232;">[  221.983997] usb 2-4: Enable of device-initiated U2 failed.
</span><span style="color:#323232;">[  221.987603] scsi host2: uas
</span><span style="color:#323232;">[  221.987831] usbcore: registered new interface driver uas
</span><span style="color:#323232;">[...]
</span><span style="color:#323232;">[  222.040564] sd 2:0:0:0: Attached scsi generic sg1 type 0
</span><span style="color:#323232;">[  222.049860] sd 2:0:0:0: [sdb] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB)
</span><span style="color:#323232;">[  222.051867] sd 2:0:0:0: [sdb] Write Protect is off
</span><span style="color:#323232;">[  222.051879] sd 2:0:0:0: [sdb] Mode Sense: 37 00 00 08
</span><span style="color:#323232;">[  222.056719] sd 2:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
</span><span style="color:#323232;">[  222.058407] sd 2:0:0:0: [sdb] Preferred minimum I/O size 512 bytes
</span><span style="color:#323232;">[  222.058413] sd 2:0:0:0: [sdb] Optimal transfer size 33553920 bytes
</span><span style="color:#323232;">[  222.252607]  sdb: sdb1
</span><span style="color:#323232;">[  222.253015] sd 2:0:0:0: [sdb] Attached SCSI disk
</span><span style="color:#323232;">[  234.935926] usb 2-4: USB disconnect, device number 2
</span><span style="color:#323232;">[  234.983962] sd 2:0:0:0: [sdb] Synchronizing SCSI cache
</span><span style="color:#323232;">[  235.227936] sd 2:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
</span><span style="color:#323232;">[...]
</span>

Thanks for the advices, it was worth another try. Anything more that comes to mind?

manos_de_papel , 18 days ago

deleted_by_author

Loading...

justpassingby OP , 17 days ago

Hi.

There is one usb drive in an nvme enclosure without their own power supply. I know the brand and I can find the chipset however what I need to understand is the issue from the logs.

The error usb 2-4: Enable of device-initiated U1 failed. seems common for USB devices not working.

What does it point to and what to look for to understand it?

Thanks.

PS: Just for curiosity I did swap the enclosure days ago and the cable but had the same issue, so the error message is not specific to it. Also I was using this enclosure as the main disk for one of my PI with no issue, so power via USB or cable should not be the problem. Not that I want to use that as metric, I need data/logs from the OS.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

NeoNachtwaechter , 18 days ago

Do you know of any other way to make ZFS more verbose on the issue

ZFS is the wrong place to look at.

Analogy:

Imagine there is an evil teacher in grammar school. Your kids are telling you, but they are unable to explain further what exactly is wrong with the teacher.

Then you don’t wait until your kids grow up and understand it all and can explain it all to you, but you go directly to the school to find out what it is what that teacher is doing.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Paragone , 20 days ago

My experience is that USB storage sometimes breaks-connection for no discernable reason.

That if one REALLY wants to do USB storage, then put it inside the housing, and don’t use one of the external-connectors, use something you can permanently-fix, so nothing can even sneeze in its direction.

This mayn’t help you with your puzzle, but it’s bedrock and unchangeable, in my experience.

USB-storage is an unreliable joke.

ANY revision of it, that I’ve tried.

hth…

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

justpassingby OP , 19 days ago

Thanks. I am ok with accepting the fact USB storage with ZFS is unreliable. I am ok with not using it in real case scenarios. My point stands however in understanding what broke so I know what to look for and, should I be crazy enough to try something similar again in some use-cases, know what to alert on. Call me curious. Everybody tells me it breaks, nobody tells me “look, it breaks here, and this is how you can see it”. I will try for another day or two and then will write it down on my notes as “unusable due to bad logging/debugging options”, not just because “it is USB” if that makes sense.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

farcaller , 20 days ago

Is there anything interesting at all reported in /proc/spl/kstat/zfs/dbgmsg?

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

justpassingby OP , 19 days ago

Thank you! A new path to check :) I didn’t find this in my search until now, so I added it on my documentation.

Unfortunately it doesn’t tell me much, but I am really happy there is some more new info here. I can see some FAILED steps but it may be just connected to the fact it is a striped volume?


<span style="color:#323232;">1717612906   spa.c:6623:spa_import(): spa_import: importing tank-02
</span><span style="color:#323232;">1717612906   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config trusted): LOADING
</span><span style="color:#323232;">1717612906   vdev.c:161:vdev_dbgmsg(): disk vdev '/dev/disk/by-partuuid/xxx-xxx-xxx-xxx-xxxx': best uberblock found for spa tank-02. txg 6462
</span><span style="color:#323232;">1717612906   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config untrusted): using uberblock with txg=6462
</span><span style="color:#323232;">1717612906   spa.c:8925:spa_async_request(): spa=tank-02 async request task=4
</span><span style="color:#323232;">1717612906   spa_misc.c:404:spa_load_failed(): spa_load(tank-02, config trusted): FAILED: cannot open vdev tree after invalidating some vdevs
</span><span style="color:#323232;">1717612906   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config trusted): UNLOADING
</span><span style="color:#323232;">1717612906   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config trusted): spa_load_retry: rewind, max txg: 6461
</span><span style="color:#323232;">1717612906   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config trusted): LOADING
</span><span style="color:#323232;">1717612907   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config untrusted): vdev tree has 1 missing top-level vdevs.
</span><span style="color:#323232;">1717612907   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config untrusted): current settings allow for maximum 0 missing top-level vdevs at this stage.
</span><span style="color:#323232;">1717612907   spa_misc.c:404:spa_load_failed(): spa_load(tank-02, config untrusted): FAILED: unable to open vdev tree [error=2]
</span><span style="color:#323232;">1717612907   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config untrusted): UNLOADING
</span>

It goes on and after a while:


<span style="color:#323232;">1717614235   spa_misc.c:2311:spa_import_progress_set_notes_impl(): 'tank-02' Finished importing
</span><span style="color:#323232;">1717614235   spa.c:8925:spa_async_request(): spa=tank-02 async request task=2048
</span><span style="color:#323232;">1717614235   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config trusted): LOADED
</span><span style="color:#323232;">1717614235   metaslab.c:2445:metaslab_load_impl(): metaslab_load: txg 6464, spa tank-02, vdev_id 0, ms_id 95, smp_length 0, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 1362018 ms, loading_time 0 ms, ms_max_size 8589934592, max size error 8589934592, old_weight 840000000000001, new_weight 840000000000001
</span>

But I see no other issue otherwise. Any other new path/logs/ways I can query the system?