'boot-pool' has been suspended (uncorrectable I/O failure)

joyoftech

Dabbler
Joined
Feb 25, 2021
Messages
16
Hi all,

New here, config' is there but also below:
- ASrock B550M-ITX
- AMD 3700X
- Corsair H60 AIO
- PNY CS3030 512GB NVME (for jails)
- 128GB San Disk USB (boot) - I planned to mirror it with another USB key (Kingston Data Traveler) but when I attached it to the boot pool, it said it was faulty
- 3*5TB WD RED
- 1*8TB WD White Air (shucked)
- The 4 disks make 1 raidz pool
- 156W AC/DC power brick
- 250W DC/DC PSU
- UDIMM ECC RAM: KSM32ES8/16ME (16GB)

Now, before you roast me for storing the system on USB keys, I know it's not recommended but I needed all SATA ports. I have a backup plan if you tell me the problem comes 100% from those keys.

Everything's fine until it's not: under prolonged load, like restoring my files (7TB) to my new config' over SMB, or sudden seemingly "heavy" load, e.g. mv multi_tb_folder newname, the UI suddenly hangs and everything is unresponsive until I hard-reboot it.
Heading to the messages, I get approx. each second:
Code:
Mar 10 23:59:39 truenas ugen0.5: <Unknown > at usbus0 (disconnected)
Mar 10 23:59:39 truenas uhub_reattach_port: could not allocate new device
Mar 10 23:59:47 truenas usb_alloc_device: set address 5 failed (USB_ERR_IOERROR, ignored)
Mar 10 23:59:47 truenas usbd_setup_device_desc: getting device descriptor at addr 5 failed, USB_ERR_IOERROR


And then, when the system hangs, I get:
Code:
Mar 11 00:00:00 truenas Solaris: WARNING: Pool 'boot-pool' has encountered an uncorrectable I/O failure and has been suspended.


In the UI, the boot-pool is marked as ONLINE / healthy and I get no alerts. This happens whether I mirror the system over a second key or not.

Now, usbconfig shows:
Code:
root@truenas[~]# usbconfig
ugen0.1: <0x1022 XHCI root HUB> at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
ugen1.1: <0x1022 XHCI root HUB> at usbus1, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
ugen1.2: <USB SanDisk 3.2Gen1> at usbus1, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (224mA)
ugen0.2: <EATON Eaton 3S> at usbus0, cfg=0 md=HOST spd=FULL (12Mbps) pwr=ON (20mA)
ugen0.3: <ASRock LED Controller> at usbus0, cfg=0 md=HOST spd=FULL (12Mbps) pwr=ON (100mA)
ugen0.4: <vendor 0x8087 product 0x0aa7> at usbus0, cfg=0 md=HOST spd=FULL (12Mbps) pwr=ON (100mA)


No mention of ugen0.5, so could this be unrelated to the USB key(s)?

Thanks!
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
WARNING: Pool 'boot-pool' has encountered an uncorrectable I/O failure and has been suspended.
This may or may not have caused a problem with the OS.

Check zpool status -v and see if you have errors on the boot pool.

Take a config backup as soon as you can in any case. You may find yourself reinstalling at some point.
 

joyoftech

Dabbler
Joined
Feb 25, 2021
Messages
16
This may or may not have caused a problem with the OS.

What else?

Take a config backup as soon as you can in any case. You may find yourself reinstalling at some point.
I have plenty of backups, I've had to reinstall already for other reasons...

Output of zpool status -v

Code:
Warning: settings changed through the CLI are not written to
the configuration database and will be reset on reboot.

root@truenas[~]# zpool status -v
  pool: ADIMAJO
 state: ONLINE
  scan: resilvered 31.7M in 00:00:04 with 0 errors on Sat Mar  6 20:05:17 2021
config:

        NAME                                            STATE     READ WRITE CKSUM
        ADIMAJO                                         ONLINE       0     0 0
          raidz1-0                                      ONLINE       0     0 0
            gptid/a050950b-72e4-11eb-8e79-a8a1593f4201  ONLINE       0     0 0
            gptid/a0859a32-72e4-11eb-8e79-a8a1593f4201  ONLINE       0     0 0
            gptid/a0908309-72e4-11eb-8e79-a8a1593f4201  ONLINE       0     0 0
            gptid/a0cadfdf-72e4-11eb-8e79-a8a1593f4201  ONLINE       0     0 0

errors: No known data errors

  pool: NVME
 state: ONLINE
config:

        NAME                                          STATE     READ WRITE CKSUM
        NVME                                          ONLINE       0     0     0
          gptid/ff6b3a39-73c5-11eb-8e79-a8a1593f4201  ONLINE       0     0     0

errors: No known data errors

  pool: boot-pool
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-2Q
  scan: resilvered 0B in 00:15:29 with 0 errors on Wed Mar 10 19:31:49 2021
config:

        NAME                                            STATE     READ WRITE CKSUM
        boot-pool                                       DEGRADED     0     0 0
          mirror-0                                      DEGRADED     0     0 0
            gptid/1740d297-5dad-11eb-9421-0555f2f29729  ONLINE       0     0 0
            da1p2                                       UNAVAIL      0     0 0  cannot open

errors: No known data errors
root@truenas[~]#


So nothing of particular interest (the mirror da1p2 is currently not plugged in, and to me it seems unrelated to my problem).

What about this ugen0.5 that is not reported by usbconfig?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
So your boot pool consists of 2 USB sticks... one (ugen1.2) is there and the other is missing, presumably ugen0.5 (which in my opinion is the same as da1, the "missing" drive in the boot pool).

I'm not sure what you expect to find out other than that you need to find a better option than that USB drive which keeps failing.

My suggestion would be a USB to SATA converter and an SSD drive attached to that if you have no free SATA ports.

As far as boot solutions go, that would be reliable enough to eliminate the need for a second one for most people.
 

joyoftech

Dabbler
Joined
Feb 25, 2021
Messages
16
I'm not sure what you expect to find out other than that you need to find a better option than that USB drive which keeps failing.
I'm expecting to find out:
  • Is it really a problem related to one of those two keys?
    • Even if it's related to da1, the fact that da0 / ugen1.2 is online as a single boot device should be enough to operate TrueNAS: why would the OS crash? The purpose of mirroring is precisely if one boot fails, the other should make the system work nevertheless...
    • Re-attached da1, ran usbconfig:
Code:
ugen1.3: <Kingston DataTraveler 3.0> at usbus1, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (126mA)

Can I conclude that it's unrelated to the USB keys?
  • If not related to one of the keys, then what?
    • What is ugen0.5? lsusb is not installed; how can I achieve this?
    • How can I disable it?
My suggestion would be a USB to SATA converter and an SSD drive attached to that if you have no free SATA ports.

If it's not related to the USB keys, then doing so wouldn't solve my problem.
If it is, my backup plan was rather to go for a PCIe => Multiple M.2 PCIe slots, e.g. the HYPER M.2 X16 CARD V2.
I would have 1 small M.2 for boot, 2 SLOG, 1 L2ARC devices.
How does that sound?

EDIT: also, there's a possibility that the two messages (see 1st post), i.e. the ugen0.5 error, is unrelated to the uncorrectable I/O failure. How could I know more about the latter?
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Have a look at dmesg | grep ugen0.5
 

joyoftech

Dabbler
Joined
Feb 25, 2021
Messages
16
Just one of those system messages:
Code:
ugen0.5: <Unknown > at usbus0 (disconnected)


Thanks!
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
How about dmesg | grep -i usb
 

joyoftech

Dabbler
Joined
Feb 25, 2021
Messages
16
How about dmesg | grep -i usb

A repetition of those lines (between the (disconnected)):
Code:
ugen0.5: <Unknown > at usbus0 (disconnected)
usb_alloc_device: set address 5 failed (USB_ERR_IOERROR, ignored)
usbd_setup_device_desc: getting device descriptor at addr 5 failed, USB_ERR_IOERROR
usbd_req_re_enumerate: addr=5, set address failed! (USB_ERR_IOERROR, ignored)
usbd_setup_device_desc: getting device descriptor at addr 5 failed, USB_ERR_IOERROR
usbd_req_re_enumerate: addr=5, set address failed! (USB_ERR_IOERROR, ignored)
usbd_setup_device_desc: getting device descriptor at addr 5 failed, USB_ERR_IOERROR
usbd_req_re_enumerate: addr=5, set address failed! (USB_ERR_IOERROR, ignored)
usbd_setup_device_desc: getting device descriptor at addr 5 failed, USB_ERR_IOERROR
usbd_req_re_enumerate: addr=5, set address failed! (USB_ERR_IOERROR, ignored)
usbd_setup_device_desc: getting device descriptor at addr 5 failed, USB_ERR_IOERROR
ugen0.5: <Unknown > at usbus0 (disconnected)


For references, I've found this issue which seems quite similar, so I'll try setting these:
Code:
hw.usb.xhci.ctlstep=1
hw.usb.xhci.dma32=1
hw.usb.xhci.use_polling=1
hw.usb.xhci.xhci_port_route=1

in /boot/loader.conf... For now, the boot-pool is resilvering.

Thanks!
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
OK, so if it's a Ryzen issue in particular, I'm out of experience, so will wish you luck.

I can suggest that a reset of the bus could cause other devices to become unstable, so perhaps that's it? anyway, all the best.
 

joyoftech

Dabbler
Joined
Feb 25, 2021
Messages
16
OK, so if it's a Ryzen issue in particular, I'm out of experience, so will wish you luck.

I can suggest that a reset of the bus could cause other devices to become unstable, so perhaps that's it? anyway, all the best.

Thanks for your message! I don't want to spend time and money figuring out how to ditch the USB keys if they're not causing this...
I'll try, once the resilvering is done:
  1. "cold" boot: shutdown, unplug PSU
  2. tweaking /boot/loader.conf
  3. take my NVME SSD (now hosting jails, but I haven't setup anything yet) as a boot drive and disconnecting all USBs
For the record, my motherboard has 2 x 2.0 Type A USB, 2 x 3.2 Gen1 Type A USB, 1 3.2 Gen1 Type A USB below the Ethernet port and 1 3.2 Gen1 Type C USB (could it simply be that Type C USB is not recognized by FreeBSD or something?), and 2 inputs for case headers: 2 x 2.0 USB ports (not plugged), 2 x 3.2 Gen1 Type A USB (plugged in, but there's only one USB port on the case).
 
Last edited:

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
he purpose of mirroring is precisely if one boot fails, the other should make the system work nevertheless...
Yes, but take into account that there's no "auto-failover" for boot devices provided in FreeNAS - this would be a BIOS function. If it doesn't provide such then you have to manually select the remaining good boot device when you experience a failure.
 

joyoftech

Dabbler
Joined
Feb 25, 2021
Messages
16
Yes, but take into account that there's no "auto-failover" for boot devices provided in FreeNAS - this would be a BIOS function. If it doesn't provide such then you have to manually select the remaining good boot device when you experience a failure.
You mean, at start-up? So when it detects a failure, it is paramount to Detach (and possibly Replace) before rebooting?
That's an important precision, thank you.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
t is paramount to Detach (and possibly Replace)
Not quite - you will need to enter your bios and ensure that the correct device is selected for boot.

From your mobo manual, looks like that would be here:

1615481968420.png


I would test to see if it will indeed failover from option 1 to option 2 - it may not, only selecting #2 if the device is not present in #1
 

joyoftech

Dabbler
Joined
Feb 25, 2021
Messages
16
I would test to see if it will indeed failover from option 1 to option 2 - it may not, only selecting #2 if the device is not present in #1
Here's the catch: I don't have an APU, nor a graphics card without a power connector, nor has my MB IPMI... So I cannot get to the BIOS (unless I buy a graphics card just for this).
I've experimented switching USB ports with one and both keys and it's always booting... I can't say more.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
I've experimented switching USB ports with one and both keys and it's always booting... I can't say more.
So it seems yours is "failing over" if you have good USB thumb drives, which is good.
Experience here generally indicates that USB2 devices perform better than USB3.

EDIT - BTW, I can't imagine not being able to enter my BIOS (again that's an Intel-based experience remark, perhaps Ryzen is less subject to needed tweaks). Saying that reminds me that there are some posts here about Ryzen tweaks for stability - (but I have no idea of their relevance to your system),
 
Last edited:

joyoftech

Dabbler
Joined
Feb 25, 2021
Messages
16
EDIT - BTW, I can't imagine not being able to enter my BIOS (again that's an Intel-based experience remark, perhaps Ryzen is less subject to needed tweaks). Saying that reminds me that there are some posts here about Ryzen tweaks for stability - (but I have no idea of their relevance to your system),
Me too, but I hadn't check if I could use the graphics card I already have, and surprise! it needs a 6-pin power. I could put my TrueNAS next to my desktop, and power the graphics card from the PSU of my desktop, but I'd rather not come to this.
I'll check these stability tweaks... I thought these things (CPU and RAM) were best left to "default" settings...
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925

joyoftech

Dabbler
Joined
Feb 25, 2021
Messages
16
Code:
hw.usb.xhci.ctlstep=1
hw.usb.xhci.dma32=1
hw.usb.xhci.use_polling=1
hw.usb.xhci.xhci_port_route=1
Tried those, did not work; tried other USB ports, did not work; tried both keys in turn, did not work.
I resolved to sacrifice my NVME disk for boot, the Solaris: WARNING: Pool 'boot-pool' has encountered an uncorrectable I/O failure is gone while mv, cp or rsync'ing big directories, but the ugen0.5: <Unknown > error is still there and produces 5 lines every second in the system messages, which is quite annoying.

I was indeed warned that using USB keys are not recommended, but with my hardware, it seems it's simply not working at all, despite my efforts to find fast keys (maybe stability-wise it would be better to use USB2 as suggested by @Redcoat ! But what about performances?). Maybe the BIOS tweaks suggested (disabling c-states, AMD Cool & Quiet and ErP Ready) would have done the trick but I'll never know (unless I encounter a problem which requires me to buy an old graphics card)!

I will thus use my main and only pool for jails as well... I'll try using a PCIe => several M.2 (HYPER M.2 X16 CARD GEN 4) to use a smaller M.2 as a boot drive, and maybe play with L2ARC and SLOG, but I still need to understand how those work, i.e. how many SSDs can go there based on PCIe "bifurcation" and "CPU and Motherboard specs" (very obscure to me for now - could require entering BIOS, etc.).

Thanks for your help.
 
Top