[Not Solved] New Kioxia NVMe DEGRADED state

Pippo1993

Dabbler
Joined
Nov 18, 2021
Messages
13
Hi, i've a brand new Supermicro Server that mount an HBA card in passthrougt to TrueNAS where there is mounted 4 NVMe Kioxia KCD81RUG1T92.
Some days ago alle the disk have checksum problem and two of four are in DREGRADED state.
It's possible that SMART Tool isn't capable of the disk and degrade the disks prematurely?

A the moment i've disabled the SMART service and all works fine. However two disk are in DEGRADED state and i dont know if they work.

Thank a lot.

Attached a SMART test of one disk.
 

Attachments

  • SMART.txt
    2.3 KB · Views: 53

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
What kind of HBA? What do you mean by passthrough? Is this a virtualised TrueNAS installation? Please provide all details including the exact server model etc. about your hardware.
 

Pippo1993

Dabbler
Joined
Nov 18, 2021
Messages
13
What kind of HBA? What do you mean by passthrough? Is this a virtualised TrueNAS installation? Please provide all details including the exact server model etc. about your hardware.
Passthrough by VMWare. Yes, TrueNAS is virtualized. The server is a Supermicro Super Server x13.
The HBA card is No Vendor i think, she pass all the disk one by one to ESXi Hardware list. It is a NVMe HBA Card

There is something other?

Thanks a lot
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Because NVMe SSDs are individual PCIe devices, they are usually passed through individually to TrueNAS in a VM scenario. This is different from SAS/SATA devices, where we pass the single PCIe drive controller through to avoid the VMware/hypervisor driver being a "layer of abstraction."

The HBA card is No Vendor i think, she pass all the disk one by one to ESXi Hardware list. It is a NVMe HBA Card
Can you identify the motherboard inside your server, as well as more details about this card?
 

Dopamin3

Dabbler
Joined
Aug 18, 2017
Messages
46
Would the motherboard / PCIe slot need to support PCIe bifurcation too? Maybe if the PCIe slot isn't set in the BIOS for x4/x4/x4/x4 bifurcation you will encounter issues.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Would the motherboard / PCIe slot need to support PCIe bifurcation too? Maybe if the PCIe slot isn't set in the BIOS for x4/x4/x4/x4 bifurcation you will encounter issues.
That's not how it works. Devices either work or they don't when it comes to bifurcation, there's no "it'll enumerate and then fail at a later time" because of an incorrect bifurcation setting.
 
Last edited:

Pippo1993

Dabbler
Joined
Nov 18, 2021
Messages
13
Because NVMe SSDs are individual PCIe devices, they are usually passed through individually to TrueNAS in a VM scenario. This is different from SAS/SATA devices, where we pass the single PCIe drive controller through to avoid the VMware/hypervisor driver being a "layer of abstraction."


Can you identify the motherboard inside your server, as well as more details about this card?
Yes, the MB is X13SEI-TF
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Yes, the MB is X13SEI-TF
That's certainly new enough to support bifurcation. What is the card being used for access to the SSDs? Does it support PCIe 5.0, and can you adjust the slot that it's in to match?

CKSUM errors usually indicate a "cable failure" - in this case, it could be explained if there's PCIe timing issues.
 

Pippo1993

Dabbler
Joined
Nov 18, 2021
Messages
13
That's certainly new enough to support bifurcation. What is the card being used for access to the SSDs? Does it support PCIe 5.0, and can you adjust the slot that it's in to match?

CKSUM errors usually indicate a "cable failure" - in this case, it could be explained if there's PCIe timing issues.
I don't know the card type because there is no description. Yes, the card is 5.0 and it is mounted on pcie 5.0 slot.

At the moment i've disabled the SMART service and i've enable the SMTP Critical notification on Supermicro BMC in case that some disk come down.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I don't know the card type because there is no description.

Even a part number or some code may help here.

You may be able to issue a zpool clear to remove the DEGRADED status of the drives, but I would recommend a scrub afterwards to ensure data integrity.

If you are experiencing PCIe signaling issues, you may need a card with an active redriver or retimer.

https://www.asteralabs.com/smart-re...imers-vs-redrivers-an-eye-popping-difference/

1701360337462.png
 

Pippo1993

Dabbler
Joined
Nov 18, 2021
Messages
13
Even a part number or some code may help here.

You may be able to issue a zpool clear to remove the DEGRADED status of the drives, but I would recommend a scrub afterwards to ensure data integrity.

If you are experiencing PCIe signaling issues, you may need a card with an active redriver or retimer.

https://www.asteralabs.com/smart-re...imers-vs-redrivers-an-eye-popping-difference/

View attachment 72978
i've issued the command earlier and it work fine. The scrub is working normaly.

I dont understand what is redriver and retimer. Is a component?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I dont understand what is redriver and retimer. Is a component?
The linked article explains it in more detail, but basically think of them as "amplifiers" that will take in the PCIe signal and re-transmit them. If your board, riser card, and cables result in a configuration that is "too long" for the PCIe signal to reach on its own, you may need to use a PCIe riser/breakout card that has a redriver or retimer in there.
 
Top