zpool status write errors on single disk cause degraded pool and not replicated on proxmox

duncandoo

Cadet
Joined
Mar 5, 2022
Messages
9
I am new to this forum, new to TrueNAS Scale and setting up a brand new machine, so please bear with me and forgive any stupid omissions.

In summary: why does the HDD connected to the /dev/sdc mount point always show this error, regardless of which actual drive is connected there, or via which wires? And what can I do to fix it?

The problem: a few minutes after creating a brand new RAIDZ2 pool of 8 4TB disks, the /dev/sdc disk shows write errors, gets faulted and the pool is marked as degraded:
Code:
root@truenas[~]# zpool status -L bigPool
  pool: bigPool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 00:00:01 with 0 errors on Sat Mar  5 11:53:37 2022
config:

        NAME        STATE     READ WRITE CKSUM
        bigPool     DEGRADED     0     0     0
          raidz2-0  DEGRADED     0     0     0
            sdb     ONLINE       0     0     0
            sdc     FAULTED      0    12     0  too many errors
            sdd     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
            sdg     ONLINE       0     0     0
            sdh     ONLINE       0     0     0
            sdi     ONLINE       0     0     0

errors: No known data errors


Steps to replicate:
1. all brand new hardware
  • ASRock X570M Pro4 board (8 onboard SATA connectors)
  • AMD Ryzen 7 5700G
  • WD Red SN700 NVMe SSD - 1TB ,M.2 for boot drive and hosting VMs
  • corsair vengeance LPX 64GB 3200Mhz DDR4 (4*16GB)
  • Silverstone CS381 case with backplane and 8 hotswap drive trays
  • 8 Seagate ST4000VN008-2DR166, different vendors, variety of serial numbers and dates
2. Proxmox 7.1-10
3. Truenas VM
Code:
root@svr:~# qm config 113
agent: 1
balloon: 0
boot: order=ide2;scsi0;net0
cores: 2
cpu: host,flags=+ibpb;+virt-ssbd;+amd-ssbd
hostpci0: 0000:07:00,pcie=1
hostpci1: 0000:08:00,pcie=1
ide2: none,media=cdrom
machine: q35
memory: 8192
meta: creation-qemu=6.1.1,ctime=1646167873
name: truenas
net0: virtio=76:40:16:8E:E5:6E,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: local-zfs:vm-113-disk-0,size=12G
scsihw: virtio-scsi-pci
smbios1: uuid=52a1fa13-3c86-499c-879e-de88eda4e8c1
sockets: 1
tpmstate0: local-zfs:vm-113-disk-2,size=4M,version=v2.0
vga: virtio
vmgenid: eafe45ee-cbdf-48fd-ba0d-dc1373633cdf

4. Install TrueNAS-SCALE-22.02.0
5. Goto 'create pool' tab, select all 8 HDDs, listed as /dev/sd[b-h], create RAIDz2 pool.
6. Wait a few minutes for error to appear.

Other symptoms:
1. Usually, but not always, running smartctl -t long /dev/sdX yields an increase in UDMA_CRC_Error_Count on the disk attached to /dev/sdc of about 10 - 50 compared to the results before the test.
2. zpool scrub, zpool clear temporarily clears the warning which then reappears a few minutes later
3. destroying the pool and recreating it leads to the same result
4. destroying the pool and the VM and returning the disks to proxmox, creating the same pool there and running the SMART tests does NOT lead to the write errors or the UDMA_CRC_Error_Count increases. This is the big mystery right now.

Things I've tried to fix it, with powering off the system, making the change and powering back on, but which don't fix it:
1. change the drive in the affected slot
2. change the SAS cables to the other half of the backplane
3. change the SAS cable tails between diffferent SATA connectors on the MB
4. several direct SATA cables, to several different drives in turn directly attached to SATA_7 connector, with separate power supply from the PSU

In all these cases the error recurs specifically to the drive labelled /dev/sdc and physically attached to SATA_7 connector, regardless of which actual drive it is. My conclusion so far is that it is not the wires, the drives or the backplane. If it is a hardware problem, it is either the SATA_7 connector on the MB, or upstream of there.

The PCI bus doesn't offer an obvious answer: In proxmox there are four SATA controllers, two of which each have four of the drives attached to them. The two relevant SATA controllers are each in their own IOMMU group with nothing else. Passthrough the two SATA controllers (leaving the other two behind) doesn't seem to cause any issues. In TrueNAS, the two controllers both appear each with four drives attached to them. The offending drive is always connected like this:
Code:
root@truenas[~]# udevadm info -q path -n /dev/sdc
/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/ata8/host8/target8:0:0/8:0:0:0/block/sdc

which corresponds to the PCI device hostpci0: 0000:07:00,pcie=1 passsed through above.

So my questions/thoughts that are outstanding:
1. If it is a hardware problem, like the connector labelled SATA_7 on the MB, why does the error only occur inside TrueNAS and not in Proxmox?
2. If it is a PCI passthrough problem, why only that one drive, and not the other three passed through in the same device?
3. If it is a software problem, what is TrueNAS doing different to Proxmox with the same zpool status commands etc to get to this error?
4. What other combination of things is going on my tiny mind cannot even conceive of to explain this?

There is no data on the machine yet, and the VMs are backed up, I've written long notes on installation. So, I'm not averse to taking it all apart to try relatively extreme things to get the answer. With further reading it seems a windows VM with GPU passthrough might work inside TrueNAS SCALE now, so I'm comtemplating removing the proxmox altogether. If it is a hardware problem I'd like to get that pinned down soonish, so the part can go back to its vendor while I still have the option.

Thank you all in anticipation.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Probably there is a problem with the PCI passthrough of those SATA controllers. Can you try a dedicated LSI HBA instead? Possibly borrow one for tests before buying. Or maybe you do have some spare parts.
Or try ESXi instead of Proxmox. With virtualised TrueNAS there are dozens of components at play and every single one not playing perfectly well might lead to problems.

And probably first: try SCALE on the hardware without Proxmox. Just to diagnose what works and what doesn't. BTW, why Proxmox if you are running SCALE? SCALE comes with KVM ...

What is known to work is:
- sufficiently recent chipset
- LSI HBA for SATA or all NVMe drives instead (NVMe is PCIe, so you can passthrough these)
- ESXi as the hypervisor

Anything else is a grey area. You might be lucky, you might not.
 

duncandoo

Cadet
Joined
Mar 5, 2022
Messages
9
Thanks very much. These are very helpful ideas.

to reply:
1. my chipset is AMD X570, released 2019. That seems recent enough?
2. I don't have access to an HBA immediately, and all the slots are spoken for. I could put in an HBA card, but for a permanent solution it looks like I'd have to get a x1 NIC to free up the x16 slot to put in an HBA. And I'd like to understand why this is happening rather than just chuck money at the problem (which may be good money after bad!).
3. ESXi is not open source so less preferred for me
4. When I started researching this new machine it seemed the VM functionality on TrueNAS SCALE was quite flaky and the ecosystem for apps was pretty sparse, so I went with proxmox as a more known system. It seems SCALE has developed very rapidly, so that is why I was considering running it baremetal anyway. I'm a little put off by this problem, since it isn't clear why it is happening, so I can't be confident it will not recur - if it is something inside TrueNAS, then I'm further back than where I started and without a Proxmox hypervisor.

If there is something wrong with the passthrough of the SATA controllers why is it only one drive having the problem? The other identical controller passes through fine, and three out of four drives pass through on the affected controller just fine. What is the nature of the problem? Are there tests I can do, the software equivalent of switching the wires and drives around I've done already, to further narrow down where along the path the problem is occurring?

Thanks again for your suggestions. My next step will be to go with your idea of putting SCALE directly on the hardware. Wish me luck.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
I'd guess the SATA_7 connector has a mechanical problem (e.g poor solder joint).. => poor signal integrity and CRC errors.
It should be independent of the software... CORE, SCALE, Proxmox
 

duncandoo

Cadet
Joined
Mar 5, 2022
Messages
9
Thanks to @morganL and @Patrick M. Hausen I have done some more testing and things seem to be pointing back toward the SATA_7 connector. I installed TrueNAS SCALE directly on the hardware yesterday and setup the pool again. Ran long SMART tests overnight and this morning.....

The drive (a different one in the pool) connected to SATA_7, now called /dev/sdb2/ by TrueNAS has increased by about 80 UDMA_CRC_Error_Count and

Code:
root@truenas[~]# zpool status -L bigPool
  pool: bigPool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:00:01 with 0 errors on Sat Mar  5 11:49:52 2022
config:

        NAME        STATE     READ WRITE CKSUM
        bigPool     ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sde2    ONLINE       0     0     0
            sdb2    ONLINE       0   112     0
            sdf2    ONLINE       0     0     0
            sdh2    ONLINE       0     0     0
            sdd2    ONLINE       0     0     0
            sda2    ONLINE       0     0     0
            sdc2    ONLINE       0     0     0
            sdg2    ONLINE       0     0     0

errors: No known data errors


Interestingly the drive is not faulted out, and the pool is not marked as degraded, which it was before. It seems the motherboard will have to go back, which is a pain. Is there anything I can do to conclusively show it is the connector - nothing that will invalidate my warranty, though.

A couple of loose ends to tie up for anyone playing along later:
1. Why is the drive not marked as faulted and the pool degraded when apparently there is an unrecoverable error, when the same software running virtualised yesterday with the same hardware did mark the drive faulted and degrade the pool?
2. Why does the zpool status command in proxmox not detect the same thing as write errors?

Thanks again for the fantastic help.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
@duncandoo Running virtualized reduces visibility of the real drives and the real errors...so different decisions may be made.

if the error is intermittent, the pattern of failure will change and will have different impacts. Software has to be tolerant of some errors, but intolerant of lots of errors that might risk the data. The middle ground is poorly defined. Once you have a regular error source, you just have to fix it. In your case, you could fix motherboard and not use slot SATA_7.
 

George Kyriazis

Dabbler
Joined
Sep 3, 2013
Messages
42
I have a similar question / issue. I have a virtualized instance of TrueNAS on Proxmox, but in my case the underlying storage medium is a ceph RBD device. I am using this TrueNAS instance as a backup of another TrueNAS server (this one is on bare metal), so performance is not an issue.

I, too, have some write errors, but I am wondering what does it really "mean" to have write errors. Was the write retried and eventually succeed? I don't get any read errors, and the scrubs show no repairs were needed. Can I just clear the errors?

Thanks,

George
 
Top