duncandoo
Cadet
- Joined
- Mar 5, 2022
- Messages
- 9
I am new to this forum, new to TrueNAS Scale and setting up a brand new machine, so please bear with me and forgive any stupid omissions.
In summary: why does the HDD connected to the /dev/sdc mount point always show this error, regardless of which actual drive is connected there, or via which wires? And what can I do to fix it?
The problem: a few minutes after creating a brand new RAIDZ2 pool of 8 4TB disks, the /dev/sdc disk shows write errors, gets faulted and the pool is marked as degraded:
Steps to replicate:
1. all brand new hardware
3. Truenas VM
4. Install TrueNAS-SCALE-22.02.0
5. Goto 'create pool' tab, select all 8 HDDs, listed as /dev/sd[b-h], create RAIDz2 pool.
6. Wait a few minutes for error to appear.
Other symptoms:
1. Usually, but not always, running
2. zpool scrub, zpool clear temporarily clears the warning which then reappears a few minutes later
3. destroying the pool and recreating it leads to the same result
4. destroying the pool and the VM and returning the disks to proxmox, creating the same pool there and running the SMART tests does NOT lead to the write errors or the UDMA_CRC_Error_Count increases. This is the big mystery right now.
Things I've tried to fix it, with powering off the system, making the change and powering back on, but which don't fix it:
1. change the drive in the affected slot
2. change the SAS cables to the other half of the backplane
3. change the SAS cable tails between diffferent SATA connectors on the MB
4. several direct SATA cables, to several different drives in turn directly attached to SATA_7 connector, with separate power supply from the PSU
In all these cases the error recurs specifically to the drive labelled /dev/sdc and physically attached to SATA_7 connector, regardless of which actual drive it is. My conclusion so far is that it is not the wires, the drives or the backplane. If it is a hardware problem, it is either the SATA_7 connector on the MB, or upstream of there.
The PCI bus doesn't offer an obvious answer: In proxmox there are four SATA controllers, two of which each have four of the drives attached to them. The two relevant SATA controllers are each in their own IOMMU group with nothing else. Passthrough the two SATA controllers (leaving the other two behind) doesn't seem to cause any issues. In TrueNAS, the two controllers both appear each with four drives attached to them. The offending drive is always connected like this:
which corresponds to the PCI device
So my questions/thoughts that are outstanding:
1. If it is a hardware problem, like the connector labelled SATA_7 on the MB, why does the error only occur inside TrueNAS and not in Proxmox?
2. If it is a PCI passthrough problem, why only that one drive, and not the other three passed through in the same device?
3. If it is a software problem, what is TrueNAS doing different to Proxmox with the same zpool status commands etc to get to this error?
4. What other combination of things is going on my tiny mind cannot even conceive of to explain this?
There is no data on the machine yet, and the VMs are backed up, I've written long notes on installation. So, I'm not averse to taking it all apart to try relatively extreme things to get the answer. With further reading it seems a windows VM with GPU passthrough might work inside TrueNAS SCALE now, so I'm comtemplating removing the proxmox altogether. If it is a hardware problem I'd like to get that pinned down soonish, so the part can go back to its vendor while I still have the option.
Thank you all in anticipation.
In summary: why does the HDD connected to the /dev/sdc mount point always show this error, regardless of which actual drive is connected there, or via which wires? And what can I do to fix it?
The problem: a few minutes after creating a brand new RAIDZ2 pool of 8 4TB disks, the /dev/sdc disk shows write errors, gets faulted and the pool is marked as degraded:
Code:
root@truenas[~]# zpool status -L bigPool pool: bigPool state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: scrub repaired 0B in 00:00:01 with 0 errors on Sat Mar 5 11:53:37 2022 config: NAME STATE READ WRITE CKSUM bigPool DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 sdb ONLINE 0 0 0 sdc FAULTED 0 12 0 too many errors sdd ONLINE 0 0 0 sde ONLINE 0 0 0 sdf ONLINE 0 0 0 sdg ONLINE 0 0 0 sdh ONLINE 0 0 0 sdi ONLINE 0 0 0 errors: No known data errors
Steps to replicate:
1. all brand new hardware
- ASRock X570M Pro4 board (8 onboard SATA connectors)
- AMD Ryzen 7 5700G
- WD Red SN700 NVMe SSD - 1TB ,M.2 for boot drive and hosting VMs
- corsair vengeance LPX 64GB 3200Mhz DDR4 (4*16GB)
- Silverstone CS381 case with backplane and 8 hotswap drive trays
- 8 Seagate ST4000VN008-2DR166, different vendors, variety of serial numbers and dates
3. Truenas VM
Code:
root@svr:~# qm config 113 agent: 1 balloon: 0 boot: order=ide2;scsi0;net0 cores: 2 cpu: host,flags=+ibpb;+virt-ssbd;+amd-ssbd hostpci0: 0000:07:00,pcie=1 hostpci1: 0000:08:00,pcie=1 ide2: none,media=cdrom machine: q35 memory: 8192 meta: creation-qemu=6.1.1,ctime=1646167873 name: truenas net0: virtio=76:40:16:8E:E5:6E,bridge=vmbr0,firewall=1 numa: 0 ostype: l26 scsi0: local-zfs:vm-113-disk-0,size=12G scsihw: virtio-scsi-pci smbios1: uuid=52a1fa13-3c86-499c-879e-de88eda4e8c1 sockets: 1 tpmstate0: local-zfs:vm-113-disk-2,size=4M,version=v2.0 vga: virtio vmgenid: eafe45ee-cbdf-48fd-ba0d-dc1373633cdf
4. Install TrueNAS-SCALE-22.02.0
5. Goto 'create pool' tab, select all 8 HDDs, listed as /dev/sd[b-h], create RAIDz2 pool.
6. Wait a few minutes for error to appear.
Other symptoms:
1. Usually, but not always, running
smartctl -t long /dev/sdX
yields an increase in UDMA_CRC_Error_Count on the disk attached to /dev/sdc of about 10 - 50 compared to the results before the test.2. zpool scrub, zpool clear temporarily clears the warning which then reappears a few minutes later
3. destroying the pool and recreating it leads to the same result
4. destroying the pool and the VM and returning the disks to proxmox, creating the same pool there and running the SMART tests does NOT lead to the write errors or the UDMA_CRC_Error_Count increases. This is the big mystery right now.
Things I've tried to fix it, with powering off the system, making the change and powering back on, but which don't fix it:
1. change the drive in the affected slot
2. change the SAS cables to the other half of the backplane
3. change the SAS cable tails between diffferent SATA connectors on the MB
4. several direct SATA cables, to several different drives in turn directly attached to SATA_7 connector, with separate power supply from the PSU
In all these cases the error recurs specifically to the drive labelled /dev/sdc and physically attached to SATA_7 connector, regardless of which actual drive it is. My conclusion so far is that it is not the wires, the drives or the backplane. If it is a hardware problem, it is either the SATA_7 connector on the MB, or upstream of there.
The PCI bus doesn't offer an obvious answer: In proxmox there are four SATA controllers, two of which each have four of the drives attached to them. The two relevant SATA controllers are each in their own IOMMU group with nothing else. Passthrough the two SATA controllers (leaving the other two behind) doesn't seem to cause any issues. In TrueNAS, the two controllers both appear each with four drives attached to them. The offending drive is always connected like this:
Code:
root@truenas[~]# udevadm info -q path -n /dev/sdc /devices/pci0000:00/0000:00:1c.0/0000:01:00.0/ata8/host8/target8:0:0/8:0:0:0/block/sdc
which corresponds to the PCI device
hostpci0: 0000:07:00,pcie=1
passsed through above.So my questions/thoughts that are outstanding:
1. If it is a hardware problem, like the connector labelled SATA_7 on the MB, why does the error only occur inside TrueNAS and not in Proxmox?
2. If it is a PCI passthrough problem, why only that one drive, and not the other three passed through in the same device?
3. If it is a software problem, what is TrueNAS doing different to Proxmox with the same zpool status commands etc to get to this error?
4. What other combination of things is going on my tiny mind cannot even conceive of to explain this?
There is no data on the machine yet, and the VMs are backed up, I've written long notes on installation. So, I'm not averse to taking it all apart to try relatively extreme things to get the answer. With further reading it seems a windows VM with GPU passthrough might work inside TrueNAS SCALE now, so I'm comtemplating removing the proxmox altogether. If it is a hardware problem I'd like to get that pinned down soonish, so the part can go back to its vendor while I still have the option.
Thank you all in anticipation.