Pool Backup Array state is ONLINE: One or more devices are faulted in response to IO failures.

Sprint

Explorer
Joined
Mar 30, 2019
Messages
72
Hi all

This is an odd one, I found a few threads which had similar issues but they didn't quiet line up with my usecase as they either had errors against drives or the pool was marked "offlien unhealthy" (Mine is still showing "Online unhealthy".

So one of my TrueNas boxes (this one is my primary backup, I have a secondary backup offsite too so data's safe) keeps stopping. I get an error, normally about "IO failure" or "IO suspended", and the pool goes offline, but yet all looks healthy and there are no errors against any drives?
Pool consists of..

vDev1 (Raidz1)
4x8Tb WD Reds
vDev2 (Raidz1)
4x8Tb WD Reds
vDev3 (Raidz1)
4x4Tb WD Reds

Server is virtualised, sitting on a Proxmox host, with 2 HBAs passed through, 64Gb of ram, boot medium is a pair of mirrored SSD within Proxmox.
Has 10 Threads assigned of a 10 Core Xeon. All other VMs are running without issue.

root@Plutonium[~]# zpool status -v
pool: Backup_Array
state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-JQ
scan: scrub repaired 0B in 22:23:24 with 0 errors on Wed Jun 1 22:23:28 2022
config:

NAME STATE READ WRITE CKS UM
Backup_Array ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/ba485b25-63af-11eb-b69a-a0369f17c294 ONLINE 0 0 0
gptid/ba7f43b0-63af-11eb-b69a-a0369f17c294 ONLINE 0 0 0
gptid/ba84c80f-63af-11eb-b69a-a0369f17c294 ONLINE 0 0 0
gptid/bb2a0f9d-63af-11eb-b69a-a0369f17c294 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
gptid/0176dd85-9676-11ec-ba47-739745ebe144 ONLINE 0 0 0
gptid/bb00a0dc-63af-11eb-b69a-a0369f17c294 ONLINE 0 0 0
gptid/bb17532b-63af-11eb-b69a-a0369f17c294 ONLINE 0 0 0
gptid/bb272b43-63af-11eb-b69a-a0369f17c294 ONLINE 0 0 0
raidz1-2 ONLINE 0 0 0
gptid/f808d018-63fb-11eb-a09d-a0369f17c294 ONLINE 0 0 0
gptid/fad5a68a-63fb-11eb-a09d-a0369f17c294 ONLINE 0 0 0
gptid/faff21cb-63fb-11eb-a09d-a0369f17c294 ONLINE 0 0 0
gptid/faf320e9-63fb-11eb-a09d-a0369f17c294 ONLINE 0 0 0

errors: List of errors unavailable: pool I/O is currently suspended

pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:01:27 with 0 errors on Sat Jun 18 03:46:27 2022
config:

NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
ada0p2 ONLINE 0 0 0

errors: No known data errors

The error I get in the GUI is "

CRITICAL

Pool Backup_Array state is ONLINE: One or more devices are faulted in response to IO failures.


Looking for some guidance as to what my next steps should be? I'm confident a reboot will bring it back online like it has in the past, but i want to get to the bottom of this as this is at least the 3rd time this machines thrown its toys out of the pram.

Appreciate any and all feed back :)
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Full Hardware Spec please. (as per forum rules)
Also, what models of drives are you using?
 

Sprint

Explorer
Joined
Mar 30, 2019
Messages
72
Full Hardware Spec please. (as per forum rules)
Also, what models of drives are you using?
They're all CMR if thats what your thinking, but full is below
4Tb WDWD40EFRX
4Tb WDWD40EFRX
4Tb WDWD40EFRX
4Tb WDWD40EFRX
8Tb WDWD80EMAZ
8Tb WDWD80EZAZ
8Tb WDWD80EZAZ
8Tb WDWD80EFZX
8Tb WDWD80EZAZ
8Tb WDWD80EMAZ
8Tb WDWD80EZAZ
8Tb WDWD80EMAZ

Xeon E5-2630V4
SuperMicro X10SRA-F
256Gb DDR4
1000w PSU
LSI 9217-8i (PCI passed through)
LSI 9200-8i (PCI passed through)
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Hmmm (I was thinking SMR) - and its not 1 HBA to explain the problem. The PSU at least looks powerful enough.
I am assuming that SMART is clear
LSI's are in IT mode? - Are they on the correct firmware version?
 

Sprint

Explorer
Joined
Mar 30, 2019
Messages
72
Hmmm (I was thinking SMR) - and its not 1 HBA to explain the problem. The PSU at least looks powerful enough.
I am assuming that SMART is clear
LSI's are in IT mode? - Are they on the correct firmware version?
Its worth adding, this setup has been running flawlessly for nearly two years, so I don't believe it to be a firmware issue. The only thing that's change is that i configured a 2Tb LUN to be used via iSCSI by Proxmox for a Proxmox Backup Server for VM backups. It does a verification job every 2 weeks, and due to the deduplicated nature of the dataset, is very IO intensive.... is there a scenario where TrueNas just isn't able to keep up, or is somehow tripping over the iSCSI workload? That's the only thing that's changed over the last few months (apart from upgrading to TrueNas 13, and having to regress back to the latest version of 12 as it broke replication/NFS shares)

(oh and yes all smarts are passing. If a drive starts to spit errors, it gets swapped out, I have cold spares on standby)
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Ahh dedupe
Dedupe can, will and often does kill NAS's due to the amount of overhead it produces.

Solution 1: Do not use dedupe
Solution 2: Buy an Optane (preferably 2 for mirrors) and move the dedupe tables to the optane in a specialised vdev. Not the cheap M10 optane but a 900p or better

How much memory have you given TN?

The fundamental problem is that dedupe requires VAST amounts of random IO which HDD's (its made even worse in a RAIDZ array) cannot possibly supply.
 
Last edited:

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947

Sprint

Explorer
Joined
Mar 30, 2019
Messages
72
Interesting... so the dedup is done entirely within the Proxmox backup server's VM, its totally transparent to Truenas as Truenas is just presenting a 2Tb LUN

I do run deduplicated datasets on my bigger, much more powerful primary Truenas box, but those are on an array of SSDs (2 Vdevs of 3x1Tb SSDs in Raidz1), with two 900P drives which host the dedup tables, so i know all about dedup'ing properly within Truenas.

TrueNas isn't being asked to store the DD data, that should be stored and handled on the VM. It does raise the question then, is the Proxmox backup server VM just thrashing the LUN to hard?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I have no idea - and thats beyond my comprehension as I don't use proxmox and keep TN on metal.
 

Sprint

Explorer
Joined
Mar 30, 2019
Messages
72
I have no idea - and thats beyond my comprehension as I don't use proxmox and keep TN on metal.
Yeah my bare metal Primary box also hosts a 2Tb lun to my second PBS backup server, and thats not had an issue, but that has NVME L2ARC, Octane Slogs, and 20 CPU cores and 256Gb dedicated to itself, so perhaps this is just exceeding the ability of the much lighter provisioned, disk only Backup TrueNas VM...? :confused:

I have a Dell510 Not doing anything, maybe I'll fire up PBS on there in a VM, but just give it the entire HBA and the 7x1Tb Dell enterprise drives, take TN out of the equation, leaving TN to handle dataset replication...
 

Sprint

Explorer
Joined
Mar 30, 2019
Messages
72
So I've been reading into the Proxmox Backup server documentation... and yep... I missed a few key things :P there's a few ways to provide it with storage, but it advises either using hardware raid, or ZFS ideally on SSDs, but if using disks (via a HBA) it STRONGLY advises using a fast SSD as a special drive, so clearly PBS suffers from the same deduplication requirements as TrueNas (which now I think about it, isn't surprising as they both leverage ZFS). My Primary server managed without issue because its a monster, but my secondary, with nothing but spinning rust, well yeah, no wonder it spat its toys out!

Now rebuilding my PBS server onto a separate machine, which will include a SSD assigned as a "Special device" :)
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Glad you got a direction to go in.
 
Top