Pool Backup Array state is ONLINE: One or more devices are faulted in response to IO failures.

Sprint · Jun 20, 2022

Hi all

This is an odd one, I found a few threads which had similar issues but they didn't quiet line up with my usecase as they either had errors against drives or the pool was marked "offlien unhealthy" (Mine is still showing "Online unhealthy".

So one of my TrueNas boxes (this one is my primary backup, I have a secondary backup offsite too so data's safe) keeps stopping. I get an error, normally about "IO failure" or "IO suspended", and the pool goes offline, but yet all looks healthy and there are no errors against any drives?
Pool consists of..

vDev1 (Raidz1)
4x8Tb WD Reds
vDev2 (Raidz1)
4x8Tb WD Reds
vDev3 (Raidz1)
4x4Tb WD Reds

Server is virtualised, sitting on a Proxmox host, with 2 HBAs passed through, 64Gb of ram, boot medium is a pair of mirrored SSD within Proxmox.
Has 10 Threads assigned of a 10 Core Xeon. All other VMs are running without issue.

root@Plutonium[~]# zpool status -v
pool: Backup_Array
state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-JQ
scan: scrub repaired 0B in 22:23:24 with 0 errors on Wed Jun 1 22:23:28 2022
config:

NAME STATE READ WRITE CKS UM
Backup_Array ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/ba485b25-63af-11eb-b69a-a0369f17c294 ONLINE 0 0 0
gptid/ba7f43b0-63af-11eb-b69a-a0369f17c294 ONLINE 0 0 0
gptid/ba84c80f-63af-11eb-b69a-a0369f17c294 ONLINE 0 0 0
gptid/bb2a0f9d-63af-11eb-b69a-a0369f17c294 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
gptid/0176dd85-9676-11ec-ba47-739745ebe144 ONLINE 0 0 0
gptid/bb00a0dc-63af-11eb-b69a-a0369f17c294 ONLINE 0 0 0
gptid/bb17532b-63af-11eb-b69a-a0369f17c294 ONLINE 0 0 0
gptid/bb272b43-63af-11eb-b69a-a0369f17c294 ONLINE 0 0 0
raidz1-2 ONLINE 0 0 0
gptid/f808d018-63fb-11eb-a09d-a0369f17c294 ONLINE 0 0 0
gptid/fad5a68a-63fb-11eb-a09d-a0369f17c294 ONLINE 0 0 0
gptid/faff21cb-63fb-11eb-a09d-a0369f17c294 ONLINE 0 0 0
gptid/faf320e9-63fb-11eb-a09d-a0369f17c294 ONLINE 0 0 0

errors: List of errors unavailable: pool I/O is currently suspended

pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:01:27 with 0 errors on Sat Jun 18 03:46:27 2022
config:

NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
ada0p2 ONLINE 0 0 0

errors: No known data errors

The error I get in the GUI is "

CRITICAL

Pool Backup_Array state is ONLINE: One or more devices are faulted in response to IO failures.

Looking for some guidance as to what my next steps should be? I'm confident a reboot will bring it back online like it has in the past, but i want to get to the bottom of this as this is at least the 3rd time this machines thrown its toys out of the pram.

Appreciate any and all feed back :)

NugentS · Jun 20, 2022

Full Hardware Spec please. (as per forum rules)
Also, what models of drives are you using?

Sprint · Jun 20, 2022

NugentS said:
Full Hardware Spec please. (as per forum rules)
Also, what models of drives are you using?

They're all CMR if thats what your thinking, but full is below

4Tb WD	WD40EFRX
4Tb WD	WD40EFRX
4Tb WD	WD40EFRX
4Tb WD	WD40EFRX
8Tb WD	WD80EMAZ
8Tb WD	WD80EZAZ
8Tb WD	WD80EZAZ
8Tb WD	WD80EFZX
8Tb WD	WD80EZAZ
8Tb WD	WD80EMAZ
8Tb WD	WD80EZAZ
8Tb WD	WD80EMAZ

Xeon E5-2630V4
SuperMicro X10SRA-F
256Gb DDR4
1000w PSU
LSI 9217-8i (PCI passed through)
LSI 9200-8i (PCI passed through)

NugentS · Jun 20, 2022

Hmmm (I was thinking SMR) - and its not 1 HBA to explain the problem. The PSU at least looks powerful enough.
I am assuming that SMART is clear
LSI's are in IT mode? - Are they on the correct firmware version?

Sprint · Jun 20, 2022

NugentS said:
Hmmm (I was thinking SMR) - and its not 1 HBA to explain the problem. The PSU at least looks powerful enough.
I am assuming that SMART is clear
LSI's are in IT mode? - Are they on the correct firmware version?

Its worth adding, this setup has been running flawlessly for nearly two years, so I don't believe it to be a firmware issue. The only thing that's change is that i configured a 2Tb LUN to be used via iSCSI by Proxmox for a Proxmox Backup Server for VM backups. It does a verification job every 2 weeks, and due to the deduplicated nature of the dataset, is very IO intensive.... is there a scenario where TrueNas just isn't able to keep up, or is somehow tripping over the iSCSI workload? That's the only thing that's changed over the last few months (apart from upgrading to TrueNas 13, and having to regress back to the latest version of 12 as it broke replication/NFS shares)

(oh and yes all smarts are passing. If a drive starts to spit errors, it gets swapped out, I have cold spares on standby)

NugentS · Jun 20, 2022

Ahh dedupe
Dedupe can, will and often does kill NAS's due to the amount of overhead it produces.

Solution 1: Do not use dedupe
Solution 2: Buy an Optane (preferably 2 for mirrors) and move the dedupe tables to the optane in a specialised vdev. Not the cheap M10 optane but a 900p or better

How much memory have you given TN?

The fundamental problem is that dedupe requires VAST amounts of random IO which HDD's (its made even worse in a RAIDZ array) cannot possibly supply.

NugentS · Jun 20, 2022

Take a look at

My experiments in building a home server capable of handling fast + consistent deduplication

AIM: To help people looking at deduplication on TrueNAS 12+, what I've found on the way making it work on mine. On sustained mixed loads, such as 50GB+ file copies and multiple transfers, using TrueNAS 12 with a deduped pool and default config...

www.truenas.com

It gives you some idea of what is required to run dedupe on TN

Sprint · Jun 20, 2022

Interesting... so the dedup is done entirely within the Proxmox backup server's VM, its totally transparent to Truenas as Truenas is just presenting a 2Tb LUN

I do run deduplicated datasets on my bigger, much more powerful primary Truenas box, but those are on an array of SSDs (2 Vdevs of 3x1Tb SSDs in Raidz1), with two 900P drives which host the dedup tables, so i know all about dedup'ing properly within Truenas.

TrueNas isn't being asked to store the DD data, that should be stored and handled on the VM. It does raise the question then, is the Proxmox backup server VM just thrashing the LUN to hard?

NugentS · Jun 20, 2022

I have no idea - and thats beyond my comprehension as I don't use proxmox and keep TN on metal.

Sprint · Jun 20, 2022

NugentS said:
I have no idea - and thats beyond my comprehension as I don't use proxmox and keep TN on metal.

Yeah my bare metal Primary box also hosts a 2Tb lun to my second PBS backup server, and thats not had an issue, but that has NVME L2ARC, Octane Slogs, and 20 CPU cores and 256Gb dedicated to itself, so perhaps this is just exceeding the ability of the much lighter provisioned, disk only Backup TrueNas VM...?

I have a Dell510 Not doing anything, maybe I'll fire up PBS on there in a VM, but just give it the entire HBA and the 7x1Tb Dell enterprise drives, take TN out of the equation, leaving TN to handle dataset replication...

Sprint · Jun 20, 2022

So I've been reading into the Proxmox Backup server documentation... and yep... I missed a few key things :P there's a few ways to provide it with storage, but it advises either using hardware raid, or ZFS ideally on SSDs, but if using disks (via a HBA) it STRONGLY advises using a fast SSD as a special drive, so clearly PBS suffers from the same deduplication requirements as TrueNas (which now I think about it, isn't surprising as they both leverage ZFS). My Primary server managed without issue because its a monster, but my secondary, with nothing but spinning rust, well yeah, no wonder it spat its toys out!

Now rebuilding my PBS server onto a separate machine, which will include a SSD assigned as a "Special device" :)

NugentS · Jun 20, 2022

Glad you got a direction to go in.

Sprint · Jun 21, 2022

NugentS said:
Glad you got a direction to go in.

Yep, thanks for your input, as that was the clue to needed :)

Important Announcement for the TrueNAS Community.

Pool Backup Array state is ONLINE: One or more devices are faulted in response to IO failures.

Sprint

Explorer

CRITICAL

Pool Backup_Array state is ONLINE: One or more devices are faulted in response to IO failures.

NugentS

MVP

Sprint

Explorer

NugentS

MVP

Sprint

Explorer

NugentS

MVP

NugentS

MVP

My experiments in building a home server capable of handling fast + consistent deduplication

Sprint

Explorer

NugentS

MVP

Sprint

Explorer

Sprint

Explorer

NugentS

MVP

Sprint

Explorer

Similar threads

Important Announcement for the TrueNAS Community.

Pool Backup Array state is ONLINE: One or more devices are faulted in response to IO failures.

Explorer

CRITICAL​

Pool Backup_Array state is ONLINE: One or more devices are faulted in response to IO failures.​

MVP

Explorer

MVP

Explorer

MVP

MVP

Explorer

MVP

Explorer

Explorer

MVP

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Pool Backup Array state is ONLINE: One or more devices are faulted in response to IO failures."

Similar threads

CRITICAL

Pool Backup_Array state is ONLINE: One or more devices are faulted in response to IO failures.