drive failures across mirror vdevs

msokol

Dabbler
Joined
Jan 5, 2023
Messages
10
Hello everyone,

Since installing Truenas Scale on both of my servers, one of which is for a backup of the other one, I've been experiencing problems with pool health. All of my storage pools on the main server are organized as mirror vdevs, 3 in total. The main Truenas server has version TrueNAS-SCALE-22.12.4.2 and it is virtualized on Proxmox (which is installed on two Micron 5300 1.92TB SSD's as a mirror vdev), with hardware HBA being passed through to it. The memory allocated is 12GB.

Almost all the time at least one of the pools is in a degraded state with disks being faulted, unavailable, or removed, usually due to a high amount of errors detected by the OS. SMART monitoring tools such as smartctl usually report on either errors during internal tests or/and a growing number of "UDMA_CRC_Error_Count" or "Hardware_ECC_Recovered". Below please find the output for one of the existing disks:

Code:
Error 38492 occurred at disk power-on lifetime: 25091 hours (1045 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 00 00 00 00 00  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 05 fe 00 00 00 40 00      00:03:20.036  SET FEATURES [Enable APM]
  60 08 00 10 78 c0 40 00      00:03:19.955  READ FPDMA QUEUED
  60 08 00 78 78 c0 40 00      00:03:19.955  READ FPDMA QUEUED
  60 08 00 38 78 c0 40 00      00:03:19.955  READ FPDMA QUEUED
  60 08 00 18 78 c0 40 00      00:03:19.955  READ FPDMA QUEUED



Once the drive is replaced and resilvering is finished, the pool will report healthy for some time until it or another one fails. Here I should say that one of my disks indeed had bad sectors and was successfully replaced. All the other ones seem to have problems with data transfer. Here I could find only three main reasons for such behaviour:

1. Problems with cables. These are relatively cheap mini SAS to 4 SATA cables bought from China (Ali, eBay)
2. Problems with the HBA card. I have an LSI 9240-8i (it is recognized as LSI SAS2008 by lspci) also purchased on Ali.

On the backup server, the config is slightly different. There I have TrueNAS-SCALE-23.10.2 installed on the bare metal (consumer board MSI-AM1I with 12GB RAM). The HDD disks are organized as a raidz1 array of four 2TB disks with boot disks being represented as a mirror vdev. HDDs are also connected via the cables of the same quality to a similar HBA card. The mirror vdev is connected directly to the motherboard. Here I always have failures of two disks. Apart from a "standard" faulty HDD one of my ssd's has encountered ewrrors as well. I attached it to another PC and checked smart there. Unfortunately I could not confirm the result of smartctl output. According to CrystalDisk Info, the disk is clean. So two previous points I deducted are questionable. In addition, I have also another observation. My current backup server used to serve as a main server with openmediavault as a main OS and practically the same hardware. I also set up several mirror vdevs and the system ran smoothly for at least 2(!) years with many disks migrated to the current main server.

If anyone has any idea how to trace the source of the problem without breaking the bank, it would be much appreciated.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703

msokol

Dabbler
Joined
Jan 5, 2023
Messages
10
I do have one 2TB disk WDC WD20EFAX which is SMR, but I do not remember any problem with it. Even if it did conflict with zfs filesystem, it would not be able to explain and cover the problem with other disks, such as SSDs connected directly to the motherboard, and smooth operation of the same disk during 2 years in a similar mirror vdev under openmediavault.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
in a similar mirror vdev under openmediavault
I wasn't aware that OMV did ZFS... (I suspect that it doesn't).

The way ZFS does writes to your pool disks is vastly different to other filesystems, which is why ZFS and SMR absolutely don't mix well.

When ZFS sends a transaction group out to the disks in a pool (every 5 seconds), it expects that those commands will be carried out and completed before the next transaction group is sent.

SMR disks will sometimes return "busy" or not return an answer at all when they are busy dealing with the shingling tasks they take on as part of their design... ZFS has no patience for that (often just evicting that disk from the VDEV).

If your controller is doing work being overloaded by ZFS with instructions, but has a backlog of instructions still running with an SMR disk, it may get a bit messy with handling not only that disk, but all disks.

Anyway, I'm not in the business of explaining why SMR and ZFS are crap together... they just are.

Since you clearly identified that you have put them together and don't see why that would be a problem since the same disks worked together fine with another filesystem (correct me if OMV has ZFS and that's what you were using), I'll just say that you're mistaken in using that evidence to form that conclusion.

I agree that one single SMR disk and possible controller backlog may not be enough to explain the problems you see with other disks... but you've been unclear about exactly what you have, so I can't comment further.
 

msokol

Dabbler
Joined
Jan 5, 2023
Messages
10
I wasn't aware that OMV did ZFS... (I suspect that it doesn't)
Yes, it did support ZFS at that time and I believe this support has not stopped since I moved to a more powerful server.

Anyway, I'm not in the business of explaining why SMR and ZFS are crap together... they just are.
I am not an expert in this field which is why I made my post here. I deeply appreciate the information you provided and I'm not disputing with you on the matter of this subject, since again, I'm not an expert in this area.

Since you clearly identified that you have put them together and don't see why that would be a problem since the same disks worked together fine with another filesystem (correct me if OMV has ZFS and that's what you were using), I'll just say that you're mistaken in using that evidence to form that conclusion.
Yes, this disk was a part of my disk array run under OMV with ZFS pools organized similarly (3 mirrored pools with 2 disks in each pool). The problem is that now this disk is part of my backup server. I double-checked and could not find any SMR disk installed on my main server (apart from 4TB HPE disk, model MB4000GEQNH where information is missing on the internet, still since it is an enterprise disk, most probably it uses CMR technology).

However, my adventures with the faulted disk continue. One disk was removed by the system from one of my pools. It appeared as an unused one, during replacing it "by itself" and further resilvering I encountered a lot of errors and this disk was labeled "Faulted" once again because of too many errors.

On the backup server, there were two faulted drives. One from the main raidz1 pool (4 disks) and 1 SSD out of boot-pool (mirror vdev). There I will replace the SMR disk with a CMR one and will see if it makes a difference. Let me know if using an SMR disk can impact other pools as well. Since the "faulted" SSD has good health with no errors according to the diagnostic tools on my Windows machine.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Hey @msokol

Can you describe the chassis this case is in? The LSI SAS HBAs, especially the earlier SAS2008, tend to get rather warm under load as they're designed to sit in servers with a high level of ambient airflow, so if your system is a standard desktop tower with little direct ventilation on the PCIe area, it could be a simply issue of physics - your card is overheating.
 

msokol

Dabbler
Joined
Jan 5, 2023
Messages
10
Hi @HoneyBadger

Can you describe the chassis this case is in?
Absolutely. I have everything mounted inside an Antec 1900 case. I know it is positioned as a gaming case, and it is not the best for server equipment but it was sold at a huge discount and, what is most importantly, it could fit my E-ATX Supermicro motherboard.
It is equipped with three 120mm intake fans in front of the HDD cage, two 120mm top exhaust fans, and one 120mm exhaust fan at the rear of the chassis.

it could be a simply issue of physics - your card is overheating.
Actually, I had problems with the overheating of my 10gig NIC. According to the sensors package installed on the host OS (Proxmox), I noticed temperatures up to 90C while copying files. Unfortunately, my HBA card doesn't have sensors (or they can't communicate with the software). So to handle this issue I installed a large 200mm fan in front of all my cards (see below the photo)

20240227_232118_new.jpg


The temperature immediately dropped to 60-70C max for the 10G fiber NIC. Hopefully, it helped other PCIe cards as well.

Anyway, no thermal problems can explain how the same LSA card installed inside a very "tight" SilverStone DS380B was able to survive almost no airflow around it for at least 2 years 24/7. In this chassis, even my HDDs sometimes reached 45C-47C, and this is with two large fans in front of the HDD cage. The PCIe card has no direct air blowing whatsoever. The only difference was in the operating system. Those days I used OMV, also with ZFS, instead of Truenas as I already explained above. And had no issues
 
Top