Hello everyone,
Since installing Truenas Scale on both of my servers, one of which is for a backup of the other one, I've been experiencing problems with pool health. All of my storage pools on the main server are organized as mirror vdevs, 3 in total. The main Truenas server has version TrueNAS-SCALE-22.12.4.2 and it is virtualized on Proxmox (which is installed on two Micron 5300 1.92TB SSD's as a mirror vdev), with hardware HBA being passed through to it. The memory allocated is 12GB.
Almost all the time at least one of the pools is in a degraded state with disks being faulted, unavailable, or removed, usually due to a high amount of errors detected by the OS. SMART monitoring tools such as smartctl usually report on either errors during internal tests or/and a growing number of "UDMA_CRC_Error_Count" or "Hardware_ECC_Recovered". Below please find the output for one of the existing disks:
Once the drive is replaced and resilvering is finished, the pool will report healthy for some time until it or another one fails. Here I should say that one of my disks indeed had bad sectors and was successfully replaced. All the other ones seem to have problems with data transfer. Here I could find only three main reasons for such behaviour:
1. Problems with cables. These are relatively cheap mini SAS to 4 SATA cables bought from China (Ali, eBay)
2. Problems with the HBA card. I have an LSI 9240-8i (it is recognized as LSI SAS2008 by lspci) also purchased on Ali.
On the backup server, the config is slightly different. There I have TrueNAS-SCALE-23.10.2 installed on the bare metal (consumer board MSI-AM1I with 12GB RAM). The HDD disks are organized as a raidz1 array of four 2TB disks with boot disks being represented as a mirror vdev. HDDs are also connected via the cables of the same quality to a similar HBA card. The mirror vdev is connected directly to the motherboard. Here I always have failures of two disks. Apart from a "standard" faulty HDD one of my ssd's has encountered ewrrors as well. I attached it to another PC and checked smart there. Unfortunately I could not confirm the result of smartctl output. According to CrystalDisk Info, the disk is clean. So two previous points I deducted are questionable. In addition, I have also another observation. My current backup server used to serve as a main server with openmediavault as a main OS and practically the same hardware. I also set up several mirror vdevs and the system ran smoothly for at least 2(!) years with many disks migrated to the current main server.
If anyone has any idea how to trace the source of the problem without breaking the bank, it would be much appreciated.
Since installing Truenas Scale on both of my servers, one of which is for a backup of the other one, I've been experiencing problems with pool health. All of my storage pools on the main server are organized as mirror vdevs, 3 in total. The main Truenas server has version TrueNAS-SCALE-22.12.4.2 and it is virtualized on Proxmox (which is installed on two Micron 5300 1.92TB SSD's as a mirror vdev), with hardware HBA being passed through to it. The memory allocated is 12GB.
Almost all the time at least one of the pools is in a degraded state with disks being faulted, unavailable, or removed, usually due to a high amount of errors detected by the OS. SMART monitoring tools such as smartctl usually report on either errors during internal tests or/and a growing number of "UDMA_CRC_Error_Count" or "Hardware_ECC_Recovered". Below please find the output for one of the existing disks:
Code:
Error 38492 occurred at disk power-on lifetime: 25091 hours (1045 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 51 00 00 00 00 00 Error: ABRT Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ef 05 fe 00 00 00 40 00 00:03:20.036 SET FEATURES [Enable APM] 60 08 00 10 78 c0 40 00 00:03:19.955 READ FPDMA QUEUED 60 08 00 78 78 c0 40 00 00:03:19.955 READ FPDMA QUEUED 60 08 00 38 78 c0 40 00 00:03:19.955 READ FPDMA QUEUED 60 08 00 18 78 c0 40 00 00:03:19.955 READ FPDMA QUEUED
Once the drive is replaced and resilvering is finished, the pool will report healthy for some time until it or another one fails. Here I should say that one of my disks indeed had bad sectors and was successfully replaced. All the other ones seem to have problems with data transfer. Here I could find only three main reasons for such behaviour:
1. Problems with cables. These are relatively cheap mini SAS to 4 SATA cables bought from China (Ali, eBay)
2. Problems with the HBA card. I have an LSI 9240-8i (it is recognized as LSI SAS2008 by lspci) also purchased on Ali.
On the backup server, the config is slightly different. There I have TrueNAS-SCALE-23.10.2 installed on the bare metal (consumer board MSI-AM1I with 12GB RAM). The HDD disks are organized as a raidz1 array of four 2TB disks with boot disks being represented as a mirror vdev. HDDs are also connected via the cables of the same quality to a similar HBA card. The mirror vdev is connected directly to the motherboard. Here I always have failures of two disks. Apart from a "standard" faulty HDD one of my ssd's has encountered ewrrors as well. I attached it to another PC and checked smart there. Unfortunately I could not confirm the result of smartctl output. According to CrystalDisk Info, the disk is clean. So two previous points I deducted are questionable. In addition, I have also another observation. My current backup server used to serve as a main server with openmediavault as a main OS and practically the same hardware. I also set up several mirror vdevs and the system ran smoothly for at least 2(!) years with many disks migrated to the current main server.
If anyone has any idea how to trace the source of the problem without breaking the bank, it would be much appreciated.