Help, zfs continue failing with pool I/O failures

Constantin.FF

Dabbler
Joined
Apr 6, 2022
Messages
13
Hi everyone,
I got into complicated issue with my TrueNAS.
SSD and HDD drives are failing after couple of months, I have tested and debug, what I can, without any result. I am posting details on my configuration (old PC) and what I have tried so far.

2xSSD - PNY CS900 120GB
2xSSD - Kingston 120GB A400
1xHDD - Toshiba N300 NAS 14 TB

Two SSD are for the boot-pool - these are still stable and working 1 year later - no errors so far.
The other two SSD I am using for Applications and Kubernetes.
These started working fine but after two months one failed on the SMART test.
Error `CRC_Error_Count`. Shortly after that it stopped even being detected by the PC. (I assumed cheap ssd and I got it replaced )
3 more SSD I have changed since then, and they fail with the same error.
Now the HDD gets the same error :/

SMART fails with "CRC_Error_Count" and zfs error is:
The number of I/O errors associated with a ZFS device exceeded acceptable levels. ZFS has marked the device as faulted.

I have tried/tested:
  1. Changed the SATA cables - CRC_Error_Count should be related to something with the disk connection - Same error continues
  2. I got Dell H310 6Gbps SAS HBA LSI 9211-8i PCI-e card - still same errors
  3. Run MemTest86 - RAM is OK
  4. Try Ubuntu and run the Disk utility - SMART tests (short and long) are good
  5. Using Try Ubuntu I formatted the disk as ext4 and fill the disk with data - No errors
  6. After all I ran TrueNAS, created pool with the SSD and set some Applications to test. All good, SMART tests run without errors. Until after 7-8 hours when the same errors started again.

Motherboard: ASUS B85M-G, Socket-1150
CPU: Intel i7 3.5 Ghz
RAM: Patriot 4 x 8 GB DDR3, 1600 MHz / PC3-12800, CL10, 1.5 V, non ECC
PCI-e: Dell H310 6Gbps SAS HBA LSI 9211-8i
TrueNAS: SCALE-22.12.2

I am out of ideas now.
I will highly appreciate any suggestions

Thank you.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Check the temps on your H310 using the finger test. These cards need airflow or they melt (figuratively, they shit themselves a long time before actually melting)

Also, it is in IT mode isn't it?
 

Constantin.FF

Dabbler
Joined
Apr 6, 2022
Messages
13
Temperatures of the H310 seems to be OK.
I believe that the issue is not really from the PCi-e SAS card, because I get the same result as when I am using the motherboard SATA ports
 

Constantin.FF

Dabbler
Joined
Apr 6, 2022
Messages
13
I got some progress here.
First I removed the H310 card since it does not look the problem is from it.
Connected the ssds again to the MB.

The change that I did was with the kernel settings, I have added libata.force=1.5G,noncq
24 hours of running together with smart test on every hour - all good so far.
After that just for the test I removed the setting and immediately one ssd failed a smart test.

One more note - I have tried libata.force=noncq or libata.force=1.5G alone, but getting the same errors.

What do you think is the hardware problem?
And just by adding the sata throttling would it solve the system stability?


I have attached some of the error messages I see in the terminal.
Mainly:
  1. I/O error, dev sda
  2. WRITE FPDMA QUEUED
  3. Unaligned write command

Here are some of the links that helped me while looking for solution:
 

Attachments

  • Screenshot from 2023-06-24 20-41-24.png
    Screenshot from 2023-06-24 20-41-24.png
    973.9 KB · Views: 118
  • Screenshot from 2023-06-24 20-42-02.png
    Screenshot from 2023-06-24 20-42-02.png
    106.2 KB · Views: 115
  • Screenshot from 2023-06-24 20-42-20.png
    Screenshot from 2023-06-24 20-42-20.png
    758.9 KB · Views: 113
  • Screenshot from 2023-06-24 20-42-41.png
    Screenshot from 2023-06-24 20-42-41.png
    699.4 KB · Views: 181
  • Screenshot from 2023-06-24 20-43-02.png
    Screenshot from 2023-06-24 20-43-02.png
    371 KB · Views: 120

bew

Cadet
Joined
Nov 2, 2022
Messages
2
Hey Constantin, could you explain how you added t hese kernel parameters?
 

Constantin.FF

Dabbler
Joined
Apr 6, 2022
Messages
13
Hey Constantin, could you explain how you added t hese kernel parameters?

here is the command I use so that it is persistent:

Code:
midclt call system.advanced.update '{ "kernel_extra_options": "libata.force=noncq" }'
 
  • Like
Reactions: bew
Top