Should I scrap my pool and start over?

hybris

Cadet
Joined
Aug 27, 2021
Messages
8
Hello TrueNAS forum!

This is my first post and I would like to ask for some advice on what to do with my 7 disk raidz2 that is constantly getting degraded or faulted.

The server is a used Dell r720xd with 64gb ecc ram and H310 controller. Using TrueNAS Scale latest beta.
I have a 4 disk raid10 running on the server that can scrub with 0 errors.

What I have done is something like this:
1) Created a raidz2 with 5 toshiba and 2 hgst disks and filled it to 70% with data. All old used 4tb disks.
2) Got a few read/write errors on the 2 hgst disks. Also on one of the toshiba disks.
3) Did a scrub. Did a clear. Did a scrub. Did a clear. Errors kept coming back. Short SMART was ok in TrueNAS for all disks.
4) Booted SeaTools and the hgst disks did not allow SMART to test them. Did long generic tests on all disks and they all passed with no errors.
5) Did a scrub and the errors came back on the hgst disks.
6) Swapped the 2 hgst disks for two new WD Red Pro drives.
7) Resilvered and got read/write/checksum errors on the WD Red Pro drives.
8) Cleared the pool. Scrub again, this time I get read/write/checksum errors on the toshiba drive (from step 2).

I am now running SeaTools and doing a long generic test on the WD drives. (Only reads)
I will run a memory test tomorrow.

Could it be that the whole pool is messed up due to disk problems? Maybe I had 3 bad disks from the start...?
Other hardware issue?

Any advice?

Thank you.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Have you crossflashed the HBA to IT mode and firmware 20.00.07.00? Failure to do that sometimes causes data integrity issues.
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,829
Are there specific errors the degradation mentions in the logs? I wonder if you might be having cabling / connector issues.
 

hybris

Cadet
Joined
Aug 27, 2021
Messages
8
Have you crossflashed the HBA to IT mode and firmware 20.00.07.00? Failure to do that sometimes causes data integrity issues.

NO I hade totally missade that. I have crossflashed today. Tanken you!
 

hybris

Cadet
Joined
Aug 27, 2021
Messages
8
Are there specific errors the degradation mentions in the logs? I wonder if you might be having cabling / connector issues.

Good point I will check.

Looks like one of my new Wd red pros hade issues. I tested the drive using WDs own software on another machine and it failed a long SMART test.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
NO I hade totally missade that. I have crossflashed today. Tanken you!
Looks like one of my new Wd red pros hade issues. I tested the drive using WDs own software on another machine and it failed a long SMART test.

These two things together are potentially a bad combination. One of the reasons we are so specific about a very particular version of the IT firmware, and running IT firmware, is because it's one of the few known combinations of things that properly handle errors and failing disks.

Running something other than an IT-mode HBA has often been found to lead to issues with mishandled errors causing a variety of subtle errors, which on a normal non-ZFS server might go unnoticed or written off as a fluke cosmic ray, but ZFS is highly sensitive to errors.

There is a very good chance that your instability is coming from an interplay of these two issues.
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,829
Well, when you have a moment, let us know just what test it failed. Some tests are far more indicative of impending HDD failure than others. Also, it is fairly common to have early deaths in every HDD drive cohort. The classic failure curve for hard drives over time is shaped like a bathtub - with a relatively high incidence rate early on, nothing for a long time, and then increasing failures after a few years. Heat, vibration, etc. also have a influence on the long-term health of hard drives.
 

hybris

Cadet
Joined
Aug 27, 2021
Messages
8
I tested using WD Red Pro using Data Lifeguard Diagnostics on windows and selected SMART Long test. It failed straight away with something like "reallocation error count 8 threshold 5 status poor". I will be sending that drive back.

I have removed that drive and replaced it with another WD Red Pro (tested using the same software). All my WDs are from the same batch :-|

My pool is currently resilvering and I will know in about 7 hours if it has succeeded or not. This is also using the crossflashed H310.
I will see if I can find the logs for the previous failures and have a look at them.

I have never really used TrueNAS before and in a way I don't mind the learning experience this has been to see failure first hand.
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,829
*horrendously paraphrasing iconic movie scene in Apollo 13* This is not a failure, this is TrueNAS' finest moment.

Hardware failures, firmware mismatches, etc. are a fact of life. The payload is (still) alive. Once the resilver is complete and the pool is back to party mode, pop some champagne.
 

hybris

Cadet
Joined
Aug 27, 2021
Messages
8
*horrendously paraphrasing iconic movie scene in Apollo 13* This is not a failure, this is TrueNAS' finest moment.

Hardware failures, firmware mismatches, etc. are a fact of life. The payload is (still) alive. Once the resilver is complete and the pool is back to party mode, pop some champagne.

Haha! I have never seen it like that before :)

Everything is up and running now so I will go back to party mode! Thank you!
 
Top