Pool online in unhealthy state

SiCwan

Dabbler
Joined
Dec 4, 2018
Messages
14
Just set up TrueNAS Scale last night, created a 24disk RAIDZ3 pool, copied 2.8TB to it last night, when I looked at it this morning I got a warning message on it (shown below) but when I look at the disks, I dont see anything that indicates what drive, smart test all show the drives are "ok". I'm not sure how to trouble shoot this error, currently running long tests on them but they'll take an hour and a half to finish. Does anyone have any pointers on where I should be checking?

CRITICAL
Pool NAS24 state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
2022-10-01 13:19:19 (America/Chicago)
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
First off, extremely wide vDevs are highly discouraged, as the performance can be quite irregular as the pool fills. Having 2 x 12 disk RAID-Z3 or 2 x 12 disk RAID-Z2 is a much better option.

Second, with ZFS, automatic corrections should occur on normal reads and checksum errors. Check the pool with zpool status NAS24 and that should show you which disk had an issue.

Disks contain error detection and correction codes, but if they are not good enough to recover a block, the unrecoverable error is returned to the OS.

As a side note, desktop drives can perform extreme attempts at recovery, which can take a drive off-line for over a minute in an attempt to re-read the affected sector, and try & recover the sector. However, modern RAID systems that find a disk off-line for too long, don't want to bother any more, and simply want to perform the redundancy recovery which can occur in less than a second. Thus, NAS drives use time limited error recovery, TLER, (some vendors use a different name, but same concept). This is generally 7 seconds, unlike desktop disks which can take over a minute.
 

SiCwan

Dabbler
Joined
Dec 4, 2018
Messages
14
I'll have to make those changes once I get it back up and going, I think the controller I bought pre-flashed in IT mode died, server is now taking ~ 40min to boot with the card in, and a few times I got a "PCIe link training failure was observed in PCIe SLot 6 and the link is disabled"

the drives in question are used enterprise drives, X422_TAL13600A10 with (if i read the S.M.A.R.T data right, have ~ 1/2 of the 'max' life hours used) my plan is to slowly replace those over time) plugged in to a Dell MD1220, via a LSI 9200-8e in IT mode, told it was originally a Dell 6Gbps SAS HBA
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Some servers need better cooling around the LSI controllers. Those LSI controllers usually have a heat sink but no fan. Try adding a small fan to cool just that chip on the LSI controller.

One other note, SATA drives that have a bad block will always return the bad block response for that block, until it's spared out. (Assuming their are spares available...) The sparing process for SATA is weird, but simple. Write the block and suddenly the drive puts that written block to a spare block. Then remaps access for that old block so the disk looks exactly the same.

SAS drives have a more complicated sparing process. But it should work with ZFS just fine too.
 

SiCwan

Dabbler
Joined
Dec 4, 2018
Messages
14
I had the card installed in an R7910, right over where the fans blow over the quad 2*10g 2*1g network card, so it has some airflow, I'll see if there's a slot with better airflow on it.
I'm fairly certain the ebay seller sold me a dud card, looking at the card, and other cards for sale, chips loot alot different and it does not look anything like the actual dell branded card its supposed to be. I may just cut my losses and return the card and see about finding one, or equivalent one from an actual retailer.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
There have been cases and rumors of counterfeit cards both LSI HBA & Intel network. It appears that after manufacturers determine a chip has failed tests, (like they can't stand the heat :smile:, they throw them away. It is thought that these counterfeiters take these defective chips, perform minimal tests and make fake cards with those that pass the minimal tests.

There are also rumors of "midnight runs" of factories that produce components, that can then be sold without the primary designer & company knowing about them. And sold without quality control, (and in essence, stealing the raw materials...).

This is one case for not manufacturing in a place that you can't verify what happens in the factory.
 

SiCwan

Dabbler
Joined
Dec 4, 2018
Messages
14
Pretty sure I got a hold of one, this one doesn't even have the HBA address' listed on it, chips where ships shoudn't be, and oddly enough, there was a sticker (clearly put there) to cover up the LSI stamping on the PCB, when I try and check the firmware on it, its reporting back the chip is in a fault state, not having much luck finding out if I can fix that on this particular card.
 

SiCwan

Dabbler
Joined
Dec 4, 2018
Messages
14
I sent a message last night to the ebay seller about the card asking which firmware they flashed on it, and they canceled the order and sent me a refund, so that pretty much says it all. I decided to go back to my orginal idea of a 9300-8e (SAS3008) card, looking at this one since it has a really nice looking heatsink on it, since you mentioned something about them getting hot https://www.amazon.com/Sparepart-Dell-PWA-12GB-SAS-HBA-T93GD/dp/B01DBRP0QM
 
Last edited:

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
If you check the forums, I think their is a reliable LSI HBA reseller on Evil-Bay, who I think is called the "Art of the Server". Give a search on the forums and read the posts that find that phrase. (Meaning don't take my second hand word... or if you do, it's at your own risk.)
 
Top