Data Corruption in zpool - how to proceed?

phomchick

Dabbler
Joined
Oct 2, 2017
Messages
26
Good drives, a little aged and small on the cache side (16MB) but traditional CMR.



Unfortunately, these are SMR drives. Not guaranteed to be responsible, and don't have the known issues with sector IDNF that WD Red SMR does, but potentially a contributing factor if they are timing out commands under load.



May be an issue if it is behaving in RAID mode, still not as ideal as HBA330. Do you recall setting it to HBA Mode in the BIOS/EFI configuration?

dmesg | grep LSI should hopefully pull out the line that's showing the driver being loaded. I expect to see mrsas as the result.
dmesg | grep LSI returns...

Code:
msp0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem 0xf7d00000-0xf7d03fff, 0xf7c80000-0xf7cbffff irq 16 at device 0,0 on pci1
 

phomchick

Dabbler
Joined
Oct 2, 2017
Messages
26
I had the server off over night, and when booting it this morning, there are no errors in any of the pools.

P.S. I really want to thank the folks who have contributed to this thread. You've been very knowledgeable and helpful.
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
And the H730 is clearly in HBA mode as the above camcontrol devlist output shows.

I don't see anything that says that. Merely printing the device model is not any sort of proof that it is an HBA, as a number of RAID controllers do this as well (such as a bunch of the 3Ware's, Highpoint, and I believe also the Areca). I explicitly state this in the HBA vs RAID card resource:

Note that having device names in "camcontrol devlist" and getting "smartctl" results is not any sort of proof that you do actually have an HBA instead of a RAID card. It's just an easy test that weeds out a wide range of RAID cards.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
msp0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem 0xf7d00000-0xf7d03fff, 0xf7c80000-0xf7cbffff irq 16 at device 0,0 on pci1
An SAS2008? That's not the H730, it looks like you have a separate HBA in there, and that raises even more questions.
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
@Ericloewe Is correct. That is reporting as a previous generation card.
It also says

Which is probably the mini pcie port (not sure and I dont have an R730 to check)

But you clearly said its:
A PERC H730 Mini.

I am unaware of any SAS2008 chip cards that fit in the mini port on the R730, only the R720.
Are you sure you don't have an R720?

The PERC H310 Mini and H710 mini would report that way....and IIRC the mounting mechanism is different between R720 and R730.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The PERC H310 Mini and H710 mini would report that way....and IIRC the mounting mechanism is different between R720 and R730.
Yeah, the Gen 13s have a proprietary stacking connector, and I think the Gen 12s have a fairly standard PCIe slot, but internal with non-standard mounts?
 

phomchick

Dabbler
Joined
Oct 2, 2017
Messages
26
idrac reports the H730. I just ordered an HBA330 and when it gets here, I'll take things apart and do a visual inventory. (I have a vertical rack, and it is a pain to get to and disassemble).
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
While having a look is relevant, you can also piece this together with lspci, plus sas2ircu and sas3ircu, the latter two have an option to print out the SAS topology.
 

phomchick

Dabbler
Joined
Oct 2, 2017
Messages
26
My HBA330 arrived and I opened up the server to replace the existing RAID controller, and it was an H730 Mini Mono just as reported. But replacing the H730 with an HBA330 doesn't seem to have changed anything, I still get checksum errors. And, very strangely, they seem to be related to one VDEV:

Code:
SCRUB
Status: FINISHED
Errors: 7
Date: 2023-07-20 11:06:56

Name     Read     Write     Checksum     Status     
FREENAS
RAIDZ2   
da4p2    0    0    0        ONLINE
da2p2    0    0    1        ONLINE
da7p2    0    0    1        ONLINE
da5p2    0    0    1        ONLINE
da3p2    0    0    1        ONLINE
da6p2    0    0    1        ONLINE
RAIDZ2   
da8p2    0    0    0        ONLINE
da9p2    0    0    0        ONLINE
da12p2    0    0    0        ONLINE
da11p2    0    0    0        ONLINE
da10p2    0    0    0        ONLINE
da13p2    0    0    0        ONLINE


I guess the next thing to change is to move the drives around to other slots and see if the problem moves with the drives, or stays in the slots.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Clear it and run a new scrub. It could be that there really was a checksum error caused by a burp of the RAID controller; this is not uncommon at least on the older LSI RAID controllers. It's one of the reasons we're so harsh on the use of RAID controllers. They do messed-up things. It could be that you are just nearing the end of your troubles.

If you can, you should really really REALLY pull all the data off this NAS, wipe the pool, create a new pool, and then reload your data. This is not always easy and I am aware of that. But it would be a really healthy, positive move.
 

phomchick

Dabbler
Joined
Oct 2, 2017
Messages
26
I cleared the errors and ran a new scrub. No luck. The same six drives still show checksum errors.

Most of the data on this server is backups from other computers, so I set up a replication job to copy the only unique data to another computer so I could zap this pool and start over, but the replication job core-dumped. I'm waiting for a diagnosis from a TrueNAS engineer.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Both core and scale recently had a small update to do with replicating encrypted datasets
 
Top