Data Corruption in zpool - how to proceed?

phomchick · Jun 8, 2023

HoneyBadger said:
Good drives, a little aged and small on the cache side (16MB) but traditional CMR.

Unfortunately, these are SMR drives. Not guaranteed to be responsible, and don't have the known issues with sector IDNF that WD Red SMR does, but potentially a contributing factor if they are timing out commands under load.

May be an issue if it is behaving in RAID mode, still not as ideal as HBA330. Do you recall setting it to HBA Mode in the BIOS/EFI configuration?

dmesg | grep LSI should hopefully pull out the line that's showing the driver being loaded. I expect to see mrsas as the result.

dmesg | grep LSI returns...

Code:

msp0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem 0xf7d00000-0xf7d03fff, 0xf7c80000-0xf7cbffff irq 16 at device 0,0 on pci1

phomchick · Jun 8, 2023

I had the server off over night, and when booting it this morning, there are no errors in any of the pools.

P.S. I really want to thank the folks who have contributed to this thread. You've been very knowledgeable and helpful.

jgreco · Jun 8, 2023

phomchick said:
And the H730 is clearly in HBA mode as the above camcontrol devlist output shows.

I don't see anything that says that. Merely printing the device model is not any sort of proof that it is an HBA, as a number of RAID controllers do this as well (such as a bunch of the 3Ware's, Highpoint, and I believe also the Areca). I explicitly state this in the HBA vs RAID card resource:

Note that having device names in "camcontrol devlist" and getting "smartctl" results is not any sort of proof that you do actually have an HBA instead of a RAID card. It's just an easy test that weeds out a wide range of RAID cards.

Ericloewe · Jun 8, 2023

phomchick said:
msp0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem 0xf7d00000-0xf7d03fff, 0xf7c80000-0xf7cbffff irq 16 at device 0,0 on pci1

An SAS2008? That's not the H730, it looks like you have a separate HBA in there, and that raises even more questions.

NickF · Jun 8, 2023

@Ericloewe Is correct. That is reporting as a previous generation card.
It also says

pci1

Which is probably the mini pcie port (not sure and I dont have an R730 to check)

But you clearly said its:

A PERC H730 Mini.

I am unaware of any SAS2008 chip cards that fit in the mini port on the R730, only the R720.
Are you sure you don't have an R720?

The PERC H310 Mini and H710 mini would report that way....and IIRC the mounting mechanism is different between R720 and R730.

Ericloewe · Jun 8, 2023

NickF said:
The PERC H310 Mini and H710 mini would report that way....and IIRC the mounting mechanism is different between R720 and R730.

Yeah, the Gen 13s have a proprietary stacking connector, and I think the Gen 12s have a fairly standard PCIe slot, but internal with non-standard mounts?

phomchick · Jun 9, 2023

idrac reports the H730. I just ordered an HBA330 and when it gets here, I'll take things apart and do a visual inventory. (I have a vertical rack, and it is a pain to get to and disassemble).

Ericloewe · Jun 10, 2023

While having a look is relevant, you can also piece this together with lspci, plus sas2ircu and sas3ircu, the latter two have an option to print out the SAS topology.

phomchick · Jul 21, 2023

My HBA330 arrived and I opened up the server to replace the existing RAID controller, and it was an H730 Mini Mono just as reported. But replacing the H730 with an HBA330 doesn't seem to have changed anything, I still get checksum errors. And, very strangely, they seem to be related to one VDEV:

Code:

SCRUB
Status: FINISHED
Errors: 7
Date: 2023-07-20 11:06:56

Name     Read     Write     Checksum     Status     
FREENAS
RAIDZ2   
da4p2    0    0    0        ONLINE
da2p2    0    0    1        ONLINE
da7p2    0    0    1        ONLINE
da5p2    0    0    1        ONLINE
da3p2    0    0    1        ONLINE
da6p2    0    0    1        ONLINE
RAIDZ2   
da8p2    0    0    0        ONLINE
da9p2    0    0    0        ONLINE
da12p2    0    0    0        ONLINE
da11p2    0    0    0        ONLINE
da10p2    0    0    0        ONLINE
da13p2    0    0    0        ONLINE

I guess the next thing to change is to move the drives around to other slots and see if the problem moves with the drives, or stays in the slots.

jgreco · Jul 21, 2023

Clear it and run a new scrub. It could be that there really was a checksum error caused by a burp of the RAID controller; this is not uncommon at least on the older LSI RAID controllers. It's one of the reasons we're so harsh on the use of RAID controllers. They do messed-up things. It could be that you are just nearing the end of your troubles.

If you can, you should really really REALLY pull all the data off this NAS, wipe the pool, create a new pool, and then reload your data. This is not always easy and I am aware of that. But it would be a really healthy, positive move.

phomchick · Jul 29, 2023

I cleared the errors and ran a new scrub. No luck. The same six drives still show checksum errors.

Most of the data on this server is backups from other computers, so I set up a replication job to copy the only unique data to another computer so I could zap this pool and start over, but the replication job core-dumped. I'm waiting for a diagnosis from a TrueNAS engineer.

NugentS · Jul 29, 2023

Both core and scale recently had a small update to do with replicating encrypted datasets

Important Announcement for the TrueNAS Community.

Data Corruption in zpool - how to proceed?

phomchick

Dabbler

phomchick

Dabbler

jgreco

Resident Grinch

Ericloewe

Server Wrangler

NickF

Guru

Ericloewe

Server Wrangler

phomchick

Dabbler

Ericloewe

Server Wrangler

phomchick

Dabbler

jgreco

Resident Grinch

phomchick

Dabbler

NugentS

MVP

Similar threads

Important Announcement for the TrueNAS Community.

Data Corruption in zpool - how to proceed?

Dabbler

Dabbler

Resident Grinch

Server Wrangler

Guru

Server Wrangler

Dabbler

Server Wrangler

Dabbler

Resident Grinch

Dabbler

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Data Corruption in zpool - how to proceed?"

Similar threads