Random checksum issues. Zpool status reports permanent errors

Status
Not open for further replies.

RAIDTester

Dabbler
Joined
Jan 23, 2017
Messages
45
We've been having checksum issues on iSCSI volumes created in our 60 disk Storinator.
Here's what we saw:
We first had checksum errors on the pool - on random disks, continuing to propagate through all the disks. Not read or write errors, just cksum.
I destroyed the whole pool and recreated. Errors continued. I left it running anyway, continuing to heavily write to the disks. This caused watchdog restarts.
I then started to think it was a bad PSU, but I have no way to confirm, since there are 3 in the box.
I destroyed and re-created the pool again and still experienced issues. Since the pool had 28/30 disks on one Rocket 750 card, it was suggested that I move the card to another slot.
Before I got a chance to do that, I created another zvol on the disks using the OTHER Rocket 750 card in the box. Again, zpool status -v showed permanent issues and checksum errors - as seen below.

Even after destroying the iSCSI extent, the error lingered (even after restart). I believe that's the <0x3b>:<0x1> error.
I created another zvol and another iSCSI extent called r11 and it also had the same errors.
Errors don't go away after a scrub either

Can anyone help? Is something wrong?
ps- We recently upgraded from FreeNAS 9.3 to FreeNAS 9.10

Code:
  pool: r10
 state: ONLINE
status: One or more devices has experienced an error resulting in data
		corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
		entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 26h49m with 0 errors on Mon Mar 27 02:49:22 2017
config:

		NAME											STATE	 READ WRITE CKSUM
		r10											 ONLINE	   0	 0	 1
		  mirror-0									  ONLINE	   0	 0	 0
			gptid/f67708c5-f2b8-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
			gptid/f6fefadb-f2b8-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
		  mirror-1									  ONLINE	   0	 0	 0
			gptid/f7a4b416-f2b8-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
			gptid/f8361f95-f2b8-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
		  mirror-2									  ONLINE	   0	 0	 0
			gptid/f8d20cd2-f2b8-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
			gptid/f964d99e-f2b8-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
		  mirror-3									  ONLINE	   0	 0	 0
			gptid/fa09005a-f2b8-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
			gptid/fa9c8222-f2b8-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
		  mirror-4									  ONLINE	   0	 0	 0
			gptid/fb43b7ae-f2b8-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
			gptid/fbdb9175-f2b8-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
		  mirror-5									  ONLINE	   0	 0	 0
			gptid/fc839c48-f2b8-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
			gptid/fd1841d8-f2b8-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
		  mirror-6									  ONLINE	   0	 0	 0
			gptid/fdae814b-f2b8-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
			gptid/fe433f02-f2b8-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
		  mirror-7									  ONLINE	   0	 0	 0
			gptid/fef93644-f2b8-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
			gptid/ff91f748-f2b8-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
		  mirror-8									  ONLINE	   0	 0	 2
			gptid/00430ca2-f2b9-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 2
			gptid/00dbc3eb-f2b9-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 2
		  mirror-9									  ONLINE	   0	 0	 0
			gptid/018e1a90-f2b9-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
			gptid/022bb531-f2b9-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
		  mirror-10									 ONLINE	   0	 0	 0
			gptid/02e30853-f2b9-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
			gptid/037e42d8-f2b9-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
		  mirror-11									 ONLINE	   0	 0	 0
			gptid/0437d874-f2b9-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
			gptid/04d9b66f-f2b9-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
		  mirror-12									 ONLINE	   0	 0	 0
			gptid/05945a15-f2b9-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
			gptid/063035f2-f2b9-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
		  mirror-13									 ONLINE	   0	 0	 0
			gptid/06f56fa9-f2b9-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
			gptid/079b0b24-f2b9-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
		  mirror-14									 ONLINE	   0	 0	 0
			gptid/08600f7c-f2b9-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0
			gptid/08ffd886-f2b9-11e6-9e1d-0cc47a7693ea  ONLINE	   0	 0	 0

errors: Permanent errors have been detected in the following files:

		<0x3b>:<0x1>
		r10/r11:<0x1>


 

Vito Reiter

Wise in the Ways of Science
Joined
Jan 18, 2017
Messages
232
Can you list the rest of your hardware specs? Boards, RAM, CPU, etc. :) Might be able to help a little more.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194

Vito Reiter

Wise in the Ways of Science
Joined
Jan 18, 2017
Messages
232
Aside from the possibility of card failure/port failure, isn't it a little unsafe to run that many drives in Raid10? If some hell miracle happened and a mirror went bad (checksum errors and stuff:eek:) you'd lose the whole thing.
 

RAIDTester

Dabbler
Joined
Jan 23, 2017
Messages
45
Aside from the possibility of card failure/port failure, isn't it a little unsafe to run that many drives in Raid10? If some hell miracle happened and a mirror went bad (checksum errors and stuff:eek:) you'd lose the whole thing.
We were testing the RAID 10, because we had horrible performance with our previous setup using RAIDZ2. After getting panicked from exactly the issue you mentioned, when I saw the failures, I destroyed the array and rebuilt as RAIDZ2. I actually saw some awesome (considering) performance from RAIDZ2 configured as 5 sets of 6 drives.
 

RAIDTester

Dabbler
Joined
Jan 23, 2017
Messages
45
Rocket 750. Yeah, that thing never ends well.
45Drives actually sent us LSI cards to test with, but they don't do staggered spinup - so we would need to upgrade PSUs. 60 Drives pull way too much current when they power up simultaneously.
 

Vito Reiter

Wise in the Ways of Science
Joined
Jan 18, 2017
Messages
232
We were testing the RAID 10, because we had horrible performance with our previous setup using RAIDZ2. After getting panicked from exactly the issue you mentioned, when I saw the failures, I destroyed the array and rebuilt as RAIDZ2. I actually saw some awesome (considering) performance from RAIDZ2 configured as 5 sets of 6 drives.

6x10 drive RaidZ3 would be the most secure you could get and better than that setup, I mean your setup should need a ton of power too. It's definitely a possibility that those drives just failed together or wrote corrupted data to each other,
 

RAIDTester

Dabbler
Joined
Jan 23, 2017
Messages
45
6x10 drive RaidZ3 would be the most secure you could get and better than that setup, I mean your setup should need a ton of power too. It's definitely a possibility that those drives just failed together or wrote corrupted data to each other,

We mostly used this machine with only 30 drives being seriously hit at once, but we have used all 60 and not had a problem in the past. There are 3 PSUs in the machine. I believe that only 2 are needed and the 3rd is for redundancy. It would take a dual PSU failure to cause this.

The drives are all fine after running extended self-tests on all of them. The checksum error is unclear, since it isn't in the read or write column. What exactly does it represent?
 

Vito Reiter

Wise in the Ways of Science
Joined
Jan 18, 2017
Messages
232
I didn't have a read or write error, it simply received corrupted data somehow and recognized it but couldn't correct it.
 

RAIDTester

Dabbler
Joined
Jan 23, 2017
Messages
45
I didn't have a read or write error, it simply received corrupted data somehow and recognized it but couldn't correct it.
That's weird. Could this be due to iSCSI being in-between? or network issue like iSCSI flipping between sessions on 2 different cards?
 

Vito Reiter

Wise in the Ways of Science
Joined
Jan 18, 2017
Messages
232
I don't know if that's the case, you'd need to switch drives between ports, swap cables, and swap drives over cards before you could rule any of that out. You have ECC RAM, so we can assume the corruption happened after the information passed through that.
 

RAIDTester

Dabbler
Joined
Jan 23, 2017
Messages
45
I don't know if that's the case, you'd need to switch drives between ports, swap cables, and swap drives over cards before you could rule any of that out. You have ECC RAM, so we can assume the corruption happened after the information passed through that.
Sounds fun..
I've been researching and checked out my disk temps. I'm seeing a bunch of drives in the >40-44C range. Could high temps cause checksum errors?
 

Vito Reiter

Wise in the Ways of Science
Joined
Jan 18, 2017
Messages
232
Not necessarily in that range. ZFS is cool though, simply just move the drives giving errors to a different spot. If the drive that was put in its place gets an error, or the error drive loses its errors you have a bad port/cable.
 

RAIDTester

Dabbler
Joined
Jan 23, 2017
Messages
45
There are no specific drives with errors. If I leave it long enough, every disk will show errors. It starts randomly and propagates through the whole pool.
 

Vito Reiter

Wise in the Ways of Science
Joined
Jan 18, 2017
Messages
232
Either an HBA card failed (Unlikely if it's on both cards), or you don't have enough power. That's a ton of voltage to spin up that machine, power the cards, 2 processors and 128GB of RAM. I used a PSU calculator and it came out to like 400W so I don't trust that at all, but you kind of have to eliminate things one by one. Is any data actually corrupted that you noticed? Also, check all the drive loads/CPU loads/RAM load/etc.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
45Drives actually sent us LSI cards to test with, but they don't do staggered spinup
  1. Of course they do, it's in the boot ROM
  2. Staggered spinup is not a good thing to have to rely on
 

RAIDTester

Dabbler
Joined
Jan 23, 2017
Messages
45
Either an HBA card failed (Unlikely if it's on both cards), or you don't have enough power. That's a ton of voltage to spin up that machine, power the cards, 2 processors and 128GB of RAM. I used a PSU calculator and it came out to like 400W so I don't trust that at all, but you kind of have to eliminate things one by one. Is any data actually corrupted that you noticed? Also, check all the drive loads/CPU loads/RAM load/etc.

I think we're running 3x950W PSUs. I didn't notice any actual corruption. It's an iSCSI volume exported to windows with a .vhdx file to store the actual data for a VM. There's a lot of layers of abstraction, but the data seems ok.
 

RAIDTester

Dabbler
Joined
Jan 23, 2017
Messages
45
  1. Of course they do, it's in the boot ROM
  2. Staggered spinup is not a good thing to have to rely on

1. True, but the internal wiring in the Storinator won't work with the LSI style of staggered spinup. I forgot the details. With the Rocket, the drives are actually powered off until the Rocket turns them on with a reset or something.
2. Agree, but we've had so much trouble getting this box to work, I doubt we're going to invest in upgrading the PSUs.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Your problem was the raid cards. Throw them in the trash and use a proper hba.

Sent from my Nexus 5X using Tapatalk
 
Status
Not open for further replies.
Top