Pool keeps failing a specific disk, in a specific RAIDZ2

DumSkidderik

Cadet
Joined
Sep 25, 2020
Messages
3
Hi folks!

I'm trying to setup some storage on some old obsolete SuperMicro hardware I got for free by hauling it off the property so the former company didn't have to ( :D )
It was running Nexenta a few years ago, then got shut off due to being replaced with newer hardware.

chrome_jMthWIbX5M.png

It's composed entirely of 300gb Seagate SAS drives, LSI controllers of sorts and one SSD for log.
~90GB RAM, Intel Xeon 8-core something or the other from around 7 years ago.
Compression and dedupe is enabled.

The strange thing is, the last of the RAIDZ2's keeps having one disk go bad with checksum errors.
No SMART errors reported.
If I replace the disk, it'll resilver fine and then the checksum errors start again.
This will happen even if the disk is in another spot in the enclosure, or even in another enclosure!

Along with this, /var/log/message get's spammed with this:
1601041428666.png


da12 -> da17 as far as I can tell, so it's the same RAIDZ2 as the slot that keeps failing.
Oddly enough, da37 doesn't show up in the logs.

I've opened up the enclosures and the server and checked for loose connections, dust etc. all looks good.

I'm at a loss here guys.
 
Joined
Jan 7, 2015
Messages
1,150
My guess is its losing power or data connection momentarily. Without knowing how you are cabling and powering all these drives id try to start there. Rule out power and cabling.
 
Joined
Jan 7, 2015
Messages
1,150
The resilver always completes? How many times/drives have you resilvered?
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,691
Has the pool gone through a scrub and still showing checksum errors?
Which version of FreeNAS or TrueNAS are you running?
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Those errors mean you have a bad Sata connection. Get a new cable and double check connections.
 

DumSkidderik

Cadet
Joined
Sep 25, 2020
Messages
3
Cables are good, it's an old SC213 chassis with two LSI SAS controllers in each enclosure.
As I mentioned I can choose any physical drive, on a completely different controller and / or enclosure, and it'll do the same thing.
Resilver always completes, then starts tossing checksums after maybe 10-15min.
Scrubbing completes but doesn't fix it.
FreeNAS 11.3 U4.1
For good measure I powered it all down and replugged all the SAS connectors, molex power and reseated the drives. Still the same.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,691
I'd suggest filing a bug report...... if you can provide access to your system remotely. If neither resilver or scrub fix the problem, there is unusual behaviour here. It will probably be almost impossible to reproduce and may be hardware related.

The good news is that with Z2 your data is safe, you can still lose a drive and have a checksum error.

Does the checksum count increase overtime.... might be only one bad block/checksum and it counts each time it is accessed.

I assume RAM is ECC?
 

DumSkidderik

Cadet
Joined
Sep 25, 2020
Messages
3
I'll file a bug report!
It's increasing over time, the rate increases if writes increase, I guess that makes sense.
RAM is ECC yes.
 
Top