SOLVED Constant zpool degrading

Status
Not open for further replies.

Green750one

Dabbler
Joined
Mar 16, 2015
Messages
36
Hi,

I'm having recurrent issues with my zpool degrading.
Typically this is a single disk becoming unavailable, which replacing resolves for a couple of days, then it happens again.
I've RMA'd 4 disks so far, but I don't actually think there's any issue with the disks. A couple have non-significant errors, but both short & long SMART test cycles pass OK.
I've replaced all the SATA cables - twice - and I've switched ports around but it appears random.
My rig does run a little hot at around 36, but that's well within the operating range of the disks.
PSU is 900W with multiple 12v rails so again should be more than adequate even for 10 drives (+ 2 SSDs) which I have configured for staggered start.

This morning I got the following alerts...
The volume Main (ZFS) state is DEGRADED: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

then 45 mins later...
Device: /dev/ada5, unable to open device (this was the drive that caused the 1st alert but was auto replaced by the spare)
The volume Main (ZFS) state is DEGRADED: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
Device: /dev/ada3, unable to open device
Device: /dev/ada1, ATA error count increased from 2 to 3
Device: /dev/ada9, unable to open device

Any thoughts as to what might be happening?
Could this be the Marvell SATA ports on the MB playing up? I have flashed the latest 9230 firmware. This only seemed to start when I doubled the size of my zpool by adding an extra 4 drives.
Could it be impending PSU failure?
Could it be impending MB failure?

Many Thanks
Philip
 

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
PSU is 900W with multiple 12v rails
Do you have your 12V load separated among your rails, or do you have it all on one rail? If you have it all on one rail, that could explain the problems, because one rail in a multi-rail PSU is probably not enough to power 10+ drives.

A couple have non-significant errors, but both short & long SMART test cycles pass OK.
Just because a SMART test passes, does not mean a drive is good.

Could this be the Marvell SATA ports on the MB playing up?
Are the drives that fail always on the Marvell SATA ports?
 

Green750one

Dabbler
Joined
Mar 16, 2015
Messages
36
Hi and Thanks,
Do you have your 12V load separated among your rails, or do you have it all on one rail? If you have it all on one rail, that could explain the problems, because one rail in a multi-rail PSU is probably not enough to power 10+ drives.
Yes - load spread over 4 rails
Just because a SMART test passes, does not mean a drive is good.
OK. So what is a good indicator? How do I tell?
Are the drives that fail always on the Marvell SATA ports?
I was thinking this as I posted - I'm not sure so will check when I get home, but I don't believe so
 

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
OK. So what is a good indicator? How do I tell?
Unfortunately, there is no universal indicator that a drive is bad, other than not being able to read from or write to the disk (obviously :D).

So what you're really looking for are indications that a drive is failing. Four of the most common indicators for us are: checksum errors on the drive, increased read/write latency, a failed SMART test, or adverse changes to SMART data (increasing bad sectors, for example). A failed SMART test is a very strong indicator that a drive is failing, and should be replaced immediately, while the other three may or may not indicate failure, and should be judged based on their frequency and severity.

For example, a drive that has a checksum error once a quarter is probably not bad. We might expect UREs at that frequency, especially if your NAS is in an electrically noisy environment. On the other hand, a drive that develops a hundred plus checksum errors a week is probably a failing drive. Unfortunately, sometimes it's ambiguous: if a drive shows a spurt of bad sectors, along with a bunch of checksum errors, but then shows no other problems, may have just hit a bad spot on the platter, and has now resolved the issue. Or it might indicate a problem with the drive or controller.

Anyway, that's why we use RAIDZ(1,2,3) and ZFS: we have to worry less about possible problems, because ZFS will take care of us in the worst case that a sector is corrupted or a drive is lost.
 

Green750one

Dabbler
Joined
Mar 16, 2015
Messages
36
I really don't think I have a drive issue. I monitor them quite closely & I don't have checksum errors, bad sectors or anything obviously amiss. They just seem to become unavailable for no apparent reason.
I'm also trying to work out if it started (or got worse) after upgrading to 9.10.

Sent from my D6503 using Tapatalk
 

CraigD

Patron
Joined
Mar 8, 2016
Messages
343
It could be the drives, SATA power or data cables, controller(s), a lack of clean power, overheating, or a short of some kind but I doubt it

I use to see this all the time (20 years ago) My guess is you have a break in a power cable wire it works until it heats up and the circuit is broken until everything cools down everything works again

Swap know good parts in and the problem will become obvious

Badblocks is a destructive HDD test that you could try

Have Fun
 

Green750one

Dabbler
Joined
Mar 16, 2015
Messages
36
Checked my box and it's not just drives attached to the Marvell ports.
Will concentrate on power cables - powering 12 drives did take some ingenuity!

Sent from my D6503 using Tapatalk
 

Mr_N

Patron
Joined
Aug 31, 2013
Messages
289
who manufactures that PSU, cant seem to find anything on it with a google search...
 

Green750one

Dabbler
Joined
Mar 16, 2015
Messages
36
I believe it's a company called iCute and it's 80 plus certified.

Sent from my D6503 using Tapatalk
 

Green750one

Dabbler
Joined
Mar 16, 2015
Messages
36
Here's the report

Sent from my D6503 using Tapatalk
 

Attachments

  • SP346_iCute_AP-900AS_850W_Report.pdf
    252.6 KB · Views: 422

Z300M

Guru
Joined
Sep 9, 2011
Messages
882
I would try a different power supply -- from a reputable manufacturer such as SeaSonic.
 

Mr_N

Patron
Joined
Aug 31, 2013
Messages
289
yeah i couldn't agree more, even with the name its hard to get any info on their hardware, i certainly wouldn't be trusting it to run anything of mine...
 

Sakuru

Guru
Joined
Nov 20, 2015
Messages
527

Green750one

Dabbler
Joined
Mar 16, 2015
Messages
36
Good job my rig isn't mission critical then. It's primarily a media server running sickrage, couchpotato & plex. Anything I can't replace is backed up in several places.
I think it's the power cabling - I daisy chained some SATA extension cables to give me 12 plugs. I've ordered a couple of 6 pin cables to plug directly into the psu. I'll see how that goes before I splash out £150 on a seasonic.

Sent from my D6503 using Tapatalk
 

Mr_N

Patron
Joined
Aug 31, 2013
Messages
289
its not about being mission critical, a crap PSU can damage all your hw... i'm guessing £150 will start to look cheap when compared to the replacement cost for the whole rig :)
 

Green750one

Dabbler
Joined
Mar 16, 2015
Messages
36
Is 80plus verified not an indication that it's an ok psu? I always thought it was.

Sent from my D6503 using Tapatalk
 

Mr_N

Patron
Joined
Aug 31, 2013
Messages
289
not really that just certifies some level of efficiency
 

tvsjr

Guru
Joined
Aug 29, 2015
Messages
959
Is 80plus verified not an indication that it's an ok psu? I always thought it was.

Sent from my D6503 using Tapatalk
Not in the least. It means that one or a handful of test units that were evaluated squeaked over some arbitrary bar.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,358
Sounds like Power issue.


Sata connectors should not be split more than once.

3 itty bitty pins for each voltage means a total of 4.5A for 12V. Normal drives needs about 2A for spinup. This means that two drives hanging off one connector will draw circa 4A. 3 would exceed the rating of the connector.

Then you need to worry about the current of the main wire loom, and what rail it's connected to.

I'd suggest doing yourself a favor and getting a good single rail modular psu.

EVGA, corsair RMx, Seasonic etc.
 

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
Status
Not open for further replies.
Top