Zpool DEGRADED - Device Faulted (Read Write Checksum) too many errors

Status
Not open for further replies.

arameen

Contributor
Joined
Sep 4, 2014
Messages
145
Each drive manufacturer has periods of times where their product shines or fails. Seagate use to shine in the past but right now it's not doing so good for longevity these days in general. If someone wants to go with Seagate drives, that is fine and maybe they will have better luck, after all Seagate's luck has to change sometime.

I have but it was self-induced. This occurred in the early FreeNAS versions, 8.x I think and I was doing a lot of testing. I forget what I was doing, probably just writing data to a sector on the hard drive to see if it noticed a corrupt file during a scrub.

Agree
Anyway now I am stuck with my drives and there warranty, its not like I would just throw away 13 drives to change brand. But if I notice too many problems in the future and when it is time to buy new ones I will reconsider other brands. But maybe until that time Seagate will have got better reputation again and drives :D
 

arameen

Contributor
Joined
Sep 4, 2014
Messages
145
The previous drive I got read, write and checksum errors was removed from my machine. Now I am checking its health in windows with different software, HD tune for the moment. My drivereseller uses it and I thought I check it that way to make sure they will replace it before I send it back to them, RMA is already created.
But HD tune shows the disk as healthy, it only showed a few dead sector when I scanned its surface but totally health was still ok o_O
Now I am worried my reseller will consider my drive ok and not replace it despite the drive didnt pass Seagate Seatools long generic test. In that case the same will apply on the other current drives in my machine that are showing read, write and checksum errors. Did you guys find yourself in this situation ?

Is freenas to sensitive when it comes to what is considers a bad disk? or is HD tune just a bad software to check a drives health ? something I should explain to my reseler if they will consider my disk OK and not replaceble yet :confused: ?
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
The failed SMART test you posted in the first post is enough proof to RMA the drive ;)
 

hugovsky

Guru
Joined
Dec 12, 2011
Messages
567
The failed SMART test you posted in the first post is enough proof to RMA the drive ;)

Even if it shows:
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

I had drives dead that showed that message.

Just to add my two cents, from quite a few wd disks (+100) over 3 or 4 years, had 2 DOA. No other problems.
 

arameen

Contributor
Joined
Sep 4, 2014
Messages
145
I got a first strange issues since I removed this drive from my machine.
I installed a new one and it started behaving the same way, faulted state and too many read, write and checksum errors until serverrestart. Even this drive, that was connected to the same cableport on the SFF-8087 as previous drive, is now removed from my machine. I don't risk anything.
So I installed another one, a third drive, still on the same cableport on the SFF-8087 as previous two and you know what? Even this one started showing read, write and checksum errors.
So 2 additional and new SATA drives are showing faults. I find it unlikely to be driveissues, could it be that specific cableport ? A cableissue ? Because both last 2 drives were new and unused.

Add to this the server GUI has stopped responding several times so I had to restart the server manually. During some of those restarts freeNAS never finished booting and never got to FreeNAS Console Setup Menu, forcing me to completly turn off my server for a minute or two before starting it again. (does freeNAS not boot up correctly because a drive is failing?)
To make my life harder freeNAS is showing devices as UNAVAIL despite the resilvering beeing done after the second replacment harddrive :mad::confused:o_O:(
I dont believe that 2 drives suddenly died in my pool not beeing available anymore. 1 replacmentdrive and one pooldrive working since before. I doublecheck cableconnections in case something is loose since in opend the case but everything when it come to connection from power and datacables is ok.
From GUI -> shell -> zpool status - ... I get "replacing-10" on one of the UNAVAIL disks.
But from the GUI -> Storage -> zpool -> View Disks I get "resilver complete and 2 UNAVAIL disks" but nothing about replacing ?!
And both show 2 UNAVAIL disks :eek:

Only positive thing is that the pool is healthy and not showing any errors, so far.
I dont understand what is going on, what is wrong and if there is more than one error and fault to fix. But I am a really worried for my pool, despited beeing a raidz3 pool now that 2 disks are not there one error could be fatal. So I am waiting for you guys, your suggestions and advice. Until that I am considering shuting off that pool to not risk any data corruption. (By shut off I mean disconnect the powercables from all drives in that pool not risking any hardware errors until I know what to do)
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
I find it unlikely to be driveissues, could it be that specific cableport ? A cableissue?

Yes, definitely a cable issue or port issue. Try with another cable.

During some of those restarts freeNAS never finished booting and never got to FreeNAS Console Setup Menu, forcing me to completly turn off my server for a minute or two before starting it again.

In case of problems the boot can be slow, very slow (it has to wait for timeouts for example), the next time just wait something like 10-15 min before powering it off.

does freeNAS not boot up correctly because a drive is failing?

Yes, it can. My first and only one drive failure (for now...) done exactly that. When I tried to put the failed drive on the desktop it wasn't even POSTing, so a drive can definitely prevent a server to boot.

I dont believe that 2 drives suddenly died in my pool not beeing available anymore. 1 replacmentdrive and one pooldrive working since before. I doublecheck cableconnections in case something is loose since in opend the case but everything when it come to connection from power and datacables is ok.
From GUI -> shell -> zpool status - ... I get "replacing-10" on one of the UNAVAIL disks.
But from the GUI -> Storage -> zpool -> View Disks I get "resilver complete and 2 UNAVAIL disks" but nothing about replacing ?!
And both show 2 UNAVAIL disks :eek:

Mmhh... can you post a list of your hardware? especially the PSU model and the number of drives you have in this server.

Until that I am considering shuting off that pool to not risk any data corruption. (By shut off I mean disconnect the powercables from all drives in that pool not risking any hardware errors until I know what to do)

You should consider having backups, RAID never replace backups... :)
 
Last edited:

arameen

Contributor
Joined
Sep 4, 2014
Messages
145
I have only one SFF-8087 so I will switch the cableport to another drive to see if the errors will follow to the other connected drive.
I was hoping it is possible to distinguish the errors beeing drive or cablerelated from the SMART values. I guess it is not possible ?

I gave freenas more than 20 minutes but it seemed to never finished booting and I couldn't access it through the GUI. It seems for now that only turning off freeNAS completly for a few minutes makes it able to finish booting.
The reason for me suspecting the bootissue beeing due to driveissue is that I regonize the behaviour from windows. When a drive was almost dead it took very long time to boot. Specially because Boot Volume Condition is HEALTHY.
But as I wrote it is a guess from me that a faulty or failing drive prevents freeNAS from booting ?

A complete hardwarelist is on the the previous page of posts in this thread :)
 
Last edited:

wtfR6a

Explorer
Joined
Jan 9, 2016
Messages
88
hehehe, seagate seems really not popular here :) Not sure if they suddenly became so bad.

nope, Ive alyways had grief with them from my first NAS setups a decade+ ago. I use to use RAID6 and kept a hot spare in the mail back and forth to Seagate for RMA until I saw the light and moved to WD.
 

arameen

Contributor
Joined
Sep 4, 2014
Messages
145
The first drive is RMAed and a new one on its way.
The second and third drive seems ok when beeing tested with seagate seatools in Windows. I did all tests inclusive the long generic one.
I have a feeling something is or was wrong with the cable ? for the moment I don't have any errors in that pool. But question is what will happen when I reintroduce the second and third drive into the machine. As mentioned earlier those 2 drives substitued the troublesome first drive and both of them showed errors. But seemed ok in windows when testing everything that there is to test.
I can add something dramatic to this. At a moment earlier this week 2 drives where UNAVAILABLE and during the resilvering freeNAS detected permanent erros in the pool. Now that the pool is resilverd and healthy I will go through all data to substitue the corrupted files. It is not a big issue right now because that pool contains mostly packed material that will easy tell if something is corrupted while unpacking with winrar. And I needed anyway to go through all material and unpack it. So no big deal. More happy that the materia is there at all than gone completly which I thought would happen first.

Back to the errors.
I am suspecting the cable, SFF-8087. I assume powercables themself can not cause errors. only sata cables. OR ?
I noted very carefully which cableport of the SFF-8087 that was connected to the drives that showed errors.
But my question remains unanswerd here ?
Is there no other way to be sure the cable is ok and not causing issues than just buy a new one and replace it ?
Maybe by reading a specific SMART value or something else that does not involve opening case, disconnecting and reattaching a lot of drives ?
 

rogerh

Guru
Joined
Apr 18, 2014
Messages
1,111
Back to the errors.
I am suspecting the cable, SFF-8087. I assume powercables themself can not cause errors. only sata cables. OR ?
I noted very carefully which cableport of the SFF-8087 that was connected to the drives that showed errors.
But my question remains unanswerd here ?
Is there no other way to be sure the cable is ok and not causing issues than just buy a new one and replace it ?
Maybe by reading a specific SMART value or something else that does not involve opening case, disconnecting and reattaching a lot of drives ?

As far as I know there is no conclusive way to distinguish drive errors from cable errors in software or firmware. And there is no utility in the drive firmware or available for the host that can test the cable specifically. Substitution (in the presence of errors) is probably the only way to test the cable; especially as cable errors usually relate to poor contacts at the termination rather than problems inside the cable, which could be tested quite easily with a multimeter. A cable could in theory just have really poor signal characteristics, resulting in the corruption of high speed serial data, but I don't think this is easily tested except by substitution with a reputable maker's cable.
 

arameen

Contributor
Joined
Sep 4, 2014
Messages
145
So there is no need to avoida the deattaching of sata connections and reattaching when troubleshooting to find out if the errors are cable or driverelated :(

But at least it can not be powerrelated, powercable, I guess ?
 

rogerh

Guru
Joined
Apr 18, 2014
Messages
1,111
I've not personally seen intermittent SATA power cable problems, but it is certainly possible to have a poor connection due to a loosely crimped wire or bent connector. Or power problems due to a weak or defective PSU.
 

rogerh

Guru
Joined
Apr 18, 2014
Messages
1,111
I've not personally seen intermittent SATA power cable problems, but it is certainly possible to have a poor connection due to a loosely crimped wire or bent connector. Or power problems due to a weak or defective PSU.
 

arameen

Contributor
Joined
Sep 4, 2014
Messages
145
In that case I will focus on drives or SATA cables in the future whenever I encounter dataerrors.
It is not easy to get out most of the SMART values. I got one link but would gladly read more if there is more to learn about SMART results/values.

Thanks all for the contribution to this thread :)
 
Status
Not open for further replies.
Top