Failing drives

lmannyr

Contributor
Joined
Oct 11, 2015
Messages
198
Got a message saying da1 and da7 have errors. I ran the reports of all my drives. Long tests are set to run twice a month and short test once a week via GUI

I have 10 drives. Attached is the smartctrl -a readouts.

What do you think?
 

Attachments

  • Smartctl Readout.txt
    63.6 KB · Views: 322

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,079
What do you think?
I think that you have neglected your system for at least 1464 hours, which is about 61 days, roughly two months. You need to setup email alerts and pay attention to your system if you care AT ALL for the data that is in it.

We need to know what the disk configuration is. Please show us the output of the command zpool status and just put it in a message using code tags. A code tag looks like this:
1546793156569.png

Resulting in a display like this:
Code:
da1

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       60%     27517         900512
# 2  Short offline       Completed: read failure       60%     27445         900512
# 3  Short offline       Completed: read failure       60%     27349         900512
# 4  Short offline       Completed: read failure       60%     27253         900512
# 5  Extended offline    Completed: read failure       90%     27240         900512
# 6  Short offline       Completed: read failure       60%     27157         900512
# 7  Short offline       Completed: read failure       60%     27061         900512
# 8  Short offline       Completed without error       00%     26965         -
# 9  Extended offline    Completed: read failure       90%     26904         900512
#10  Short offline       Completed: read failure       60%     26869         900512
#11  Short offline       Completed: read failure       60%     26773         900512
#12  Short offline       Completed: read failure       60%     26725         900512
#13  Short offline       Completed: read failure       60%     26630         900512
#14  Short offline       Completed: read failure       60%     26534         900512

da2

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
200 Multi_Zone_Error_Rate   0x0008   173   173   000    Old_age   Offline      -       10803

da7

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       12

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       50%     26476         132440
# 2  Short offline       Completed: read failure       60%     26404         129568
# 3  Short offline       Completed: read failure       60%     26308         132440
# 4  Short offline       Completed: read failure       60%     26212         135936
# 5  Extended offline    Completed: read failure       90%     26199         132440
# 6  Short offline       Completed: read failure       50%     26116         135936
# 7  Short offline       Completed: read failure       50%     26020         135936
# 8  Short offline       Completed without error       00%     25924         -
# 9  Extended offline    Completed: read failure       90%     25863         132440
#10  Short offline       Completed: read failure       50%     25828         132440
#11  Short offline       Completed: read failure       50%     25732         135936
#12  Short offline       Completed: read failure       50%     25684         135936
#13  Short offline       Completed: read failure       60%     25588         132440
#14  Short offline       Completed: read failure       60%     25493         135936
#15  Extended offline    Completed: read failure       90%     25480         132440
#16  Short offline       Completed: read failure       60%     25397         132440
#17  Short offline       Completed: read failure       60%     25301         132440
#18  Short offline       Completed: read failure       50%     25205         132440
#19  Extended offline    Completed: read failure       90%     25144         135936
#20  Short offline       Completed: read failure       50%     25109         135936
#21  Short offline       Completed: read failure       60%     25012         132440

These are some of the important errors from the text file you shared. The oldest one listed is from two months ago but there is no indication that the error state has not been persistent for longer than that.
Both drives da1 and da7 should be replaced as soon as possible and da2 is also having problems that should be reason for concern, I would replace it also.
 

lmannyr

Contributor
Joined
Oct 11, 2015
Messages
198
I tried to put the above smartctl output into a "code" window but got an error submitting the post saying post was too long.

Bellow is the "zpool status" output.

Code:
 pool: POOL
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(7) for details.
  scan: scrub repaired 3.48M in 0 days 04:25:28 with 0 errors on Sun Jan  6 04:25:30 2019
config:

    NAME                                            STATE     READ WRITE CKSUM
    POOL                                            ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/a30076fc-7a79-11e5-8637-0cc47a6ae0be  ONLINE       0     0     0
        gptid/a3bebf3c-7a79-11e5-8637-0cc47a6ae0be  ONLINE       0     0     0
        gptid/ed456456-e41f-11e6-9400-0cc47a6ae0be  ONLINE       0     0     0
        gptid/a5350cbb-7a79-11e5-8637-0cc47a6ae0be  ONLINE       0     0     0
        gptid/a5f1744d-7a79-11e5-8637-0cc47a6ae0be  ONLINE       0     0     0
        gptid/a6b36e4f-7a79-11e5-8637-0cc47a6ae0be  ONLINE       0     0     0
        gptid/a7706d59-7a79-11e5-8637-0cc47a6ae0be  ONLINE       0     0     0
        gptid/a82c560c-7a79-11e5-8637-0cc47a6ae0be  ONLINE       0     0     0
        gptid/a8e7d0ca-7a79-11e5-8637-0cc47a6ae0be  ONLINE       0     0     0
        gptid/a99ec722-7a79-11e5-8637-0cc47a6ae0be  ONLINE       0     0     0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: resilvered 788M in 0 days 00:06:50 with 0 errors on Sat Jan  5 23:25:02 2019
config:

    NAME        STATE     READ WRITE CKSUM
    freenas-boot  ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        da8p2   ONLINE       0     0     0
        da9p2   ONLINE       0     0     0

errors: No known data errors
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
My 2 cents:
1) Backup your data, you have a RAIDZ2 with two drives on the way out and a possible third and your scrub had to repair some data.
2) Replace drive da1 WD-WCC4E5UXCVCK due to failure to complete SMART Testing. This means you have a physical defect of the drive.
3) Replace drive da7 WD-WCC4E4PDN2XP due to failure to complete SMART Testing. This means you have a physical defect of the drive.
4) Keep and eye on drive da2 WD-WCC4E1FFRAJ2 due to ID 1 and ID 200 values but since the Extended test is passing you should be okay and values in these two locations alone do not indicate failure. Keep an eye on it and if the values continue to increase then you might consider replacing the drive when it's something you can afford. If it's under warranty then I'd RMA the drive now.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,079

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,079
I tried to put the above smartctl output into a "code" window but got an error submitting the post saying post was too long.
You have a lot of drives, so it probably needed to be broken into a couple of posts, but your initial post was workable.
 

lmannyr

Contributor
Joined
Oct 11, 2015
Messages
198
Do you have monitoring scripts running on your system?

Github repository for FreeNAS scripts, including disk burnin
https://forums.freenas.org/index.ph...for-freenas-scripts-including-disk-burnin.28/

Do you have spare drives on hand? You might want to have a spare ready.
Before you add drives to your system, you should do burn-in testing on those drives.
Other than the monitor that 11.2-U6 provides in the GUI, no.

No spares. Have 3 drives on order from amazon.

Don't have a free box to run a burn on.
 

lmannyr

Contributor
Joined
Oct 11, 2015
Messages
198
All 3 drives are RMA'd despite being 2 months out of warranty. BIG THUMBS UP to WESTERN DIGITAL!! I'm sure it will take a little longer than if shipped from Amazon though.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
All 3 drives are RMA'd despite being 2 months out of warranty. BIG THUMBS UP to WESTERN DIGITAL!! I'm sure it will take a little longer than if shipped from Amazon though.
If you are able to get them all replaced under RMA, at least you will have three spares should you need them. I'd recommend that you burn those in when you can.

As for the drives coming from Amazon, if your FreeNAS system is active (running) and you can't shut it down then I'd run a quick burn-in on one drive and replace da1 or da7, both are going to fail but we can't say which one would be first. For me a quick burn-in would be an SMART Long test, One pass of Badblocks, and another SMART Long test. If no errors are reported when you examine the SMART data then replace da1 or da7. While resilvering is going on you should burn-in your other two drives. Resilvering can take a long time and depends on how much data you have so you may finish the burn-in before the resilvering is done.

Good luck!
 

lmannyr

Contributor
Joined
Oct 11, 2015
Messages
198
Since WD is RMAing all 3 drives, I canceled the amazon order. The RMAs will require me to return the 3 old ones back as well.

Thanks for the tips.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
Since WD is RMAing all 3 drives, I canceled the amazon order. The RMAs will require me to return the 3 old ones back as well.

Thanks for the tips.
So you did an Advanced RMA so you can get the drives in hand first?
 

lmannyr

Contributor
Joined
Oct 11, 2015
Messages
198

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,079
You might want to consider getting a spare, for the next time.
 
Top