S.M.A.R.T. Error help

urbansystems

Dabbler
Joined
Oct 22, 2019
Messages
10
Hi all.

I got an error e-mail from my freenas

Code:
Device: /dev/da43, Self-Test Log error count increased from 0 to 1


I ran a long test and got these results:

Code:
[root@Cvlt_CGI_Data_bck] ~# smartctl -a /dev/da43
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUS726040AL4210
Revision:             A907
Compliance:           SPC-4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca2443067e4
Serial number:        N8GVM88Y
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Tue Oct 22 05:33:20 2019 PDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     33 C
Drive Trip Temperature:        85 C

Manufactured in week 03 of year 2017
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  18
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  187
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 39325794304

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0     146707          0.400           0
write:         0      781         0       781      18774          0.185          67
verify:        0        0         0         0      48090          0.000           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   10868                 - [-   -    -]
# 2  Background short  Completed                   -   10862                 - [-   -    -]
# 3  Background short  Completed                   -   10856                 - [-   -    -]
# 4  Background long   Failed in segment -->       7   10852            180879 [0x3 0x5d 0x1]
# 5  Background short  Completed                   -   10850                 - [-   -    -]
# 6  Background short  Completed                   -   10844                 - [-   -    -]
# 7  Background short  Completed                   -   10838                 - [-   -    -]
# 8  Background short  Completed                   -   10832                 - [-   -    -]
# 9  Background short  Completed                   -   10826                 - [-   -    -]
#10  Background short  Completed                   -   10820                 - [-   -    -]
#11  Background short  Completed                   -   10814                 - [-   -    -]
#12  Background short  Completed                   -   10808                 - [-   -    -]
#13  Background short  Completed                   -   10802                 - [-   -    -]
#14  Background long   Failed in segment -->       7   10799            180879 [0x3 0x5d 0x1]
#15  Background short  Completed                   -   10796                 - [-   -    -]
#16  Background short  Completed                   -   10790                 - [-   -    -]
#17  Background short  Completed                   -   10784                 - [-   -    -]
#18  Background short  Completed                   -   10778                 - [-   -    -]
#19  Background short  Completed                   -   10772                 - [-   -    -]
#20  Background short  Completed                   -   10766                 - [-   -    -]

Long (extended) Self Test duration: 34237 seconds [570.6 minutes]


I am not entirely sure what to make of it. Should I replace the drive? It says that it is OK, but I see those uncorrected errors and Failed in segment's, which make me believe I should replace.
 
Joined
Oct 18, 2018
Messages
969
If I were you, I'd replace. Better to do it before the drive completely dies. It looks like you may have truncated part of the results though, which may indicate other errors on the disk.
 

Fredda

Guru
Joined
Jul 9, 2019
Messages
608
It looks like you may have truncated part of the results though, which may indicate other errors on the disk.
No, I don't think so, it's a SAS drive, their smart output looks that way. SATA drives have much nicer information.
 
Joined
Oct 18, 2018
Messages
969
No, I don't think so, it's a SAS drive, their smart output looks that way. SATA drives have much nicer information.
Ah, right. That is what i get for reading too quickly. Thanks for the correction!
 

urbansystems

Dabbler
Joined
Oct 22, 2019
Messages
10
Thanks everyone. One thing to note.

In View Multipaths, /dev/da43 shows up as multipath/disk21

Then when I go into Volume Status, multipath/disk21p2 is listed in the spares / stripe section instead of a raidz#-#. Does this mean that the disk is a spare, and not currently being used with the RAID? I have two other disks in the spare/stripe section. It is showing up as AVAILABLE instead of ONLINE Attached is a screengrab. So is this disk a spare, or used in striping? I just want to know the severity of the issue.

Screen Shot 2019-10-22 at 12.19.13 PM.png
 

urbansystems

Dabbler
Joined
Oct 22, 2019
Messages
10
Two consecutive long tests failed, agree you should replace it.

In View Multipaths, /dev/da43 shows up as multipath/disk21

Am I reading the Volume Status correctly? This drive is a spare?

Since I have two other spares, should I not worry too much?
 
Joined
Oct 18, 2018
Messages
969
Then when I go into Volume Status, multipath/disk21p2 is listed in the spares / stripe section instead of a raidz#-#. Does this mean that the disk is a spare, and not currently being used with the RAID?
Correct, the disk is a spare currently.

Since I have two other spares, should I not worry too much?
You could lose every single spare and still not worry. The spares aren't holding any data and assuming all of your data disks are in good shape you aren't at any risk of data loss. The spares are only used if a main data drive completely bites the dust. Spares are useful if you are not able to replace failed drives quickly. Lots of folks opt to not have hot spares and instead rely on cold spares which have been burned in but removed from the system that are then used to replace any drive which is failing or completely fails. The best solution depends on your situation.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,995
Agreed, your drive S/N: N8GVM88Y is faulty. It should be replaced as soon as possible. With it being a SPARE, the system could try to inject it into the pool if you have a drive failure before it's replaced.

Just some friendly advice, when you design a huge server like what is indicated, ensure you fully understand everything about it. I suspect that you inheritted this server after someone set it up and maybe left your company. Best of luck to you.
 

urbansystems

Dabbler
Joined
Oct 22, 2019
Messages
10
Agreed, your drive S/N: N8GVM88Y is faulty. It should be replaced as soon as possible. With it being a SPARE, the system could try to inject it into the pool if you have a drive failure before it's replaced.

Just some friendly advice, when you design a huge server like what is indicated, ensure you fully understand everything about it. I suspect that you inheritted this server after someone set it up and maybe left your company. Best of luck to you.

I did inherit it. Luckly, after a little research, it is a FreeNAS certified system, that is still on support. Opened a ticket, hopefully hear back soon.
 
Joined
Oct 18, 2018
Messages
969
I did inherit it. Luckly, after a little research, it is a FreeNAS certified system, that is still on support. Opened a ticket, hopefully hear back soon.
Glad to hear you got a system still under support. :)
 

zenon1823

Explorer
Joined
Nov 13, 2018
Messages
66
Being a spare explains a lot. The first thing that caught my eye was that it was a drive with over 10,000hrs on it that had only wrote 200MB and read 400MB in its life, had me really scratching my head until I read the rest of the post. Congrats on inheriting a system with support, its nice when you inherit something setup and supported properly.
 
Top