SOLVED Drive failed/died - replacement suggestions and resolving issue

thecoffeeguy · Feb 16, 2020

Hey Folks.
Well, had my first drive fail on me last night, which was a new experience for me.
In a quick summary, I use my FreeNAS primarily for development/testing stuff and storage/backups VM's.
I have been toying lately of running VM's off of FreeNAS from my vCenter server via iSCSI. Seemed to work ok, me thinks, but i could be wrong.

Anyway, I was doing some work on one of my VM's and noticed it was abnormally slow. Did some digging around and saw the IO was bad (this VM is being run off FreeNAS).
I logged into FreeNAS and was greeted with a lovely warning about a degraded pool.
Some googling and digging around, looks like a drive has died or is dying.

I am not a FreeNAS expert, but i love the product (grew up on FreeBSD).
My only asks/questions I have to the folks here is:

1.) anyway I can figure out what caused the drive to die? Could there be any relation to the VM causing issues on the storage? The VM itself did have some database stuff work it was doing.

2.) I ordered a couple of disks as replacements and plan to plug them in when I get them. My question is, should I just add them to the pool? This is where my lack of experience comes in.

3.) lastly, does it matter if the drives are the same size? spec? model? These were older drives (probably 7 years ago or so), 2 TB. Here is smartctl output:

Code:

== START OF INFORMATION SECTION ===
Model Family:     Western Digital Green
Device Model:     WDC WD20EARX-00PASB0
Serial Number:    WD-WMAZA7232126
LU WWN Device Id: 5 0014ee 206c3ed23
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sun Feb 16 13:31:16 2020 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Thats about it. Not terribly worried as this is not production, but want to take steps to resolve the issue and make it better.

Much obliged for the help.

TCG

sretalla · Feb 17, 2020

thecoffeeguy said:
Here is smartctl output

How about smartctl -a ?

thecoffeeguy said:
lastly, does it matter if the drives are the same size? spec? model?

Size will matter... must be at least as big as the disk it's replacing... can be bigger. Other things not so important... speed is a consideration as the pool performance will be driven by its slowest member.

thecoffeeguy said:
Could there be any relation to the VM causing issues on the storage?

No.

thecoffeeguy said:
I ordered a couple of disks as replacements and plan to plug them in when I get them. My question is, should I just add them to the pool?

No. Read the manual about replacing failed disks. https://www.ixsystems.com/documentation/freenas/11.3-RELEASE/storage.html#replacing-a-failed-disk

thecoffeeguy · Feb 17, 2020

sretalla said:
How about smartctl -a ?

Yea, I kinda blew that one. :/
Here is the output

Code:

smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Green
Device Model:     WDC WD20EARX-00PASB0
Serial Number:    WD-WMAZA7232126
LU WWN Device Id: 5 0014ee 206c3ed23
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Feb 17 08:52:25 2020 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (35880) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 346) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1
  3 Spin_Up_Time            0x0027   169   169   021    Pre-fail  Always       -       6533
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       47
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       2131
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       45
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       18
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       40
194 Temperature_Celsius     0x0022   114   111   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.

sretalla said:
Size will matter... must be at least as big as the disk it's replacing... can be bigger. Other things not so important... speed is a consideration as the pool performance will be driven by its slowest member.

Ok. I thought that might be the case. I ordered the same size disks, but a bit faster. I ordered 3 (2 tb) drives.

sretalla said:
No.

No. Read the manual about replacing failed disks. https://www.ixsystems.com/documentation/freenas/11.3-RELEASE/storage.html#replacing-a-failed-disk

Got it. I will review that today.
Thanks for the help![

thecoffeeguy · Feb 19, 2020

Morning folks.
So, something interesting happened this morning as I was working to add 2 drives to my FreeNAS setup to fix the above issue.
I bought (2) 2TB WD drives (red) to pop into my FreeNAS box. I brought down my FreeNAS box, proceeded to install the 2 new drives and left the other 2 in there (short story why i didnt bring out the 'bad' one, i was in a hurry and forgot to mark which one it was.....).

Anyway, put the drives in, started up FreeNAS and logged in. To my surprise, my one pool I have is green now and no longer degraded. As I poked aorund, the dashboard said the disks were 'resilvered' (new to me) and here is output from zpool status

Code:

# zpool status -v
  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:02:27 with 0 errors on Thu Feb 13 03:47:27 2020
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          da0p2     ONLINE       0     0     0

errors: No known data errors

  pool: vNAS
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Feb 19 08:02:17 2020
        50.2G scanned at 121M/s, 26.7G issued at 64.4M/s, 50.2G total
        26.7G resilvered, 53.25% done, 0 days 00:06:13 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        vNAS                                            ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/250afd56-1067-11ea-946d-001517be098a  ONLINE       0     0    25
            gptid/27240b69-1067-11ea-946d-001517be098a  ONLINE       0     0     0

errors: No known data errors

So color me confused, but....what happened? I thought the drive was dead or dying, but it appears not to be the case?

Second, I have 4 more TB's to use. Should I add to the pool? add a spare? Kinda cool, but looking for suggestions.
I am going to get some more coffee and come back and figure out what happened.

Thanks everyone.

TCG

SweetAndLow · Feb 19, 2020

You need to be running automated smart tests on your drives. This will email you and test if your drives are dying.

The read error on the drive you posted the smart output for tells me that drive is on its way out.

Before you use those new drives you need to burn them in. This takes a couple days. Never use a untested disk in your system.

thecoffeeguy · Feb 19, 2020

SweetAndLow said:
You need to be running automated smart tests on your drives. This will email you and test if your drives are dying.

The read error on the drive you posted the smart output for tells me that drive is on its way out.

Before you use those new drives you need to burn them in. This takes a couple days. Never use a untested disk in your system.

Fair enough. I can get started on that.

I will look into how to properly burn in the new drives.
Thanks.

sretalla · Feb 19, 2020

You're on borrowed time with those drives... resilver will happen to a bad drive which hides the risk, but then your next scrub will find it out and the disk will be kicked out of the pool again. Replace the disks as you plan and sleep better.

SweetAndLow · Feb 19, 2020

sretalla said:
You're on borrowed time with those drives... resilver will happen to a bad drive which hides the risk, but then your next scrub will find it out and the disk will be kicked out of the pool again. Replace the disks as you plan and sleep better.

That's not how it works. The resilver happened because there were changes that happened while the disk was degraded. Also a scrub and resilver are basically the same thing a scrub just doesn't write anything.

Only way to know status of disks is to run a smart long test.

sretalla · Feb 20, 2020

Although I wrote scrub, I meant to say long SMART test...

In that case specifically, I don't agree that a scrub and resilver are the same as a full scrub takes hours and those resilvers are done in minutes... a clear difference. (I don't dispute that the resilver process uses code that's shared between scrub and resilver, but running one is not identical to running the other with read/write the only difference)

Important Announcement for the TrueNAS Community.

SOLVED Drive failed/died - replacement suggestions and resolving issue

thecoffeeguy

Dabbler

sretalla

Powered by Neutrality

thecoffeeguy

Dabbler

thecoffeeguy

Dabbler

SweetAndLow

Sweet'NASty

thecoffeeguy

Dabbler

sretalla

Powered by Neutrality

SweetAndLow

Sweet'NASty

sretalla

Powered by Neutrality

Similar threads

Important Announcement for the TrueNAS Community.

SOLVED Drive failed/died - replacement suggestions and resolving issue

Dabbler

Powered by Neutrality

Dabbler

Dabbler

Sweet'NASty

Dabbler

Powered by Neutrality

Sweet'NASty

Powered by Neutrality

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Drive failed/died - replacement suggestions and resolving issue"

Similar threads