SOLVED Pool degraded, one drive dropped and reappeared again and error during bi-weekly scrub

Mannekino · Feb 17, 2019

Yesterday during a reboot of my server I saw some error messages when looking at the IPMI console. After FreeNAS was done booting I saw that my primary data pool was degraded and one drive was missing. After a few minutes the drive reappeared and the pool resilvered in a few seconds and it was healthy again.

Last night at 1:00 in the morning my bi-weekly scrub of this pool started and around 3:34 I got an e-mail alert the pool was degraded again. I got this in the mail

Code:

New alerts:
* The volume data state is DEGRADED: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

The scrub just finished after about 6.5 hours. After the scrub was done it didn’t resilver like it did earlier yesterday, it was still degraded. I took this screenshot during the scrub (when it was at about 98% done). After the scrub was done I turned off the server and checked all the cables if they were properly fitted. Everything was in order, all the cables were plugged in all the way. When the server was back online the pool was healthy again.

Screenshot during scrub

After shutting down, checking cables and turning back on again

Pool status right now

Code:

 root@freenas[~]# zpool status
  pool: data
state: ONLINE
  scan: scrub repaired 356K in 0 days 06:32:00 with 0 errors on Mon Feb 18 07:32:01 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        data                                            ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/a18b468c-b464-11e8-9fd5-0025907457e1  ONLINE       0     0     0
            gptid/a2b7dae8-b464-11e8-9fd5-0025907457e1  ONLINE       0     0     0
            gptid/a3f67a48-b464-11e8-9fd5-0025907457e1  ONLINE       0     0     0
            gptid/a52943ef-b464-11e8-9fd5-0025907457e1  ONLINE       0     0     0

How should I proceed in troubleshooting this issue? I’ve been running bi-weekly scrubs of this pool and daily SMART short self-test. Also I would like to point to this post where I added an exhaust fan for my SAS HBA and applied the common old school 7 V mod to the fan to reduce the noise. One of the first things I did was to disconnect the exhaust fan this morning. I already ordered a simple fan controller for this to solve it properly. Also I plugged in the SATA cable supplying power to my drives to another output a couple of days ago when I attached the Molex power connectors for the new exhaust fan. I don’t know if any of this is relevant but I mention it anyway.

I’ve ordered a new drive just to be sure and I’m going to pick it up later today.

Mannekino · Feb 18, 2019

So it's been over 2 hours since I shutdown the server and turned it back on. No new errors. I've got the replacement drive.

I *think* I initiated a long self-test via the GUI. Is there any way for me to see how far along the self-test is? I've looked at the smartctl commands but so far can't figure out if there is a way to check.

I wasn't looking properly. I can see with smartctl -a that the long self-test is currently running on all 4 drives and they all show the same status and progress.

Code:

Self-test execution status:      ( 246) Self-test routine in progress...
                                        60% of test remaining.

When building my NAS I wrote down the serial numbers and the position they are in. I figured out what disk was degraded earlier. Should I just replace the disk to be sure? Should I wait until the long self-test is done?

Johnny Fartpants · Feb 18, 2019

Can you share the SMART stats of all the drives?

If you run smartctl -a /dev/da3 then that should show you progress of the long SMART tests you've set off.

Mannekino · Feb 18, 2019

Yeah, I just figured that out, I updated my previous comment. Thanks! I will post the stats of all the drives when the long self-test is done. Should be in 1 hour or so I think. I'm kinda leaning towards maybe an issue with the SAS HBA of SATA breakout cable? The drives are new (I bought them last year august) and the rest of my NAS is built from second hand parts.

Chris Moore · Feb 18, 2019

Mannekino said:
When building my NAS I wrote down the serial numbers and the position they are in. I figured out what disk was degraded earlier. Should I just replace the disk to be sure? Should I wait until the long self-test is done?

I would wait until the test is finished but with the way this drive is dropping out and then comes back, it might be a cabling issue. One of the users here had a difficult time tracking down a drive fault that ended up being a bad SATA power connection.

Mannekino · Feb 18, 2019

Well the long self-test is done and everything is in order. Any thoughts on how to proceed, should I just wait and see if it happens again? I got the spare drive anyway. Is the screenshot below enough to see, or do you guys need the complete output of smartctl -a /dev/da3?

Chris Moore · Feb 18, 2019

Mannekino said:
Is the screenshot below enough to see, or do you guys need the complete output of smartctl -a /dev/da3?

Sometimes there are details that help figure things out when you look at the full details, but what I would want to see is the output of
smartctl -x /dev/da3 but not from all the drives, just the one that blinked out on you.

Mannekino · Feb 18, 2019

Here it is, this is the drive that got degraded.

Code:

root@freenas[~]# smartctl -x /dev/da3
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68N32N0
Serial Number:    WD-WCC7K1SHEXSP
LU WWN Device Id: 5 0014ee 2652c9583
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Feb 18 19:43:56 2019 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (43380) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 461) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   165   165   021    -    6725
  4 Start_Stop_Count        -O--CK   100   100   000    -    39
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   096   096   000    -    2931
 10 Spin_Retry_Count        -O--CK   100   253   000    -    0
 11 Calibration_Retry_Count -O--CK   100   253   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    39
192 Power-Off_Retract_Count -O--CK   200   200   000    -    37
193 Load_Cycle_Count        -O--CK   200   200   000    -    23
194 Temperature_Celsius     -O---K   119   109   000    -    31
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb6  GPL,SL  VS       1  Device vendor specific log
0xb7       GPL,SL  VS      56  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      2929         -
# 2  Short offline       Completed without error       00%      2911         -
# 3  Short offline       Completed without error       00%      2887         -
# 4  Short offline       Completed without error       00%      2863         -
# 5  Short offline       Completed without error       00%      2839         -
# 6  Short offline       Completed without error       00%      2816         -
# 7  Short offline       Completed without error       00%      2792         -
# 8  Short offline       Completed without error       00%      2768         -
# 9  Short offline       Completed without error       00%      2744         -
#10  Short offline       Completed without error       00%      2721         -
#11  Short offline       Completed without error       00%      2696         -
#12  Short offline       Completed without error       00%      2652         -
#13  Short offline       Completed without error       00%      2628         -
#14  Short offline       Completed without error       00%      2604         -
#15  Short offline       Completed without error       00%      2580         -
#16  Short offline       Completed without error       00%      2556         -
#17  Short offline       Completed without error       00%      2532         -
#18  Short offline       Completed without error       00%      2508         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    31 Celsius
Power Cycle Min/Max Temperature:     31/34 Celsius
Lifetime    Min/Max Temperature:     19/41 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/65 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (419)

Index    Estimated Time   Temperature Celsius
420    2019-02-18 11:46    32  *************
...    ..( 29 skipped).    ..  *************
450    2019-02-18 12:16    32  *************
451    2019-02-18 12:17    31  ************
...    ..( 67 skipped).    ..  ************
  41    2019-02-18 13:25    31  ************
  42    2019-02-18 13:26    34  ***************
...    ..( 77 skipped).    ..  ***************
120    2019-02-18 14:44    34  ***************
121    2019-02-18 14:45    33  **************
...    ..( 14 skipped).    ..  **************
136    2019-02-18 15:00    33  **************
137    2019-02-18 15:01    32  *************
...    ..(116 skipped).    ..  *************
254    2019-02-18 16:58    32  *************
255    2019-02-18 16:59    33  **************
...    ..(139 skipped).    ..  **************
395    2019-02-18 19:19    33  **************
396    2019-02-18 19:20    34  ***************
...    ..(  6 skipped).    ..  ***************
403    2019-02-18 19:27    34  ***************
404    2019-02-18 19:28    33  **************
...    ..(  7 skipped).    ..  **************
412    2019-02-18 19:36    33  **************
413    2019-02-18 19:37    32  *************
...    ..(  5 skipped).    ..  *************
419    2019-02-18 19:43    32  *************

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              39  ---  Lifetime Power-On Resets
0x01  0x010  4            2931  ---  Power-on Hours
0x01  0x018  6     10310640191  ---  Logical Sectors Written
0x01  0x020  6        91363173  ---  Number of Write Commands
0x01  0x028  6     57052367713  ---  Logical Sectors Read
0x01  0x030  6       346760981  ---  Number of Read Commands
0x01  0x038  6      1961665408  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4            2915  ---  Spindle Motor Power-on Hours
0x03  0x010  4            2912  ---  Head Flying Hours
0x03  0x018  4              61  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               6  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x03  0x038  4               0  ---  Number of Realloc. Candidate Logical Sectors
0x03  0x040  4              37  ---  Number of High Priority Unload Events
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              31  ---  Current Temperature
0x05  0x010  1              31  ---  Average Short Term Temperature
0x05  0x018  1              33  ---  Average Long Term Temperature
0x05  0x020  1              41  ---  Highest Temperature
0x05  0x028  1              22  ---  Lowest Temperature
0x05  0x030  1              37  ---  Highest Average Short Term Temperature
0x05  0x038  1              27  ---  Lowest Average Short Term Temperature
0x05  0x040  1              36  ---  Highest Average Long Term Temperature
0x05  0x048  1              33  ---  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              65  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4           14214  ---  Number of Hardware Resets
0x06  0x010  4              61  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            1  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            9  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4        43363  Vendor specific

Chris Moore · Feb 18, 2019

Mannekino said:
Here it is, this is the drive that got degraded.

I don't see any clues. Does it only happen when the drive is under a workload?

Mannekino · Feb 18, 2019

So far it happened two times past weekend. But it all happened after I added the PCI exhaust fan I talked about in the other thread. I since then disconnected it and waiting for a simple fan controller.

First time was when I rebooted the server and I saw some messages in the console that something couldn't be read. After the server booted the disk was missing but it reappeared after a couple minutes.

Second time was last night about halfway into the bi-weekly scrub.

I've been working on my NAS today all day checking stuff and configuring some more plugins and I ran the long self-tests but nothing super disk intensive I would say. Nothing happened for the past 13 hours or so.

Mannekino · Feb 18, 2019

It just happened again. I rebooted my NAS and now the pool is degraded again. Any thoughts on how to proceed?

And it's back again after a minute or so, also named /dev/da5 now.

This is in the console

Code:

Feb 18 23:03:12 freenas.fritz.box mps0: SAS Address for SATA device = 4f636454cbb8da90
Feb 18 23:03:12 freenas.fritz.box mps0: SAS Address from SATA device = 4f636454cbb8da90
Feb 18 23:03:12 freenas.fritz.box da5 at mps0 bus 0 scbus0 target 3 lun 0
Feb 18 23:03:12 freenas.fritz.box da5: <ATA WDC WD40EFRX-68N 0A82> Fixed Direct Access SPC-4 SCSI device
Feb 18 23:03:12 freenas.fritz.box da5: Serial Number WD-WCC7K1SHEXSP
Feb 18 23:03:12 freenas.fritz.box da5: 600.000MB/s transfers
Feb 18 23:03:12 freenas.fritz.box da5: Command Queueing enabled
Feb 18 23:03:12 freenas.fritz.box da5: 3815447MB (7814037168 512 byte sectors)
Feb 18 23:03:12 freenas.fritz.box da5: quirks=0x8<4K>
Feb 18 23:03:12 freenas.fritz.box ZFS: vdev state changed, pool_guid=11791038853189070155 vdev_guid=14250359306952643538
Feb 18 23:03:12 freenas.fritz.box ZFS: vdev state changed, pool_guid=11791038853189070155 vdev_guid=14949667362364330100
Feb 18 23:03:12 freenas.fritz.box ZFS: vdev state changed, pool_guid=11791038853189070155 vdev_guid=2841930983946545696
Feb 18 23:03:12 freenas.fritz.box ZFS: vdev state changed, pool_guid=11791038853189070155 vdev_guid=5356569320816614764
undefined

Chris Moore · Feb 18, 2019

Mannekino said:
And it's back again after a minute or so, also named /dev/da5 now.

This is behaving as if it is either loosing connectivity or loosing power and I am leaning more toward loosing power. Usually, if the drive data cable is bad, you will see some CRC errors but we are not seeing that in the SMART results. I would first try connecting the drive to a different power connection, just as a test, and part of the reason I am thinking it might be the power connector is because of the fact it happens under load, when the drive is pulling more current. After changing the connection, run another scrub, as a test, to see if the drive drops out again.
It could be there is something internal to the drive, but I hate to suggest sending it back when it is passing diagnostic tests.
Please be patient, finding an intermittent fault is one of the most difficult things there is.

Mannekino · Feb 19, 2019

Thanks for your reply. I decided this morning to replace the disk with the spare I bought yesterday. The re-silvering process is currently running. If this completes successfully I want to wait a bit before trying anything else, and to see if the problem still exists of course. I also contacted Seasonic on how I can order some additional PSU cables and I've been looking into where I can get new SAS to SATA breakout cables should the need arise. I have a spare PSU brand new still in the box a Cooler Master V Series V550 I could try also.

If the problem goes away after the re-silvering I will send the drive I took out back to the supplier I bought it from. It is still under warranty, I bought the drives last year late august.

But yes, dealing with such intermittent failures is a bit frustrating.

Mannekino · Feb 19, 2019

The re-silver proces completed successfully. I've since then rebooted the system a bunch of times because I also tried upgrading to U2 but ended up going back to U1 due to some bugs. Everything is working well so far.

Mannekino · Feb 22, 2019

@Chris Moore so I replaced the drive on Monday and did a re-silver. Haven't had any problems pop up this week until now. I just rebooted my FreeNAS server and after the reboot the data pool was degraded again, and it was also /dev/da3 the replaced drive.

On Tuesday I also ordered two new SATA power cables from Seasonic and I'm about to replace that. If this doesn't work out you got any other suggestions to further troubleshoot?

Redcoat · Feb 22, 2019

Furthr troubleshooting suggestion: Try swapping the mobo port to which the data cable is connected and see if the issue occurs with drive connected to the original port or moves with a currently affected drive.

Mannekino · Feb 22, 2019

@Redcoat I have a Dell PERC H200 controller with a SAS to SATA breakout cable. So the drive isn't plugged into a SATA port on the motherboard.

Redcoat · Feb 22, 2019

I should have looked back ...

Mannekino · Feb 22, 2019

OK, I replaced the power cable and the pool is healthy again but I do see an error. Should I run a scrub tonight when I sleep?

Or should I run zpool clear?

Mannekino · Feb 22, 2019

I'm getting a bit more worried and not sure how to proceed. The scrub is almost done and I got another e-mail around 03:00 (I started the scrub at 00:20 and it's 06:00 now. The error count has gone up.

So far I replaced the drive and replaced the power cable. Should I be looking at a possible failure of the datacable, HBA or PSU? The drives and PSU were the only new parts I bought last year. The rest was secondhand. The PSU is a Seasonic FOCUS PLUS 550 GOLD.

Can I detach that drive from the HBA and plug it directly into one of the SATA ports of the motherboard? That should still work for the pool right?

If the problem goes away then it should either be the data cable or HBA.

Important Announcement for the TrueNAS Community.

SOLVED Pool degraded, one drive dropped and reappeared again and error during bi-weekly scrub

Patron

Patron

Guru

Patron

Hall of Famer

Patron

Hall of Famer

Patron

Hall of Famer

Patron

Patron

Hall of Famer

Patron

Patron

Patron

MVP

Patron

MVP

Patron

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Pool degraded, one drive dropped and reappeared again and error during bi-weekly scrub"

Similar threads