ZFS mirror continually disengages one drive and resilvers it

hunter · Jul 16, 2016

After months of having a ZFS mirror of two drives work fine, the last week my pool seems to continually disengage one of the two drives (the same one) and shortly afterwards begin resilvering it. The resilver completes. Then in 1 -2 days the same thing seems to happen again. Some of the error log entries I found are below. I looked at SMART report for the drive that keeps getting disengaged and did not see errors in the stored runs of SMART.

Can anyone suggest how to solve this problem? The drive that keeps disengaging is a 6 Tb drive and I don't have that size as a spare, and can't tell if anything is wrong with the drive.

Jul 16 06:41:33 freenas smartd[2489]: Device: /dev/ada0, Temperature 51 Celsius reached critical limit of 41 Celsius (Min/Max 48/54)
reached critical limit of 41 Celsius (Min/Max 48/54)
Jul 16 13:41:33 freenas smartd[2489]: Device: /dev/ada1, Temperature 50 Celsius reached critical limit of 41 Celsius (Min/Max 45/51)

Jul 16 13:54:14 freenas ada1 at ahcich4 bus 0 scbus4 target 0 lun 0
Jul 16 13:54:14 freenas ada1: <HGST HDN726060ALE610 APGNT517> s/n NCGTRHJS detached
Jul 16 13:54:14 freenas devd: Executing '[ -e /tmp/.sync_disk_done ] && LD_LIBRARY_PATH=/usr/local/lib /usr/local/bin/python /usr/local/www/freenasUI/tools/sync_disks.py && LD_LIBRARY_PATH=/usr/local/lib /usr/local/bin/python /usr/local/www/freenasUI/tools/smart_alert.py -d ada1'
Jul 16 13:54:14 freenas (ada1:ahcich4:0:0:0): Periph destroyed
Jul 16 13:54:15 freenas devd: Executing 'logger -p kern.notice -t ZFS 'vdev is removed, pool_guid=17897178385600871224 vdev_guid=2516134913956804389''
Jul 16 13:54:15 freenas ZFS: vdev is removed, pool_guid=17897178385600871224 vdev_guid=2516134913956804389
Jul 16 13:54:26 freenas ada1 at ahcich4 bus 0 scbus4 target 0 lun 0
Jul 16 13:54:26 freenas ada1: <HGST HDN726060ALE610 APGNT517> ACS-2 ATA SATA 3.x device
Jul 16 13:54:26 freenas ada1: Serial Number NCGTRHJS
Jul 16 13:54:26 freenas ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
Jul 16 13:54:26 freenas ada1: Command Queueing enabled
Jul 16 13:54:26 freenas ada1: 5723166MB (11721045168 512 byte sectors)
Jul 16 13:54:26 freenas ada1: Previously was known as ad12
Jul 16 13:54:26 freenas devd: Executing '[ -e /tmp/.sync_disk_done ] && LD_LIBRARY_PATH=/usr/local/lib /usr/local/bin/python /usr/local/www/freenasUI/tools/sync_disks.py ada1'
Jul 16 13:54:26 freenas devd: Executing 'logger -p kern.notice -t ZFS 'vdev state changed, pool_guid=17897178385600871224 vdev_guid=2516134913956804389''
Jul 16 13:54:26 freenas ZFS: vdev state changed, pool_guid=17897178385600871224 vdev_guid=2516134913956804389
Jul 16 14:11:33 freenas smartd[2489]: Device: /dev/ada0, Temperature 51 Celsius reached critical limit of 41 Celsius (Min/Max 48/54)
Jul 16 14:11:33 freenas smartd[2489]: Device: /dev/ada0, Temperature 51 Celsius reached critical limit of 41 Celsius (Min/Max 48/54)
Jul 16 14:11:33 freenas smartd[2489]: Device: /dev/ada1, Temperature 50 Celsius reached critical limit of 41 Celsius (Min/Max 45/51)
Jul 16 14:16:42 freenas notifier: shutdown: [pid 48089]
Jul 16 14:16:42 freenas notifier: Shutdown NOW!
Jul 16 14:16:42 freenas shutdown: reboot by root:
Jul 16 14:16:42 freenas notifier: Shutdown NOW!
Jul 16 14:16:42 freenas notifier:
Jul 16 14:16:42 freenas notifier: System shutdown time has arrived^G^G
Jul 16 14:19:30 freenas VT(vga): resolution 640x480
Jul 16 14:19:30 freenas CPU: Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz (3300.10-MHz K8-class CPU)
Jul 16 14:19:30 freenas Origin="GenuineIntel" Id=0x306a9 Family=0x6 Model=0x3a Stepping=9
Jul 16 14:19:30 freenas Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
Jul 16 14:19:30 freenas Features2=0x7fbae3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
Jul 16 14:19:30 freenas AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM>
Jul 16 14:19:30 freenas AMD Features2=0x1<LAHF>
Jul 16 14:19:30 freenas Structured Extended Features=0x281<FSGSBASE,SMEP,ERMS>
Jul 16 14:19:30 freenas XSAVE Features=0x1<XSAVEOPT>
Jul 16 14:19:30 freenas VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
Jul 16 14:19:30 freenas TSC: P-state invariant, performance statistics
Jul 16 14:19:30 freenas real memory = 17985175552 (17152 MB)
Jul 16 14:19:30 freenas avail memory = 16557699072 (15790 MB)
Jul 16 14:19:30 freenas Event timer "LAPIC" quality 600
Jul 16 14:19:30 freenas ACPI APIC Table: <SUPERM SMCI--MB>
Jul 16 14:19:30 freenas FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs
Jul 16 14:19:30 freenas FreeBSD/SMP: 1 package(s) x 4 core(s) x 2 SMT threads

Jul 16 14:19:30 freenas Timecounter "HPET" frequency 14318180 Hz quality 950
Jul 16 14:19:30 freenas Event timer "HPET" frequency 14318180 Hz quality 550

Jul 16 14:19:30 freenas ada0: Previously was known as ad8
Jul 16 14:19:30 freenas ada1 at ahcich4 bus 0 scbus4 target 0 lun 0
Jul 16 14:19:30 freenas ada1: <HGST HDN726060ALE610 APGNT517> ACS-2 ATA SATA 3.x device
Jul 16 14:19:30 freenas ada1: Serial Number NCGTRHJS
Jul 16 14:19:30 freenas ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
Jul 16 14:19:30 freenas ada1: Command Queueing enabled
Jul 16 14:19:30 freenas ada1: 5723166MB (11721045168 512 byte sectors)
Jul 16 14:19:30 freenas ada1: Previously was known as ad12
Jul 16 14:19:30 freenas ada2 at ahcich5 bus 0 scbus5 target 0 lun 0
Jul 16 14:19:30 freenas ada2: <WDC WD30EFRX-68EUZN0 80.00A80> ACS-2 ATA SATA 3.x device
Jul 16 14:19:30 freenas ada2: Serial Number WD-WMC4N2323512
Jul 16 14:19:30 freenas ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
Jul 16 14:19:30 freenas ada2: Command Queueing enabled
Jul 16 14:19:30 freenas ada2: 2861588MB (5860533168 512 byte sectors)
Jul 16 14:19:30 freenas ada2: quirks=0x1<4K>
Jul 16 14:19:30 freenas ada2: Previously was known as ad14

Jul 16 14:49:38 freenas smartd[2490]: Device: /dev/ada0, Temperature 50 Celsius reached critical limit of 41 Celsius (Min/Max 50/50)
Jul 16 14:49:38 freenas smartd[2490]: Device: /dev/ada0, Temperature 50 Celsius reached critical limit of 41 Celsius (Min/Max 50/50)
Jul 16 14:49:38 freenas smartd[2490]: Device: /dev/ada1, Temperature 49 Celsius reached critical limit of 41 Celsius (Min/Max 49/49)
Jul 16 15:19:38 freenas smartd[2490]: Device: /dev/ada0, Temperature 50 Celsius reached critical limit of 41 Celsius (Min/Max 50/50)
Jul 16 16:19:38 freenas smartd[2490]: Device: /dev/ada0, Temperature 50 Celsius reached critical limit of 41 Celsius (Min/Max 50/50)
Jul 16 16:19:38 freenas smartd[2490]: Device: /dev/ada0, Temperature 50 Celsius reached critical limit of 41 Celsius (Min/Max 50/50)
Jul 16 16:19:38 freenas smartd[2490]: Device: /dev/ada1, Temperature 49 Celsius reached critical limit of 41 Celsius (Min/Max 49/49)
Jul 16 16:42:17 freenas notifier: Stopping smartd.
Jul 16 16:42:17 freenas notifier: Waiting for PIDS: 2490.
Jul 16 16:42:17 freenas notifier: smartd not running? (check /var/run/smartd.pid).
Jul 16 16:42:17 freenas notifier: Starting smartd.

Jul 16 18:59:57 freenas ada1 at ahcich4 bus 0 scbus4 target 0 lun 0
Jul 16 18:59:57 freenas ada1: <HGST HDN726060ALE610 APGNT517> s/n NCGTRHJS detached
Jul 16 18:59:57 freenas devd: Executing '[ -e /tmp/.sync_disk_done ] && LD_LIBRARY_PATH=/usr/local/lib /usr/local/bin/python /usr/local/www/freenasUI/tools/sync_disks.py && LD_LIBRARY_PATH=/usr/local/lib /usr/local/bin/python /usr/local/www/freenasUI/tools/smart_alert.py -d ada1'
Jul 16 18:59:57 freenas GEOM_ELI: Device ada1p1.eli destroyed.
Jul 16 18:59:57 freenas GEOM_ELI: Detached ada1p1.eli on last close.
Jul 16 18:59:57 freenas (ada1:ahcich4:0:0:0): Periph destroyed
Jul 16 18:59:58 freenas devd: Executing 'logger -p kern.notice -t ZFS 'vdev is removed, pool_guid=17897178385600871224 vdev_guid=2516134913956804389''
Jul 16 18:59:58 freenas ZFS: vdev is removed, pool_guid=17897178385600871224 vdev_guid=2516134913956804389
Jul 16 19:00:09 freenas ada1 at ahcich4 bus 0 scbus4 target 0 lun 0
Jul 16 19:00:09 freenas ada1: <HGST HDN726060ALE610 APGNT517> ACS-2 ATA SATA 3.x device
Jul 16 19:00:09 freenas ada1: Serial Number NCGTRHJS
Jul 16 19:00:09 freenas ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
Jul 16 19:00:09 freenas ada1: Command Queueing enabled
Jul 16 19:00:09 freenas ada1: 5723166MB (11721045168 512 byte sectors)
Jul 16 19:00:09 freenas ada1: Previously was known as ad12
Jul 16 19:00:09 freenas devd: Executing '[ -e /tmp/.sync_disk_done ] && LD_LIBRARY_PATH=/usr/local/lib /usr/local/bin/python /usr/local/www/freenasUI/tools/sync_disks.py ada1'
Jul 16 19:00:10 freenas devd: Executing 'logger -p kern.notice -t ZFS 'vdev state changed, pool_guid=17897178385600871224 vdev_guid=2516134913956804389''
Jul 16 19:00:10 freenas ZFS: vdev state changed, pool_guid=17897178385600871224 vdev_guid=2516134913956804389
Jul 16 20:16:13 freenas notifier: Stopping smartd.
Jul 16 20:16:13 freenas notifier: Waiting for PIDS: 29537.

Stop refresh

tvsjr · Jul 16, 2016

I'd start with a smartctl - x of the failing drive. Your drives are definitely too hot.

hunter · Jul 16, 2016

Thank you, I'll try it and post the results when I'm back home tomorrow night

Sent from my SPH-L710 using Tapatalk

joeschmuck · Jul 16, 2016

A few things real quick...

1) Do you run SMART tests frequently, both short and long test?
2) Have you tried to connect the drive to a different SATA port and use a different SATA cable?
3) Have you looked and tried the Hard Drive Troubleshooting Guide in these forums?
4) You are having problems with the drive on ada0, correct? I ask because looking at the data it only shows one entry where is states is was ad8 but I see nothing more about it.
5) If you don't have your data backed up, well I'd get moving on that first. Safe your data.

EDIT: I agree 100% with those drive temps are TOO HIGH! You are just looking for early drive failures once you start running them over 40C for any length of time. Figure out that cooling issue and fix it unless you want to replace drives frequently.

hunter · Jul 18, 2016

Thank you both for the suggestions. I looked carefully at the output of smartctl -x on the "failing" drive and didn't see errors or recovered sectors but the # of power cycles was way too high--considerably higher than another identical drive on the system that has been there longer. Both drives were plugged into the same rail on the power supply. So I tried tapping one of the unused rails for the 2nd drive. It has resilvered once now and just stopped. The higher temperatures, were because shortly after a power cycle, FreeNAS would restart the resilvering. So both drives had been running their fastest in a warm closet.

Now I am seeing 44 Celsius temperatures out of HGST Deskstar NAS 6Tb 7200 rpm drives. Does that seem too high? Smartctl -x says:
Min/Max recommended Temperature: 0/60 Celsius
Min/Max Temperature Limit: -40/70 Celsius

Do you think the drives were running too hot at 50 Celsius given they were loaded and are 7200 RPM drives?

tvsjr · Jul 19, 2016

hunter said:
Do you think the drives were running too hot at 50 Celsius given they were loaded and are 7200 RPM drives?

Yes. I've got a similar situation I'm working through... I've got 15K SAS drives running 46-49C, and I feel that's too hot. They are probably similar to yours... in a closet at home. I didn't want to go to a full AC system due to cost and freeze-up issues (4x8x8 closet, not enough air space) so I'm doing a push-pull fan arrangement with air filtering. Making progress...

joeschmuck · Jul 19, 2016

hunter said:
Do you think the drives were running too hot at 50 Celsius given they were loaded and are 7200 RPM drives?

Yes.

You should try to get them down to 40C if possible. Placing a system in a closed closet is a bad idea. If possible maybe you could cut a vent in the top of the closet and run a vent above the closet door so hot air can escape. If the bottom of the closet door doesn't have a .5" gap then maybe you could cut off a small amount to make the gap to allow cool air in the bottom. And if that isn't enough cooling, a few 240mm fans and cheap wallwort could be mounted tot he upper vent to force the hot air out.

Yea, if this is not your house then maybe it is not an option, but it's a possible solution if you have no other location for the hardware.

hunter · Jul 19, 2016

Thank you both for the feedback. It's not my house and the closet door is always open. I don't want to air condition the place during the day when I'm not home. I will try adding another 120mm fan to the case if it has a place for it. I wish I had bought 5500 rpm WD reds again, those ran a lot cooler

Sent from my SPH-L710 using Tapatalk

Jailer · Jul 19, 2016

Your problem is not your drive selection, your problem is inadequate cooling. You need to fix your cooling problem or you'll likely be spending a lot of money on drive replacement.

Penny wise, pound foolish.

hunter · Jul 24, 2016

Thank you everyone for your suggestions. My case already has 2, 120mm fans blowing air out the top and back of the case, and 1 120mm fan blowing into the case and across the hard drives. I will try copper fin heatsinks, which Steve Gibson has said provide a surprising amount of cooling for hard drives. With everything set up the same way in the case, the WD reds definitely ran below 40 Celsius. I guess I have inadequate cooling but it sure wasn't until I went with the HGST 7200 RPM drives.

DrKK · Jul 24, 2016

I still would like to see the output of smartctl -x for the drive.

After all, we have seen many thousands of these from many customers, and have quite a bit of experience teasing out things from the SMART report that may not be discernible to most people.

hunter · Jul 25, 2016

Here is the output of the smartctl -x for the drive. I didn't know how I could post it before. I redirected it into a file then opened it with Windows Notepad and tried to add line breaks where they should be. If you know a better way please say. I also attached a copy of the same smartctl output in a the file freebsd created, maybe you have an editor that can read it properly. I appreciate whatever you can tell me about how the drive is running from the output.
----------------------------------------------------------------------------------------------------------

Code:

smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION
===
Model Family:     HGST Deskstar NAS
Device Model:     HGST HDN726060ALE610

Serial Number:    NCGTRHJS
LU WWN
Device Id: 5 000cca 24dcb3d5f

Firmware Version: APGNT517

User Capacity:    6,001,175,126,016 bytes [6.00 TB]

Sector Sizes:     512 bytes logical, 4096 bytes physical

Rotation Rate:    7200 rpm

Form Factor:      3.5 inches
Device is:     
In smartctl database [for details use: -P show]

ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4

SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)

Local Time is:    Mon Jul 25 18:56:02 2016 EDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled
AAM feature is:   Unavailable

APM feature is:   Disabled
Rd look-ahead is: Enabled

Write cache is:   Enabled
ATA Security is:  Disabled, frozen [SEC2]
Wt
Cache Reorder: Enabled


=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
was completed without error.
                 
Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed                 
without error or no self-test has ever                 
been run.
Total time to complete Offline data collection:         (  113) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.           
Auto Offline data collection on/off support.                 
Suspend Offline collection upon new
                    command.
   
Offline surface scan supported.
                 
Self-test supported.
                 
No Conveyance Self-test supported.
                 
Selective Self-test supported.
SMART capabilities:         
(0x0003)    Saves SMART data before entering
                 
power-saving mode.
                 
Supports SMART auto save timer.

Error logging capability:        (0x01)    Error logging supported.
                 
General Purpose Logging supported.

Short self-test routine
recommended polling time:      (   2) minutes.

Extended self-test routine
recommended polling time:      ( 713) minutes.

SCT capabilities:         
(0x003d)    SCT Status supported.
                 
SCT Error Recovery Control supported.
                 
SCT Feature Control supported.
                 
SCT Data Table supported.


SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE

1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0

2 Throughput_Performance  P-S---   138   138   054    -    100

3 Spin_Up_Time            POS---   253   253   024    -    224 (Average 64)

4 Start_Stop_Count        -O--C-   100   100   000    -    64

5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0

7 Seek_Error_Rate         PO-R--   100   100   067    -    0

8 Seek_Time_Performance   P-S---   140   140   020    -    15

9 Power_On_Hours          -O--C-   100   100   000    -    1482

10 Spin_Retry_Count        PO--C-   100   100   060    -    0

12 Power_Cycle_Count       -O--CK   100   100   000    -    64
192
Power-Off_Retract_Count -O--CK   100   100   000    -    121
193
Load_Cycle_Count        -O--C-   100   100   000    -    121
194
Temperature_Celsius     -O----   125   125   000    -    48 (Min/Max 25/51)
196
Reallocated_Event_Count -O--CK   100   100   000    -    0
197
Current_Pending_Sector  -O---K   100   100   000    -    0
198
Offline_Uncorrectable   ---R--   100   100   000    -    0
199
UDMA_CRC_Error_Count    -O-R--   200   200   000    -    1
                         
||||||_ K auto-keep
                         
|||||__ C event count
                         
||||___ R error rate
                         
|||____ S speed/performance
                         
||_____ O updated online
                         
|______ P prefailure warning


General Purpose Log Directory Version 1

SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      1  Comprehensive SMART error log
0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  SATA NCQ Queued Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x12       GPL     R/O      1  SATA NCQ NON-DATA log
0x15       GPL,SL  R/W      1  SATA Rebuild Assist log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x24       GPL     R/O    128  Current Device Internal Status Data log
0x25       GPL     R/O    128  Saved Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
Device Error Count: 1
    CR     = Command Register
    FEATR  = Features Register
    COUNT  = Count (was: Sector Count) Register
    LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
    LH     = LBA High (was: Cylinder High) Register    ]   LBA
    LM     = LBA Mid (was: Cylinder Low) Register      ] Register
    LL     = LBA Low (was: Sector Number) Register     ]
    DV     = Device (was: Device/Head) Register
    DC     = Device Control Register
    ER     = Error register
    ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 [0] occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  84 -- 41 00 b1 00 00 b9 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0xb9000000 = 3103784960

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 01 00 00 28 00 00 b9 22 48 48 40 08     00:47:46.644  WRITE FPDMA QUEUED
  61 01 00 00 20 00 00 b9 22 47 48 40 08     00:47:46.641  WRITE FPDMA QUEUED
  61 01 00 00 18 00 00 b9 22 46 48 40 08     00:47:46.633  WRITE FPDMA QUEUED
  61 01 00 00 10 00 00 b9 22 45 48 40 08     00:47:46.633  WRITE FPDMA QUEUED
  61 01 00 00 08 00 00 b9 22 44 48 40 08     00:47:46.632  WRITE FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       256 (0x0100)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    48 Celsius
Power Cycle Min/Max Temperature:     33/48 Celsius
Lifetime    Min/Max Temperature:     25/51 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute

Min/Max recommended Temperature:      0/60
Celsius
Min/Max Temperature Limit:           -40/70
Celsius
Temperature History Size (Index):    128 (52)

Index 
Estimated Time   Temperature Celsius
  53    2016-07-25 16:49 
47  ****************************
...    ..( 37 skipped).    ..
****************************
  91    2016-07-25 17:27    47
****************************
  92    2016-07-25 17:28    48
  *****************************
  93    2016-07-25 17:29    47
****************************
...    ..( 18 skipped).    ..  ****************************
112    2016-07-25 17:48 
47  ****************************
113    2016-07-25 17:49 
48  *****************************
...    ..( 18 skipped).    ..  *****************************
   4    2016-07-25 18:08 
48  *****************************
   5    2016-07-25 18:09 
47  ****************************
   6    2016-07-25 18:10 
48  *****************************
   7    2016-07-25 18:11 
47  ****************************
   8    2016-07-25 18:12 
48  *****************************
...    ..(  5 skipped).    ..  *****************************
  14    2016-07-25 18:18 
48  *****************************
  15    2016-07-25 18:19    47  ****************************
...    ..( 20 skipped).    ..  ****************************

36    2016-07-25 18:40    47  ****************************
  37    2016-07-25 18:41 
48  *****************************
  38    2016-07-25 18:42 
47  ****************************
  39    2016-07-25 18:43 
47  ****************************
  40    2016-07-25 18:44 
47  ****************************
  41    2016-07-25 18:45 
48  *****************************
  42    2016-07-25 18:46 
48  *****************************
  43    2016-07-25 18:47 
48  *****************************
  44    2016-07-25 18:48 
47  ****************************
  45    2016-07-25 18:49 
48  *****************************
...    ..(  5 skipped).    ..  *****************************

51    2016-07-25 18:55 
48  *****************************

52    2016-07-25 18:56 
47  ****************************


SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled


Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags
Description
0x01  =====  =               =  ===  == General Statistics (rev 2) ==
0x01  0x008  4              64  ---
Lifetime Power-On Resets
0x01  0x018  6      8903644812  ---
Logical Sectors Written
0x01  0x020  6        74202174  ---
Number of Write Commands
0x01  0x028  6     16860943534  ---
Logical Sectors Read
0x01  0x030  6        66827392  ---
Number of Read Commands
0x03  =====  =               =  ===  ==
Rotating Media Statistics (rev 1) ==
0x03  0x008  4            1452  ---
Spindle Motor Power-on Hours
0x03  0x010  4            1452  ---
Head Flying Hours
0x03  0x018  4             121  ---
Head Load Events
0x03  0x020  4               0  ---
Number of Reallocated Logical Sectors
0x03  0x028  4           29568  ---
Read Recovery Attempts
0x03  0x030  4               0  ---
Number of Mechanical Start Failures
0x04  =====  =            
=  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---
Number of Reported Uncorrectable Errors
0x04  0x010  4               1  ---
Resets Between Cmd Acceptance and Completion
0x05  =====  =            
=  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              48  ---
Current Temperature
0x05  0x010  1              46  N--
Average Short Term Temperature
0x05  0x018  1              46  N--
Average Long Term Temperature
0x05  0x020  1              51  ---
Highest Temperature
0x05  0x028  1              25  ---
Lowest Temperature
0x05  0x030  1              50  N--
Highest Average Short Term Temperature
0x05  0x038  1              25  N--
Lowest Average Short Term Temperature
0x05  0x040  1              47  N--
Highest Average Long Term Temperature
0x05  0x048  1              25  N--
Lowest Average Long Term Temperature
0x05  0x050  4               0  ---
Time in Over-Temperature
0x05  0x058  1              60  ---
Specified Maximum Operating Temperature
0x05  0x060  4               0  ---
Time in Under-Temperature
0x05  0x068  1               0  ---
Specified Minimum Operating Temperature
0x06  =====  =            
=  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4               4  ---
Number of Hardware Resets
0x06  0x010  4              12  ---
Number of ASR Events
0x06  0x018  4               1  ---
Number of Interface CRC Errors
                                |||_ C
monitored condition met
                             
||__ D supports DSN
                             
|___ N normalized value

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description

0x0001  2            0  Command failed due to ICRC error
0x0002  2            0
R_ERR response for data FIS
0x0003  2            0
R_ERR response for device-to-host data FIS
0x0004  2            0
R_ERR response for host-to-device data FIS
0x0005  2            0
R_ERR response for non-data FIS
0x0006  2            0
R_ERR response for device-to-host non-data FIS
0x0007  2            0
R_ERR response for host-to-device non-data FIS
0x0008  2            0
Device-to-host non-data FIS retries
0x0009  2            5
Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            4
Device-to-host register FISes sent due to a COMRESET
0x000b  2            0
CRC errors within host-to-device FIS
0x000d  2            0
Non-CRC errors within host-to-device FIS

Robert Trevellyan · Jul 25, 2016

Code:

smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Deskstar NAS
Device Model:     HGST HDN726060ALE610
Serial Number:    NCGTRHJS
LU WWN Device Id: 5 000cca 24dcb3d5f
Firmware Version: APGNT517
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Jul 25 18:56:02 2016 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  113) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 713) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
  2 Throughput_Performance  P-S---   138   138   054    -    100
  3 Spin_Up_Time            POS---   253   253   024    -    224 (Average 64)
  4 Start_Stop_Count        -O--C-   100   100   000    -    64
  5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
  7 Seek_Error_Rate         PO-R--   100   100   067    -    0
  8 Seek_Time_Performance   P-S---   140   140   020    -    15
  9 Power_On_Hours          -O--C-   100   100   000    -    1482
10 Spin_Retry_Count        PO--C-   100   100   060    -    0
12 Power_Cycle_Count       -O--CK   100   100   000    -    64
192 Power-Off_Retract_Count -O--CK   100   100   000    -    121
193 Load_Cycle_Count        -O--C-   100   100   000    -    121
194 Temperature_Celsius     -O----   125   125   000    -    48 (Min/Max 25/51)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O---K   100   100   000    -    0
198 Offline_Uncorrectable   ---R--   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    1
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      1  Comprehensive SMART error log
0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  SATA NCQ Queued Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x12       GPL     R/O      1  SATA NCQ NON-DATA log
0x15       GPL,SL  R/W      1  SATA Rebuild Assist log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x24       GPL     R/O    128  Current Device Internal Status Data log
0x25       GPL     R/O    128  Saved Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
Device Error Count: 1
    CR     = Command Register
    FEATR  = Features Register
    COUNT  = Count (was: Sector Count) Register
    LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
    LH     = LBA High (was: Cylinder High) Register    ]   LBA
    LM     = LBA Mid (was: Cylinder Low) Register      ] Register
    LL     = LBA Low (was: Sector Number) Register     ]
    DV     = Device (was: Device/Head) Register
    DC     = Device Control Register
    ER     = Error register
    ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 [0] occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  84 -- 41 00 b1 00 00 b9 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0xb9000000 = 3103784960

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 01 00 00 28 00 00 b9 22 48 48 40 08     00:47:46.644  WRITE FPDMA QUEUED
  61 01 00 00 20 00 00 b9 22 47 48 40 08     00:47:46.641  WRITE FPDMA QUEUED
  61 01 00 00 18 00 00 b9 22 46 48 40 08     00:47:46.633  WRITE FPDMA QUEUED
  61 01 00 00 10 00 00 b9 22 45 48 40 08     00:47:46.633  WRITE FPDMA QUEUED
  61 01 00 00 08 00 00 b9 22 44 48 40 08     00:47:46.632  WRITE FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       256 (0x0100)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    48 Celsius
Power Cycle Min/Max Temperature:     33/48 Celsius
Lifetime    Min/Max Temperature:     25/51 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -40/70 Celsius
Temperature History Size (Index):    128 (52)

Index    Estimated Time   Temperature Celsius
  53    2016-07-25 16:49    47  ****************************
...    ..( 37 skipped).    ..  ****************************
  91    2016-07-25 17:27    47  ****************************
  92    2016-07-25 17:28    48  *****************************
  93    2016-07-25 17:29    47  ****************************
...    ..( 18 skipped).    ..  ****************************
112    2016-07-25 17:48    47  ****************************
113    2016-07-25 17:49    48  *****************************
...    ..( 18 skipped).    ..  *****************************
   4    2016-07-25 18:08    48  *****************************
   5    2016-07-25 18:09    47  ****************************
   6    2016-07-25 18:10    48  *****************************
   7    2016-07-25 18:11    47  ****************************
   8    2016-07-25 18:12    48  *****************************
...    ..(  5 skipped).    ..  *****************************
  14    2016-07-25 18:18    48  *****************************
  15    2016-07-25 18:19    47  ****************************
...    ..( 20 skipped).    ..  ****************************
  36    2016-07-25 18:40    47  ****************************
  37    2016-07-25 18:41    48  *****************************
  38    2016-07-25 18:42    47  ****************************
  39    2016-07-25 18:43    47  ****************************
  40    2016-07-25 18:44    47  ****************************
  41    2016-07-25 18:45    48  *****************************
  42    2016-07-25 18:46    48  *****************************
  43    2016-07-25 18:47    48  *****************************
  44    2016-07-25 18:48    47  ****************************
  45    2016-07-25 18:49    48  *****************************
...    ..(  5 skipped).    ..  *****************************
  51    2016-07-25 18:55    48  *****************************
  52    2016-07-25 18:56    47  ****************************

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 2) ==
0x01  0x008  4              64  ---  Lifetime Power-On Resets
0x01  0x018  6      8903644812  ---  Logical Sectors Written
0x01  0x020  6        74202174  ---  Number of Write Commands
0x01  0x028  6     16860943534  ---  Logical Sectors Read
0x01  0x030  6        66827392  ---  Number of Read Commands
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4            1452  ---  Spindle Motor Power-on Hours
0x03  0x010  4            1452  ---  Head Flying Hours
0x03  0x018  4             121  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4           29568  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               1  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              48  ---  Current Temperature
0x05  0x010  1              46  N--  Average Short Term Temperature
0x05  0x018  1              46  N--  Average Long Term Temperature
0x05  0x020  1              51  ---  Highest Temperature
0x05  0x028  1              25  ---  Lowest Temperature
0x05  0x030  1              50  N--  Highest Average Short Term Temperature
0x05  0x038  1              25  N--  Lowest Average Short Term Temperature
0x05  0x040  1              47  N--  Highest Average Long Term Temperature
0x05  0x048  1              25  N--  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4               4  ---  Number of Hardware Resets
0x06  0x010  4              12  ---  Number of ASR Events
0x06  0x018  4               1  ---  Number of Interface CRC Errors
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            5  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            4  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS

A couple of observations:
1. As noted above, the drive is running hotter than generally recommended in these forums.
2. Are you really writing 2GB and reading 5GB per hour on that drive, or is my math off?

hunter · Jul 26, 2016

Thank you for sharing your observations. Regarding #2, most of the time my drives are idle but when they are in use, the figures seem approximately what I'd expect as I often record and stream HD video.

Sent from my SPH-L710 using Tapatalk

Robert Trevellyan · Jul 26, 2016

OK, well other than the above, and the one error recorded at 0 lifetime hours, I don't see anything obviously wrong with the disk. However, you don't appear to have regular SMART tests running, so you might want to address that.

hunter · Jul 26, 2016

Thank you for looking and for the heads up. I intended for the freeNAS to run them monthly, I'll review everything to make sure they run as intended.

Sent from my SPH-L710 using Tapatalk

SweetAndLow · Jul 26, 2016

Your smart long tests should run 2x a month and then you shoudl setup short tests for every week. This is a rough guideline and lots of people do it different but you want some long tests and a couple more shorts tests.

Green750one · Aug 22, 2016

Hi,

I had exactly the same issue and it was frustrating the h*ll out of me.

First thing, whilst I don't disagree with anything anyone has posted, I'd be surprised if it was inadequate cooling, drives failing or SATA ports or cables.
Even though it's usually a good idea to keep drives at or below 40 C, the HGST drives you mention have an operating maximum of ambient temperature of 60 c - that's the air temp of the room, not the disk. The temp of the disk is going to be higher than that. There is also evidence to suggest that over-cooling kills disks just as much as over-heating.

What I'm interested in knowing is if the problem has resolved since you changed the power supply arrangement?
My issue was I was overloading the 12v rails on my PSU by having 4 disks on one cable - admittedly I had 9 disks and 2 SSDs on 3 cables, but now with no more than 2 drives per cable I've not had a single drive drop out of the pool. It might be the power consumption of 6TB 7200 drives was too much for the single rail, especially especially if you had other devices hanging off it.
The other thing it might be is first signs that your PSU is giving up.

DrKK · Aug 22, 2016

Green750one said:
First thing, whilst I don't disagree with anything anyone has posted, I'd be surprised if it was inadequate cooling, drives failing or SATA ports or cables.
Even though it's usually a good idea to keep drives at or below 40 C, the HGST drives you mention have an operating maximum of ambient temperature of 60 c - that's the air temp of the room, not the disk. The temp of the disk is going to be higher than that. There is also evidence to suggest that over-cooling kills disks just as much as over-heating.

Indeed sir, no one will dispute this.

But I want to be very clear: what we do in this forums is drives. NAS. File storage. HDD arrays. Specifications on drives offer numbers like 60C ambient, 80C operating, and so on. That being said:

All of us in the FreeNAS community can *STRONGLY* advise everyone that regardless of the numbers listed on spec sheets, our users *CERTAINLY* experience vastly degraded longevity if their drives, themselves are operating much out of the 40's. I would never *DREAM* of operating a drive at 40C or above in a 24/7 NAS, and even the most forgiving FreeNAS veteran will certainly not countenance anything above 45C or so.

It is true that there appears to be a (minor) drop off in longevity for drives run in the 20's (i.e., "too cold"), but the effect is very minor, and cannot necessarily be normalized to be purely causal based on the temperatures. On balance, running your drives around 27-37C seems to be absolutely ideal in our experience with small to medium arrays, and I think I speak for everyone when I say that regardless of spec sheets, our experience is that drives running well above 40C generally have many more problems, and less longevity, and that the effect is quite noticeable.

Damn the spec sheet---focus instead on what the collated and collected experience of a hundred thousand FreeNAS users says.

Run your drives in the 30's, at worst. Period.

Stux · Aug 22, 2016

But, yes.

Check your power supply, and it's cable layout.

Important Announcement for the TrueNAS Community.

ZFS mirror continually disengages one drive and resilvers it

Explorer

Guru

Explorer

Old Man

Explorer

Guru

Old Man

Explorer

Not strong, but bad

Explorer

FreeNAS Generalissimo

Explorer

Attachments

Pony Wrangler

Explorer

Pony Wrangler

Explorer

Sweet'NASty

Dabbler

FreeNAS Generalissimo

MVP

Similar threads