Another failing drive?

Patrick M. Hausen · Dec 8, 2014

Hi, all,

after the disaster with the DOA WD drives I successfully replaced all disks of our small NAS.
Reference: https://forums.freenas.org/index.php?threads/wd-red-drives-not-spinning-up.25111/

Now, just one day later, I notice this - could someone please help me interpret the output? Should I suspect the connector in the enclosure or the controller?

APM set to 254 for all drives.

Thanks,
Patrick

Code:

(ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 50 c1 82 40 ba 00 00 00 00 00
(ada2:ahcich2:0:0:0): CAM status: ATA Status Error
(ada2:ahcich2:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada2:ahcich2:0:0:0): RES: 41 40 50 c1 82 00 ba 00 00 08 00
(ada2:ahcich2:0:0:0): Retrying command
(ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 40 5c 99 40 d8 00 00 00 00 00
(ada2:ahcich2:0:0:0): CAM status: ATA Status Error
(ada2:ahcich2:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada2:ahcich2:0:0:0): RES: 41 40 40 5c 99 00 d8 00 00 10 00
(ada2:ahcich2:0:0:0): Retrying command
...
da2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 40 5c 99 40 d8 00 00 00 00 00
(ada2:ahcich2:0:0:0): CAM status: ATA Status Error
(ada2:ahcich2:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada2:ahcich2:0:0:0): RES: 41 40 48 5c 99 00 d8 00 00 10 00
(ada2:ahcich2:0:0:0): Error 5, Retries exhausted
...
ahcich2: Timeout on slot 29 port 0
ahcich2: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd c0 serr 00000000 cmd 0000fd17
ahcich2: Error while READ LOG EXT
(ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 20 91 99 40 d8 00 00 00 00 00
(ada2:ahcich2:0:0:0): CAM status: ATA Status Error
(ada2:ahcich2:0:0:0): ATA status: 00 ()
(ada2:ahcich2:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada2:ahcich2:0:0:0): Retrying command
(ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 20 91 99 40 d8 00 00 00 00 00
(ada2:ahcich2:0:0:0): CAM status: ATA Status Error
(ada2:ahcich2:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada2:ahcich2:0:0:0): RES: 41 40 20 91 99 00 d8 00 00 08 00
(ada2:ahcich2:0:0:0): Retrying command
...
(ada2:ahcich2:0:0:0): READ_DMA48. ACB: 25 00 50 a5 99 40 d8 00 00 00 08 00
(ada2:ahcich2:0:0:0): CAM status: ATA Status Error
(ada2:ahcich2:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
(ada2:ahcich2:0:0:0): RES: 51 40 50 a5 99 00 d8 00 00 00 00
(ada2:ahcich2:0:0:0): Error 5, Retries exhausted
(ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 20 c8 99 40 d8 00 00 00 00 00
(ada2:ahcich2:0:0:0): CAM status: ATA Status Error
(ada2:ahcich2:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada2:ahcich2:0:0:0): RES: 41 40 20 c8 99 00 d8 00 00 08 00
(ada2:ahcich2:0:0:0): Retrying command
...
(ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 20 c8 99 40 d8 00 00 00 00 00
(ada2:ahcich2:0:0:0): CAM status: ATA Status Error
(ada2:ahcich2:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada2:ahcich2:0:0:0): RES: 41 40 20 c8 99 00 d8 00 00 08 00
(ada2:ahcich2:0:0:0): Error 5, Retries exhausted
...

Code:

[root@freenas-je] ~# zpool status
  pool: zfs
state: ONLINE
  scan: scrub in progress since Mon Dec  8 09:02:31 2014
        790G scanned out of 4.99T at 309M/s, 3h58m to go
        152K repaired, 15.45% done
config:

    NAME        STATE     READ WRITE CKSUM
    zfs         ONLINE       0     0     0
      raidz2-0  ONLINE       0     0     0
        ada0p2  ONLINE       0     0     0
        ada1p2  ONLINE       0     0     0
        ada2p2  ONLINE       0     0     0  (repairing)
        ada3p2  ONLINE       0     0     0

errors: No known data errors

Code:

[root@freenas-je] ~# smartctl -a /dev/ada2
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST4000VN000-1H4168
Serial Number:    Z302C6XN
LU WWN Device Id: 5 000c50 079481ce9
Firmware Version: SC44
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Dec  8 09:50:37 2014 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  107) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      ( 531) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x10bd)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   115   092   006    Pre-fail  Always       -       90091760
  3 Spin_Up_Time            0x0003   096   096   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       4
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   065   060   030    Pre-fail  Always       -       3616389
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       89
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       4
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       139
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       1
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   066   065   045    Old_age   Always       -       34 (Min/Max 31/34)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       4
194 Temperature_Celsius     0x0022   034   040   000    Old_age   Always       -       34 (0 19 0 0 0)
197 Current_Pending_Sector  0x0012   100   099   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   099   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 139 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 139 occurred at disk power-on lifetime: 88 hours (3 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00      02:58:13.087  READ FPDMA QUEUED
  60 00 38 ff ff ff 4f 00      02:58:13.086  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00      02:58:13.064  READ LOG EXT
  60 00 08 ff ff ff 4f 00      02:58:09.128  READ FPDMA QUEUED
  60 00 38 ff ff ff 4f 00      02:58:09.127  READ FPDMA QUEUED

Error 138 occurred at disk power-on lifetime: 88 hours (3 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00      02:58:09.128  READ FPDMA QUEUED
  60 00 38 ff ff ff 4f 00      02:58:09.127  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00      02:58:09.093  READ LOG EXT
  60 00 08 ff ff ff 4f 00      02:58:05.167  READ FPDMA QUEUED
  60 00 38 ff ff ff 4f 00      02:58:05.166  READ FPDMA QUEUED

Error 137 occurred at disk power-on lifetime: 88 hours (3 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00      02:58:05.167  READ FPDMA QUEUED
  60 00 38 ff ff ff 4f 00      02:58:05.166  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00      02:58:05.133  READ LOG EXT
  61 00 28 ff ff ff 4f 00      02:58:01.212  WRITE FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      02:58:01.212  READ FPDMA QUEUED

Error 136 occurred at disk power-on lifetime: 88 hours (3 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 28 ff ff ff 4f 00      02:58:01.212  WRITE FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      02:58:01.212  READ FPDMA QUEUED
  60 00 38 ff ff ff 4f 00      02:58:01.211  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00      02:58:01.134  READ LOG EXT
  60 00 08 ff ff ff 4f 00      02:57:55.804  READ FPDMA QUEUED

Error 135 occurred at disk power-on lifetime: 88 hours (3 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00      02:57:55.804  READ FPDMA QUEUED
  60 00 38 ff ff ff 4f 00      02:57:55.796  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00      02:57:55.794  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      02:57:55.785  READ FPDMA QUEUED
  60 00 18 ff ff ff 4f 00      02:57:55.782  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Conveyance offline  Completed without error       00%        41         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

rs225 · Dec 8, 2014

Your failure is at a magical 0x0fffffff address. Somebody else had this a couple weeks ago on a Seagate 3TB, and it seemed to be drive controller problem. This is likely the same situation here, unless it is actually a problem with your mobo controller port. This means it is probably not a cable or connector problem.

cyberjock · Dec 8, 2014

I have to wonder if the Seagate forums would shed any light on this problem. Seems to be a recent problem that has come up for Seagates only.

Patrick M. Hausen · Dec 8, 2014

We replaced the drive again and everything seems fine, now.

Anyway, I'm really dumbfounded. 3 out of 5 WD RED drives DOA. 1 out of 5 Seagate drives faulty. Brand new products from a well established and reliable distributor.

Is this the current state of affairs in this industry? I don't want to think of the consequences if we had put e.g. 2 of the WD drives into a customer server with just gmirror for redundancy instead of a home NAS with 4 drives and ZFS. Luckily we use 7k2 "enterprise" drives for the servers, only. Thanks to everyone working on FreeNAS and ZFS in FreeBSD in particular, so we did not lose a single bit and know for sure we didn't.

Kind regards
Patrick

joeschmuck · Dec 9, 2014

You shouldn't have to set APM to anything, Disable should be fine.

What hardware are you running, maybe you have something causing the issue. As cyberjock said, maybe the Seagate forums will hold an answer, or you can Google it.

Patrick M. Hausen · Dec 9, 2014

HP ProLiant Micro Server
FreeNAS-9.2.1.8-RELEASE-x64 (e625626)
ECC RAM

Code:

CPU: AMD Athlon(tm) II Neo N36L Dual-Core Processor (1297.87-MHz K8-class CPU)
  Origin = "AuthenticAMD"  Id = 0x100f63  Family = 0x10  Model = 0x6  Stepping = 3
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x802009<SSE3,MON,CX16,POPCNT>
  AMD Features=0xee500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM,3DNow!+,3DNow!>
  AMD Features2=0x8377f<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,Prefetch,OSVW,IBS,SKINIT,WDT,NodeId>
  TSC: P-state invariant
real memory  = 17179869184 (16384 MB)
avail memory = 16396615680 (15637 MB)
...
ahci0: <ATI IXP700 AHCI SATA controller> port 0xd000-0xd007,0xc000-0xc003,0xb000-0xb007,0xa000-0xa003,0x9000-0x900f mem 0xfe6ffc00-0xfe6fffff irq 19 at device 17.0 on pci0
ahci0: AHCI v1.20 with 4 3Gbps ports, Port Multiplier supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
...
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <ST4000VN000-1H4168 SC44> ATA-9 SATA 3.x device
ada0: Serial Number Z302C950
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 3815447MB (7814037168 512 byte sectors: 16H 63S/T 16383C)
ada0: Previously was known as ad4
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <ST4000VN000-1H4168 SC44> ATA-9 SATA 3.x device
ada1: Serial Number Z302C21X
ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 3815447MB (7814037168 512 byte sectors: 16H 63S/T 16383C)
ada1: Previously was known as ad6
ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
ada2: <ST4000VN000-1H4168 SC44> ATA-9 SATA 3.x device
ada2: Serial Number Z302CA6G
ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada2: Command Queueing enabled
ada2: 3815447MB (7814037168 512 byte sectors: 16H 63S/T 16383C)
ada2: Previously was known as ad8
ada3 at ahcich3 bus 0 scbus3 target 0 lun 0
ada3: <ST4000VN000-1H4168 SC44> ATA-9 SATA 3.x device
ada3: Serial Number Z302C7LV
ada3: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 3815447MB (7814037168 512 byte sectors: 16H 63S/T 16383C)
ada3: Previously was known as ad10

As I wrote - after exchanging that last drive, everything looks OK now. I will configure outgoing email for this system, so I get the output of periodic zfs scrubs and smart tests ...

I'll have look at the Seagate forums, though.

Thanks
Patrick

bestboy · Dec 9, 2014

With such a troublesome background I'd recommend to do a thorough burn-in testing before going live.

1.21gigawatts · Dec 9, 2014

Patrick M. Hausen said:
We replaced the drive again and everything seems fine, now.

Anyway, I'm really dumbfounded. 3 out of 5 WD RED drives DOA. 1 out of 5 Seagate drives faulty. Brand new products from a well established and reliable distributor.

Is this the current state of affairs in this industry? I don't want to think of the consequences if we had put e.g. 2 of the WD drives into a customer server with just gmirror for redundancy instead of a home NAS with 4 drives and ZFS. Luckily we use 7k2 "enterprise" drives for the servers, only. Thanks to everyone working on FreeNAS and ZFS in FreeBSD in particular, so we did not lose a single bit and know for sure we didn't.

Kind regards
Patrick

About five years ago I bought 100 Seagate 1.5T drives. About a quarter of them were DOA, another quarter failed during burn-in, and I just replaced the last of them last week. This spurred long conversations with "sales engineers" at Seagate, WD and Toshiba. All they could say is that the drives would be replaced under warranty. I told them I wasn't interested in replacements, I was interested in reliable hardware and would gladly pay 5X the price for stuff that would last longer. They (all) told me that for more money, I would get a drive with a longer warranty, but not to count on a longer life.

I've since been using WD RED drives, and they seem healthier (but I've only been through 25 or so). It is VERY disheartening to hear that these may be no better than the "old" days.

bestboy · Dec 9, 2014

1.21gigawatts said:
About five years ago I bought 100 Seagate 1.5T drives.

That kind of explains your nickname :)

1.21gigawatts said:
It is VERY disheartening to hear that these may be no better than the "old" days.

That's why I'm hesitant to step over the imaginary "3TB border" in my head. The drives are not really getting more reliable yet their data density increases steadily. URE probability and resilver times sky rocket. No warranty can compensate that. I simply do not feel comfortable going forward :(.

Patrick M. Hausen · Dec 10, 2014

bestboy said:
With such a troublesome background I'd recommend to do a thorough burn-in testing before going live.

The system has been in use for more than two years with 4 Seagate ST32000542AS without a hitch. One of the drives started to show uncorrectable errors, but nothing was ever fundamentally broken. So we thought we'd exchange the drives. And while at it, use new ones with double the capacity. I definitely did not expect this routine task to be so troublesome.

We will just have to monitor the system closely, now.

Kind regards
Patrick

cyberjock · Dec 10, 2014

A friend and I had discussed this a few years ago, but he long since forgot about it. He just ordered 48 hard drives for a business project. They came in two boxes, each 24 drives per box. One box of drives was opened and put into use immediately upon arrival. The other box was left on a shelf for 24 hours. The friend lives in Montana and this was just last week (its winter in Montana this time of year).

Well, the first box had almost 50% failures either DOA or before 24 hours of burn-in had completed. The second box had 1 failure. Both being from what appears to be same batch, etc etc you'd expect that the failure rates on a per-box basis would be nearly identical. But there is one thing that these two boxes didn't have in common.

Remember how one box was shelved and the other box was immediately used? Those drives when pulled out were cold because they had been in a UPS truck which isn't exactly climate controlled. If you read the documentation on the hard drives it says something about waiting so many hours for drives to reach room temperature before using a drive. Maybe these temperatures are affecting them and "we" as an industry should take more note of this? I personally always unbox the drives and let them sit to equalize for at least 4 hours (often overnight) before opening the sealed bag they come in. I realize this is hard to do because you want to play with the new toys, but maybe temps are a bigger factor than we realize? The air inside a hard drive has some moisture in it, and if it's condensing on the platters and then on power-on the head is having to deal with even a smidge of moisture on the platter... well, it doesn't take a rocket scientist to realize this is probably not a good scenario. On top of that, the platters are insulated by air, so even if the outside of the hard drive isn't particularly cold the platters are possible still very cool.

Think about how many times you've left your hard drive in your car to be exposed to very cold (or very warm) temps and thought nothing of it? Maybe this whole "temperature equalization" thing should be given a more serious look.

Many of you that read here regularly are aware that I've been extremely lucky with my hard drives the last 5 years or so (minus Seagate's AFU firmware that I won't forgive them for) but I can't believe that I'm just *that* lucky being that I've been involved with buying over 150 disks for friends over the last 3 years and we've had absolutely amazing success with disks from all brands. The only DOA I've had in the last 3 years from any manufacturer was a Seagate and the packaging was very poor and the hard drive was shipped in one of those white sealed plastic bags with the only thing protecting the disk was the disk manufacturer's retail box.

It is possible that one box was drop-kicked or something but my friend says both boxes are in excellent condition.

Just something to think about....

rs225 · Dec 10, 2014

It looks like Seagate drives keep track of Minimum and Maximum temperature? If so, maybe they know this is a factor and they like to be able to estimate the effect on incoming RMAs. If they are willing to put Helium in drives, perhaps the temperature of the air (too high or too low) is a factor, rather than the commonly-presumed issue of 'expansion' and 'contraction' of the platter.

Ericloewe · Dec 10, 2014

rs225 said:
It looks like Seagate drives keep track of Minimum and Maximum temperature? If so, maybe they know this is a factor and they like to be able to estimate the effect on incoming RMAs. If they are willing to put Helium in drives, perhaps the temperature of the air (too high or too low) is a factor, rather than the commonly-presumed issue of 'expansion' and 'contraction' of the platter.

From what I read, Helium is meant to reduce shear stress caused by airflow over the platters. Since there are fewer particles to interact with the surfaces, the viscous forces are reduced, allowing for more platters without a crazy aerodynamic penalty.

DKarnov · Dec 10, 2014

Ericloewe said:
From what I read, Helium is meant to reduce shear stress caused by airflow over the platters. Since there are fewer particles to interact with the surfaces, the viscous forces are reduced, allowing for more platters without a crazy aerodynamic penalty.

Mostly this, it allows reduced platter distance and reduced motor torque:

The new drives are sealed and contain helium, rather than ordinary air. Because helium has one-seventh the density of air, reduced friction from the spinning disks lowers the electrical power consumed by the HDD’s motor. It also enables more disks to be packed closer together and allows data tracks to be spaced closer together on each platter, both of which increase a drive’s data-storage capacity. Helium HDDs also run quieter and cooler, requiring less external cooling.

As a result, HGST’s unique 7Stac™ technology enables a seven-disk, 6TB helium drive with the same physical size as today’s five-disk, air-filled 4TB HDD, but consumes 23% less power, or 49% fewer Watts per terabyte of data-storage capacity.

The key innovation in the helium HDD is HGST’s patented HelioSeal™ hermetic seal technology that prevents helium from escaping from the HDD and is compatible with high-volume manufacturing. This durable seal also permits their use with new immersion cooling systems, a new data center cooling technology. Conventional air-filled HDDs, which must “breathe” to accommodate changes in air pressure, cannot be immersed in any cooling liquid.

Also to a very small degree the hermetic seal ensures that no dust/particulate gets inside.

AFAIK all drives with SMART record max and min temperature, but only (obviously) when they're powered up. I doubt Seagate would care unless the temps were wildly outside the rated specs, and rated (running) spec on a Seagate NAS is 0 to 70C, and if you're outside those you're obviously doing something special. It also doesn't tell you anything about average temperatures, and all the reports say it's average temps that affect drive lifespan, not short-term rise and fall beyond absolute points.

That being said, rapid change in temp will kill moving machinery dead. Seagate gives a 20C/hr maximum dT spec for a running drive. If you do the math, that'd be quite easy to exceed if the drive is cold; if you're hotswapping the drive in a running server to immediately rebuild / resilver, and the other drives in the array are at an example temp of 35C, any drive that starts below 15C (~60F) will violate that limit. Seagate's limit for non-op dT is only 30C/hr. So like Cyberjock says, best to leave the stuff in the packaging to slowly thaw for a couple hours / overnight before jamming them into service. (This may also be part of the argument for, quite literally, hot spares.)

bestboy · Dec 10, 2014

Yeah and when they advance to the ultimate reduction of molecular friction on the platter by going one last step to the left in the periodic table, then we can finally have hackable, exploding hard drives as seen in the movies \o/

Ericloewe · Dec 10, 2014

bestboy said:
Yeah and when they advance to the ultimate reduction of molecular friction on the platter by going one last step to the left in the periodic table, then we can finally have hackable, exploding hard drives as seen in the movies \o/

That's what happens if you use unpatched versions of Bash.

krikboh · Dec 10, 2014

Forget Hydrogen. How about a vacuum?

Sent from my iPhone using Tapatalk

bestboy · Dec 11, 2014

Sure, why not? You just need to find a way to defy gravity and keep the head from touching the platter while still being close enough to read from it.
And before you yell "Eureka! Magnets FTW!"

source: http://ninpope-physics.comuv.com/maglev/howitworks.php

remember that hard drives don't like magnets at all.

But if you can get it to work in a vacuum, you'll probably get very rich and become a hot candidate for the Nobel Prize in Physics :)

Important Announcement for the TrueNAS Community.

Another failing drive?

Patrick M. Hausen

Hall of Famer

rs225

Guru

cyberjock

Inactive Account

Patrick M. Hausen

Hall of Famer

joeschmuck

Old Man

Patrick M. Hausen

Hall of Famer

bestboy

Contributor

1.21gigawatts

Explorer

bestboy

Contributor

Patrick M. Hausen

Hall of Famer

cyberjock

Inactive Account

rs225

Guru

Ericloewe

Server Wrangler

DKarnov

Dabbler

bestboy

Contributor

Ericloewe

Server Wrangler

krikboh

Patron

bestboy

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

Another failing drive?

Hall of Famer

Guru

Inactive Account

Hall of Famer

Old Man

Hall of Famer

Contributor

Explorer

Contributor

Hall of Famer

Inactive Account

Guru

Server Wrangler

Dabbler

Contributor

Server Wrangler

Patron

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Another failing drive?"

Similar threads