Failures Happen to All of Us - Example

joeschmuck · Feb 19, 2017

This posting is just to show that even someone with my experience has problems with hardware that seems to come out of nowhere.

First of all like our forum rules dictate, I'll post my system configuration:
Physical configuration is in my signature for ESXi System 1 and FreeNAS VM on ESXi System 1.
ESXi Software: 6.5.0 (Build 4887370)
FreeNAS Software Version: FreeNAS-9.10-2-U1 (86c7ef5)

The problem started out with an email (@3:01AM 2/18/2017) stating I have some issue with my ada3 drive with Serial Number ending in 516. Two days ago I had deleted my MiniDLNA jail and rebooted the FreeNAS VM (I'm using ESXi per my system description).

Code:


freenas.local kernel log messages:

> ahcich4: Timeout on slot 14 port 0
> ahcich4: is 48000000 cs 00000000 ss 00004000 rs 00004000 tfd 8441 serr 00400000 cmd 0071ce17
> (ada3:ahcich4:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 70 33 b0 40 45 00 00 01 00 00
> (ada3:ahcich4:0:0:0): CAM status: Command timeout
> (ada3:ahcich4:0:0:0): Retrying command
> ahcich4: Timeout on slot 1 port 0
> ahcich4: is 48000000 cs 0000003c ss 0000003e rs 0000003e tfd 8441 serr 00400000 cmd 0071c117
> (ada3:ahcich4:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 38 60 fd d2 40 45 00 00 00 00 00
> (ada3:ahcich4:0:0:0): CAM status: Command timeout
> (ada3:ahcich4:0:0:0): Retrying command
> ahcich4: Timeout on slot 29 port 0
> ahcich4: is 48000000 cs c0000001 ss e0000001 rs e0000001 tfd 8441 serr 00400000 cmd 0071dd17
> (ada3:ahcich4:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 78 30 a4 e0 40 45 00 00 00 00 00
> (ada3:ahcich4:0:0:0): CAM status: Command timeout
> (ada3:ahcich4:0:0:0): Retrying command
> arp: 00:0c:29:54:b8:3b attempts to modify permanent entry for 192.168.1.1 on epair0b
> arp: 00:0c:29:54:b8:3b attempts to modify permanent entry for 192.168.1.1 on epair0b
> arp: 00:0c:29:54:b8:3b attempts to modify permanent entry for 192.168.1.1 on epair0b
> arp: 00:0c:29:54:b8:3b attempts to modify permanent entry for 192.168.1.1 on epair0b


-- End of security output --

I also received an email from my SMART monitoring script which gave me this output:

Code:

########## SMART status report summary for all drives ##########

+------+---------------+----+-----+-----+-----+-------+-------+--------+------+------+------+------+------+-------+----+
|Device|Serial         |Temp|Power|Start|Spin |ReAlloc|Current|Offline |Multi |Load  |UDMA  |Seek  |High  |Command|Last|
|      |               |    |On   |Stop |Retry|Sectors|Pending|Uncorrec|Zone  |Cycle |CRC   |Errors|Fly   |Timeout|Test|
|      |               |    |Hours|Count|Count|       |Sectors|Sectors |Errors|Count |Errors|      |Writes|Count  |Age |
+------+---------------+----+-----+-----+-----+-------+-------+--------+------+------+------+------+------+-------+----+
|ada0  |WD-WMC301176182| 32 |33880|  281|    0|      0|      0|       0|     0|   183|     0|   N/A|   N/A|    N/A|   1|
|ada1 ?|WD-WMC301183577| 36 |33426|  264|    0|      0|      0|       0|     0|   170|     0|   N/A|   N/A|    N/A|   1|
|ada2 ?|WD-WMC300411588| 36 |37329|  487|    0|      0|      0|       0|     0|   370|     0|   N/A|   N/A|    N/A|   1|
|ada3  |WD-WMC300411516| 34 |37333|  495|    0|      0|      0|       0|     0|   378|     4|   N/A|   N/A|    N/A|   1|
|ada4 ?|WD-WMC300412480| 36 |37313|  483|    0|      0|      0|       0|     0|   576|     0|   N/A|   N/A|    N/A|   1|
|ada5 ?|WD-WMC300410673| 37 |37337|  498|    0|      0|      0|       0|     0|   380|     0|   N/A|   N/A|    N/A|   1|
+------+---------------+----+-----+-----+-----+-------+-------+--------+------+------+------+------+------+-------+----+

Code:

########## SMART status report for ada3 drive (Western Digital Red: WD-WMC300411516) ##########

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       31
  3 Spin_Up_Time            0x0027   181   174   021    Pre-fail  Always       -       3908
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       495
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   049   049   000    Old_age   Always       -       37333
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       231
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       116
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       378
194 Temperature_Celsius     0x0022   113   107   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       4
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline    Completed without error       00%     37167         -
Short offline       Completed without error       00%     37306         -

As you can see I have 4 errors in ID 199 which indicated a communication issue between the hard drive and SATA controller, basically the SATA cable is the first suspect item. I can state for a fact that this computer has not been physically touched in weeks so this coming out of the blue surprised me. I would have expected a different drive failure to be honest.

Note: ID 199 is not considered a critical error from the hard drives perspective.

What I did was looked at dmesg and confirmed that those error messages were present, I then shutdown FreeNAS and I reseated both ends of the SATA cable. I have no idea if this will fix the problem.

Additionally I received an error that my Extended SMART tests never occurred for any drive on Sunday and I checked my CRON and it's there to be conducted so I have no idea what is happening right now in that respect since it had worked previously. I did not make any changes to FreeNAS other than the deletion of the jail and a reboot on 17 Feb. Prior to that I upgraded FreeNAS to version 9.10.2-U1 on 29 Jan 2017. However last week I did upgrade ESXi tot he current build version on 11 February 2017 but right now I do not see that as being the issue, but I still need to remember it just in case nothing else proves to be the issue.

My next step is to watch and see what happens. If I get the CRC error again for drive ada3 then I will swap the SATA cables between ada3 (S/N: 516) and an adjacent drive and see if the problem stays with the suspect drive or moves to the other drive. So the process is to identify the failing hardware through the process of elimination.

Note that I'm using the hard drive serial number for tracking, using ada3 means nothing. And I'm not considering this problem corrected for at least 1 month after I make a change and the problem does not come back. Additionally I will manually kick off an Extended Test on the S/N: 516 hard drive periodically. While I don't believe there is a physical media defect, the problem could be in the hard drive electronics and this might help me identify an issue faster.

I'll update this post as this unfolds but the entire reason I'm posting this is to show that this does happen right out of the blue and to show the process of troubleshooting the issue.

joeschmuck · Feb 21, 2017

Okay so it's time for an update and I got lucky. I have not had any further issues with my drives at all but a few days means nothing yet but my gut says I'm in the clear for now. Hopefully I don't eat my own words anytime soon. I honestly didn't expect this result, I expected to replace a cable or hard drive by now.

Remember to treat your system with respect and kindness and it may give you years of great friendship ;)

joeschmuck · Feb 26, 2017

EDIT: Sorry the formatting is a bit off, now sure what is going on with the editor today. Basically I had 41 MultiZone errors on ada0.

I spoke too soon. Now another drive issue but on a different drive so I ran another Extended Test after I got the error email. Here is the email I received this morning:

Code:


A SMART ERROR HAS BEEN IDENTIFIED, EXAMINE THIS DATA



########## SMART Status Quick Summary for all drives ##########

+------+---------------+----+-----+-----------+----+
|Device |Serial              |Temp|Power|  Sector      |Last|
|           |                       |         |On     |  Errors      |Test|
|           |                       |         |Hours|RE|PE|OL|MZ|Age |
+------+---------------+----+-----+--+--+--+--+----+
|ada0 !|WD-WMC301176182| 28 |34047| 0| 0| 0|41|  0|
|ada1  |WD-WMC301183577| 32 |33592| 0| 0| 0| 0|  0|
|ada2  |WD-WMC300411588| 32 |37495| 0| 0| 0| 0|  0|
|ada3  |WD-WMC300411516| 30 |37500| 0| 0| 0| 0|  0|
|ada4  |WD-WMC300412480| 32 |37480| 0| 0| 0| 0|  0|
|ada5  |WD-WMC300410673| 33 |37504| 0| 0| 0| 0|  0|
+------+---------------+----+-----+--+--+--+--+----+



########## SMART status report summary for all drives ##########

+------+---------------+----+-----+-----+-----+-----+-------+-------+--------+------+------+------+------+------+-------+----+
|Device|Serial  |Temp|Power|Power|Start|Spin |ReAlloc|Current|Offline |Multi |Load  |UDMA  |Seek  |High  |Command|Last|
|  |  |  |On  |On  |Stop |Retry|Sectors|Pending|Uncorrec|Zone  |Cycle |CRC  |Errors|Fly  |Timeout|Test|
|  |  |  |Hours|Years|Count|Count|  |Sectors|Sectors |Errors|Count |Errors|  |Writes|Count  |Age |
+------+---------------+----+-----+-----+-----+-----+-------+-------+--------+------+------+------+------+------+-------+----+
|ada0 !|WD-WMC301176182| 28 |34047|3.887|  286|  0|  0|  0|  0|  41|  188|  0|  N/A|  N/A|  N/A|  0|
|ada1  |WD-WMC301183577| 32 |33592|3.835|  269|  0|  0|  0|  0|  0|  175|  0|  N/A|  N/A|  N/A|  0|
|ada2  |WD-WMC300411588| 32 |37495|4.280|  492|  0|  0|  0|  0|  0|  375|  0|  N/A|  N/A|  N/A|  0|
|ada3  |WD-WMC300411516| 30 |37500|4.281|  500|  0|  0|  0|  0|  0|  383|  4|  N/A|  N/A|  N/A|  0|
|ada4  |WD-WMC300412480| 32 |37480|4.279|  488|  0|   0|  0|  0|  0|  581|  0|  N/A|  N/A|  N/A|  0|
|ada5  |WD-WMC300410673| 33 |37504|4.281|  503|  0|  0|  0|  0|  0|  385|  0|  N/A|  N/A|  N/A|  0|
+------+---------------+----+-----+-----+-----+-----+-------+-------+--------+------+------+------+------+------+-------+----+

Then I ran the second Extended Test and the Multi-Zone errors all went away.

Code:

########## SMART Status Quick Summary for all drives ##########

+------+---------------+----+-----+-----------+----+
|Device|Serial  |Temp|Power|  Sector  |Last|
|  |  |  |On  |  Errors  |Test|
|  |  |  |Hours|RE|PE|OL|MZ|Age |
+------+---------------+----+-----+--+--+--+--+----+
|ada0  |WD-WMC301176182| 30 |34054| 0| 0| 0| 0|  0|
|ada1  |WD-WMC301183577| 33 |33599| 0| 0| 0| 0|  0|
|ada2  |WD-WMC300411588| 32 |37503| 0| 0| 0| 0|  0|
|ada3  |WD-WMC300411516| 31 |37507| 0| 0| 0| 0|  0|
|ada4  |WD-WMC300412480| 33 |37487| 0| 0| 0| 0|  0|
|ada5  |WD-WMC300410673| 34 |37511| 0| 0| 0| 0|  0|
+------+---------------+----+-----+--+--+--+--+----+



########## SMART status report summary for all drives ##########

+------+---------------+----+-----+-----+-----+-----+-------+-------+--------+------+------+------+------+------+-------+----+
|Device|Serial  |Temp|Power|Power|Start|Spin |ReAlloc|Current|Offline |Multi |Load  |UDMA  |Seek  |High  |Command|Last|
|  |  |  |On  |On  |Stop |Retry|Sectors|Pending|Uncorrec|Zone  |Cycle |CRC  |Errors|Fly  |Timeout|Test|
|  |  |  |Hours|Years|Count|Count|  |Sectors|Sectors |Errors|Count |Errors|  |Writes|Count  |Age |
+------+---------------+----+-----+-----+-----+-----+-------+-------+--------+------+------+------+------+------+-------+----+
|ada0  |WD-WMC301176182| 30 |34054|3.887|  286|  0|  0|  0|  0|  0|  188|  0|  N/A|  N/A|  N/A|  0|
|ada1  |WD-WMC301183577| 33 |33599|3.836|  269|  0|  0|  0|  0|  0|  175|  0|  N/A|  N/A|  N/A|  0|
|ada2  |WD-WMC300411588| 32 |37503|4.281|  492|  0|  0|  0|  0|  0|  375|  0|  N/A|  N/A|  N/A|  0|
|ada3  |WD-WMC300411516| 31 |37507|4.282|  500|  0|  0|  0|  0|  0|  383|  4|  N/A|  N/A|  N/A|  0|
|ada4  |WD-WMC300412480| 33 |37487|4.279|  488|  0|  0|  0|   0|  0|  581|  0|  N/A|  N/A|  N/A|  0|
|ada5  |WD-WMC300410673| 34 |37511|4.282|  503|  0|  0|  0|  0|  0|  385|  0|  N/A|  N/A|  N/A|  0|
+------+---------------+----+-----+-----+-----+-----+-------+-------+--------+------+------+------+------+------+-------+----+

I'm not sure why I'm seeing issues but I'll have to keep track of things. Maybe I have a power supply failing. Note that the MultiZone errors are all internal to the hard drive, it's not an interface problem, but if power is not stable then that could cause issues. Wish I had an O'scope right now.

Additionally I am running a scrub right now. I'm sure it will be fine but it's just an extra precaution.

Cheers!

Robert Trevellyan · Feb 26, 2017

Drives can be weird. A client's laptop had become unusable, and when I SMART tested the hard drive it reported a read failure almost immediately. I cloned it with ddrescue and recovered all but about 1.5MB. The pending sector count was over 400, and the ddrescue mapfile showed many, many areas of unreadable sectors.

After installing a new drive, restoring the data and returning the laptop, I ran badblocks on the original drive. All the errors disappeared in the first pass. No pending sectors, no reallocation events, nothing - even after four passes.

Of course, I don't plan to deploy the drive...

joeschmuck · Feb 27, 2017

Makes you wonder what happened to the drive in the first place to cause the corrupt sectors. But yes, sometimes drives can be weird.

I ran a scrub and of course all was good. While I don't think it could be the issue, I did upgrade ESXi 6.5 to the first update a few days earlier. I don't see how they could be related but it's always worth noting.

Robert Trevellyan · Feb 27, 2017

The only rationalization I can come up with is areas of the drive that were written when it was new, then mostly sat idle, then some kind of calibration issue made them hard to read, but writing them again masked the issue. I imagine that if the drive remained in service, the problem would reappear.

joeschmuck · Apr 10, 2017

And another update:

Drive ada0 cleared up, no issues. Drive ada3 however now has 27 UDMA (ID 199) errors, no other error and SMART Extended test passes with flying colors. And I did just update ESXi again. Related? So now I've replaced the SATA cable and will see what happens. A scrub tends to give me results on if the problem is "fixed" or not. I'm halfway through a scrub an no further errors have occurred, yet. I'll run a SMART long test as well, leaving no stone un-turned.

Code:

+------+---------------+----+-----+-----+-----+-----+-------+-------+--------+------+------+------+------+------+-------+----+
|Device|Serial         |Temp|Power|Power|Start|Spin |ReAlloc|Current|Offline |Multi |Load  |UDMA  |Seek  |High  |Command|Last|
|      |               |    |On   |On   |Stop |Retry|Sectors|Pending|Uncorrec|Zone  |Cycle |CRC   |Errors|Fly   |Timeout|Test|
|      |               |    |Hours|Years|Count|Count|       |Sectors|Sectors |Errors|Count |Errors|      |Writes|Count  |Age |
+------+---------------+----+-----+-----+-----+-----+-------+-------+--------+------+------+------+------+------+-------+----+
|ada0  |WD-WMC301176182| 29 |35076|4.004|  290|    0|      0|      0|       0|     0|   192|     0|   N/A|   N/A|    N/A|   0|
|ada1  |WD-WMC301183577| 34 |34622|3.952|  273|    0|      0|      0|       0|     0|   179|     0|   N/A|   N/A|    N/A|   0|
|ada2  |WD-WMC300411588| 35 |38525|4.398|  496|    0|      0|      0|       0|     0|   379|     0|   N/A|   N/A|    N/A|   0|
|ada3  |WD-WMC300411516| 31 |38529|4.398|  504|    0|      0|      0|       0|     0|   387|    27|   N/A|   N/A|    N/A|   0|
|ada4  |WD-WMC300412480| 35 |38510|4.396|  492|    0|      0|      0|       0|     0|   585|     0|   N/A|   N/A|    N/A|   0|
|ada5  |WD-WMC300410673| 34 |38533|4.399|  507|    0|      0|      0|       0|     0|   389|     0|   N/A|   N/A|    N/A|   0|
+------+---------------+----+-----+-----+-----+-----+-------+-------+--------+------+------+------+------+------+-------+----+

joeschmuck · Apr 15, 2017

Almost a week later and an update. Replacing the SATA cable seems to be the culprit. Since I replaced the cable I've rebooted the entire machine, updated the BIOS, powered off the machine, and run a scrub a few times. The problem seems to be corrected. Still wish I could remove the UDMA count value, it wasn't a true hard drive issue so why would it be retained. A quick investigation shows that this value is permanent, oh well

joeschmuck · Oct 8, 2017

And we are back again, same drive (using the serial number for tracking) is throwing sector errors. You can see below that I have errors in IDs 5, 196, 197, and 200 and it failed the Extended Test. And I feel I got a good deal since this drive is 4.9 years of age. It's not a record setting but for a drive warranty of 3 years, almost 5 is respectable.

Code:

########## SMART status report for ada4 drive (Western Digital Red: WD-WMC300411516) ##########
 
SMART overall-health self-assessment test result: PASSED
 
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   886
  3 Spin_Up_Time			0x0027   178   174   021	Pre-fail  Always	   -	   4066
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   528
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   1
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   042   042   000	Old_age   Always	   -	   42871
10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0
11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   253
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   130
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   397
194 Temperature_Celsius	 0x0022   118   107   000	Old_age   Always	   -	   29
196 Reallocated_Event_Count 0x0032   199   199   000	Old_age   Always	   -	   1
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   9
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   28
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   366
 
No Errors Logged
 
Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline	Completed: read failure	   60%	 42868		 1132696104
Short offline	   Completed without error	   00%	 42869		 -

So what to do now... Well I'll replace it temporarily with another 2TB drive I have laying around that tested good using badblocks (I have a few 2TB drives just waiting to be used as needed). I plan to test the crap out of the failing drive and see what happens. Now that the first drive has failed, it's reasonable to assume the other drives will fall in line so I need to buy replacement hard drives but I desire to reduce the overall hard drive count down to 3 or 4 hard drives. I'm looking at either four 6TB drives in a RAIDZ2 or three 10TB drives in a 3-way mirror, both result in approximately the same storage capacity. Currently I have ~5.8TB of usable storage and while I'm only using ~60% of that storage, I'd like to increase that storage just a bit to ~8.7TB of usable storage. Over 50% of what I store is computer backup images. So I'm now waiting on a good sale for 6TB or 10TB drives. The 6TB drives will be cheaper to purchase over the 10TB drives but we will have to see what sales crop up. If I could find 6TB drives for $160/each, I'd be ordering them right now.

I will update this posting as the hard drive testing continues. My hope that my problems become solutions for other people who have experience a drive failure.

guermantes · Oct 16, 2017

Thanks for documenting your troubleshooting approach so meticulously! I am just about to get my feet wet in FreeNAS territory, so I appreciate this kind of insight in how to make sense of strange errors. I just hope I won't fail in setting up all the reporting that is needed to catch problems before they become dire.

joeschmuck · Oct 16, 2017

FreeNAS is not terrible to setup however not everything is automatic. The main things are to setup email notification and ensure they work, and setup routine SMART tests.

To update my situation, I thoroughly tested my "failing" drive and I only have the one sector marked as bad, no other failures. I should have remapped that sector as good and retested the drive however I didn't. After 1 week of solid testing I have determined that the drive is fine for now and I have resilvered it back into the pool. I am now actively looking for a good sale on 6TB NAS drives and once I find that awesome sale, I will replace my pool with the new drives. On my morning automated drive test status email I still get a failure indication for the one drive, that is only because I don't allow for any errors in my script, but this is acceptable to me and I don't think I'll have to wait too long for a good sale. I'm hoping for $180 each but that may be wishful thinking for 2017.

Jailer · Oct 16, 2017

joeschmuck said:
I'm hoping for $180 each

You missed it. Newegg had one a few weeks back on the HGST NAS drives.

Inxsible · Oct 16, 2017

Jailer said:
You missed it. Newegg had one a few weeks back on the HGST NAS drives.

Yup, I got a couple of Ironwolf 6TB drives for $359.98 --- $179.99 each.

joeschmuck · Oct 16, 2017

Inxsible said:
Yup, I got a couple of Ironwolf 6TB drives for $359.98 --- $179.99 each.

It will come back around, I just need to be paitient. My problem is I didn't have a questional drive during that sale, now I do and I hate laying out ~$800 for the drives. That is a lot of cash!

Jailer · Oct 16, 2017

Tell me about it. I had my first drive failure a week ago and had to pay more than I wanted to for the replacement so I could get one here and get it burned in. I'm still hoping to get 2 more years out of the existing drives but when the next sale comes up I will be purchasing a spare to have burned in ready to go.

joeschmuck · Nov 18, 2017

This is just a happy update to report that four of my six hard drives are now beyond the 5 year point with no issues. Well I shouldn't say no issues, I've got 1 reallocated sector which I believe I personnaly induced and 28 CRC errors which was induced by a bad cable. Unfortunately I cannot erase these issues, they are permanently recorded.

Just a bit of history about the drive configuration. The first four drives were purchased for FreeNAS and I tested the heck out of them which is why you will see a lot of head loading cycles and spinup cycles. Eventually I settled on a configuration I really liked and purchased two more drives so I could have the desired capacity I needed and finalize my FreeNAS configuration. All my drives were adjusted in the drive firmware to disable the head park timer and the drives never sleep. What this means is for the majority of almost 5 years my hard drives have been spinning with the heads loaded on the platters. This is important to note because the longevity of these drives is fantastic. Now the only time the drives spin down is when I power off the computer to perform routine maintenance on it like cleaning the case of dust and spiders and ensure all the fans are working and no unwanted noises are being generated.

The only downside to these drives lasting longer than expected is I'm itching to replace them but until I have a failure I just can't see spending the money.

The output below is what I get every morning at 7AM and the red highlighted data indicates that I should examine the data. Of course this is the one realocated sector I caused which is why this is a warning message. If I have a failure count of 2 or more this becomes a ERROR message in the title and the highlighted text changes to red.

Also, the drive temps are that way just due to location with respect to the cooling fans. All drives do not get the same airflow but I don't think they have ever gone above 34C since they have been installed in this case.

My conclusion here is that WD created some very good hard drives and they were a very sound purchase.

Code:

To: JoeSchmuck@myhome
From: root@freenas.local
Subject: *WARNING*  FreeNAS Message Hard Drive Testing Message  *WARNING*
 
A SMART ERROR HAS BEEN IDENTIFIED, EXAMINE THIS DATA
 
 
########## SMART Status Quick Summary for all drives ##########
 
+------+---------------+----+-----+-----------+----+
|Device|Serial		 |Temp|Power|  Sector   |Last|
|	  |			   |	|On   |  Errors   |Test|
|	  |			   |	|Hours|RE|PE|OL|MZ|Age |
+------+---------------+----+-----+--+--+--+--+----+
|ada0  |WD-WMC301176182| 25 |40393| 0| 0| 0| 0|   0|
|ada1  |WD-WMC301183577| 25 |39939| 0| 0| 0| 0|   0|
|ada2  |WD-WMC300411588| 27 |43842| 0| 0| 0| 0|   0|
|ada3  |WD-WMC300412480| 25 |43827| 0| 0| 0| 0|   0|
|ada4 ?|WD-WMC300411516| 28 |43846| 1| 0| 0| 0|   0|
|ada5  |WD-WMC300410673| 28 |43850| 0| 0| 0| 0|   0|
+------+---------------+----+-----+--+--+--+--+----+
 
 
 
########## SMART status report summary for all drives ##########
 
+------+---------------+----+-----+-----+-----+-----+-------+-------+--------+------+------+------+------+------+-------+----+
|Device|Serial		 |Temp|Power|Power|Start|Spin |ReAlloc|Current|Offline |Multi |Load  |UDMA  |Seek  |High  |Command|Last|
|	  |			   |	|On   |On   |Stop |Retry|Sectors|Pending|Uncorrec|Zone  |Cycle |CRC   |Errors|Fly   |Timeout|Test|
|	  |			   |	|Hours|Years|Count|Count|	   |Sectors|Sectors |Errors|Count |Errors|	  |Writes|Count  |Age |
+------+---------------+----+-----+-----+-----+-----+-------+-------+--------+------+------+------+------+------+-------+----+
|ada0  |WD-WMC301176182| 25 |40393|4.611|  325|	0|	  0|	  0|	   0|	 0|   207|	 0|   N/A|   N/A|	N/A|   0|
|ada1  |WD-WMC301183577| 25 |39939|4.559|  308|	0|	  0|	  0|	   0|	 0|   194|	 0|   N/A|   N/A|	N/A|   0|
|ada2  |WD-WMC300411588| 27 |43842|5.005|  531|	0|	  0|	  0|	   0|	 0|   394|	 0|   N/A|   N/A|	N/A|   0|
|ada3  |WD-WMC300412480| 25 |43827|5.003|  527|	0|	  0|	  0|	   0|	 0|   600|	 0|   N/A|   N/A|	N/A|   0|
|ada4 ?|WD-WMC300411516| 28 |43846|5.005|  539|	0|	  1|	  0|	   0|	 0|   402|	28|   N/A|   N/A|	N/A|   0|
|ada5  |WD-WMC300410673| 28 |43850|5.006|  542|	0|	  0|	  0|	   0|	 0|   404|	 0|   N/A|   N/A|	N/A|   0|
+------+---------------+----+-----+-----+-----+-----+-------+-------+--------+------+------+------+------+------+-------+----+
 
 
 
########## SMART status report for ada0 drive (Western Digital Red: WD-WMC301176182) ##########
 
SMART overall-health self-assessment test result: PASSED
 
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   1
  3 Spin_Up_Time			0x0027   174   172   021	Pre-fail  Always	   -	   4283
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   325
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   045   045   000	Old_age   Always	   -	   40393
 10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   218
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   117
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   207
194 Temperature_Celsius	 0x0022   122   106   000	Old_age   Always	   -	   25
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0
 
No Errors Logged
 
Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline	Completed without error	   00%	 40393		 -
Short offline	   Completed without error	   00%	 40364		 -
 
 
 
########## SMART status report for ada1 drive (Western Digital Red: WD-WMC301183577) ##########
 
SMART overall-health self-assessment test result: PASSED
 
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   175   171   021	Pre-fail  Always	   -	   4208
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   308
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   046   046   000	Old_age   Always	   -	   39939
 10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   205
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   113
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   194
194 Temperature_Celsius	 0x0022   122   105   000	Old_age   Always	   -	   25
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0
 
No Errors Logged
 
Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline	Completed without error	   00%	 39938		 -
Short offline	   Completed without error	   00%	 39909		 -
 
 
 
########## SMART status report for ada2 drive (Western Digital Red: WD-WMC300411588) ##########
 
SMART overall-health self-assessment test result: PASSED
 
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   177   173   021	Pre-fail  Always	   -	   4141
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   531
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   040   040   000	Old_age   Always	   -	   43842
 10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   263
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   136
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   394
194 Temperature_Celsius	 0x0022   120   105   000	Old_age   Always	   -	   27
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0
 
No Errors Logged
 
Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline	Completed without error	   00%	 43842		 -
Short offline	   Completed without error	   00%	 43812		 -
 
 
 
########## SMART status report for ada3 drive (Western Digital Red: WD-WMC300412480) ##########
 
SMART overall-health self-assessment test result: PASSED
 
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   175   172   021	Pre-fail  Always	   -	   4225
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   527
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   040   040   000	Old_age   Always	   -	   43827
 10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   262
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   136
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   600
194 Temperature_Celsius	 0x0022   122   103   000	Old_age   Always	   -	   25
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0
 
No Errors Logged
 
Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline	Completed without error	   00%	 43826		 -
Short offline	   Completed without error	   00%	 43797		 -
 
 
 
########## SMART status report for ada4 drive (Western Digital Red: WD-WMC300411516) ##########
 
SMART overall-health self-assessment test result: PASSED
 
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   1084
  3 Spin_Up_Time			0x0027   178   174   021	Pre-fail  Always	   -	   4100
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   539
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   1
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   040   040   000	Old_age   Always	   -	   43846
 10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   264
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   136
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   402
194 Temperature_Celsius	 0x0022   119   107   000	Old_age   Always	   -	   28
196 Reallocated_Event_Count 0x0032   199   199   000	Old_age   Always	   -	   1
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   28
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0
 
No Errors Logged
 
Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline	Completed without error	   00%	 43846		 -
Short offline	   Completed without error	   00%	 43816		 -
 
 
 
########## SMART status report for ada5 drive (Western Digital Red: WD-WMC300410673) ##########
 
SMART overall-health self-assessment test result: PASSED
 
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   175   172   021	Pre-fail  Always	   -	   4208
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   542
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   040   040   000	Old_age   Always	   -	   43850
 10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   266
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   137
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   404
194 Temperature_Celsius	 0x0022   119   103   000	Old_age   Always	   -	   28
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0
 
No Errors Logged
 
Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline	Completed without error	   00%	 43850		 -
Short offline	   Completed without error	   00%	 43821		 -

Important Announcement for the TrueNAS Community.

Failures Happen to All of Us - Example

joeschmuck

Old Man

joeschmuck

Old Man

joeschmuck

Old Man

Robert Trevellyan

Pony Wrangler

joeschmuck

Old Man

Robert Trevellyan

Pony Wrangler

joeschmuck

Old Man

joeschmuck

Old Man

joeschmuck

Old Man

guermantes

Patron

joeschmuck

Old Man

Jailer

Not strong, but bad

Inxsible

Guru

joeschmuck

Old Man

Jailer

Not strong, but bad

joeschmuck

Old Man

Similar threads

Important Announcement for the TrueNAS Community.

Failures Happen to All of Us - Example

Old Man

Old Man

Old Man

Pony Wrangler

Old Man

Pony Wrangler

Old Man

Old Man

Old Man

Patron

Old Man

Not strong, but bad

Guru

Old Man

Not strong, but bad

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Failures Happen to All of Us - Example"

Similar threads