blahhumbug
Dabbler
- Joined
 - Apr 25, 2015
 
- Messages
 - 22
 
I recently built my first FreeNAS box using 8xWD Red 3TB (WD30EFRX) drives.  As part of my initial system checks I ran memtest86 for 48 hours and then ran the following tests in parallel across all 8 drives:
smartctl -t short
smartctl -t conveyance
smartctl -t long
smartctl -a
smartctl reported no errors and all tests passed across all 8 drives. At this point I created a raid-z2 volume on the drives so that I could play around with configuring some jails. While I did this I let jgreco's solnet-array-test-v2.sh script run which does many parallel dd read tasks. This script ran for ~48 hours before completing. I then halted my jails and detached the raid-z2 volume and ran the following tests:
badblocks -ns #non-destructive test
smartctl -t long
smartctl -a
On 1 out of 8 of the drives, I got the following smartctl report.
The SMART error log shows "No Errors Logged", but the Extended Offline test lists "Completed: read failure" and shows an LBA with error. Multi_Zone_Error_Rate now has a raw_value of 1, although value/worst/thresh are unchanged and match all the passing drives. Does no 'errors-logged' indidcate that maybe a bad sector was remapped? But I do not know what kind of encoding WD uses for this smart attribute, so that number may not be a cause for concern?
I was expecting to see a non-zero Current_Pending_Sector count, but since it did not change, I'm unsure how serious this is. I ran the extended-offline test once more and got the same failure at the same LBA. Still "No Errors Logged" and Multi_Zone_Error_Rate remaine the same as above.
My assumption is that this is an issue worthy of an immediate RMA. But I wanted to ping the experts here to make sure I'm not misunderstanding the SMART results, or if there are any additional steps I should take or tests I should perform before doing an RMA?
I saw a few threads elsewhere talking about using fdisk and a few other utilities to determine a block location then using dd to get the drive to remap the bad block, but in those cases Current_Pending_Count was non-zero, so I'm not sure that applies in this case.
If needed, the full system hardware specs can be found in this post:
https://forums.freenas.org/index.php?threads/first-nas-build-avoton-8-core-and-24tb-wd-red.30460/
The drive in question is /dev/ada3, attached to SATA_3 on an Asrock C2750D4I (Intel Sata2.0 port, not one of the Marvell). It is in the 4th bay from the bottom of a Silverstone DS380B case.
After a few hours of stress testing with the parallel dd read script, I did observe that the drive reached 43C. This fluctuated a little during the dd reads, but mostly remained above 40C during random spot checks over the 2 days it ran. The drive at the top of the case also reached 43C, but all others remained under 40C. While this temperature is higher than I would like, I'm not sure it's an immediate cause for concern. When idle or under normal use-case load, the drive does stay between 35C and 39C.
	
		
			
		
		
	
			
			smartctl -t short
smartctl -t conveyance
smartctl -t long
smartctl -a
smartctl reported no errors and all tests passed across all 8 drives. At this point I created a raid-z2 volume on the drives so that I could play around with configuring some jails. While I did this I let jgreco's solnet-array-test-v2.sh script run which does many parallel dd read tasks. This script ran for ~48 hours before completing. I then halted my jails and detached the raid-z2 volume and ran the following tests:
badblocks -ns #non-destructive test
smartctl -t long
smartctl -a
On 1 out of 8 of the drives, I got the following smartctl report.
Code:
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-************
LU WWN Device Id: * ****** *********
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed May  6 19:35:29 2015 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 113) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                (39540) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 397) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   100   253   021    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       2
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       145
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       2
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       42
194 Temperature_Celsius     0x0022   115   109   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       10%       138         901099712
# 2  Extended offline    Completed without error       00%        32         -
# 3  Conveyance offline  Completed without error       00%        24         -
# 4  Short offline       Completed without error       00%        24         -
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
The SMART error log shows "No Errors Logged", but the Extended Offline test lists "Completed: read failure" and shows an LBA with error. Multi_Zone_Error_Rate now has a raw_value of 1, although value/worst/thresh are unchanged and match all the passing drives. Does no 'errors-logged' indidcate that maybe a bad sector was remapped? But I do not know what kind of encoding WD uses for this smart attribute, so that number may not be a cause for concern?
I was expecting to see a non-zero Current_Pending_Sector count, but since it did not change, I'm unsure how serious this is. I ran the extended-offline test once more and got the same failure at the same LBA. Still "No Errors Logged" and Multi_Zone_Error_Rate remaine the same as above.
My assumption is that this is an issue worthy of an immediate RMA. But I wanted to ping the experts here to make sure I'm not misunderstanding the SMART results, or if there are any additional steps I should take or tests I should perform before doing an RMA?
I saw a few threads elsewhere talking about using fdisk and a few other utilities to determine a block location then using dd to get the drive to remap the bad block, but in those cases Current_Pending_Count was non-zero, so I'm not sure that applies in this case.
If needed, the full system hardware specs can be found in this post:
https://forums.freenas.org/index.php?threads/first-nas-build-avoton-8-core-and-24tb-wd-red.30460/
The drive in question is /dev/ada3, attached to SATA_3 on an Asrock C2750D4I (Intel Sata2.0 port, not one of the Marvell). It is in the 4th bay from the bottom of a Silverstone DS380B case.
After a few hours of stress testing with the parallel dd read script, I did observe that the drive reached 43C. This fluctuated a little during the dd reads, but mostly remained above 40C during random spot checks over the 2 days it ran. The drive at the top of the case also reached 43C, but all others remained under 40C. While this temperature is higher than I would like, I'm not sure it's an immediate cause for concern. When idle or under normal use-case load, the drive does stay between 35C and 39C.
			
				Last edited: