blahhumbug
Dabbler
- Joined
- Apr 25, 2015
- Messages
- 22
I recently built my first FreeNAS box using 8xWD Red 3TB (WD30EFRX) drives. As part of my initial system checks I ran memtest86 for 48 hours and then ran the following tests in parallel across all 8 drives:
smartctl -t short
smartctl -t conveyance
smartctl -t long
smartctl -a
smartctl reported no errors and all tests passed across all 8 drives. At this point I created a raid-z2 volume on the drives so that I could play around with configuring some jails. While I did this I let jgreco's solnet-array-test-v2.sh script run which does many parallel dd read tasks. This script ran for ~48 hours before completing. I then halted my jails and detached the raid-z2 volume and ran the following tests:
badblocks -ns #non-destructive test
smartctl -t long
smartctl -a
On 1 out of 8 of the drives, I got the following smartctl report.
The SMART error log shows "No Errors Logged", but the Extended Offline test lists "Completed: read failure" and shows an LBA with error. Multi_Zone_Error_Rate now has a raw_value of 1, although value/worst/thresh are unchanged and match all the passing drives. Does no 'errors-logged' indidcate that maybe a bad sector was remapped? But I do not know what kind of encoding WD uses for this smart attribute, so that number may not be a cause for concern?
I was expecting to see a non-zero Current_Pending_Sector count, but since it did not change, I'm unsure how serious this is. I ran the extended-offline test once more and got the same failure at the same LBA. Still "No Errors Logged" and Multi_Zone_Error_Rate remaine the same as above.
My assumption is that this is an issue worthy of an immediate RMA. But I wanted to ping the experts here to make sure I'm not misunderstanding the SMART results, or if there are any additional steps I should take or tests I should perform before doing an RMA?
I saw a few threads elsewhere talking about using fdisk and a few other utilities to determine a block location then using dd to get the drive to remap the bad block, but in those cases Current_Pending_Count was non-zero, so I'm not sure that applies in this case.
If needed, the full system hardware specs can be found in this post:
https://forums.freenas.org/index.php?threads/first-nas-build-avoton-8-core-and-24tb-wd-red.30460/
The drive in question is /dev/ada3, attached to SATA_3 on an Asrock C2750D4I (Intel Sata2.0 port, not one of the Marvell). It is in the 4th bay from the bottom of a Silverstone DS380B case.
After a few hours of stress testing with the parallel dd read script, I did observe that the drive reached 43C. This fluctuated a little during the dd reads, but mostly remained above 40C during random spot checks over the 2 days it ran. The drive at the top of the case also reached 43C, but all others remained under 40C. While this temperature is higher than I would like, I'm not sure it's an immediate cause for concern. When idle or under normal use-case load, the drive does stay between 35C and 39C.
smartctl -t short
smartctl -t conveyance
smartctl -t long
smartctl -a
smartctl reported no errors and all tests passed across all 8 drives. At this point I created a raid-z2 volume on the drives so that I could play around with configuring some jails. While I did this I let jgreco's solnet-array-test-v2.sh script run which does many parallel dd read tasks. This script ran for ~48 hours before completing. I then halted my jails and detached the raid-z2 volume and ran the following tests:
badblocks -ns #non-destructive test
smartctl -t long
smartctl -a
On 1 out of 8 of the drives, I got the following smartctl report.
Code:
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD30EFRX-68EUZN0
Serial Number: WD-************
LU WWN Device Id: * ****** *********
Firmware Version: 82.00A82
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Wed May 6 19:35:29 2015 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 113) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: (39540) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 397) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x703d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 100 253 021 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 2
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 145
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 2
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 0
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 42
194 Temperature_Celsius 0x0022 115 109 000 Old_age Always - 35
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 10% 138 901099712
# 2 Extended offline Completed without error 00% 32 -
# 3 Conveyance offline Completed without error 00% 24 -
# 4 Short offline Completed without error 00% 24 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
The SMART error log shows "No Errors Logged", but the Extended Offline test lists "Completed: read failure" and shows an LBA with error. Multi_Zone_Error_Rate now has a raw_value of 1, although value/worst/thresh are unchanged and match all the passing drives. Does no 'errors-logged' indidcate that maybe a bad sector was remapped? But I do not know what kind of encoding WD uses for this smart attribute, so that number may not be a cause for concern?
I was expecting to see a non-zero Current_Pending_Sector count, but since it did not change, I'm unsure how serious this is. I ran the extended-offline test once more and got the same failure at the same LBA. Still "No Errors Logged" and Multi_Zone_Error_Rate remaine the same as above.
My assumption is that this is an issue worthy of an immediate RMA. But I wanted to ping the experts here to make sure I'm not misunderstanding the SMART results, or if there are any additional steps I should take or tests I should perform before doing an RMA?
I saw a few threads elsewhere talking about using fdisk and a few other utilities to determine a block location then using dd to get the drive to remap the bad block, but in those cases Current_Pending_Count was non-zero, so I'm not sure that applies in this case.
If needed, the full system hardware specs can be found in this post:
https://forums.freenas.org/index.php?threads/first-nas-build-avoton-8-core-and-24tb-wd-red.30460/
The drive in question is /dev/ada3, attached to SATA_3 on an Asrock C2750D4I (Intel Sata2.0 port, not one of the Marvell). It is in the 4th bay from the bottom of a Silverstone DS380B case.
After a few hours of stress testing with the parallel dd read script, I did observe that the drive reached 43C. This fluctuated a little during the dd reads, but mostly remained above 40C during random spot checks over the 2 days it ran. The drive at the top of the case also reached 43C, but all others remained under 40C. While this temperature is higher than I would like, I'm not sure it's an immediate cause for concern. When idle or under normal use-case load, the drive does stay between 35C and 39C.
Last edited: