Andrew McLean
Cadet
- Joined
- Feb 14, 2017
- Messages
- 1
A couple of years ago I built a FreeNAS machine around an HP Microserver GEN8 G2020T. It has 4 x 4GB WD Hard disks configured as a single raidz pool, boots from a MicroSD card. It's been working pretty much trouble free since then, it's recently started faulting the RAID pool with the message: The volume RAID (ZFS) state is UNAVAIL: One or more devices are faulted in response to IO failures.
If I reboot the machine then import the pool everything is typically OK
I'm struggling to work out what the problem is. Here's what I've tried:
One of the hard disk had persistently had a "Current Pending Sector Count" of 1, generating message like. "Device: /dev/ada0, 1 Currently unreadable (pending) sectors" I replaced this disk. That made no difference.
The SMART status of the disks looks OK, here is a typical example:
Although do note the Load_Cycle_Count, this is a known problem with this model. I upgraded the firmware a while ago to mitigate the issue.
I'm wondering if it may be a problem with the server/disk controller, but can't think of a way of diagnosing this. Are there any logs I should be looking at?
Any help gratefully received.
Andrew
If I reboot the machine then import the pool everything is typically OK
Code:
[root@freenas] ~# zpool status RAID pool: RAID state: ONLINE scan: scrub repaired 0 in 14h11m with 0 errors on Mon Feb 13 08:37:21 2017 config: NAME STATE READ WRITE CKSUM RAID ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 gptid/8be90dc0-ea53-11e6-a7d1-a0481cb8579c ONLINE 0 0 0 gptid/286666d8-5f35-11e3-aab1-a0481cb8579c ONLINE 0 0 0 gptid/294514ff-5f35-11e3-aab1-a0481cb8579c ONLINE 0 0 0 gptid/2a284dd0-5f35-11e3-aab1-a0481cb8579c ONLINE 0 0 0 errors: No known data errors
I'm struggling to work out what the problem is. Here's what I've tried:
One of the hard disk had persistently had a "Current Pending Sector Count" of 1, generating message like. "Device: /dev/ada0, 1 Currently unreadable (pending) sectors" I replaced this disk. That made no difference.
The SMART status of the disks looks OK, here is a typical example:
Code:
[root@freenas] ~# smartctl -a /dev/ada1 smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Red Device Model: WDC WD40EFRX-68WT0N0 Serial Number: WD-WCC4E0266365 LU WWN Device Id: 5 0014ee 25e702e10 Firmware Version: 80.00A80 User Capacity: 4,000,787,030,016 bytes [4.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Tue Feb 14 18:53:29 2017 GMT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (54780) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 548) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x703d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 248 177 021 Pre-fail Always - 4591 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 51 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 063 063 000 Old_age Always - 27368 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 51 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 38 193 Load_Cycle_Count 0x0032 043 043 000 Old_age Always - 473558 194 Temperature_Celsius 0x0022 124 114 000 Old_age Always - 28 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 27094 - # 2 Short offline Completed without error 00% 27089 - # 3 Short offline Completed without error 00% 27065 - # 4 Short offline Completed without error 00% 27041 - # 5 Extended offline Completed without error 00% 27026 - # 6 Short offline Completed without error 00% 26993 - # 7 Short offline Completed without error 00% 26969 - # 8 Short offline Completed without error 00% 26946 - # 9 Short offline Completed without error 00% 26921 - #10 Short offline Completed without error 00% 26900 - #11 Short offline Completed without error 00% 26873 - #12 Extended offline Completed without error 00% 26859 - #13 Short offline Completed without error 00% 26826 - #14 Short offline Completed without error 00% 26802 - #15 Short offline Completed without error 00% 26778 - #16 Short offline Completed without error 00% 26754 - #17 Short offline Completed without error 00% 26730 - #18 Extended offline Completed without error 00% 26715 - #19 Short offline Completed without error 00% 26658 - #20 Short offline Completed without error 00% 26634 - #21 Short offline Completed without error 00% 26610 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
Although do note the Load_Cycle_Count, this is a known problem with this model. I upgraded the firmware a while ago to mitigate the issue.
I'm wondering if it may be a problem with the server/disk controller, but can't think of a way of diagnosing this. Are there any logs I should be looking at?
Any help gratefully received.
Andrew