- Joined
- May 28, 2011
- Messages
- 10,996
This posting is just to show that even someone with my experience has problems with hardware that seems to come out of nowhere.
First of all like our forum rules dictate, I'll post my system configuration:
Physical configuration is in my signature for ESXi System 1 and FreeNAS VM on ESXi System 1.
ESXi Software: 6.5.0 (Build 4887370)
FreeNAS Software Version: FreeNAS-9.10-2-U1 (86c7ef5)
The problem started out with an email (@3:01AM 2/18/2017) stating I have some issue with my ada3 drive with Serial Number ending in 516. Two days ago I had deleted my MiniDLNA jail and rebooted the FreeNAS VM (I'm using ESXi per my system description).
I also received an email from my SMART monitoring script which gave me this output:
As you can see I have 4 errors in ID 199 which indicated a communication issue between the hard drive and SATA controller, basically the SATA cable is the first suspect item. I can state for a fact that this computer has not been physically touched in weeks so this coming out of the blue surprised me. I would have expected a different drive failure to be honest.
Note: ID 199 is not considered a critical error from the hard drives perspective.
What I did was looked at
Additionally I received an error that my Extended SMART tests never occurred for any drive on Sunday and I checked my CRON and it's there to be conducted so I have no idea what is happening right now in that respect since it had worked previously. I did not make any changes to FreeNAS other than the deletion of the jail and a reboot on 17 Feb. Prior to that I upgraded FreeNAS to version 9.10.2-U1 on 29 Jan 2017. However last week I did upgrade ESXi tot he current build version on 11 February 2017 but right now I do not see that as being the issue, but I still need to remember it just in case nothing else proves to be the issue.
My next step is to watch and see what happens. If I get the CRC error again for drive ada3 then I will swap the SATA cables between ada3 (S/N: 516) and an adjacent drive and see if the problem stays with the suspect drive or moves to the other drive. So the process is to identify the failing hardware through the process of elimination.
Note that I'm using the hard drive serial number for tracking, using ada3 means nothing. And I'm not considering this problem corrected for at least 1 month after I make a change and the problem does not come back. Additionally I will manually kick off an Extended Test on the S/N: 516 hard drive periodically. While I don't believe there is a physical media defect, the problem could be in the hard drive electronics and this might help me identify an issue faster.
I'll update this post as this unfolds but the entire reason I'm posting this is to show that this does happen right out of the blue and to show the process of troubleshooting the issue.
First of all like our forum rules dictate, I'll post my system configuration:
Physical configuration is in my signature for ESXi System 1 and FreeNAS VM on ESXi System 1.
ESXi Software: 6.5.0 (Build 4887370)
FreeNAS Software Version: FreeNAS-9.10-2-U1 (86c7ef5)
The problem started out with an email (@3:01AM 2/18/2017) stating I have some issue with my ada3 drive with Serial Number ending in 516. Two days ago I had deleted my MiniDLNA jail and rebooted the FreeNAS VM (I'm using ESXi per my system description).
Code:
freenas.local kernel log messages: > ahcich4: Timeout on slot 14 port 0 > ahcich4: is 48000000 cs 00000000 ss 00004000 rs 00004000 tfd 8441 serr 00400000 cmd 0071ce17 > (ada3:ahcich4:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 70 33 b0 40 45 00 00 01 00 00 > (ada3:ahcich4:0:0:0): CAM status: Command timeout > (ada3:ahcich4:0:0:0): Retrying command > ahcich4: Timeout on slot 1 port 0 > ahcich4: is 48000000 cs 0000003c ss 0000003e rs 0000003e tfd 8441 serr 00400000 cmd 0071c117 > (ada3:ahcich4:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 38 60 fd d2 40 45 00 00 00 00 00 > (ada3:ahcich4:0:0:0): CAM status: Command timeout > (ada3:ahcich4:0:0:0): Retrying command > ahcich4: Timeout on slot 29 port 0 > ahcich4: is 48000000 cs c0000001 ss e0000001 rs e0000001 tfd 8441 serr 00400000 cmd 0071dd17 > (ada3:ahcich4:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 78 30 a4 e0 40 45 00 00 00 00 00 > (ada3:ahcich4:0:0:0): CAM status: Command timeout > (ada3:ahcich4:0:0:0): Retrying command > arp: 00:0c:29:54:b8:3b attempts to modify permanent entry for 192.168.1.1 on epair0b > arp: 00:0c:29:54:b8:3b attempts to modify permanent entry for 192.168.1.1 on epair0b > arp: 00:0c:29:54:b8:3b attempts to modify permanent entry for 192.168.1.1 on epair0b > arp: 00:0c:29:54:b8:3b attempts to modify permanent entry for 192.168.1.1 on epair0b -- End of security output --
I also received an email from my SMART monitoring script which gave me this output:
Code:
########## SMART status report summary for all drives ########## +------+---------------+----+-----+-----+-----+-------+-------+--------+------+------+------+------+------+-------+----+ |Device|Serial |Temp|Power|Start|Spin |ReAlloc|Current|Offline |Multi |Load |UDMA |Seek |High |Command|Last| | | | |On |Stop |Retry|Sectors|Pending|Uncorrec|Zone |Cycle |CRC |Errors|Fly |Timeout|Test| | | | |Hours|Count|Count| |Sectors|Sectors |Errors|Count |Errors| |Writes|Count |Age | +------+---------------+----+-----+-----+-----+-------+-------+--------+------+------+------+------+------+-------+----+ |ada0 |WD-WMC301176182| 32 |33880| 281| 0| 0| 0| 0| 0| 183| 0| N/A| N/A| N/A| 1| |ada1 ?|WD-WMC301183577| 36 |33426| 264| 0| 0| 0| 0| 0| 170| 0| N/A| N/A| N/A| 1| |ada2 ?|WD-WMC300411588| 36 |37329| 487| 0| 0| 0| 0| 0| 370| 0| N/A| N/A| N/A| 1| |ada3 |WD-WMC300411516| 34 |37333| 495| 0| 0| 0| 0| 0| 378| 4| N/A| N/A| N/A| 1| |ada4 ?|WD-WMC300412480| 36 |37313| 483| 0| 0| 0| 0| 0| 576| 0| N/A| N/A| N/A| 1| |ada5 ?|WD-WMC300410673| 37 |37337| 498| 0| 0| 0| 0| 0| 380| 0| N/A| N/A| N/A| 1| +------+---------------+----+-----+-----+-----+-------+-------+--------+------+------+------+------+------+-------+----+
Code:
########## SMART status report for ada3 drive (Western Digital Red: WD-WMC300411516) ########## SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 31 3 Spin_Up_Time 0x0027 181 174 021 Pre-fail Always - 3908 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 495 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 049 049 000 Old_age Always - 37333 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 231 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 116 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 378 194 Temperature_Celsius 0x0022 113 107 000 Old_age Always - 34 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 4 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 No Errors Logged Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Extended offline Completed without error 00% 37167 - Short offline Completed without error 00% 37306 -
As you can see I have 4 errors in ID 199 which indicated a communication issue between the hard drive and SATA controller, basically the SATA cable is the first suspect item. I can state for a fact that this computer has not been physically touched in weeks so this coming out of the blue surprised me. I would have expected a different drive failure to be honest.
Note: ID 199 is not considered a critical error from the hard drives perspective.
What I did was looked at
dmesg
and confirmed that those error messages were present, I then shutdown FreeNAS and I reseated both ends of the SATA cable. I have no idea if this will fix the problem.Additionally I received an error that my Extended SMART tests never occurred for any drive on Sunday and I checked my CRON and it's there to be conducted so I have no idea what is happening right now in that respect since it had worked previously. I did not make any changes to FreeNAS other than the deletion of the jail and a reboot on 17 Feb. Prior to that I upgraded FreeNAS to version 9.10.2-U1 on 29 Jan 2017. However last week I did upgrade ESXi tot he current build version on 11 February 2017 but right now I do not see that as being the issue, but I still need to remember it just in case nothing else proves to be the issue.
My next step is to watch and see what happens. If I get the CRC error again for drive ada3 then I will swap the SATA cables between ada3 (S/N: 516) and an adjacent drive and see if the problem stays with the suspect drive or moves to the other drive. So the process is to identify the failing hardware through the process of elimination.
Note that I'm using the hard drive serial number for tracking, using ada3 means nothing. And I'm not considering this problem corrected for at least 1 month after I make a change and the problem does not come back. Additionally I will manually kick off an Extended Test on the S/N: 516 hard drive periodically. While I don't believe there is a physical media defect, the problem could be in the hard drive electronics and this might help me identify an issue faster.
I'll update this post as this unfolds but the entire reason I'm posting this is to show that this does happen right out of the blue and to show the process of troubleshooting the issue.