UPDATE: 22 September 2018 - Added Drive Data Refreshing
UPDATE: 2 April 2017 - Added support for FreeNAS Corral (FreeNAS 10 and beyond)
UPDATE: 1 November 2020 - Added ID 1 and 7 description for Seagate drives at bottom of Appendix B
This guide covers the most routine single hard drive failures that are encountered and is not meant to cover every situation, specifically we will check to see if you have a physical drive failure or a communications error. If this guide fails to solve your problem, please open a new thread in the Help forum, list your hardware specs (FreeNAS version, Hardware Configuration), your failure and all indications, and specify that you used this guide and the step which failed to help you if appropriate. If there is an error or improvement you would like to suggest to this procedure, contact one of the forum moderators or the author and your inputs will be evaluated.
How to use this guide:
1. It is assumed you have some knowledge on how to open up a Shell window and perform some minor Linux/FreeBSD commands.
2. All the steps in this guide are non-destructive so you can safely perform these steps without further risk to your data.
3. We cannot take into account all formats of an error message but we used “?” to indicate any value. Additionally if we list an error message format, please keep in mind that as the software changes, the format may change and we will not update the guide every time a minor format of a message occurs.
4. The drive identifier in each command will be “ada0” however the user must enter the identifier for the suspect drive such as “ada4” or maybe “da4”. The failure message should indicate the drive identifier.
5. Once you have identified the failed drive serial number, write it down because drive identifiers “ada0” can change and the serial number is the best way to track and replace your drive if required.
6. You may be referenced to use the FreeNAS User Manual to conduct specific procedures.
7. Appendix A: Examples Error Messages
8. Appendix B: S.M.A.R.T. Data, What’s Important to Me?
9. Appendix C: Extra Troubleshooting - Drive Data Refreshing and Bad Blocks
Routine Procedures:
These few procedures will be run often so to minimize placing these steps all over the procedure, they will be written here and the user will refer here when directed to run one of them.
Output SMART Status Results
This procedure will display the hard drive data, including error information.
1) Open a shell (can be done via the GUI or SSH using something like Putty). If using FreeNAS 10 from the GUI Console, type "shell" to enter the shell and type "exit" when completed, for FreeNAS 11.x or greater you may select "Shell" from the lefthand pane.
2) Type
3) Note the items asked about in the troubleshooting text.
4) The following output does not mean the hard drive completely passed, this is a terrible summary and all the data must be examined to ensure no errors exist(ed):
Perform SMART Long Test
A SMART Long Test conducts a test of the drive electronics and a read of the entire drive surface. This test should be run periodically by setting it up in the FreeNAS GUI for automatic accomplishment. Different users have different opinions on how frequently this should be done, the author prefers once a week for the Long tests and daily for the Short tests.
1) Open a shell (can be done via the GUI or SSH using something like Putty).
2) Type
3) Once the period of time has lapsed for the testing, obtain a SMART Status Result and return to the troubleshooting text.
What type of failure did you received?
1) Error stating:
2) If “a” then goto step 3, If “b” then goto step 4.
Physical Drive Failure
3) This procedure troubleshoots common physical drive failures.
Drive Communications Failure
4) This procedure troubleshoots common communications errors for a single drive failure.
Hard Drive Failure Messages
Email Messages:
CRITICAL: Device: /dev/ada3, 1 Currently unreadable (pending) sectors
CRITICAL: Device: /dev/ada1, 817 Currently unreadable (pending) sectors
CRITICAL: Device: /dev/ada1, 2397 Offline uncorrectable sectors
SMART Results Output
(Note: Items in red are failure indications)
Hard Drive Communications Error Messages
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 90 b2 b9 40 2e 00 00 01 00 00
(ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada1:ahcich1:0:0:0): Retrying command
When troubleshooting a hard drive failure we utilize the built in SMART diagnostics, part of every hard drive. These results can be used to justify an RMA as well. A SMART test will not isolate communication errors, it will only validate the physical hard drive. If you want to get some good information visit the Wiki for S.M.A.R.T at this link: https://en.wikipedia.org/wiki/S.M.A.R.T.
The important data we look at are as follows:
1) Serial Number
2) ID 5 Relocated Sector Count
3) ID 197 Current Pending Sector Count
4) ID 198 Offline Uncorrectable Sector Count
5) ID 199 UDMA CRC Error Count
Other notable data are:
6) ID 194 Temperature
7) ID 200 MultiZone Error Rate
8) Extended Self-test Time (value in minutes)
9) SMART Self-test logs, specifically the results of the self tests
If ID’s 5, 197, or 198 have any value greater than zero (0) then there has been some defect identified in the media. If ID 194 Temperature is above 40C then you may have a cooling issue and this could shorten the life of your drive. Many manufacturers will not accept an RMA if the temperature of the drive exceeds a certain value (manufacturer specific) as this voids the warranty.
ID 199 is a communications error between the drive electronics and the drive controller. The drive controller is part of your motherboard or an add-on card. Typically this error code results in replacement of the SATA cable to correct the situation. This wouldn't typically be a condition to RMA a drive however it is possible that the hard drive electronics has failed or someone broke the SATA connector on the drive, but that is not the typical failure we see.
ID 200 MultiZone Error Rate can be the cause of a drive failure although a value in this location doesn't always mean it's the fault. It is notable if there are no other failing indications.
Wear Level (ID# is manufacture specific) is the indication of the percentage of how many write operations are left. In a typical SSD you have approximately 2000 erase/write cycles per memory block (4k block) however better SSDs are being manufactured that will last longer and over-provisioning creates the illusion of longer life. If your wear level drops to zero then you will not be able to write again to your SSD and it may fail to operate at all.
The SMART Self-test logs indicate the last time you conducted a SMART test, the type of test, it’s completion status, how far it completed, and the hours of the results (Hours is a value in relation to the ID 9 Power On Hours value.)
It is always a good thing to run a SMART Long test if you doubt the integrity of your drive.
What data is not important?
Much of the other data is manufacturer specific so even if they look like they could be accurate data, odds are they are not important since there are other well known values that do maintain meaning. Let me provide you a great example of what I'm talking about, ID 1 - Raw Read Error Rate. This value represents errors in reading data, right? Yes is the answer however how the manufacturer does this is different between manufacturers and it's all due to design. Let me explain (this is likely not accurate but it's an attempt to show that drives handle internal functions differently):
Drive "A" is told to read XX sectors of data and it returns those sectors + the next YY sectors of data in it's internal buffer.
Drive "B" is told to read XX sectors of data and it returns those sectors + the next YY sectors of data in it's internal buffer.
Drive "A" next finds out that the extra YY sectors worth of data isn't required and since it provided exactly what was requested of it, all is good in the world and it completed the operation.
Drive "B" next finds out that the extra YY sectors worth of data isn't required but this drive is programmed differently and it has more data to provide and this creates an internal issue as the drive now thinks it read the wrong data or too much data, something. The end result is the ID 1 value is incremented.
If ID 1 is a low value then it is likely a value you could use as a failing indicator however if it's a high number that is always changing a lot, you can just ignore it.
There are other values you can ignore and some values you should pay attention to however the most important values are listed above.
UPDATE (1 Nov 2020): How to read Seagate drive ID's 1 and 7, but note that even if these values turn out to be greater than zero after the conversion, they are still not important unless you have other key indicators of failure:
Drive Data Refreshing
If you are having a few Pending Sector Errors during a SMART Test then you could try to simply refresh your data to your hard drive. I can't tell you that this will make your hard drive all better but it should not hurt, provided you enter the commands properly. What this does is read all the hard drive sectors and write them back in the same locations and this is important to know as this does not read just the data you have stored, it reads the entire hard drive surface area and writes it back. This means that it is a long process (many hours) and the time it takes to complete is dependent on the size of the hard drive and how fast the hard drive is at reading and writing and any other operations the drive is doing.
1) First we need to allow you to perform RAW Write operations otherwise you will get an error message stating the operation is not permitted. Type
3) Once you are done you need to either reboot FreeNAS or type
BAD Blocks
While I personally would prefer to RMA my hard drive or install a new replacement, some people may decide that they want to run some further testing on their hard drives such as Bad Blocks because they have an issue which drove them to this hard drive troubleshooting in the first place. There is a nice thread here which documents quite a bit on how to run Bad Blocks for burn-in testing but here are the instructions for just a single drive. I would also recommend that the drive in which you are testing is not part of an active pool/vdev, remove it first. Drive "ada0" will be used in our example here with a Long Read Test failure at LBA 1144448. Also, read this entire section before running the test, there is nothing worse than destroying the wrong drive.
Because Bad Blocks takes generally several days to run (likely a full week) on a hard drive, I have broken down the troubleshooting into what I feel are reasonable steps in order to test the drive as quickly as possible. If you had a SMART Short or Long read test failure then we will test that section of the hard drive first because if it keeps failing then the drive is not salvageable. If you did not have a SMART read failure and just want to test the entire drive, well I've written the procedure to allow for that situation as well.
Record the failing LBA and then add at least 100,000 and subtract at least 100,000 to that count. These will become the ending and starting LBAs. You can subtract or add a larger value than 100,000 and it's not a bad idea if you get zero errors during your first run. Once you are all done with the troubled area you can run badblocks on the entire surface and ensure there are no other problem locations.
There is one assumption and that is that you are running this testing on your FreeNAS system. You can place your hard drive on any other computer and boot up FreeNAS on a USB Flash drive or use Ubuntu Live CD or some other piece of software, the instructions are basically the same.
After all of this testing and fixing of your hard drive, if you have another failure several months down the road, you can rest assured that drive is having physical component failure and you can toss the drive into the recycle bin, after you take it apart and get those high strength magnets out and stick them to your refrigerator. They are painful to get back off!
WARNING: THIS IS A DESTRUCTIVE TEST, VERIFY YOUR DRIVE BY THE SERIAL NUMBER!
Setup:
Because this is destructive you should take precautions to prevent accidental damage to your good hard drives containing data. Power off your system and physically disconnect all your good hard drives from the system, leaving the suspect drive connected. Use your serial number to ensure this. Now you can power on FreeNAS and your system should boot up. If desired you can boot to an Ubuntu Live CD/ISO if you like, open a terminal window, and then go to step 3 below. If you are using Ubuntu Live, I will assume you have a clue what you are doing and do not need step by step instructions so below is just a guide for those people.
1) Open up an SSH window.
2) Note added by @wblock 2018-01-22: this section recommended enabling the kern.geom.debugflags sysctl. Many people still think it has something to do with allowing raw writes. It does not. Instead, it disables a safety system that is intended to prevent writes to disks that are in use (say, by having a mounted filesystem). From man 4 geom:
2) First we need to allow you to perform RAW Write operations otherwise you will get an error message stating the operation is not permitted. Type
Now comes the destructive part...
3) If you did not have a SMART read failure or just want to run badblocks and walk away, goto step 8.
4) At the end of the command line is the ending LBA and then the starting LBA, in that order. We will run the test 10 times. Type
Note: The "-p 10" identifies how many times to run the test after a pass and you can increase or decrease this. I chose a value of 10 for no real reason other than I want to be very sure the surface holds up.
5) So lets say step 4 identifies a few more blocks and fixes (I use that term loosely) them. Next you should adjust the ending and starting LBA number for a larger area such as +/- 200,000. If you don't have any failures there, then go to step 6, otherwise you go back to step 4 until you have no errors.
6) Now we need to run another SMART Long test to see if there are any more offending sectors which can be quickly picked up. Type
7) If the SMART Long test fails the Extended Read test again, using the new LBA jump back to step 4, otherwise continue to step 8.
8) Now that we can get through an entire SMART Long Read test we are ready to run Bad Blocks on the entire hard drive surface. This testing will take considerable time to run, likely several days. Type
Once you are able to get through the entire badblocks program you can perform step 9 or reboot the machine, I prefer to reboot.
9) If you are not running Ubuntu Live, Type
Good Luck!
Resource icon by Evan-Amos @ Wikimedia Commons
UPDATE: 2 April 2017 - Added support for FreeNAS Corral (FreeNAS 10 and beyond)
UPDATE: 1 November 2020 - Added ID 1 and 7 description for Seagate drives at bottom of Appendix B
This guide covers the most routine single hard drive failures that are encountered and is not meant to cover every situation, specifically we will check to see if you have a physical drive failure or a communications error. If this guide fails to solve your problem, please open a new thread in the Help forum, list your hardware specs (FreeNAS version, Hardware Configuration), your failure and all indications, and specify that you used this guide and the step which failed to help you if appropriate. If there is an error or improvement you would like to suggest to this procedure, contact one of the forum moderators or the author and your inputs will be evaluated.
How to use this guide:
1. It is assumed you have some knowledge on how to open up a Shell window and perform some minor Linux/FreeBSD commands.
2. All the steps in this guide are non-destructive so you can safely perform these steps without further risk to your data.
3. We cannot take into account all formats of an error message but we used “?” to indicate any value. Additionally if we list an error message format, please keep in mind that as the software changes, the format may change and we will not update the guide every time a minor format of a message occurs.
4. The drive identifier in each command will be “ada0” however the user must enter the identifier for the suspect drive such as “ada4” or maybe “da4”. The failure message should indicate the drive identifier.
5. Once you have identified the failed drive serial number, write it down because drive identifiers “ada0” can change and the serial number is the best way to track and replace your drive if required.
6. You may be referenced to use the FreeNAS User Manual to conduct specific procedures.
7. Appendix A: Examples Error Messages
8. Appendix B: S.M.A.R.T. Data, What’s Important to Me?
9. Appendix C: Extra Troubleshooting - Drive Data Refreshing and Bad Blocks
Routine Procedures:
These few procedures will be run often so to minimize placing these steps all over the procedure, they will be written here and the user will refer here when directed to run one of them.
Output SMART Status Results
This procedure will display the hard drive data, including error information.
1) Open a shell (can be done via the GUI or SSH using something like Putty). If using FreeNAS 10 from the GUI Console, type "shell" to enter the shell and type "exit" when completed, for FreeNAS 11.x or greater you may select "Shell" from the lefthand pane.
2) Type
smartctl –a /dev/ada0
where “ada0” is the subject drive. If the output scrolls off the screen then enter smartctl –a /dev/ada0 | more
and the screen will only fill one page at a time.3) Note the items asked about in the troubleshooting text.
4) The following output does not mean the hard drive completely passed, this is a terrible summary and all the data must be examined to ensure no errors exist(ed):
Code:
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
Perform SMART Long Test
A SMART Long Test conducts a test of the drive electronics and a read of the entire drive surface. This test should be run periodically by setting it up in the FreeNAS GUI for automatic accomplishment. Different users have different opinions on how frequently this should be done, the author prefers once a week for the Long tests and daily for the Short tests.
1) Open a shell (can be done via the GUI or SSH using something like Putty).
2) Type
smartctl –t long /dev/ada0
where “ada0” is the drive identifier. Note how long it will take for the test to complete. You may still use your system however it will slow down the testing.3) Once the period of time has lapsed for the testing, obtain a SMART Status Result and return to the troubleshooting text.
Troubleshooting Procedure
What type of failure did you received?
1) Error stating:
a. ID 5 Relocated Sector Count, ID 197 Current Pending Sector Count, ID 198 Offline, Uncorrectable Sector Count, or (?da?:ata?:?:?:?): CAM status: ATA Status Error, Pool is Degraded, or if you just don't know where to start.
b. Timeout errors or any Communication Errors (ID 199 UDMA CRC Error Count).
2) If “a” then goto step 3, If “b” then goto step 4.
Physical Drive Failure
3) This procedure troubleshoots common physical drive failures.
a. Conduct Output SMART Status Results and record the drive Serial Number, IDs 5, 197, and 198. (Note: For detailed explanation of what each of these IDs represent, visit the S.M.A.R.T. Wiki website)
b. If any of the IDs are greater than zero (0) then the drive has failed for RMA purposes.
c. If ALL of the IDs are zero (0), then run a SMART Long Test and after the test has completed, conduct Output SMART Status Results. If ALL of the IDs are still at zero (0), ensure you are troubleshooting the correct drive and if you are, proceed to step 4 because the hard drive does not indicate a hardware failure at this point.
d. If any of the IDs are 1 to 5 then you may be able to retain the drive however if you’re troubleshooting it, it’s not likely you desire to retain the drive even if it's slowly failing. If you do retain it, it’s highly recommended that you run frequent SMART Long Tests on the drive to ensure the IDs values do not increase. If they increase at all then replace the drive.
e. If replacing the drive follow the FreeNAS User Guide on how to replace a failed drive. If you have an encrypted drive, ensure you take appropriate precautions per the FreeNAS User Guide.
f. Exit this guide.
Drive Communications Failure
4) This procedure troubleshoots common communications errors for a single drive failure.
a. Conduct Output SMART Status Results and record the drive Serial Number, IDs 5, 197, and 198.
b. Inspect ID’s 5, 197, and 198 and if any value is greater than zero (0), the drive may have an unrelated failure. Goto to Step 3 after finishing this troubleshooting.
c. Replace the DATA cable between the hard drive (utilizing the serial number to identify the suspect drive) and controller. (Note: The data cable is the most common cause of drive communications errors.)
d. If the problem is not fixed, Swap the DATA cables between the suspect drive and a nearby drive (at the drive connections). (Note: We are trying to isolate the problem to the hard drive or something else.)
e. If the problem goes away, it’s likely the DATA cable is still the cause, you may exit this procedure however keep an eye open for future failures. (Note: At times a poor connection may cause this error or a marginal data cable.)
f. If the problem still exists, run the Output SMART Status Results and verify the drive serial number.
g. If the drive serial number changed, continue with step h, if the serial number did not change continue with step j. (Note: If the drive serial number changed then the failure could be the DATA cable or drive controller.)
h. Relocate the DATA cable for the failing drive (remember to use the serial number) to another DATA port on the controller or motherboard that does appear to be working.
i. If the failure still exists then run the Output SMART Status Results and verify the drive serial number that failed. If it has not changed then the DATA cable is suspect. Goto step k.
j. If the problem remains with the same drive then the hard drive electronics are suspect and the drive can be considered defective, replace the drive.
k. If the problem still exists, this is not a common failure and post your failure in the FreeNAS forums.
l. Exit this guide.
APPENDIX A
Example Error Messages
Example Error Messages
Hard Drive Failure Messages
Email Messages:
CRITICAL: Device: /dev/ada3, 1 Currently unreadable (pending) sectors
CRITICAL: Device: /dev/ada1, 817 Currently unreadable (pending) sectors
CRITICAL: Device: /dev/ada1, 2397 Offline uncorrectable sectors
SMART Results Output
(Note: Items in red are failure indications)
Code:
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p28 amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Red Device Model: WDC WD20EFRX-68AX9N0 Serial Number: WD-WMC300411000 LU WWN Device Id: 5 0014ee 6ad787ae3 Firmware Version: 80.00A80 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Wed Jan 27 15:41:21 2016 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (27840) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 281) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x70bd) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 31 3 Spin_Up_Time 0x0027 176 174 021 Pre-fail Always - 4175 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 340 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 16 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 061 061 000 Old_age Always - 28532 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 148 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 61 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 278 194 Temperature_Celsius 0x0022 120 107 000 Old_age Always - 27 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 42 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 42 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 28522 - # 2 Short offline Completed without error 00% 28498 - # 3 Short offline Completed without error 00% 28474 - # 4 Extended offline Completed: read failure 70% 28455 - 543988376 # 5 Short offline Completed without error 00% 28426 - # 6 Short offline Completed without error 00% 28330 - # 7 Extended offline Completed without error 00% 28312 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
Hard Drive Communications Error Messages
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 90 b2 b9 40 2e 00 00 01 00 00
(ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada1:ahcich1:0:0:0): Retrying command
APPENDIX B
S.M.A.R.T. Data, What’s Important to Me?
S.M.A.R.T. Data, What’s Important to Me?
When troubleshooting a hard drive failure we utilize the built in SMART diagnostics, part of every hard drive. These results can be used to justify an RMA as well. A SMART test will not isolate communication errors, it will only validate the physical hard drive. If you want to get some good information visit the Wiki for S.M.A.R.T at this link: https://en.wikipedia.org/wiki/S.M.A.R.T.
The important data we look at are as follows:
1) Serial Number
2) ID 5 Relocated Sector Count
3) ID 197 Current Pending Sector Count
4) ID 198 Offline Uncorrectable Sector Count
5) ID 199 UDMA CRC Error Count
Other notable data are:
6) ID 194 Temperature
7) ID 200 MultiZone Error Rate
8) Extended Self-test Time (value in minutes)
9) SMART Self-test logs, specifically the results of the self tests
If ID’s 5, 197, or 198 have any value greater than zero (0) then there has been some defect identified in the media. If ID 194 Temperature is above 40C then you may have a cooling issue and this could shorten the life of your drive. Many manufacturers will not accept an RMA if the temperature of the drive exceeds a certain value (manufacturer specific) as this voids the warranty.
ID 199 is a communications error between the drive electronics and the drive controller. The drive controller is part of your motherboard or an add-on card. Typically this error code results in replacement of the SATA cable to correct the situation. This wouldn't typically be a condition to RMA a drive however it is possible that the hard drive electronics has failed or someone broke the SATA connector on the drive, but that is not the typical failure we see.
ID 200 MultiZone Error Rate can be the cause of a drive failure although a value in this location doesn't always mean it's the fault. It is notable if there are no other failing indications.
Wear Level (ID# is manufacture specific) is the indication of the percentage of how many write operations are left. In a typical SSD you have approximately 2000 erase/write cycles per memory block (4k block) however better SSDs are being manufactured that will last longer and over-provisioning creates the illusion of longer life. If your wear level drops to zero then you will not be able to write again to your SSD and it may fail to operate at all.
The SMART Self-test logs indicate the last time you conducted a SMART test, the type of test, it’s completion status, how far it completed, and the hours of the results (Hours is a value in relation to the ID 9 Power On Hours value.)
It is always a good thing to run a SMART Long test if you doubt the integrity of your drive.
What data is not important?
Much of the other data is manufacturer specific so even if they look like they could be accurate data, odds are they are not important since there are other well known values that do maintain meaning. Let me provide you a great example of what I'm talking about, ID 1 - Raw Read Error Rate. This value represents errors in reading data, right? Yes is the answer however how the manufacturer does this is different between manufacturers and it's all due to design. Let me explain (this is likely not accurate but it's an attempt to show that drives handle internal functions differently):
Drive "A" is told to read XX sectors of data and it returns those sectors + the next YY sectors of data in it's internal buffer.
Drive "B" is told to read XX sectors of data and it returns those sectors + the next YY sectors of data in it's internal buffer.
Drive "A" next finds out that the extra YY sectors worth of data isn't required and since it provided exactly what was requested of it, all is good in the world and it completed the operation.
Drive "B" next finds out that the extra YY sectors worth of data isn't required but this drive is programmed differently and it has more data to provide and this creates an internal issue as the drive now thinks it read the wrong data or too much data, something. The end result is the ID 1 value is incremented.
If ID 1 is a low value then it is likely a value you could use as a failing indicator however if it's a high number that is always changing a lot, you can just ignore it.
There are other values you can ignore and some values you should pay attention to however the most important values are listed above.
UPDATE (1 Nov 2020): How to read Seagate drive ID's 1 and 7, but note that even if these values turn out to be greater than zero after the conversion, they are still not important unless you have other key indicators of failure:
APPENDIX C
Extra Testing - Drive Data Refreshing and Bad Blocks
Extra Testing - Drive Data Refreshing and Bad Blocks
Drive Data Refreshing
If you are having a few Pending Sector Errors during a SMART Test then you could try to simply refresh your data to your hard drive. I can't tell you that this will make your hard drive all better but it should not hurt, provided you enter the commands properly. What this does is read all the hard drive sectors and write them back in the same locations and this is important to know as this does not read just the data you have stored, it reads the entire hard drive surface area and writes it back. This means that it is a long process (many hours) and the time it takes to complete is dependent on the size of the hard drive and how fast the hard drive is at reading and writing and any other operations the drive is doing.
1) First we need to allow you to perform RAW Write operations otherwise you will get an error message stating the operation is not permitted. Type
sysctl kern.geom.debugflags=0x10
and hit Enter.NOTE: This will take a while and completely depends on the hard drive size and speed, not the data stored.
2) Next we need to use the "dd" command to read the hard drive data and then write it back thus refreshing the data. Use dd if=/dev/ada0 of=/dev/ada0 bs=1m
3) Once you are done you need to either reboot FreeNAS or type
sysctl kern.geom.debugflags=0x00
and hit Enter to restore the write protection of the mounted drive. My personal preference is to reboot FreeNAS as they you know that it is returned to normal operations.BAD Blocks
While I personally would prefer to RMA my hard drive or install a new replacement, some people may decide that they want to run some further testing on their hard drives such as Bad Blocks because they have an issue which drove them to this hard drive troubleshooting in the first place. There is a nice thread here which documents quite a bit on how to run Bad Blocks for burn-in testing but here are the instructions for just a single drive. I would also recommend that the drive in which you are testing is not part of an active pool/vdev, remove it first. Drive "ada0" will be used in our example here with a Long Read Test failure at LBA 1144448. Also, read this entire section before running the test, there is nothing worse than destroying the wrong drive.
Because Bad Blocks takes generally several days to run (likely a full week) on a hard drive, I have broken down the troubleshooting into what I feel are reasonable steps in order to test the drive as quickly as possible. If you had a SMART Short or Long read test failure then we will test that section of the hard drive first because if it keeps failing then the drive is not salvageable. If you did not have a SMART read failure and just want to test the entire drive, well I've written the procedure to allow for that situation as well.
Record the failing LBA and then add at least 100,000 and subtract at least 100,000 to that count. These will become the ending and starting LBAs. You can subtract or add a larger value than 100,000 and it's not a bad idea if you get zero errors during your first run. Once you are all done with the troubled area you can run badblocks on the entire surface and ensure there are no other problem locations.
There is one assumption and that is that you are running this testing on your FreeNAS system. You can place your hard drive on any other computer and boot up FreeNAS on a USB Flash drive or use Ubuntu Live CD or some other piece of software, the instructions are basically the same.
After all of this testing and fixing of your hard drive, if you have another failure several months down the road, you can rest assured that drive is having physical component failure and you can toss the drive into the recycle bin, after you take it apart and get those high strength magnets out and stick them to your refrigerator. They are painful to get back off!
WARNING: THIS IS A DESTRUCTIVE TEST, VERIFY YOUR DRIVE BY THE SERIAL NUMBER!
Setup:
Because this is destructive you should take precautions to prevent accidental damage to your good hard drives containing data. Power off your system and physically disconnect all your good hard drives from the system, leaving the suspect drive connected. Use your serial number to ensure this. Now you can power on FreeNAS and your system should boot up. If desired you can boot to an Ubuntu Live CD/ISO if you like, open a terminal window, and then go to step 3 below. If you are using Ubuntu Live, I will assume you have a clue what you are doing and do not need step by step instructions so below is just a guide for those people.
1) Open up an SSH window.
2) Note added by @wblock 2018-01-22: this section recommended enabling the kern.geom.debugflags sysctl. Many people still think it has something to do with allowing raw writes. It does not. Instead, it disables a safety system that is intended to prevent writes to disks that are in use (say, by having a mounted filesystem). From man 4 geom:
To summarize, this option should generally not be needed. It only makes it possible to harm data. Any disk you are going to overwrite with data should not be mounted or have anything you wish to keep. In fact, best practice is to not be erasing or stress-testing drives on a system that has actual data on it. Since those disks will not have mounted filesystems, this sysctl will not affect being able to write to them. In fact, it will only make it possible to blow away things that are in use.
2) First we need to allow you to perform RAW Write operations otherwise you will get an error message stating the operation is not permitted. Type
sysctl kern.geom.debugflags=0x10
and hit Enter.Now comes the destructive part...
3) If you did not have a SMART read failure or just want to run badblocks and walk away, goto step 8.
4) At the end of the command line is the ending LBA and then the starting LBA, in that order. We will run the test 10 times. Type
badblocks -b 4096 -wsv -c 64 -p 10 /dev/ada0 1244448 1044448
.Note: The "-p 10" identifies how many times to run the test after a pass and you can increase or decrease this. I chose a value of 10 for no real reason other than I want to be very sure the surface holds up.
5) So lets say step 4 identifies a few more blocks and fixes (I use that term loosely) them. Next you should adjust the ending and starting LBA number for a larger area such as +/- 200,000. If you don't have any failures there, then go to step 6, otherwise you go back to step 4 until you have no errors.
6) Now we need to run another SMART Long test to see if there are any more offending sectors which can be quickly picked up. Type
smartctl -t long /dev/ada0
7) If the SMART Long test fails the Extended Read test again, using the new LBA jump back to step 4, otherwise continue to step 8.
8) Now that we can get through an entire SMART Long Read test we are ready to run Bad Blocks on the entire hard drive surface. This testing will take considerable time to run, likely several days. Type
badblocks -b 4096 -wsv -c 64 /dev/ada0
.Once you are able to get through the entire badblocks program you can perform step 9 or reboot the machine, I prefer to reboot.
9) If you are not running Ubuntu Live, Type
sysctl kern.geom.debugflags=0x00
and hit Enter.Good Luck!
Resource icon by Evan-Amos @ Wikimedia Commons
Solonet-Array-Test should test those large hard drives. Here is the resource link.
https://www.truenas.com/community/resources/solnet-array-test.1/