Acceptable S.M.A.R.T. results for new drive burn-in

Alan W. Smtih · Jul 18, 2017

I haven't been able to find a concise "Here's how to burn-in a new drive and confirm it's safe as far as we can tell" checklist. (As in, there's already a FreeNAS box in production and you just want to spin up another temporary one to prep replacement disks.)

Between this forum (including this post which you should read for a lot more detail) and the Wikipedia S.M.A.R.T. entry, I put together what I think is a core checklist:

(Assuming: logged in as a user other than 'root' and one hypothetical drive at ada0 with a stripe size of 4096 )

1. Enabling kernel debugging with: sudo sysctl kern.geom.debugflags=0x10 (and make sure that if you reboot before completing the badblocks run, then you re-run this command)

2. Run sudo smartctl --scan to get a list of drives (e.g. /dev/ada0)

3. Run sudo diskinfo -v ada0 for the drive and get the number for stripesize which is passed to -b in the next step

4. Run sudo badblocks -b 4096 -ws /dev/ada0 on the drive and allowing it to complete (which can take a few days for disks that reach into TB)

5. Running smartctl -t long /dev/ada0 on the drive and allowing it to complete (which takes several hours -- you can track progress by running the same command that's used in the next step sudo smartctl -x /dev/ada0 | less and looking for the line with % remaining)

6. Reviewing the results of sudo smartctl -x /dev/ada0 | less and verify that for all of the following items that appear (and not all of them will) all have a RAW_VALUE of 0 (zero).

Code:

| ID  | Name										|
|-----|---------------------------------------------|
| 5   | Reallocated Sectors Count				   |
| 10  | Spin Retry Count							|
| 184 | End-to-End error / IOEDC					|
| 187 | Reported Uncorrectable Errors			   |
| 188 | Command Timeout							 |
| 196 | Reallocation Event Count					|
| 197 | Current Pending Sector Count				|
| 198 | (Offline) Uncorrectable Sector Count		|
| 201 | Soft Read Error Rate or TA Counter Detected |

Is there anything else that needs to be added?

Stux · Jul 18, 2017

This one's fairly definitive:

https://forums.freenas.org/index.php?threads/how-to-hard-drive-burn-in-testing.21451/

Some drives have non zero raw values all the time. Ie Seagate.

joeschmuck · Jul 18, 2017

Take a look at my Hard Drive Troubleshooting Guide in the Resources (see link in my signature). It lists all the items that are of interest when checking SMART data. While ID 200 isn't typically a value to indicate failure, it can be one of the first signs of failure and we have seen it here quite a bit on the forums. This resource does not include SSDs as there are some unique manufacturer specific IDs. I also cover how to use badblocks to troubleshoot a specific LBA failure. If you feel that this should be updated, I'm all ears.

Also, a value in ID 197 provided it's a low count (1-5) and nothing in ID 5 and a SMART Long Test doesn't fail doesn't mean there is a problem. This could clear itself.

I agree with @Stux that the link he provided for drive burn-in testing (we punt many users to this link) contains all the info and a lot of good personal pain from many users.

I'm glad that you went through the effort to pay out your steps and number them. I do that as well.

Alan W. Smtih · Jul 19, 2017

Stux said:
This ones fairly definitive:

https://forums.freenas.org/index.php?threads/how-to-hard-drive-burn-in-testing.21451/

Totally!

I love that one. It was the biggest help for me getting my head wrapped around this stuff.

(I almost linked to it the first time. I'm going to make an edit to add that.)

While it shows a ton of details (e.g. tmux commands) and different ways to do things (e.g. -ws vs -ns), my goal was a little different. I'm trying to boil it down to a concise checklist with only the core commands.

(I just read The Checklist Manifesto, and this is an attempt to apply the lessons form it. )

Also, while that post talks about stripesize and how it's pretty safe to assume -b 4096 for bigger disks, I added a step to explicitly make that determination.

(Frankly, if that step was in the original, I probably wouldn't have posted this and just kept it for my personal checklist.)

Stux said:
Some drives have non zero raw values all the time. Ie Seagate.

Right on. I've seen some references to how those values can change.

I put all zeros because I assumed that's what they should be out-of-the-box. (The idea being that this checklist is for brand new drives. Not ones that have seen active duty.)

Or, more to the point, if you see non-zero values on a new disk after a badblocks run, should that be a "it might be okay, but why risk it" flag?

joeschmuck said:
Take a look at my Hard Drive Troubleshooting Guide in the Resources

That's fantastic! I can only imagine how much work has gone into it.

I've only had a chance to skim it today, but will dig in more when time allows.

Good stuff!

joeschmuck · Jul 19, 2017

Alan W. Smtih said:
That's fantastic! I can only imagine how much work has gone into it.

Not much really but thanks.

Stux · Jul 19, 2017

Alan W. Smtih said:
I put all zeros because I assumed that's what they should be out-of-the-box. (The idea being that this checklist is for brand new drives. Not ones that have seen active duty.)

No, out of the box maybe... but for example... the Seek Error RAW value on a seagate increases continuously, because it actually tells you how many seeks have been performed, and how many failed. Drives do a lot of seeking, so its continuously going up. Its actually fairly neat to be able to tell how many bytes have been written, and all of that, and that comes from the RAW values, but its encoded, suffice to say, the raw values on a seagate can be mostly useless for checking for errors.

Code:

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   115   099   006	Pre-fail  Always	   -	   94262920
  3 Spin_Up_Time			0x0003   091   091   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   020	Old_age   Always	   -	   59
  5 Reallocated_Sector_Ct   0x0033   100   100   010	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000f   084   060   030	Pre-fail  Always	   -	   316356383
  9 Power_On_Hours		  0x0032   092   092   000	Old_age   Always	   -	   7667
10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   59
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0
188 Command_Timeout		 0x0032   100   100   000	Old_age   Always	   -	   0
189 High_Fly_Writes		 0x003a   100   100   000	Old_age   Always	   -	   0
190 Airflow_Temperature_Cel 0x0022   062   056   045	Old_age   Always	   -	   38 (Min/Max 32/39)
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   44
193 Load_Cycle_Count		0x0032   100   100   000	Old_age   Always	   -	   59
194 Temperature_Celsius	 0x0022   038   044   000	Old_age   Always	   -	   38 (0 22 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0

See what I mean? Seek_Error_Rate and Raw_Read_Error_Rate are very concerning.... but they're actually capable of telling you how many Seeks and Reads have been performed... although it might lap around... I forget :)

But you are right, the values you suggest checking are zero.

Alan W. Smtih · Jul 19, 2017

Stux said:
the Seek Error RAW value on a seagate increases continuously, because it actually tells you how many seeks have been performed, and how many failed

That's wild!

(And, I'll admit it makes me slightly less likely to by Seagate. Even though I now know better, the initial shock of seeing the number each time before remembering it doesn't necessarily correspond to the "Error" part of the title is not a feeling my brain likes. Of course, if the price is right, my brain will get over it :D)

Thanks for the info!

joeschmuck · Jul 19, 2017

As far as I recall the seek errors for a Seagate drive are mainly related to when the drive uses it's technology to read ahead anticipating that the next sector is what will be asked for but when it's not the data the system is asking for, now it's a read error. This of course isn't a real read error, it's just the data was not requested and the drive jumpped the gun in an effort to speed up operations. It makes good sense to read ahead, however I think it would be better if this type of request were not recorded as an error.

Now in a WD Red, ID 1 with a value other than zero is a true error, but still it would not be the only indicator if the drive was failing.

Alan W. Smtih · Jul 19, 2017

@joeschmuck - minor note on the Hard Drive Troubleshooting Guide

There is a line in "APPENDIX A", under "SMART Results Output" that says:

Code:

(Note: Items in red are failure indications)

I'm not seeing anything red in my browser (Chrome 59 on macOS Sierra 10.12.5).

Might just be me, but figured I'd give you the heads up.

joeschmuck · Jul 19, 2017

Good catch. I guess it lost the red color when it was migrated to the resources area, it was originally just a normal thread. I'll have to fix that.

EDIT: I've updated it, not sure if the formatting will stick.

Alan W. Smtih · Jul 19, 2017

Confirmed I'm seeing it now

::thumbs-up::

Important Announcement for the TrueNAS Community.

Acceptable S.M.A.R.T. results for new drive burn-in

Alan W. Smtih

Explorer

Stux

MVP

joeschmuck

Old Man

Alan W. Smtih

Explorer

joeschmuck

Old Man

Stux

MVP

Alan W. Smtih

Explorer

joeschmuck

Old Man

Alan W. Smtih

Explorer

joeschmuck

Old Man

Alan W. Smtih

Explorer

Similar threads