Acceptable S.M.A.R.T. results for new drive burn-in

Status
Not open for further replies.

Alan W. Smtih

Explorer
Joined
Aug 30, 2014
Messages
54
I haven't been able to find a concise "Here's how to burn-in a new drive and confirm it's safe as far as we can tell" checklist. (As in, there's already a FreeNAS box in production and you just want to spin up another temporary one to prep replacement disks.)

Between this forum (including this post which you should read for a lot more detail) and the Wikipedia S.M.A.R.T. entry, I put together what I think is a core checklist:

(Assuming: logged in as a user other than 'root' and one hypothetical drive at ada0 with a stripe size of 4096 )

1. Enabling kernel debugging with: sudo sysctl kern.geom.debugflags=0x10 (and make sure that if you reboot before completing the badblocks run, then you re-run this command)

2. Run sudo smartctl --scan to get a list of drives (e.g. /dev/ada0)

3. Run sudo diskinfo -v ada0 for the drive and get the number for stripesize which is passed to -b in the next step

4. Run sudo badblocks -b 4096 -ws /dev/ada0 on the drive and allowing it to complete (which can take a few days for disks that reach into TB)

5. Running smartctl -t long /dev/ada0 on the drive and allowing it to complete (which takes several hours -- you can track progress by running the same command that's used in the next step sudo smartctl -x /dev/ada0 | less and looking for the line with % remaining)

6. Reviewing the results of sudo smartctl -x /dev/ada0 | less and verify that for all of the following items that appear (and not all of them will) all have a RAW_VALUE of 0 (zero).

Code:
| ID  | Name										|
|-----|---------------------------------------------|
| 5   | Reallocated Sectors Count				   |
| 10  | Spin Retry Count							|
| 184 | End-to-End error / IOEDC					|
| 187 | Reported Uncorrectable Errors			   |
| 188 | Command Timeout							 |
| 196 | Reallocation Event Count					|
| 197 | Current Pending Sector Count				|
| 198 | (Offline) Uncorrectable Sector Count		|
| 201 | Soft Read Error Rate or TA Counter Detected |


Is there anything else that needs to be added?
 
Last edited:

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Last edited by a moderator:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Take a look at my Hard Drive Troubleshooting Guide in the Resources (see link in my signature). It lists all the items that are of interest when checking SMART data. While ID 200 isn't typically a value to indicate failure, it can be one of the first signs of failure and we have seen it here quite a bit on the forums. This resource does not include SSDs as there are some unique manufacturer specific IDs. I also cover how to use badblocks to troubleshoot a specific LBA failure. If you feel that this should be updated, I'm all ears.

Also, a value in ID 197 provided it's a low count (1-5) and nothing in ID 5 and a SMART Long Test doesn't fail doesn't mean there is a problem. This could clear itself.

I agree with @Stux that the link he provided for drive burn-in testing (we punt many users to this link) contains all the info and a lot of good personal pain from many users.

I'm glad that you went through the effort to pay out your steps and number them. I do that as well.
 

Alan W. Smtih

Explorer
Joined
Aug 30, 2014
Messages
54

Totally!

I love that one. It was the biggest help for me getting my head wrapped around this stuff.

(I almost linked to it the first time. I'm going to make an edit to add that.)

While it shows a ton of details (e.g. tmux commands) and different ways to do things (e.g. -ws vs -ns), my goal was a little different. I'm trying to boil it down to a concise checklist with only the core commands.

(I just read The Checklist Manifesto, and this is an attempt to apply the lessons form it. )

Also, while that post talks about stripesize and how it's pretty safe to assume -b 4096 for bigger disks, I added a step to explicitly make that determination.

(Frankly, if that step was in the original, I probably wouldn't have posted this and just kept it for my personal checklist.)

Some drives have non zero raw values all the time. Ie Seagate.

Right on. I've seen some references to how those values can change.

I put all zeros because I assumed that's what they should be out-of-the-box. (The idea being that this checklist is for brand new drives. Not ones that have seen active duty.)

Or, more to the point, if you see non-zero values on a new disk after a badblocks run, should that be a "it might be okay, but why risk it" flag?

Take a look at my Hard Drive Troubleshooting Guide in the Resources

That's fantastic! I can only imagine how much work has gone into it.

I've only had a chance to skim it today, but will dig in more when time allows.

Good stuff!
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
I put all zeros because I assumed that's what they should be out-of-the-box. (The idea being that this checklist is for brand new drives. Not ones that have seen active duty.)

No, out of the box maybe... but for example... the Seek Error RAW value on a seagate increases continuously, because it actually tells you how many seeks have been performed, and how many failed. Drives do a lot of seeking, so its continuously going up. Its actually fairly neat to be able to tell how many bytes have been written, and all of that, and that comes from the RAW values, but its encoded, suffice to say, the raw values on a seagate can be mostly useless for checking for errors.

Code:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   115   099   006	Pre-fail  Always	   -	   94262920
  3 Spin_Up_Time			0x0003   091   091   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   020	Old_age   Always	   -	   59
  5 Reallocated_Sector_Ct   0x0033   100   100   010	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000f   084   060   030	Pre-fail  Always	   -	   316356383
  9 Power_On_Hours		  0x0032   092   092   000	Old_age   Always	   -	   7667
10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   59
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0
188 Command_Timeout		 0x0032   100   100   000	Old_age   Always	   -	   0
189 High_Fly_Writes		 0x003a   100   100   000	Old_age   Always	   -	   0
190 Airflow_Temperature_Cel 0x0022   062   056   045	Old_age   Always	   -	   38 (Min/Max 32/39)
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   44
193 Load_Cycle_Count		0x0032   100   100   000	Old_age   Always	   -	   59
194 Temperature_Celsius	 0x0022   038   044   000	Old_age   Always	   -	   38 (0 22 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0


See what I mean? Seek_Error_Rate and Raw_Read_Error_Rate are very concerning.... but they're actually capable of telling you how many Seeks and Reads have been performed... although it might lap around... I forget :)

But you are right, the values you suggest checking are zero.
 
Last edited by a moderator:

Alan W. Smtih

Explorer
Joined
Aug 30, 2014
Messages
54
the Seek Error RAW value on a seagate increases continuously, because it actually tells you how many seeks have been performed, and how many failed

That's wild!

(And, I'll admit it makes me slightly less likely to by Seagate. Even though I now know better, the initial shock of seeing the number each time before remembering it doesn't necessarily correspond to the "Error" part of the title is not a feeling my brain likes. Of course, if the price is right, my brain will get over it :D)

Thanks for the info!
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
As far as I recall the seek errors for a Seagate drive are mainly related to when the drive uses it's technology to read ahead anticipating that the next sector is what will be asked for but when it's not the data the system is asking for, now it's a read error. This of course isn't a real read error, it's just the data was not requested and the drive jumpped the gun in an effort to speed up operations. It makes good sense to read ahead, however I think it would be better if this type of request were not recorded as an error.

Now in a WD Red, ID 1 with a value other than zero is a true error, but still it would not be the only indicator if the drive was failing.
 

Alan W. Smtih

Explorer
Joined
Aug 30, 2014
Messages
54
@joeschmuck - minor note on the Hard Drive Troubleshooting Guide

There is a line in "APPENDIX A", under "SMART Results Output" that says:

Code:
(Note: Items in red are failure indications)


I'm not seeing anything red in my browser (Chrome 59 on macOS Sierra 10.12.5).

Might just be me, but figured I'd give you the heads up.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Good catch. I guess it lost the red color when it was migrated to the resources area, it was originally just a normal thread. I'll have to fix that.

EDIT: I've updated it, not sure if the formatting will stick.
 
Last edited:
Status
Not open for further replies.
Top