Resource icon

Hard Drive Burn-In Testing - Discussion Thread

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,449
Well, it was my first disk replacement and I forgot how I tested all 5 drives before creating the pool a few years ago.

So I put the "da4" drive OFFLINE (out of 5 drives)
Put the new drive in the same bay (I forgot I had to test it)
Ran short rest
Ran long test
Is running badblocl test

Now I realize I have a degraded zpool for as long as the 5th drive is not ready.

What should I do?
Should I just stop the tests, run the "replace" command, resilver and hope fort the best ?
What was the reason for offlining the drive?
If you haven't performed the replacement yet, maybe you could reinsert the old drive and resilver the pool. It may take less time as entire resilvering may not be necessary.
In that case, I would proceed with Badblock and wait for it to complete. Then if all is good, you can replace the old drive with the new drive.
 

phier

Patron
Joined
Dec 4, 2012
Messages
398
May I ask regardin that drive burn-in, the solnet scrip contains only "read" operations.... as
dd if=/dev/ada0 of=/dev/null bs=1048576

so as part of the burn-in write to the drive is not required?

thanks
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,107
That's what make the test non-destructive, and allows to run on a live array.
 

phier

Patron
Joined
Dec 4, 2012
Messages
398
@Etorix thanks, but still not clear is that enough for a burn-in just read operation? Also i was wondering once the test is over ie all dd if=drive of=null complete... how can i find out if there was some issue/or drive is okay or wrong?

Thanks
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,107
"Burn-in" is whatever makes you feel confident before putting the hardware in production…
There are "recipes" but no real "gold standard". Badblocks—a legacy from the time when hard drives could not map their own defects—has been much abused for that. Solnet-array is a different take on the issue, and if it's good for @jgreco I'd say it's good enough for everyone.
 

phier

Patron
Joined
Dec 4, 2012
Messages
398
@Etorix i believe it, but i was wondering how to validate that action, u execute solnet-array; once its done how u evaluate burn-in operation?
running smartctl -t long and looking for a results?

thanks
 

sleeper52

Explorer
Joined
Nov 12, 2018
Messages
91
I'm currently running jgreco's Solnet-Array-Test on a tmux instance as per recommended by a number of people of this post. It is testing 3x Exos x18 18TB drives (ST18000NM000J-2TV103). I also have netdata as a monitoring tool. My question is: how long should I run the Solnet-Array-Test? Does it terminate by itself once it has scanned all sectors (like Badblocks) or does it run indefinitely?
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,903
I'm currently running jgreco's Solnet-Array-Test on a tmux instance as per recommended by a number of people of this post. It is testing 3x Exos x18 18TB drives (ST18000NM000J-2TV103). I also have netdata as a monitoring tool. My question is: how long should I run the Solnet-Array-Test? Does it terminate by itself once it has scanned all sectors (like Badblocks) or does it run indefinitely?
Looking at your signature, my guess is that you want to add those 3 drives as another RAIDZ1. For drives of that size that means a pretty long resilvering period in RAIDZ?, if one of them dies. Therefore RAIDZ1 is not recommended for larger drives.

Personally, and I know this is purely anecdotal, I was not very lucky with my 8 Seagate Exos X16 16 TB drives, which I bought in September 2020. Just yesterday I returned the 4th drive to Seagate. It was the replacement for one of the original 3 ones that had died so far. And the 4th of the original 8 ones is also showing signs of dying. In fairness: The RMA process has always been perfect so far, absolutely nothing to complain about.

I am not saying this to bash Seagate but to underline that drives can die suddenly, even after months of burn-in. Especially so, if they are coming from the same batch.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
My question is: how long should I run the Solnet-Array-Test? Does it terminate by itself once it has scanned all sectors (like Badblocks) or does it run indefinitely?

The version I released to the community will make one pass through the drives. It runs several different phases that are intended to help identify various issues. In the old days, we had shared parallel SCSI busses and this could limit your bandwidth. The set of problems these days is a little bit different but really not that much different. The new SCALE-compatible version will be a little more chatty about what it is doing and why.
 

sleeper52

Explorer
Joined
Nov 12, 2018
Messages
91
Looking at your signature, my guess is that you want to add those 3 drives as another RAIDZ1. For drives of that size that means a pretty long resilvering period in RAIDZ?, if one of them dies. Therefore RAIDZ1 is not recommended for larger drives.

Personally, and I know this is purely anecdotal, I was not very lucky with my 8 Seagate Exos X16 16 TB drives, which I bought in September 2020. Just yesterday I returned the 4th drive to Seagate. It was the replacement for one of the original 3 ones that had died so far. And the 4th of the original 8 ones is also showing signs of dying. In fairness: The RMA process has always been perfect so far, absolutely nothing to complain about.

I am not saying this to bash Seagate but to underline that drives can die suddenly, even after months of burn-in. Especially so, if they are coming from the same batch.
Yikes. Yes, that is my plan. I currently don't have the budget to change course at the moment. Fortunately, the data in the storage will mainly consist of media (movies and TV shows) for my Plex media server. Based on what you just posted, I will be leaving my more sensitive data on the smaller drives. Hopefully TrueNAS will offer RAIDz expansion which includes the ability to increase parity in the near future. Thanks for sharing your experience as this will certainly affect my upgrade decisions in the future.
The version I released to the community will make one pass through the drives. It runs several different phases that are intended to help identify various issues. In the old days, we had shared parallel SCSI busses and this could limit your bandwidth. The set of problems these days is a little bit different but really not that much different. The new SCALE-compatible version will be a little more chatty about what it is doing and why.
We certainly appreciate the contributions you've made to the TrueNAS community. We honestly need something to replace Badblocks with as I also ran into some issues when attempting to run it as a burn-in test tool. It was through digging around the forums that I found your SAT tool. How long do you reckon it takes your test to complete a 18TB drive?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
How long do you reckon it takes your test to complete a 18TB drive?

A *long* time. It was originally made to burn in drives back in the late '90's when sizes like 50GB and 73GB were the largest common drive sizes. However, it will do all your drives in parallel, simultaneously, so this may not be as terrible as it sounds. The thing that it is really testing is the ability of a HDD to accurately seek and read tracks.
 

sleeper52

Explorer
Joined
Nov 12, 2018
Messages
91
A *long* time. It was originally made to burn in drives back in the late '90's when sizes like 50GB and 73GB were the largest common drive sizes. However, it will do all your drives in parallel, simultaneously, so this may not be as terrible as it sounds. The thing that it is really testing is the ability of a HDD to accurately seek and read tracks.
Hahaha. Yeah thankfully it is doing it in parallel.
The new version has a burn-in mode.
Looking forward to using the new version. I'll be sure to use it in my next pool upgrade.
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Looking forward to use the new version. I'll be sure to use it in my next pool upgrade.

Just now posted. See

 

sleeper52

Explorer
Joined
Nov 12, 2018
Messages
91
Just now posted. See

That is awesome. I just downloaded it in my TrueNAS server. Unfortunately I've been running the v2 on my 3 drives for around 2 days now. I would've loved to have used your new v3 version. Do you reckon the v3 runs the test faster and that I should just cancel the burn-in using v2?
 

dxun

Explorer
Joined
Jan 24, 2016
Messages
52
Not reading the whole 30 pages of this thread (apologies if this has already been asked and answered) - would an acceptable HDD burn-in test be to fill the pool as close to 100% and then do a scrub? This is assuming RAM is tested and ECC is present and correctly functioning. Then, if scrub passes correctly, that would verify that each written block is readable without errors.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Depends. Part of what tools like solnet-array-test are doing involves doing large numbers of seeks, which help to identify one of the common mechanical failures in HDD's. Another would be the statistical analysis done to check for slow drives or other similar issues. Burn-in probably should run out to a thousand hours as well, which is where most infant mortality happens within. I would take your fill and scrub as a "better than nothing" option, but I think there are good reasons to do a more formal burn-in.
 

dxun

Explorer
Joined
Jan 24, 2016
Messages
52
Yeah, that makes sense - thank you.
 
Top