Register for the iXsystems Community to get an ad-free experience and exclusive discounts in our eBay Store.
Resource icon

Hard Drive Burn-In Testing - Discussion Thread

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,475
Thanks for the tip on the boosted -c option, kicking off a 10TB Toshiba MG06ACA10TE presently.

I'm curious if anyone knows what this message below means:



It only shows up in a few posts elsewhere and it's unclear if there's an issue
I've seen that before, and it's never been a problem.

If you're running on FreeNAS/FreeBSD, did you remember to run this before testing?
Code:
sysctl kern.geom.debugflags=0x10
 

thalf

Junior Member
Joined
Mar 1, 2014
Messages
14
I purchased two WD Ultrastar DC HC550 18TB drives and wanted to do some thorough testing before starting to use them, so I'm following the advice in qwertymodo's post that started this thread.

But... it's taking forever, even though I'm testing both disks in parallel.

The long S.M.A.R.T. test took roughly 30 hours. Then I started badblocks. The first pattern (0xaa) took roughly 46 hours. Now at 49h40m they're 16.6% and 20.4% along writing the 0x55 pattern, but at 100% they'll start "Reading and comparing" 0x55 before moving on to 0xff and then 0x00.

(I find it a bit interesting that one drive appears faster than the other, since I started badblocks on the two drives just 10 seconds apart.)

So badblocks looks like it'll take around 180 hours, or a bit over 7 days. That's a long wait. And after that qwertymodo's post says I should run another long S.M.A.R.T. test, for another 30 hours. That'll be a total of roughly 10 days' worth of testing the drives before starting to use them.

I didn't notice the difference that the -c flag to badblocks potentially can make until after around 30 hours or so, and at that point I was very hesitant to abort badblocks to try out different -c values to see if I got a speedup. If there's no speedup, that'll mean I've aborted and had to restart badblocks for nothing, losing a lot time.

How thorough do you guys test large drives before starting to use them? And what flags do you use?


Also, while on the subject of badblocks: for large drives a larger block size must be used, apparently due to some 32 bit limitations in the code. qwertymodo advises to use -b 4096 for drives larger than 2TB. Well, I had to use -b 8192 for the 18TB drives, as I first got errors:

Code:
root@nas:~ # time badblocks -ws /dev/ada0
badblocks: Value too large to be stored in data type invalid end block (17578328064): must be 32-bit value
0.000u 0.003s 0:00.00 0.0%      0+0k 0+0io 0pf+0w
root@nas:~ # time badblocks -b 4096 -ws /dev/ada0
badblocks: Value too large to be stored in data type invalid end block (4394582016): must be 32-bit value
0.000u 0.005s 0:00.02 0.0%      0+0k 0+0io 10pf+0w
root@nas:~ # smartctl -a /dev/ada0 | grep 'Sector Sizes:'
Sector Sizes:     512 bytes logical, 4096 bytes physical

But I read somewhere (can't find the source now) that using a block size larger than the drive's sector sizes was bad/pointless, since that somehow wouldn't test the entire disk, i.e. using a 8192 bytes block size on a drive with sector size 4096 bytes meant that only every other 4096 bytes would actually be tested.

Anyone knows if this is correct? If it is, it means I'm really just wasting time running badblocks.
 

Mastakilla

Member
Joined
Jul 18, 2019
Messages
178
...
(I find it a bit interesting that one drive appears faster than the other, since I started badblocks on the two drives just 10 seconds apart.)
...
See above thread on this subject... I bumbed into similar observations when burning in with the solnet-array-test script.
 

Apollo

Neophyte Sage
Joined
Jun 13, 2013
Messages
1,265
See above thread on this subject... I bumbed into similar observations when burning in with the solnet-array-test script.
I doubt any kind of throttling should be implemented on such devices.
For time critical system, this becomes a liability and brings uncertainty to the system as it becomes non deterministic.
Normally, the hardware (PCB layout and thermal mitigation) should accomodate for acceptable cooling performance at any load.
If product has been design properly and cooling is within specification, then such condition shouldn't occur.
 

ChrisRJ

Neophyte Sage
Joined
Oct 23, 2020
Messages
622
Well, I do not consider 7 days long for a burn-in. But hey, I have once lost about 800 KB of manual entries to a database. When I moved to my new NAS a few months ago, I first let it run as a backup system for the old NAS for about 6 weeks. Only then did I reverse things so the old NAS is now the backup (using ZFS replication).
 

raz0r11

Newbie
Joined
Jun 8, 2021
Messages
3
Now, to bring this around to a vaguely relevant-to-the-forum post, I'll point out that it is interesting that Foxconn has come to Wisconsin, a site I expect was selected in large part due to the ready availability of water from the world's largest fresh water supply, which their manufacturing processes require a lot of. I am skeptical of the claims of lots of high paying jobs in the long term for Wisconsin, although I'd be happy to be wrong.

Imagine my surprise to find this in the middle of a drive burn in testing discussion. For anyone who cares interesting article about what happened and BTW it was the promise of ~$3B of tax abatement not fresh water that drove the decision which has shrunk along with the jobs originally promised

Ok enough off topic stuff.

TL:DR

Badblocks:
-c <number> can make a large difference v. default so consider that.
-b <number> please make sure aligned with your actual block size (consider using diskinfo to figure out)

Why?

Using default -c (64) may make things slower and misaligning your block size will DEFINITELY (512 v. 520 v. 4096 v. 8192) slow things down...

Other things to consider enabling writeback as stated earlier in thread (smartctl -s wcache,on /dev/<whatever>) though to be fair YMMV...
 

jgreco

Resident Grinch
Moderator
Joined
May 29, 2011
Messages
15,026
I may have forgotten to point out that I'm a resident of the region. The Verge article isn't entirely crap, but misses a lot of nuance and misunderstands some of what transpired.

interesting article about what happened and BTW it was the promise of ~$3B of tax abatement not fresh water that drove the decision which has shrunk along with the jobs originally promised

Except that NO ONE who had the slightest clue and looked at the deal thought it was about the tax abatement. It never made any sense that Foxconn was going to do any significant manufacturing of LCD panels or anything else here. Producing panels here would be pointless, as the market for them is all back in Asia where the companies building TV's and monitors are.

So what was really going on? There are basically two sides to consider, three if you include the taxpayers:

Governor Walker and his pals were desperate for a super-positive talking point to win an election, and making it sound like Foxconn was going to make Wisconsin into an electronics manufacturing haven would have been a very big feather in his cap, swinging likely-Democratic votes in Southeastern Wisconsin to Republican. Unfortunately, the legitimacy and practicality of the deal fell under closer scrutiny than expected, and it became known that the only practical way for them to make panels at a reasonable cost was to do lights-out manufacturing, which punctured the bubble of "lots of jobs". Additionally, the total absence of anyone on this side of the planet to buy the panels was also apparent.

From Foxconn's point of view, it is a little bit murkier, but a bunch of investigative reporting yielded clues that Foxconn had been shopping for properties with access to large amounts of freshwater, a major component to their manufacturing, and something that is sometimes in short supply in Asia. Siting a manufacturing plant within the boundaries of the Great Lakes Compact would give them a toehold to a new source for freshwater. This theory is backed up by the manner in which they made it known to the Walker administration that they needed both freshwater and exemptions from various environmental, wastewater, and pollution regulations, all of which were then rammed through. If they didn't need freshwater, then the question would need to be why they sought these exemptions. The amount of land that was cleared out in Mount Pleasant for Foxconn is immense, so it seems likely that they did want an option to build some significant manufacturing capacity.

Third, they also fought for - and got - a widening of Interstate 94, at taxpayer expense, in order to be able to run self-driving trucks between Racine and General Mitchell Airport. This was really a big clue as to what the future would hold for the project, as there is plenty of local workforce available to drive trucks.

So, to those of us who live in the region and were aware that this didn't make any sense, I predicted that this was either going to turn into lights out manufacturing with maybe a thousand jobs, or fail entirely, well before ground was broken out in Mount Pleasant.

And here's the thing. They now do have the infrastructure, power, water, etc., and it would be relatively easy for them to move forward with a manufacturing facility, which could still happen, but it won't be bringing the thousands of jobs. My best guess is that they were covering bets as to business continuity in the face of global warming, and that they've got a place that they could light up in a year or two if need be.

The taxpayers of Wisconsin have been taken for a ride by both sides.
 

Etorix

Senior Member
Joined
Dec 30, 2020
Messages
498
Thanks for this enlightening analysis.
But while I'm talking about "light"… you imply that Foxconn may "light up" to "lights-out" manufacturing. A nice practical lesson on verbal particles for non-native English speakers !o_O
 

jgreco

Resident Grinch
Moderator
Joined
May 29, 2011
Messages
15,026
Thanks for this enlightening analysis.
But while I'm talking about "light"… you imply that Foxconn may "light up" to "lights-out" manufacturing. A nice practical lesson on verbal particles for non-native English speakers !o_O

Sorry. :smile:

"Lights-out manufacturing" refers to automated manufacturing facilities. When you do not have humans in a building, you may want to turn the lights off to save on energy. Few places are actually so automated that they don't have ANY people on site, so the term is somewhat aspirational, but the term is still used to mean a "low-to-no-human" facility, even if the lights are actually on.

I'm a network and infrastructure engineer, and to "light up" a building refers to bringing it on network, or, in a more abstract sense, to make the building useful. The process of lighting up a building involves bringing in the equipment and doing all the setup and stuff to take an empty building to a functional facility. There is probably a better term for this for a manufacturing facility, but I am not coming up with it even now, so my mind went to the closest parallel.

Anyways, I try to look for the truth behind the superficial news headlines. The Verge article isn't terrible, but it misses out on some stuff that can be inferred from what has actually transpired. Of course, it is easier for them to get this stuff right now that so much of this is in hindsight. I was skeptical back at the start of all of this, but that's mostly because I spend a lot of time trying to understand motives and practicalities.
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,475
I purchased two WD Ultrastar DC HC550 18TB drives and wanted to do some thorough testing before starting to use them, so I'm following the advice in qwertymodo's post that started this thread.

But... it's taking forever, even though I'm testing both disks in parallel.

The long S.M.A.R.T. test took roughly 30 hours. Then I started badblocks. The first pattern (0xaa) took roughly 46 hours. Now at 49h40m they're 16.6% and 20.4% along writing the 0x55 pattern, but at 100% they'll start "Reading and comparing" 0x55 before moving on to 0xff and then 0x00.

(I find it a bit interesting that one drive appears faster than the other, since I started badblocks on the two drives just 10 seconds apart.)

So badblocks looks like it'll take around 180 hours, or a bit over 7 days. That's a long wait. And after that qwertymodo's post says I should run another long S.M.A.R.T. test, for another 30 hours. That'll be a total of roughly 10 days' worth of testing the drives before starting to use them.

I didn't notice the difference that the -c flag to badblocks potentially can make until after around 30 hours or so, and at that point I was very hesitant to abort badblocks to try out different -c values to see if I got a speedup. If there's no speedup, that'll mean I've aborted and had to restart badblocks for nothing, losing a lot time.

How thorough do you guys test large drives before starting to use them? And what flags do you use?


Also, while on the subject of badblocks: for large drives a larger block size must be used, apparently due to some 32 bit limitations in the code. qwertymodo advises to use -b 4096 for drives larger than 2TB. Well, I had to use -b 8192 for the 18TB drives, as I first got errors:

Code:
root@nas:~ # time badblocks -ws /dev/ada0
badblocks: Value too large to be stored in data type invalid end block (17578328064): must be 32-bit value
0.000u 0.003s 0:00.00 0.0%      0+0k 0+0io 0pf+0w
root@nas:~ # time badblocks -b 4096 -ws /dev/ada0
badblocks: Value too large to be stored in data type invalid end block (4394582016): must be 32-bit value
0.000u 0.005s 0:00.02 0.0%      0+0k 0+0io 10pf+0w
root@nas:~ # smartctl -a /dev/ada0 | grep 'Sector Sizes:'
Sector Sizes:     512 bytes logical, 4096 bytes physical

But I read somewhere (can't find the source now) that using a block size larger than the drive's sector sizes was bad/pointless, since that somehow wouldn't test the entire disk, i.e. using a 8192 bytes block size on a drive with sector size 4096 bytes meant that only every other 4096 bytes would actually be tested.

Anyone knows if this is correct? If it is, it means I'm really just wasting time running badblocks.
I use 32768 for the block size with a block count of 512 when I run badblocks to test disks, like this:
Code:
badblocks -b 32768 -c 512 -wsv -e 1 -o {block_log_file} /dev/{drive ID}
This uses more memory than the default settings, but seems to speed up the process a little... but it's just the nature of things that it takes a long time to crank through every LBA on these huge disks!
 

Etorix

Senior Member
Joined
Dec 30, 2020
Messages
498
"Lights-out" was clear from the context.
I would rather associate "up" with "in" or "on", and "out" with "down" or "off", so I found that the contradictory association of "light up" with "lights-out" made for an interesting stylistic effect. Maybe I got my semantic wrong (or upside-down)? Or I'm looking for style effects where none was intended (just pretend :wink: ).
Please do not refrain you writing skills.
 

ByteMan

Member
Joined
Nov 10, 2021
Messages
30
I use 32768 for the block size with a block count of 512 when I run badblocks to test disks, like this:
Code:
badblocks -b 32768 -c 512 -wsv -e 1 -o {block_log_file} /dev/{drive ID}
This uses more memory than the default settings, but seems to speed up the process a little... but it's just the nature of things that it takes a long time to crank through every LBA on these huge disks!
I gave badblocks -b 32768 -c 512 -wsv a try on four 14TB 4kn/512e drives (WD Red Pro + WD DC HC530) but it seems to be slower compared to badblocks -b 4096 -wsv.
I aborted -b 4096 -wsv after 79hrs in order to re-run with -b 32768 -c 512 -wsv on the same drives. After 79hrs, there is a 12% difference in completion
Summarizing:
badblocks -b 32768 -c 512 -wsv after 79h: 3rd pass @ 13% completion
badblocks -b 4096 -wsv after 79h: 3rd pass @ 25% completion

Is there a logical explanation for this?
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
9,145
Is there a logical explanation for this?
Of course there is.

I'll take a stab at it:
Your command -b 32768 is not a native block size (even though it is a multiple) -c 512 is 512 blocks and looks to write 16777216 bytes at a time.
Your command -b 4096 alone specifies the native block size of the drive and 64 blocks and to write 262144 bytes at a time.

Stick with native sizes of the drives if you want some speed. If you were to use -b 4096 -c 512, you might get a faster result but will it be appreciable? If you run into bad sectors then I'm not sure if a larger -c value will hurt you or not, I would suspect it would as the program retries the entire write operation over and over again, but will it be a major slow down when bad sectors are found? I actually do not know with certainty.

So my advice, stick with -b 4096 for true 4096 sector drives.
 
Top