Resource icon

Hard Drive Burn-In Testing - Discussion Thread

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Thanks for the tip on the boosted -c option, kicking off a 10TB Toshiba MG06ACA10TE presently.

I'm curious if anyone knows what this message below means:



It only shows up in a few posts elsewhere and it's unclear if there's an issue
I've seen that before, and it's never been a problem.

If you're running on FreeNAS/FreeBSD, did you remember to run this before testing?
Code:
sysctl kern.geom.debugflags=0x10
 

thalf

Dabbler
Joined
Mar 1, 2014
Messages
19
I purchased two WD Ultrastar DC HC550 18TB drives and wanted to do some thorough testing before starting to use them, so I'm following the advice in qwertymodo's post that started this thread.

But... it's taking forever, even though I'm testing both disks in parallel.

The long S.M.A.R.T. test took roughly 30 hours. Then I started badblocks. The first pattern (0xaa) took roughly 46 hours. Now at 49h40m they're 16.6% and 20.4% along writing the 0x55 pattern, but at 100% they'll start "Reading and comparing" 0x55 before moving on to 0xff and then 0x00.

(I find it a bit interesting that one drive appears faster than the other, since I started badblocks on the two drives just 10 seconds apart.)

So badblocks looks like it'll take around 180 hours, or a bit over 7 days. That's a long wait. And after that qwertymodo's post says I should run another long S.M.A.R.T. test, for another 30 hours. That'll be a total of roughly 10 days' worth of testing the drives before starting to use them.

I didn't notice the difference that the -c flag to badblocks potentially can make until after around 30 hours or so, and at that point I was very hesitant to abort badblocks to try out different -c values to see if I got a speedup. If there's no speedup, that'll mean I've aborted and had to restart badblocks for nothing, losing a lot time.

How thorough do you guys test large drives before starting to use them? And what flags do you use?


Also, while on the subject of badblocks: for large drives a larger block size must be used, apparently due to some 32 bit limitations in the code. qwertymodo advises to use -b 4096 for drives larger than 2TB. Well, I had to use -b 8192 for the 18TB drives, as I first got errors:

Code:
root@nas:~ # time badblocks -ws /dev/ada0
badblocks: Value too large to be stored in data type invalid end block (17578328064): must be 32-bit value
0.000u 0.003s 0:00.00 0.0%      0+0k 0+0io 0pf+0w
root@nas:~ # time badblocks -b 4096 -ws /dev/ada0
badblocks: Value too large to be stored in data type invalid end block (4394582016): must be 32-bit value
0.000u 0.005s 0:00.02 0.0%      0+0k 0+0io 10pf+0w
root@nas:~ # smartctl -a /dev/ada0 | grep 'Sector Sizes:'
Sector Sizes:     512 bytes logical, 4096 bytes physical

But I read somewhere (can't find the source now) that using a block size larger than the drive's sector sizes was bad/pointless, since that somehow wouldn't test the entire disk, i.e. using a 8192 bytes block size on a drive with sector size 4096 bytes meant that only every other 4096 bytes would actually be tested.

Anyone knows if this is correct? If it is, it means I'm really just wasting time running badblocks.
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
202
...
(I find it a bit interesting that one drive appears faster than the other, since I started badblocks on the two drives just 10 seconds apart.)
...
See above thread on this subject... I bumbed into similar observations when burning in with the solnet-array-test script.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,449
See above thread on this subject... I bumbed into similar observations when burning in with the solnet-array-test script.
I doubt any kind of throttling should be implemented on such devices.
For time critical system, this becomes a liability and brings uncertainty to the system as it becomes non deterministic.
Normally, the hardware (PCB layout and thermal mitigation) should accomodate for acceptable cooling performance at any load.
If product has been design properly and cooling is within specification, then such condition shouldn't occur.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,906
Well, I do not consider 7 days long for a burn-in. But hey, I have once lost about 800 KB of manual entries to a database. When I moved to my new NAS a few months ago, I first let it run as a backup system for the old NAS for about 6 weeks. Only then did I reverse things so the old NAS is now the backup (using ZFS replication).
 

raz0r11

Cadet
Joined
Jun 8, 2021
Messages
3
Now, to bring this around to a vaguely relevant-to-the-forum post, I'll point out that it is interesting that Foxconn has come to Wisconsin, a site I expect was selected in large part due to the ready availability of water from the world's largest fresh water supply, which their manufacturing processes require a lot of. I am skeptical of the claims of lots of high paying jobs in the long term for Wisconsin, although I'd be happy to be wrong.

Imagine my surprise to find this in the middle of a drive burn in testing discussion. For anyone who cares interesting article about what happened and BTW it was the promise of ~$3B of tax abatement not fresh water that drove the decision which has shrunk along with the jobs originally promised

Ok enough off topic stuff.

TL:DR

Badblocks:
-c <number> can make a large difference v. default so consider that.
-b <number> please make sure aligned with your actual block size (consider using diskinfo to figure out)

Why?

Using default -c (64) may make things slower and misaligning your block size will DEFINITELY (512 v. 520 v. 4096 v. 8192) slow things down...

Other things to consider enabling writeback as stated earlier in thread (smartctl -s wcache,on /dev/<whatever>) though to be fair YMMV...
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
I may have forgotten to point out that I'm a resident of the region. The Verge article isn't entirely crap, but misses a lot of nuance and misunderstands some of what transpired.

interesting article about what happened and BTW it was the promise of ~$3B of tax abatement not fresh water that drove the decision which has shrunk along with the jobs originally promised

Except that NO ONE who had the slightest clue and looked at the deal thought it was about the tax abatement. It never made any sense that Foxconn was going to do any significant manufacturing of LCD panels or anything else here. Producing panels here would be pointless, as the market for them is all back in Asia where the companies building TV's and monitors are.

So what was really going on? There are basically two sides to consider, three if you include the taxpayers:

Governor Walker and his pals were desperate for a super-positive talking point to win an election, and making it sound like Foxconn was going to make Wisconsin into an electronics manufacturing haven would have been a very big feather in his cap, swinging likely-Democratic votes in Southeastern Wisconsin to Republican. Unfortunately, the legitimacy and practicality of the deal fell under closer scrutiny than expected, and it became known that the only practical way for them to make panels at a reasonable cost was to do lights-out manufacturing, which punctured the bubble of "lots of jobs". Additionally, the total absence of anyone on this side of the planet to buy the panels was also apparent.

From Foxconn's point of view, it is a little bit murkier, but a bunch of investigative reporting yielded clues that Foxconn had been shopping for properties with access to large amounts of freshwater, a major component to their manufacturing, and something that is sometimes in short supply in Asia. Siting a manufacturing plant within the boundaries of the Great Lakes Compact would give them a toehold to a new source for freshwater. This theory is backed up by the manner in which they made it known to the Walker administration that they needed both freshwater and exemptions from various environmental, wastewater, and pollution regulations, all of which were then rammed through. If they didn't need freshwater, then the question would need to be why they sought these exemptions. The amount of land that was cleared out in Mount Pleasant for Foxconn is immense, so it seems likely that they did want an option to build some significant manufacturing capacity.

Third, they also fought for - and got - a widening of Interstate 94, at taxpayer expense, in order to be able to run self-driving trucks between Racine and General Mitchell Airport. This was really a big clue as to what the future would hold for the project, as there is plenty of local workforce available to drive trucks.

So, to those of us who live in the region and were aware that this didn't make any sense, I predicted that this was either going to turn into lights out manufacturing with maybe a thousand jobs, or fail entirely, well before ground was broken out in Mount Pleasant.

And here's the thing. They now do have the infrastructure, power, water, etc., and it would be relatively easy for them to move forward with a manufacturing facility, which could still happen, but it won't be bringing the thousands of jobs. My best guess is that they were covering bets as to business continuity in the face of global warming, and that they've got a place that they could light up in a year or two if need be.

The taxpayers of Wisconsin have been taken for a ride by both sides.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,110
Thanks for this enlightening analysis.
But while I'm talking about "light"… you imply that Foxconn may "light up" to "lights-out" manufacturing. A nice practical lesson on verbal particles for non-native English speakers !o_O
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Thanks for this enlightening analysis.
But while I'm talking about "light"… you imply that Foxconn may "light up" to "lights-out" manufacturing. A nice practical lesson on verbal particles for non-native English speakers !o_O

Sorry. :smile:

"Lights-out manufacturing" refers to automated manufacturing facilities. When you do not have humans in a building, you may want to turn the lights off to save on energy. Few places are actually so automated that they don't have ANY people on site, so the term is somewhat aspirational, but the term is still used to mean a "low-to-no-human" facility, even if the lights are actually on.

I'm a network and infrastructure engineer, and to "light up" a building refers to bringing it on network, or, in a more abstract sense, to make the building useful. The process of lighting up a building involves bringing in the equipment and doing all the setup and stuff to take an empty building to a functional facility. There is probably a better term for this for a manufacturing facility, but I am not coming up with it even now, so my mind went to the closest parallel.

Anyways, I try to look for the truth behind the superficial news headlines. The Verge article isn't terrible, but it misses out on some stuff that can be inferred from what has actually transpired. Of course, it is easier for them to get this stuff right now that so much of this is in hindsight. I was skeptical back at the start of all of this, but that's mostly because I spend a lot of time trying to understand motives and practicalities.
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
I purchased two WD Ultrastar DC HC550 18TB drives and wanted to do some thorough testing before starting to use them, so I'm following the advice in qwertymodo's post that started this thread.

But... it's taking forever, even though I'm testing both disks in parallel.

The long S.M.A.R.T. test took roughly 30 hours. Then I started badblocks. The first pattern (0xaa) took roughly 46 hours. Now at 49h40m they're 16.6% and 20.4% along writing the 0x55 pattern, but at 100% they'll start "Reading and comparing" 0x55 before moving on to 0xff and then 0x00.

(I find it a bit interesting that one drive appears faster than the other, since I started badblocks on the two drives just 10 seconds apart.)

So badblocks looks like it'll take around 180 hours, or a bit over 7 days. That's a long wait. And after that qwertymodo's post says I should run another long S.M.A.R.T. test, for another 30 hours. That'll be a total of roughly 10 days' worth of testing the drives before starting to use them.

I didn't notice the difference that the -c flag to badblocks potentially can make until after around 30 hours or so, and at that point I was very hesitant to abort badblocks to try out different -c values to see if I got a speedup. If there's no speedup, that'll mean I've aborted and had to restart badblocks for nothing, losing a lot time.

How thorough do you guys test large drives before starting to use them? And what flags do you use?


Also, while on the subject of badblocks: for large drives a larger block size must be used, apparently due to some 32 bit limitations in the code. qwertymodo advises to use -b 4096 for drives larger than 2TB. Well, I had to use -b 8192 for the 18TB drives, as I first got errors:

Code:
root@nas:~ # time badblocks -ws /dev/ada0
badblocks: Value too large to be stored in data type invalid end block (17578328064): must be 32-bit value
0.000u 0.003s 0:00.00 0.0%      0+0k 0+0io 0pf+0w
root@nas:~ # time badblocks -b 4096 -ws /dev/ada0
badblocks: Value too large to be stored in data type invalid end block (4394582016): must be 32-bit value
0.000u 0.005s 0:00.02 0.0%      0+0k 0+0io 10pf+0w
root@nas:~ # smartctl -a /dev/ada0 | grep 'Sector Sizes:'
Sector Sizes:     512 bytes logical, 4096 bytes physical

But I read somewhere (can't find the source now) that using a block size larger than the drive's sector sizes was bad/pointless, since that somehow wouldn't test the entire disk, i.e. using a 8192 bytes block size on a drive with sector size 4096 bytes meant that only every other 4096 bytes would actually be tested.

Anyone knows if this is correct? If it is, it means I'm really just wasting time running badblocks.
I use 32768 for the block size with a block count of 512 when I run badblocks to test disks, like this:
Code:
badblocks -b 32768 -c 512 -wsv -e 1 -o {block_log_file} /dev/{drive ID}
This uses more memory than the default settings, but seems to speed up the process a little... but it's just the nature of things that it takes a long time to crank through every LBA on these huge disks!
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,110
"Lights-out" was clear from the context.
I would rather associate "up" with "in" or "on", and "out" with "down" or "off", so I found that the contradictory association of "light up" with "lights-out" made for an interesting stylistic effect. Maybe I got my semantic wrong (or upside-down)? Or I'm looking for style effects where none was intended (just pretend :wink: ).
Please do not refrain you writing skills.
 

ByteMan

Dabbler
Joined
Nov 10, 2021
Messages
32
I use 32768 for the block size with a block count of 512 when I run badblocks to test disks, like this:
Code:
badblocks -b 32768 -c 512 -wsv -e 1 -o {block_log_file} /dev/{drive ID}
This uses more memory than the default settings, but seems to speed up the process a little... but it's just the nature of things that it takes a long time to crank through every LBA on these huge disks!
I gave badblocks -b 32768 -c 512 -wsv a try on four 14TB 4kn/512e drives (WD Red Pro + WD DC HC530) but it seems to be slower compared to badblocks -b 4096 -wsv.
I aborted -b 4096 -wsv after 79hrs in order to re-run with -b 32768 -c 512 -wsv on the same drives. After 79hrs, there is a 12% difference in completion
Summarizing:
badblocks -b 32768 -c 512 -wsv after 79h: 3rd pass @ 13% completion
badblocks -b 4096 -wsv after 79h: 3rd pass @ 25% completion

Is there a logical explanation for this?
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,970
Is there a logical explanation for this?
Of course there is.

I'll take a stab at it:
Your command -b 32768 is not a native block size (even though it is a multiple) -c 512 is 512 blocks and looks to write 16777216 bytes at a time.
Your command -b 4096 alone specifies the native block size of the drive and 64 blocks and to write 262144 bytes at a time.

Stick with native sizes of the drives if you want some speed. If you were to use -b 4096 -c 512, you might get a faster result but will it be appreciable? If you run into bad sectors then I'm not sure if a larger -c value will hurt you or not, I would suspect it would as the program retries the entire write operation over and over again, but will it be a major slow down when bad sectors are found? I actually do not know with certainty.

So my advice, stick with -b 4096 for true 4096 sector drives.
 

rudds

Dabbler
Joined
Apr 17, 2018
Messages
34
I'm having the same issue as thalf, running badblocks on some 18TB WD Red Pros:

Code:
root@truenas[~]# badblocks -w -b 4096 -s /dev/da0
badblocks: Value too large to be stored in data type invalid end block (4394582016): must be 32-bit value


The native block size on these drives is 4096, so since best practice seems to be to stick to the correct block size, is there essentially no reliable way to run badblocks on drives this big and get accurate results? Should I just run a long SMART test on these new drives and call it a day? Or is even that worth doing? Given how long the testing is likely to take on these even if I could get it working, the temptation to just go ahead and create a new pool is strong.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Failing to do burn-in testing on your drives is a bad idea. badblocks is certainly not the only test available.

 
Joined
Jan 27, 2020
Messages
577
I'm having the same issue as thalf, running badblocks on some 18TB WD Red Pros:

Code:
root@truenas[~]# badblocks -w -b 4096 -s /dev/da0
badblocks: Value too large to be stored in data type invalid end block (4394582016): must be 32-bit value


The native block size on these drives is 4096, so since best practice seems to be to stick to the correct block size, is there essentially no reliable way to run badblocks on drives this big and get accurate results? Should I just run a long SMART test on these new drives and call it a day? Or is even that worth doing? Given how long the testing is likely to take on these even if I could get it working, the temptation to just go ahead and create a new pool is strong.
Problem seems that badblocks never was intended to be used as a hdd burn-in tool. It was deliberately decided to never support block numbers larger than 2^32.

Code:
badblocks is pretty much deprecated at this point, and is almost certainly the wrong tool to be using for a situation like this.  It was designed in the floppy disk days, when loss of sectors was expected; today if you have user-visible bad blocks, your storage is bad, likely to get worse, and needs to be replaced immediately.  Tracking bad block locations won't do any good.

For example, from the upstream maintainer, on the mailing list:

"I think badblocks is vestigal at this point, and for huge disk
arrays, almost certainly block replacement will be handed at the LVM,
storage array, or HDD level.  So it might be better simply to have
mke2fs throw an error if there is an attempt to hand it a 64-bit block
number.

-Ted"

The 32-bit limit is intentional:

        /* ext2 badblocks file can't handle large values */
        if (last_block >> 32) {
                com_err(program_name, EOVERFLOW,
                        _("invalid end block (%llu): must be 32-bit value"),
                        last_block);
                exit(1);
        }

which came from:

commit d87f198ca3250c9dff6a4002cd2bbbb5ab6f113a
Author: Darrick J. Wong <darrick.wong@oracle.com>
Date:   Wed Oct 23 19:43:32 2013 -0400

    libext2fs: reject 64bit badblocks numbers
    
    Don't accept block numbers larger than 2^32 for the badblocks list,
    and don't run badblocks on them either.

If you *really* want to try your luck, you can specify a larger block size for testing, i.e. badblocks -b 16384.  But I wouldn't really trust badblocks on large devices; it wasn't designed for this purpose.


 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
badblocks never was intended to be used as a hdd burn-in tool.

That is also correct. It comes from the days before intelligent disk controllers. SCSI disks, for example, have had defect lists since around the beginning; I remember wrestling with grown defect lists on the CDC Wren's and Sun workstations in the late '80's and being amazed at the functionality.

One of the choices I made when writing the solnet-array-test tool was to focus on the functionality I needed, which was to diagnose controller communications, controller performance, drive burn-in, and SCSI bus issues. Surface scanning of the HDD's came along for free just because of the nature of the thing, but I was more interested in making sure we could find issues related to infant mortality, so you'll see that the array test focuses on issues such as intensive seeking.

I think I should get bonus points because I declined to write custom C code to do it, and managed it in easily hackable shell script that should be relatively portable, with minor tweaking for OS-specific differences. I may have lucked out in that FreeBSD introduced the CAM subsystem in FreeBSD 3.0 (~1998) and that's still the de-facto standard for I/O devices almost a quarter of a century later.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,175
Why in Linuxland, you'd have had to rewrite it in 1999, 2003, 2008, three times in 2012 and fix an obscure bug upstream in 2016 and send the Red Hat people a nasty email to have something not removed from the distribution and still end up with Debian!
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Why in Linuxland, you'd have had to rewrite it in 1999, 2003, 2008, three times in 2012 and fix an obscure bug upstream in 2016 and send the Red Hat people a nasty email to have something not removed from the distribution and still end up with Debian!

You just made my day. Thanks, man! ;-)
 

rudds

Dabbler
Joined
Apr 17, 2018
Messages
34
Failing to do burn-in testing on your drives is a bad idea. badblocks is certainly not the only test available.


/incoming appears to be empty on that FTP, do you know somewhere else to get the script?

Edit: I was able to grab the file by manually requesting it with a CLI client, not sure why it wasn't visible in Filezilla.
 
Last edited:
Top