Hard Drive Burn-In Testing - Discussion Thread

Spearfoot · Apr 29, 2020

Teeps said:
Thanks for the tip on the boosted -c option, kicking off a 10TB Toshiba MG06ACA10TE presently.

I'm curious if anyone knows what this message below means:

It only shows up in a few posts elsewhere and it's unclear if there's an issue

I've seen that before, and it's never been a problem.

If you're running on FreeNAS/FreeBSD, did you remember to run this before testing?

Code:

sysctl kern.geom.debugflags=0x10

thalf · Jan 5, 2021

I purchased two WD Ultrastar DC HC550 18TB drives and wanted to do some thorough testing before starting to use them, so I'm following the advice in qwertymodo's post that started this thread.

But... it's taking forever, even though I'm testing both disks in parallel.

The long S.M.A.R.T. test took roughly 30 hours. Then I started badblocks. The first pattern (0xaa) took roughly 46 hours. Now at 49h40m they're 16.6% and 20.4% along writing the 0x55 pattern, but at 100% they'll start "Reading and comparing" 0x55 before moving on to 0xff and then 0x00.

(I find it a bit interesting that one drive appears faster than the other, since I started badblocks on the two drives just 10 seconds apart.)

So badblocks looks like it'll take around 180 hours, or a bit over 7 days. That's a long wait. And after that qwertymodo's post says I should run another long S.M.A.R.T. test, for another 30 hours. That'll be a total of roughly 10 days' worth of testing the drives before starting to use them.

I didn't notice the difference that the -c flag to badblocks potentially can make until after around 30 hours or so, and at that point I was very hesitant to abort badblocks to try out different -c values to see if I got a speedup. If there's no speedup, that'll mean I've aborted and had to restart badblocks for nothing, losing a lot time.

How thorough do you guys test large drives before starting to use them? And what flags do you use?

Also, while on the subject of badblocks: for large drives a larger block size must be used, apparently due to some 32 bit limitations in the code. qwertymodo advises to use -b 4096 for drives larger than 2TB. Well, I had to use -b 8192 for the 18TB drives, as I first got errors:

Code:

root@nas:~ # time badblocks -ws /dev/ada0
badblocks: Value too large to be stored in data type invalid end block (17578328064): must be 32-bit value
0.000u 0.003s 0:00.00 0.0%      0+0k 0+0io 0pf+0w
root@nas:~ # time badblocks -b 4096 -ws /dev/ada0
badblocks: Value too large to be stored in data type invalid end block (4394582016): must be 32-bit value
0.000u 0.005s 0:00.02 0.0%      0+0k 0+0io 10pf+0w
root@nas:~ # smartctl -a /dev/ada0 | grep 'Sector Sizes:'
Sector Sizes:     512 bytes logical, 4096 bytes physical

But I read somewhere (can't find the source now) that using a block size larger than the drive's sector sizes was bad/pointless, since that somehow wouldn't test the entire disk, i.e. using a 8192 bytes block size on a drive with sector size 4096 bytes meant that only every other 4096 bytes would actually be tested.

Anyone knows if this is correct? If it is, it means I'm really just wasting time running badblocks.

Mastakilla · Jan 5, 2021

thalf said:
...
(I find it a bit interesting that one drive appears faster than the other, since I started badblocks on the two drives just 10 seconds apart.)
...

Does the SAS 2008 chip throttle when overheating?

Hi all, While running the https://www.ixsystems.com/community/resources/solnet-array-test.1/ script on FreeNAS, I had some very weird behaviour during the seek-tress-read test (which uses dd and takes about a week to complete). During the test my IPMI and network interfaces became unreachable...

forums.servethehome.com

See above thread on this subject... I bumbed into similar observations when burning in with the solnet-array-test script.

Apollo · Jan 7, 2021

Mastakilla said:
Does the SAS 2008 chip throttle when overheating?

Hi all, While running the https://www.ixsystems.com/community/resources/solnet-array-test.1/ script on FreeNAS, I had some very weird behaviour during the seek-tress-read test (which uses dd and takes about a week to complete). During the test my IPMI and network interfaces became unreachable...

forums.servethehome.com

See above thread on this subject... I bumbed into similar observations when burning in with the solnet-array-test script.

I doubt any kind of throttling should be implemented on such devices.
For time critical system, this becomes a liability and brings uncertainty to the system as it becomes non deterministic.
Normally, the hardware (PCB layout and thermal mitigation) should accomodate for acceptable cooling performance at any load.
If product has been design properly and cooling is within specification, then such condition shouldn't occur.

ChrisRJ · Jan 7, 2021

Well, I do not consider 7 days long for a burn-in. But hey, I have once lost about 800 KB of manual entries to a database. When I moved to my new NAS a few months ago, I first let it run as a backup system for the old NAS for about 6 weeks. Only then did I reverse things so the old NAS is now the backup (using ZFS replication).

raz0r11 · Jun 12, 2021

jgreco said:
Now, to bring this around to a vaguely relevant-to-the-forum post, I'll point out that it is interesting that Foxconn has come to Wisconsin, a site I expect was selected in large part due to the ready availability of water from the world's largest fresh water supply, which their manufacturing processes require a lot of. I am skeptical of the claims of lots of high paying jobs in the long term for Wisconsin, although I'd be happy to be wrong.

Imagine my surprise to find this in the middle of a drive burn in testing discussion. For anyone who cares interesting article about what happened and BTW it was the promise of ~$3B of tax abatement not fresh water that drove the decision which has shrunk along with the jobs originally promised

Ok enough off topic stuff.

TL:DR

Badblocks:
-c <number> can make a large difference v. default so consider that.
-b <number> please make sure aligned with your actual block size (consider using diskinfo to figure out)

Why?

Using default -c (64) may make things slower and misaligning your block size will DEFINITELY (512 v. 520 v. 4096 v. 8192) slow things down...

Other things to consider enabling writeback as stated earlier in thread (smartctl -s wcache,on /dev/<whatever>) though to be fair YMMV...

jgreco · Jun 12, 2021

I may have forgotten to point out that I'm a resident of the region. The Verge article isn't entirely crap, but misses a lot of nuance and misunderstands some of what transpired.

raz0r11 said:
interesting article about what happened and BTW it was the promise of ~$3B of tax abatement not fresh water that drove the decision which has shrunk along with the jobs originally promised

Except that NO ONE who had the slightest clue and looked at the deal thought it was about the tax abatement. It never made any sense that Foxconn was going to do any significant manufacturing of LCD panels or anything else here. Producing panels here would be pointless, as the market for them is all back in Asia where the companies building TV's and monitors are.

So what was really going on? There are basically two sides to consider, three if you include the taxpayers:

Governor Walker and his pals were desperate for a super-positive talking point to win an election, and making it sound like Foxconn was going to make Wisconsin into an electronics manufacturing haven would have been a very big feather in his cap, swinging likely-Democratic votes in Southeastern Wisconsin to Republican. Unfortunately, the legitimacy and practicality of the deal fell under closer scrutiny than expected, and it became known that the only practical way for them to make panels at a reasonable cost was to do lights-out manufacturing, which punctured the bubble of "lots of jobs". Additionally, the total absence of anyone on this side of the planet to buy the panels was also apparent.

From Foxconn's point of view, it is a little bit murkier, but a bunch of investigative reporting yielded clues that Foxconn had been shopping for properties with access to large amounts of freshwater, a major component to their manufacturing, and something that is sometimes in short supply in Asia. Siting a manufacturing plant within the boundaries of the Great Lakes Compact would give them a toehold to a new source for freshwater. This theory is backed up by the manner in which they made it known to the Walker administration that they needed both freshwater and exemptions from various environmental, wastewater, and pollution regulations, all of which were then rammed through. If they didn't need freshwater, then the question would need to be why they sought these exemptions. The amount of land that was cleared out in Mount Pleasant for Foxconn is immense, so it seems likely that they did want an option to build some significant manufacturing capacity.

Third, they also fought for - and got - a widening of Interstate 94, at taxpayer expense, in order to be able to run self-driving trucks between Racine and General Mitchell Airport. This was really a big clue as to what the future would hold for the project, as there is plenty of local workforce available to drive trucks.

So, to those of us who live in the region and were aware that this didn't make any sense, I predicted that this was either going to turn into lights out manufacturing with maybe a thousand jobs, or fail entirely, well before ground was broken out in Mount Pleasant.

And here's the thing. They now do have the infrastructure, power, water, etc., and it would be relatively easy for them to move forward with a manufacturing facility, which could still happen, but it won't be bringing the thousands of jobs. My best guess is that they were covering bets as to business continuity in the face of global warming, and that they've got a place that they could light up in a year or two if need be.

The taxpayers of Wisconsin have been taken for a ride by both sides.

Etorix · Jun 13, 2021

Thanks for this enlightening analysis.
But while I'm talking about "light"… you imply that Foxconn may "light up" to "lights-out" manufacturing. A nice practical lesson on verbal particles for non-native English speakers !

jgreco · Jun 13, 2021

Etorix said:
Thanks for this enlightening analysis.
But while I'm talking about "light"… you imply that Foxconn may "light up" to "lights-out" manufacturing. A nice practical lesson on verbal particles for non-native English speakers !

Sorry.

"Lights-out manufacturing" refers to automated manufacturing facilities. When you do not have humans in a building, you may want to turn the lights off to save on energy. Few places are actually so automated that they don't have ANY people on site, so the term is somewhat aspirational, but the term is still used to mean a "low-to-no-human" facility, even if the lights are actually on.

I'm a network and infrastructure engineer, and to "light up" a building refers to bringing it on network, or, in a more abstract sense, to make the building useful. The process of lighting up a building involves bringing in the equipment and doing all the setup and stuff to take an empty building to a functional facility. There is probably a better term for this for a manufacturing facility, but I am not coming up with it even now, so my mind went to the closest parallel.

Anyways, I try to look for the truth behind the superficial news headlines. The Verge article isn't terrible, but it misses out on some stuff that can be inferred from what has actually transpired. Of course, it is easier for them to get this stuff right now that so much of this is in hindsight. I was skeptical back at the start of all of this, but that's mostly because I spend a lot of time trying to understand motives and practicalities.

Spearfoot · Jun 13, 2021

thalf said:
I purchased two WD Ultrastar DC HC550 18TB drives and wanted to do some thorough testing before starting to use them, so I'm following the advice in qwertymodo's post that started this thread.

But... it's taking forever, even though I'm testing both disks in parallel.

The long S.M.A.R.T. test took roughly 30 hours. Then I started badblocks. The first pattern (0xaa) took roughly 46 hours. Now at 49h40m they're 16.6% and 20.4% along writing the 0x55 pattern, but at 100% they'll start "Reading and comparing" 0x55 before moving on to 0xff and then 0x00.

(I find it a bit interesting that one drive appears faster than the other, since I started badblocks on the two drives just 10 seconds apart.)

So badblocks looks like it'll take around 180 hours, or a bit over 7 days. That's a long wait. And after that qwertymodo's post says I should run another long S.M.A.R.T. test, for another 30 hours. That'll be a total of roughly 10 days' worth of testing the drives before starting to use them.

I didn't notice the difference that the -c flag to badblocks potentially can make until after around 30 hours or so, and at that point I was very hesitant to abort badblocks to try out different -c values to see if I got a speedup. If there's no speedup, that'll mean I've aborted and had to restart badblocks for nothing, losing a lot time.

How thorough do you guys test large drives before starting to use them? And what flags do you use?

Also, while on the subject of badblocks: for large drives a larger block size must be used, apparently due to some 32 bit limitations in the code. qwertymodo advises to use -b 4096 for drives larger than 2TB. Well, I had to use -b 8192 for the 18TB drives, as I first got errors:

Code:
root@nas:~ # time badblocks -ws /dev/ada0 badblocks: Value too large to be stored in data type invalid end block (17578328064): must be 32-bit value 0.000u 0.003s 0:00.00 0.0% 0+0k 0+0io 0pf+0w root@nas:~ # time badblocks -b 4096 -ws /dev/ada0 badblocks: Value too large to be stored in data type invalid end block (4394582016): must be 32-bit value 0.000u 0.005s 0:00.02 0.0% 0+0k 0+0io 10pf+0w root@nas:~ # smartctl -a /dev/ada0 | grep 'Sector Sizes:' Sector Sizes: 512 bytes logical, 4096 bytes physical

But I read somewhere (can't find the source now) that using a block size larger than the drive's sector sizes was bad/pointless, since that somehow wouldn't test the entire disk, i.e. using a 8192 bytes block size on a drive with sector size 4096 bytes meant that only every other 4096 bytes would actually be tested.

Anyone knows if this is correct? If it is, it means I'm really just wasting time running badblocks.

I use 32768 for the block size with a block count of 512 when I run badblocks to test disks, like this:

Code:

badblocks -b 32768 -c 512 -wsv -e 1 -o {block_log_file} /dev/{drive ID}

This uses more memory than the default settings, but seems to speed up the process a little... but it's just the nature of things that it takes a long time to crank through every LBA on these huge disks!

Etorix · Jun 14, 2021

jgreco said:
Sorry.

"Lights-out" was clear from the context.
I would rather associate "up" with "in" or "on", and "out" with "down" or "off", so I found that the contradictory association of "light up" with "lights-out" made for an interesting stylistic effect. Maybe I got my semantic wrong (or upside-down)? Or I'm looking for style effects where none was intended (just pretend

).
Please do not refrain you writing skills.

ByteMan · Jan 23, 2022

Spearfoot said:
I use 32768 for the block size with a block count of 512 when I run badblocks to test disks, like this:
Code:
badblocks -b 32768 -c 512 -wsv -e 1 -o {block_log_file} /dev/{drive ID}
This uses more memory than the default settings, but seems to speed up the process a little... but it's just the nature of things that it takes a long time to crank through every LBA on these huge disks!

I gave badblocks -b 32768 -c 512 -wsv a try on four 14TB 4kn/512e drives (WD Red Pro + WD DC HC530) but it seems to be slower compared to badblocks -b 4096 -wsv.
I aborted -b 4096 -wsv after 79hrs in order to re-run with -b 32768 -c 512 -wsv on the same drives. After 79hrs, there is a 12% difference in completion
Summarizing:
badblocks -b 32768 -c 512 -wsv after 79h: 3rd pass @ 13% completion
badblocks -b 4096 -wsv after 79h: 3rd pass @ 25% completion

Is there a logical explanation for this?

joeschmuck · Jan 27, 2022

ByteMan said:
Is there a logical explanation for this?

Of course there is.

I'll take a stab at it:
Your command -b 32768 is not a native block size (even though it is a multiple) -c 512 is 512 blocks and looks to write 16777216 bytes at a time.
Your command -b 4096 alone specifies the native block size of the drive and 64 blocks and to write 262144 bytes at a time.

Stick with native sizes of the drives if you want some speed. If you were to use -b 4096 -c 512, you might get a faster result but will it be appreciable? If you run into bad sectors then I'm not sure if a larger -c value will hurt you or not, I would suspect it would as the program retries the entire write operation over and over again, but will it be a major slow down when bad sectors are found? I actually do not know with certainty.

So my advice, stick with -b 4096 for true 4096 sector drives.

rudds · Feb 14, 2022

I'm having the same issue as thalf, running badblocks on some 18TB WD Red Pros:

Code:

root@truenas[~]# badblocks -w -b 4096 -s /dev/da0
badblocks: Value too large to be stored in data type invalid end block (4394582016): must be 32-bit value

The native block size on these drives is 4096, so since best practice seems to be to stick to the correct block size, is there essentially no reliable way to run badblocks on drives this big and get accurate results? Should I just run a long SMART test on these new drives and call it a day? Or is even that worth doing? Given how long the testing is likely to take on these even if I could get it working, the temptation to just go ahead and create a new pool is strong.

jgreco · Feb 14, 2022

Failing to do burn-in testing on your drives is a bad idea. badblocks is certainly not the only test available.

solnet-array-test

Back in the late '90's, I was managing a bunch of large whitebox storage servers. For the largest of these, I had the pleasure of building and deploying a massive storage server, 8 shelves of 9 drives each, Seagate ST173404LW 73GB drives, a...

www.truenas.com

mistermanko · Feb 15, 2022

rudds said:
I'm having the same issue as thalf, running badblocks on some 18TB WD Red Pros:

Code:
root@truenas[~]# badblocks -w -b 4096 -s /dev/da0 badblocks: Value too large to be stored in data type invalid end block (4394582016): must be 32-bit value

The native block size on these drives is 4096, so since best practice seems to be to stick to the correct block size, is there essentially no reliable way to run badblocks on drives this big and get accurate results? Should I just run a long SMART test on these new drives and call it a day? Or is even that worth doing? Given how long the testing is likely to take on these even if I could get it working, the temptation to just go ahead and create a new pool is strong.

Problem seems that badblocks never was intended to be used as a hdd burn-in tool. It was deliberately decided to never support block numbers larger than 2^32.

Code:

badblocks is pretty much deprecated at this point, and is almost certainly the wrong tool to be using for a situation like this.  It was designed in the floppy disk days, when loss of sectors was expected; today if you have user-visible bad blocks, your storage is bad, likely to get worse, and needs to be replaced immediately.  Tracking bad block locations won't do any good.

For example, from the upstream maintainer, on the mailing list:

"I think badblocks is vestigal at this point, and for huge disk
arrays, almost certainly block replacement will be handed at the LVM,
storage array, or HDD level.  So it might be better simply to have
mke2fs throw an error if there is an attempt to hand it a 64-bit block
number.

-Ted"

The 32-bit limit is intentional:

        /* ext2 badblocks file can't handle large values */
        if (last_block >> 32) {
                com_err(program_name, EOVERFLOW,
                        _("invalid end block (%llu): must be 32-bit value"),
                        last_block);
                exit(1);
        }

which came from:

commit d87f198ca3250c9dff6a4002cd2bbbb5ab6f113a
Author: Darrick J. Wong <darrick.wong@oracle.com>
Date:   Wed Oct 23 19:43:32 2013 -0400

    libext2fs: reject 64bit badblocks numbers
    
    Don't accept block numbers larger than 2^32 for the badblocks list,
    and don't run badblocks on them either.

If you *really* want to try your luck, you can specify a larger block size for testing, i.e. badblocks -b 16384.  But I wouldn't really trust badblocks on large devices; it wasn't designed for this purpose.

1306522 – badblocks only supports 32bit int for number of device blocks

bugzilla.redhat.com

jgreco · Feb 15, 2022

mistermanko said:
badblocks never was intended to be used as a hdd burn-in tool.

That is also correct. It comes from the days before intelligent disk controllers. SCSI disks, for example, have had defect lists since around the beginning; I remember wrestling with grown defect lists on the CDC Wren's and Sun workstations in the late '80's and being amazed at the functionality.

One of the choices I made when writing the solnet-array-test tool was to focus on the functionality I needed, which was to diagnose controller communications, controller performance, drive burn-in, and SCSI bus issues. Surface scanning of the HDD's came along for free just because of the nature of the thing, but I was more interested in making sure we could find issues related to infant mortality, so you'll see that the array test focuses on issues such as intensive seeking.

I think I should get bonus points because I declined to write custom C code to do it, and managed it in easily hackable shell script that should be relatively portable, with minor tweaking for OS-specific differences. I may have lucked out in that FreeBSD introduced the CAM subsystem in FreeBSD 3.0 (~1998) and that's still the de-facto standard for I/O devices almost a quarter of a century later.

Ericloewe · Feb 15, 2022

Why in Linuxland, you'd have had to rewrite it in 1999, 2003, 2008, three times in 2012 and fix an obscure bug upstream in 2016 and send the Red Hat people a nasty email to have something not removed from the distribution and still end up with Debian!

jgreco · Feb 15, 2022

Ericloewe said:
Why in Linuxland, you'd have had to rewrite it in 1999, 2003, 2008, three times in 2012 and fix an obscure bug upstream in 2016 and send the Red Hat people a nasty email to have something not removed from the distribution and still end up with Debian!

You just made my day. Thanks, man! ;-)

rudds · Feb 15, 2022

jgreco said:
Failing to do burn-in testing on your drives is a bad idea. badblocks is certainly not the only test available.

solnet-array-test

Back in the late '90's, I was managing a bunch of large whitebox storage servers. For the largest of these, I had the pleasure of building and deploying a massive storage server, 8 shelves of 9 drives each, Seagate ST173404LW 73GB drives, a...

www.truenas.com

/incoming appears to be empty on that FTP, do you know somewhere else to get the script?

Edit: I was able to grab the file by manually requesting it with a CLI client, not sure why it wasn't visible in Filezilla.

Important Announcement for the TrueNAS Community.

Hard Drive Burn-In Testing - Discussion Thread

He of the long foot

Dabbler

Patron

Wizard

Wizard

Cadet

Resident Grinch

Wizard

Resident Grinch

He of the long foot

Wizard

Dabbler

Old Man

Dabbler

Resident Grinch

Guru

Resident Grinch

Server Wrangler

Resident Grinch

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Hard Drive Burn-In Testing - Discussion Thread"

Similar threads