Trying to make a comprehensive HDD burn-in

rcaron

Cadet
Joined
Jul 19, 2016
Messages
6
I've recently started using fio (in addition to SMART, badblocks/nwipe, and diskinfo) to do a better job screening hard drives before adding them to a pool. In the past, disks without the 24hr fio would pass everything, yet they will still start throwing errors and sometimes even get kicked hours/days after being added to a pool. I kept those disks around out of curiousity, but even those disks pass fio - no threads crash and there's no change in SMART wear parameters afterwards. so I feel like this burn-in is still lacking. What else can I add?

I've based my process on
https://www.truenas.com/docs/core/gettingstarted/corehardwareguide/ and
https://www.truenas.com/community/threads/how-do-you-burn-in-new-disks.67774/
specifically:
  1. SMART extended to compare against ( smartctl -a /dev/adaX > adaX_1baseline.txt )
    • - reject condition: any non-zero pending or uncorrectable sectors
    • (unless I'm interested in whether any of the subsequent writing helps 'heal' the drive with sector reallocation)
  2. 24hr of random small FIO to also stress the drive ( fio --name=randrw --time_based --runtime=86400 --iodepth=64 --rw=randrw --bs=512 --direct=1 --numjobs=4 --filename=/dev/adaX )
    • - reject condition: if any threads die
  3. another SMART extended ( smartctl -a /dev/adaX > adaX_3post_fio.txt )
    • - this only takes a few minutes for such small SSDs
    • - reject condition: any wear parameters increase
  4. write latency consistency (diskinfo -wS)
    • - reject condition: unclear. informational.
  5. badblocks to look for bad regions ( time badblocks -b 4096 -c 16384 -p 1 -svw /dev/adaX )
    • - timing it also gives a feel of net R&W sequential performance
    • - reject condition: if any bad blocks are detected
    • - nice for b parameter to equal physical sector size (512 bytes in this case)
    • - b parameter X c parameter should roughly match drive cache side to maximize speed/stress. (for HDDs I use -b 4096 -c 16384 for 64MiB)
  6. one final SMART extended ( smartctl -a /dev/ada0 > ada0_6final.txt )
    • - reject condition: if any wear parameters increase
After re-reading the above threads the only omission is the thermal cycling. But the burn-in is really under the same operating temperatures (37-43C) as the pools they failed in. Should I deliberately let the drives get hotter during burn-in? How hot?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
5 is sufficient, maybe 1 & 6, but your plan is massive overkill.

Should I deliberately let the drives get hotter during burn-in? How hot?
Never above the suggested operating range for the drive ( can be 50 on some, 60 on others).

If you have a chassis with sufficient airflow, you wouldn't want to see that go above 40 no matter the load.
 

rcaron

Cadet
Joined
Jul 19, 2016
Messages
6
Thanks strella. My regular trays are in the 30s, but my swap bays definitely get tostier at 45C. I will be adding more fans.

I've certainly weeded out many drives with just badblocks & SMART, but the surprising thing was still a few "good" drives that would misbehave once added to a pool. So while I am appreciative of TrueNAS's "drive pickiness", it can be annoying for a new pool to be on-the-brink with multiple drives erroring during a scrub or resliver. So, it may seem like massive overkill, but to me its not comprehensive. I guess I'm unlucky enough to encounter some rare/corner cases.

So I'm still looking for more tests. Something just as picky as TrueNAS itself. Perhaps I should just build a burning-in test pool...

I should also mention this well written resource, which has 30 pages of discussion. No mention of even fio, but there is a brief mention of iozone.

Some other interesting tidbits from that thread:

I'll be looking into Spearfoot's script, jgreco's solnet script, and iozone.



Tangentially, I also want to shout out nwipe & ShreadOS (a fork of the old DBAN). Its a great tool when using a testbench PC to do the actual drive burning for your NAS. If you verify all passes it is basically as thorough as badblocks, just with random test patterns. I like the temperature reporting, the multi-drive UI, and especially the ETA.
 
Top