Resource icon

Building, Burn-In, and Testing your FreeNAS system

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
I've been meaning to post some guidance here for a while now. We frequently see people come to the forums with hardware problems that should have washed out in the system build process, but since many of the users here are DIY'ers without professional experience building servers, it goes from parts-in-box-to-in-use in a few hours.

This process also needs to be repeated when you are upgrading your system by adding new parts or changing out existing parts. There's a little bit of "use your brain" in how strict you need to be, but doing stuff like just dumping more RAM into your box and then booting it with a production pool attached can lead to great sadness. Don't do that. Your hardware needs to be solid and validated, and if it isn't, you can scramble your bits, possibly irretrievably.

This isn't a complete or comprehensive guide, more like a work-in-progress.

The Build Process

It is tempting to just rip open all those boxes you got, put it together, and pray. Bad idea.

1) Set up a suitable workspace.

1a) If you are in a low humidity environment, pay extra special attention to environmental static controls, including making sure that the clothes you're wearing have been conditioned with fabric softener. Fabric softener can be diluted with water in a sprayer and applied to carpets, which also reduces the nasty winter zaps!

1b) All computer assembly work should be done on top of an anti-static mat. Fry's, NewEgg, and Amazon all have these for around $20. This is not optional! Static damage can subtly damage your silicon.

1c) Assemble while wearing an anti-static wrist strap. These are available from $5-$20 at those retailers and elsewhere. Ideally you should wear some ESD gloves. Even the cheap $1-a-pair ones will not only help ESD but will also keep skin oils off your surfaces and help reduce contaminants.

1d) Make sure you hook up the wires from your anti-static gear to a proper ground.

1e) Do not wear static-prone clothing, particularly many synthetics. Short sleeves and shorts can help reduce static!

Handle all components only in your anti-static environment, preferably by their edges, never by exposed contacts. A lot of people like to install their RAM and CPU on a mainboard prior to installing in the case, but this increases the number of components in play at a time. It is ideal to install the mainboard in the chassis first, then ground the chassis, and then install components one at a time. In a small chassis build, this may be impractical, and even in a large chassis it could make installation of the CPU tricky. Only remove components from the packages as they are actually required. Resist the urge to unpack everything and spread it out. Extra handling is extra risk. Keep your mind on grounding and careful handling.

Make sure that when mounting your mainboard, that it is securely supported at all points where the chassis offers screw holes. Make sure that the chassis doesn't have any screw standoffs in places where the motherboard does not have a hole; these can short out a motherboard.

Tighten all screws until you feel moderate resistance, then give just a bit more of a twist. You want a solidly seated screw, not loose, not stripped.

Smoke Test

Our name for the initial power-on test. Computers run on smoke. Once the smoke comes out, they stop working.

You should smoke test on the bench with the chassis open so that you can visually inspect.

All cards should be fully seated, meaning that among other things you should see an even amount of the board's copper fingers exposed along the length of the socket.

All DIMM modules should be fully seated. A properly seated DIMM will include the clips on the side being fully engaged, which is a fairly quick visual test. Also verify that the module's copper fingers appear to be uniform.

Power it on and make sure that the hardware manifest agrees with what the BIOS reports. Observe to make sure that all your fans are spinning, and take some time to address any buzzy or unpleasant vibration noise.

Configuration

In the BIOS:

Reset the BIOS settings to defaults.

Configure server to power on after power loss (if this is desired)

Configure power button to NOT shut off host immediately upon press. A momentary press of the power button is supposed to send a signal to the OS to shut down cleanly and power off. It's a good idea to test that once the OS is installed.

Configure the boot priority to prioritize the USB or SATA DOM, and if possible, disable booting from the data disks.

Burn-In and Testing

The burn-in and testing may be separated out or done as kind of a commingled whole. It may disappoint you to discover that proper burn-in and testing actually requires more than a thousand hours of time - a month and a half. This is the time during which you want infant mortality to strike.

Be sure to run an exhaustive test of memtest86 on all the system memory. If you have chosen to play with fire by not buying an ECC system, this is really the only significant opportunity you have to discover problems without your pool being at risk. If you have ECC, try to identify the way in which your mainboard manufacturer provides failure logging (BIOS, BMC/IPMI, etc). Any memory failures should be addressed. Do not rely on ECC to repair problems in a problematic module. The memtest86 tests can be run several times throughout the burn-in period to validate your memory. Don't just run one pass. Run it for a week or two in aggregate, once early on, once later on.

Find and run one of the CPU-stressing utilities such as "CPUstress" (http://www.ultimatebootcd.com/) and run this for at least a day, and monitor the temperature and fan behaviour in your chassis. Especially if you are attempting to build a "quiet" NAS, now is the opportunity to make sure that your cooling strategy is going to work. Heatsink compound typically takes a while to cure or "break-in", typically around 150-250 hours of operation for materials such as AS5, and you should notice a temperature drop of five to ten degrees after that point. Just like with the memory, it's a great idea to run it for a day early on, and then a day later on near the end of the burn-in.

Run SMART tests on all your drives, including the manufacturer's conveyance test and the long test. For this purpose it might be convenient to configure up an instance of FreeNAS.

SMART is not sufficient to weed out bad drives, though it is a good thing to run. It is unable to identify subtle problems in the SATA/SAS channels to your drives, for example. In order to validate both the drives and their ability to successfully shuffle data, be sure to run plenty of tests on them. I suggest:

1) Individual sequential read and write tests. This is basically just using "dd if=/dev/da${n} of=/dev/null bs=1048576" to do a read test, and "dd if=/dev/zero of=/dev/da${n} bs=1048576" to do a write test. Note the reported read/write speeds at the end and compare them to both the other drives in the pool and also any benchmark stats you can find for that particular drive on the Internet. They should be very close.

2) Simultaneous sequential read and write tests. This is using those same tests in parallel.

3) I am kind of lazy so I will then have it do seek testing by starting multiple tests per drive and separating them by a minute or two each. The drives will do lots of seeking within a relatively small locality. Other people like to use different tools to do this; I have absolutely no objection but as I noted I'm kind of lazy.

4) I've recently provided a tool to assist: https://forums.freenas.org/index.php?resources/solnet-array-test.1/

NO AMOUNT OF STRESS should result in the kernel reporting timeouts or other issues communicating with your drives. If it does - update drivers, reseat cables, Google for known problems with your drives or controller, upgrade drive firmware, and/or generally fix your busted hardware. You cannot use a system throwing spurious storage subsystem errors for ZFS.

Hard drives have a high infant mortality rate, and so you really want to get past the thousand hour mark without incident. It's better to be doing this as part of burn-in rather than "oh crud I just trusted all my data to this RAIDZ2 and now two of my drives are reporting problems."

Once you've hit a system with tests like these for a few weeks, confidence in the platform should be established. Then you move on to installing FreeNAS, setting up ZFS, and beating on that for awhile... some comments on that and networking testing hopefully coming soon...
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Well keeeeerap. That doesn't work. XenForo for the FAIL.

Reposted here.

Back in the late '90's, I was managing a bunch of large whitebox storage servers. For the largest of these, I had the pleasure of building and deploying a massive storage server, 8 shelves of 9 drives each, Seagate ST173404LW 73GB drives, a whopping 5TB ... (*grin*)

Part of the problem was burning in these systems, and so I devised some shell scripty stuff that the hardware techs could use. I've become convinced that a variation on this would be helpful in the FreeNAS community, so I'm playing with a stripped-down version that does some basic disk read testing (at the time of this writing). It is suitable for testing and burn-in use.

I've included just two main passes, a parallel read pass, and a parallel read pass with multiple accesses per disk. The script will do some rudimentary performance analysis and point out possible issues. It needs more work but here it is anyways. This script is expected to be safe to run on a live pool, even though that's not a good idea for performance testing purposes. As with anything you download onto your machine, you are expected to verify the safety to your own satisfaction. Note that the only things that touch the disks are "dd" and they're all structured as "dd if=/dev/${disk}".

Link to the current version of the script

To run it, download it onto a FreeNAS box and execute it as root.

Code:
cd /tmp
fetch ftp://ftp.sol.net/incoming/solnet-array-test-v2.sh
chmod +x solnet-array-test-v2.sh
./solnet-array-test-v2.sh


It will give you a simple menu

Code:
sol.net disk array test v2

1) Use all disks (from camcontrol)
2) Use selected disks (from camcontrol|grep)
3) Specify disks
4) Show camcontrol list

Option:


You probably want to look at the disklist (option 4), then pick your target disks with option 2 and an appropriate pattern. For a Seagate ST4000DM000, you could select "ST4000" for example.

The test will run a variety of things and report status. It takes a while. Be patient. It will never terminate as it is intended as a burn-in aid, but you do want to let it do its thing for at least a pass or two to get an idea of how your system performs.

It is best to do this while the system is not busy and preferably before a pool is up and running. That said, it should be safe to use even on a busy filer. I've picked on a busy filer here to give an example of how this looks. Note that da14 is a spare drive and you'll notice that all the other drives are testing much slower (because they're in use). Also note that my numbers here are in a testing mode that doesn't have the script actually doing the entire disk, real results would look a bit different and take forever.

Code:
sol.net disk array test v2

1) Use all disks (from camcontrol)
2) Use selected disks (from camcontrol|grep)
3) Specify disks
4) Show camcontrol list

Option: 2

Enter grep match pattern (e.g. ST150176): ST4

Selected disks: da3 da4 da5 da6 da7 da8 da9 da10 da11 da12 da13 da14
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 44 lun 0 (da3,pass5)
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 45 lun 0 (da4,pass6)
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 46 lun 0 (da5,pass7)
<ATA ST4000DM000-1F21 CC51>  at scbus3 target 47 lun 0 (da6,pass8)
<ATA ST4000DM000-1F21 CC51>  at scbus3 target 48 lun 0 (da7,pass9)
<ATA ST4000DM000-1F21 CC51>  at scbus3 target 49 lun 0 (da8,pass10)
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 50 lun 0 (da9,pass11)
<ATA ST4000DM000-1F21 CC51>  at scbus3 target 51 lun 0 (da10,pass12)
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 52 lun 0 (da11,pass13)
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 53 lun 0 (da12,pass14)
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 54 lun 0 (da13,pass15)
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 55 lun 0 (da14,pass16)
Is this correct? (y/N): y
Performing initial serial array read (baseline speeds)
Tue Oct 21 08:21:23 CDT 2014
Tue Oct 21 08:26:47 CDT 2014
Completed: initial serial array read (baseline speeds)

Array's average speed is 97.6883 MB/sec per disk

Disk  Disk Size  MB/sec %ofAvg
------- ---------- ------ ------
da3  3815447MB  98  100
da4  3815447MB  90  92
da5  3815447MB  98  100
da6  3815447MB  97  99
da7  3815447MB  95  97
da8  3815447MB  82  84 --SLOW--
da9  3815447MB  87  89 --SLOW--
da10  3815447MB  84  86 --SLOW--
da11  3815447MB  97  99
da12  3815447MB  92  94
da13  3815447MB  102  104
da14  3815447MB  151  155 ++FAST++

Performing initial parallel array read
Tue Oct 21 08:26:47 CDT 2014
The disk da3 appears to be 3815447 MB.
Disk is reading at about 74 MB/sec
This suggests that this pass may take around 860 minutes

  Serial Parall % of
Disk  Disk Size  MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da3  3815447MB  98  86  88 --SLOW--
da4  3815447MB  90  74  82 --SLOW--
da5  3815447MB  98  82  84 --SLOW--
da6  3815447MB  97  91  95
da7  3815447MB  95  72  76 --SLOW--
da8  3815447MB  82  80  97
da9  3815447MB  87  84  96
da10  3815447MB  84  111  133 ++FAST++
da11  3815447MB  97  120  124 ++FAST++
da12  3815447MB  92  116  126 ++FAST++
da13  3815447MB  102  123  121 ++FAST++
da14  3815447MB  151  144  95

Awaiting completion: initial parallel array read
Tue Oct 21 08:39:32 CDT 2014
Completed: initial parallel array read

Disk's average time is 741 seconds per disk

Disk  Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
da3  104857600000  743  100
da4  104857600000  764  103
da5  104857600000  752  101
da6  104857600000  737  99
da7  104857600000  748  101
da8  104857600000  754  102
da9  104857600000  738  100
da10  104857600000  762  103
da11  104857600000  748  101
da12  104857600000  756  102
da13  104857600000  740  100
da14  104857600000  653  88 ++FAST++

Performing initial parallel seek-stress array read
Tue Oct 21 08:39:32 CDT 2014
The disk da3 appears to be 3815447 MB.
Disk is reading at about 58 MB/sec
This suggests that this pass may take around 1093 minutes

  Serial Parall % of
Disk  Disk Size  MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da3  3815447MB  98  52  53
da4  3815447MB  90  48  53
da5  3815447MB  98  50  51
da6  3815447MB  97  50  52
da7  3815447MB  95  48  50
da8  3815447MB  82  48  59
da9  3815447MB  87  54  62
da10  3815447MB  84  47  56
da11  3815447MB  97  49  50
da12  3815447MB  92  50  55
da13  3815447MB  102  49  48
da14  3815447MB  151  52  34

Awaiting completion: initial parallel seek-stress array read
 
Last edited:

pro lamer

Guru
Joined
Feb 16, 2018
Messages
626
I haven't seen any place in our forums that would remind about an UPS burn in test (or rather a basic UPS-FreeNAS setup test). Is it too simple or too obvious?

Disclaimer: I've just run a "ups burn in" query in our forums finding the first found posts/threads not promising in terms of the above.

Anyway I wondered when to test the UPS setup: before adding the hard disks to the rig or after or both... I guess both. (Concern: scared that UPS failure during burn in may damage the drives. OTOH the drives consume some energy so testing an UPS without them is moot)

Sent from my phone
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
You cannot burn in a UPS, not really. Every deep cycle that a UPS goes through diminishes the remaining battery life. You are best off just doing whatever the manufacturer's recommended tests are.

Testing whether your UPS and FreeNAS are communicating is probably best done by tweaking the settings on the FreeNAS side to make it hyper-sensitive to power loss, and shutting down quickly after a power loss, thereby not drawing down the UPS battery much. This is perfectly fine to do.
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
202
I just started a bit of stress testing myself...

Annoyingly, it's currently winter over here and we have no AC for the summer. So I had to get a bit creative in simulating that it is a hot sunny day in the summer, while actually being winter :)
As not everyone would think of ambient temperature fluctuations, I thought it could be useful to share my endeavors :p

Currently my room is like 20°C. Exceptionally it can get up to 40°C (like 1 or 2 days a year) in here. So I needed to bump temps by 20°C...

Here is what I did:
1581169339114.png

I also had the idea to put a soldering iron inside the case, but I couldn't find an entry point that would allow me to keep the case closed...

But actually the above electric heater works great!
* I doesn't seem to melt the case (yet? :p)
* It bumps up the case temps to exactly where I want them :)

Without the electric heater:
1581169481718.png


With the electric heater:
1581169530176.png


Here are the temps measured by the IPMI (while running MemTest86):
1581173465168.png


As it's in my nature, I also couldn't resist overclocking the memory a little...

Standard the memory runs at 1333Mhz CL20, with following performance
1581170335390.png


After lots of tweaking and testing, I got it 100% stable at 1800Mhz CL22, with following performance
1581170348635.png


But just to be on the very safe side, I've decreased the frequency to 1700Mhz and increased some secondary timings so all timings are at least as high the stock timings, with following performance
1581170361301.png


And that is what I'm stress testing now...

edit:
4x MemTest86 passed :)
In the end, I was able to force the case temp till 47°C
So I guess I can call that part a success
 
Last edited:

LVLouisCyphre

Dabbler
Joined
Dec 22, 2019
Messages
16
One of the best stress tests for FreeBSD or *nix based OSes in general is to do a make buildworld and kernel configs. Those greatly exercise the silicon and storage. If you get a core dump and that's not a PEBKAC, you have problem hardware. I don't know if FreeNAS includes the full FreeBSD distribution where you can do a make buildworld and do kernel config. It's easy enough to drop FreeBSD on a system and do a kernel config and make buildworld for additional stress testing.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
One of the best stress tests for FreeBSD or *nix based OSes in general is to do a make buildworld and kernel configs. Those greatly exercise the silicon and storage. If you get a core dump and that's not a PEBKAC, you have problem hardware. I don't know if FreeNAS includes the full FreeBSD distribution where you can do a make buildworld and do kernel config. It's easy enough to drop FreeBSD on a system and do a kernel config and make buildworld for additional stress testing.

No, FreeNAS doesn't, and buildworld actually isn't particularly stressful in the grand scheme of things. There was a time when it was a substantial undertaking, like back in the '90's, but today it just isn't really adequate with fast machines and multiple cores and massive amounts of RAM.

And it totally misses the big thing.

A NAS is primarily an I/O device and a buildworld only puts out a handful of GB of traffic towards the disks. A proper stress test FOR A NAS includes actively trying to access every data block of every device, and doing that simultaneously, to identify issues with the storage devices (bad blocks, etc) or controller (throughput issues, etc). You also have to test the CPU and memory for more than the hour or two a buildworld takes. I typically suggest timeframes of weeks.
 

troudee

Explorer
Joined
Mar 26, 2020
Messages
69
(This is a cross posting from here)

I do not understand how I get "results" (being "drive BAD" or "drive not BAD yet") from the solnet script.

NO AMOUNT OF STRESS should result in the kernel reporting timeouts or other issues communicating with your drives.

Can I deduce from that, that I just have to leave it running with all the HDDs for a few days and check if there are error messages printed to the screen (--> drive BAD)? Am I supposed to do anything with the numerical output the script gives me (except from very clear things like "Wow, this number is ten times higher at this drive!")?
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
202
I was wondering the same things and besides looking for obvious error messages, I also tried looking at IO consistency during the test to determine if all was "normal" and "ok".

From how I understand it, this is "ok" to do, but people did warn me to not be too "strict" when interpreting the results (a bit of inconsistency should be ok to ignore).

More details here:
 
Last edited:

troudee

Explorer
Joined
Mar 26, 2020
Messages
69
Hello Mastakilla,

thank you very much for your reply!

So the process would be to leave the solnet script running for a week (?) and check if the values are roughly consistent?

How could I check if the kernel reports "timeouts or other issues communicating with [my] drives" as @jgreco said? When those messages just get printed to the screen, they would be gone after a week's runtime.
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
202
I used both the values (mostly reported in seconds by the solnet script) and also the FreeNAS IO reporting (screenshots in the link above)

/var/log/messages should be your friend for finding timeouts / other issues
 

troudee

Explorer
Joined
Mar 26, 2020
Messages
69
Another question about the solnet script: @jgreco wrote in the description that there are only two passes:

I've included just two main passes, a parallel read pass, and a parallel read pass with multiple accesses per disk.

And as well:

The test will run a variety of things and report status. It takes a while. Be patient. It will never terminate as it is intended as a burn-in aid, but you do want to let it do its thing for at least a pass or two to get an idea of how your system performs.

I do not quite get these two statements together -- should it terminate or shouldn't it? For me, doing the test using option 3 (specify disks), it just did (it said "Completed: initial parallel seek-stress array read", gave me a stats table and exited), but I do not know if it should.
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
202
As far as I understand the script code, it should terminate... I was also a bit confused by that statement... It does take "forever" to complete ;)
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
There are a few versions of the tool floating around. This originated around two decades ago when drive sizes were much smaller, and drives were faster .. the ST150176 and ST173404 specifically.

So the thing here is that, for the random seek test, the tool is reading 1MB blocks. It starts five "threads" each of which is sequentially reading the whole disk, and these are separated by 60 seconds. Back in the day the 72GB ST173404LWV had a massive 16MB cache so one of my concerns was to make sure that everything didn't all end up in the drive's cache. I've been observing with some interest to see if anyone starts reporting "fast" random times with the tool but so far not seein' it.

HDD throughput is significantly disrupted by seek traffic, to the point where if you are asking for random 512 byte sectors, and your drive is only able to do maybe 200 seeks per second, your performance might drop down to ~~100KB/sec. I wasn't really interested in an open ended random seek test, so instead the tool just has five threads reading large blocks in roughly the same region of the disk, which was intended to exercise the mechanism. The 1.6" Seagates were very twitchy and sensitive to mishandling, and we were building an array of 72 of them...

Anyways back then the tool might complete relatively quickly, but as disk sizes have increased and seek performance is about the same, a single run can take quite awhile... it isn't really the number of passes that matters, in my opinion, it's that the drive is able to seek reliably throughout the disk.
 

troudee

Explorer
Joined
Mar 26, 2020
Messages
69
Okay, so when the script finishes without warnings on screen or in /var/log/messages, it's good? Or should I run it again and again?
 

c77dk

Patron
Joined
Nov 27, 2019
Messages
467
I'm replacing my WD RED SMR (*censored*) drives with Ironwolfs - and just have to confirm how important burn-in can be.

Started a "long" SMART test after "short" and "conveyance" had completed without errors .... and then the "long" test failed with "read failure". So time to see how good Seagate is with RMA
 

Elliott

Dabbler
Joined
Sep 13, 2019
Messages
40
This is a nice guide. Thanks for the test script. It would be good to add a write test too (with a warning not to use it on a live array). I'm going to work on this when I have time.
 
Top