jgreco
Resident Grinch
- Joined
- May 29, 2011
- Messages
- 18,680
I've been meaning to post some guidance here for a while now. We frequently see people come to the forums with hardware problems that should have washed out in the system build process, but since many of the users here are DIY'ers without professional experience building servers, it goes from parts-in-box-to-in-use in a few hours.
This process also needs to be repeated when you are upgrading your system by adding new parts or changing out existing parts. There's a little bit of "use your brain" in how strict you need to be, but doing stuff like just dumping more RAM into your box and then booting it with a production pool attached can lead to great sadness. Don't do that. Your hardware needs to be solid and validated, and if it isn't, you can scramble your bits, possibly irretrievably.
This isn't a complete or comprehensive guide, more like a work-in-progress.
The Build Process
It is tempting to just rip open all those boxes you got, put it together, and pray. Bad idea.
1) Set up a suitable workspace.
1a) If you are in a low humidity environment, pay extra special attention to environmental static controls, including making sure that the clothes you're wearing have been conditioned with fabric softener. Fabric softener can be diluted with water in a sprayer and applied to carpets, which also reduces the nasty winter zaps!
1b) All computer assembly work should be done on top of an anti-static mat. Fry's, NewEgg, and Amazon all have these for around $20. This is not optional! Static damage can subtly damage your silicon.
1c) Assemble while wearing an anti-static wrist strap. These are available from $5-$20 at those retailers and elsewhere. Ideally you should wear some ESD gloves. Even the cheap $1-a-pair ones will not only help ESD but will also keep skin oils off your surfaces and help reduce contaminants.
1d) Make sure you hook up the wires from your anti-static gear to a proper ground.
1e) Do not wear static-prone clothing, particularly many synthetics. Short sleeves and shorts can help reduce static!
Handle all components only in your anti-static environment, preferably by their edges, never by exposed contacts. A lot of people like to install their RAM and CPU on a mainboard prior to installing in the case, but this increases the number of components in play at a time. It is ideal to install the mainboard in the chassis first, then ground the chassis, and then install components one at a time. In a small chassis build, this may be impractical, and even in a large chassis it could make installation of the CPU tricky. Only remove components from the packages as they are actually required. Resist the urge to unpack everything and spread it out. Extra handling is extra risk. Keep your mind on grounding and careful handling.
Make sure that when mounting your mainboard, that it is securely supported at all points where the chassis offers screw holes. Make sure that the chassis doesn't have any screw standoffs in places where the motherboard does not have a hole; these can short out a motherboard.
Tighten all screws until you feel moderate resistance, then give just a bit more of a twist. You want a solidly seated screw, not loose, not stripped.
Smoke Test
Our name for the initial power-on test. Computers run on smoke. Once the smoke comes out, they stop working.
You should smoke test on the bench with the chassis open so that you can visually inspect.
All cards should be fully seated, meaning that among other things you should see an even amount of the board's copper fingers exposed along the length of the socket.
All DIMM modules should be fully seated. A properly seated DIMM will include the clips on the side being fully engaged, which is a fairly quick visual test. Also verify that the module's copper fingers appear to be uniform.
Power it on and make sure that the hardware manifest agrees with what the BIOS reports. Observe to make sure that all your fans are spinning, and take some time to address any buzzy or unpleasant vibration noise.
Configuration
In the BIOS:
Reset the BIOS settings to defaults.
Configure server to power on after power loss (if this is desired)
Configure power button to NOT shut off host immediately upon press. A momentary press of the power button is supposed to send a signal to the OS to shut down cleanly and power off. It's a good idea to test that once the OS is installed.
Configure the boot priority to prioritize the USB or SATA DOM, and if possible, disable booting from the data disks.
Burn-In and Testing
The burn-in and testing may be separated out or done as kind of a commingled whole. It may disappoint you to discover that proper burn-in and testing actually requires more than a thousand hours of time - a month and a half. This is the time during which you want infant mortality to strike.
Be sure to run an exhaustive test of memtest86 on all the system memory. If you have chosen to play with fire by not buying an ECC system, this is really the only significant opportunity you have to discover problems without your pool being at risk. If you have ECC, try to identify the way in which your mainboard manufacturer provides failure logging (BIOS, BMC/IPMI, etc). Any memory failures should be addressed. Do not rely on ECC to repair problems in a problematic module. The memtest86 tests can be run several times throughout the burn-in period to validate your memory. Don't just run one pass. Run it for a week or two in aggregate, once early on, once later on.
Find and run one of the CPU-stressing utilities such as "CPUstress" (http://www.ultimatebootcd.com/) and run this for at least a day, and monitor the temperature and fan behaviour in your chassis. Especially if you are attempting to build a "quiet" NAS, now is the opportunity to make sure that your cooling strategy is going to work. Heatsink compound typically takes a while to cure or "break-in", typically around 150-250 hours of operation for materials such as AS5, and you should notice a temperature drop of five to ten degrees after that point. Just like with the memory, it's a great idea to run it for a day early on, and then a day later on near the end of the burn-in.
Run SMART tests on all your drives, including the manufacturer's conveyance test and the long test. For this purpose it might be convenient to configure up an instance of FreeNAS.
SMART is not sufficient to weed out bad drives, though it is a good thing to run. It is unable to identify subtle problems in the SATA/SAS channels to your drives, for example. In order to validate both the drives and their ability to successfully shuffle data, be sure to run plenty of tests on them. I suggest:
1) Individual sequential read and write tests. This is basically just using "dd if=/dev/da${n} of=/dev/null bs=1048576" to do a read test, and "dd if=/dev/zero of=/dev/da${n} bs=1048576" to do a write test. Note the reported read/write speeds at the end and compare them to both the other drives in the pool and also any benchmark stats you can find for that particular drive on the Internet. They should be very close.
2) Simultaneous sequential read and write tests. This is using those same tests in parallel.
3) I am kind of lazy so I will then have it do seek testing by starting multiple tests per drive and separating them by a minute or two each. The drives will do lots of seeking within a relatively small locality. Other people like to use different tools to do this; I have absolutely no objection but as I noted I'm kind of lazy.
4) I've recently provided a tool to assist: https://forums.freenas.org/index.php?resources/solnet-array-test.1/
NO AMOUNT OF STRESS should result in the kernel reporting timeouts or other issues communicating with your drives. If it does - update drivers, reseat cables, Google for known problems with your drives or controller, upgrade drive firmware, and/or generally fix your busted hardware. You cannot use a system throwing spurious storage subsystem errors for ZFS.
Hard drives have a high infant mortality rate, and so you really want to get past the thousand hour mark without incident. It's better to be doing this as part of burn-in rather than "oh crud I just trusted all my data to this RAIDZ2 and now two of my drives are reporting problems."
Once you've hit a system with tests like these for a few weeks, confidence in the platform should be established. Then you move on to installing FreeNAS, setting up ZFS, and beating on that for awhile... some comments on that and networking testing hopefully coming soon...
This process also needs to be repeated when you are upgrading your system by adding new parts or changing out existing parts. There's a little bit of "use your brain" in how strict you need to be, but doing stuff like just dumping more RAM into your box and then booting it with a production pool attached can lead to great sadness. Don't do that. Your hardware needs to be solid and validated, and if it isn't, you can scramble your bits, possibly irretrievably.
This isn't a complete or comprehensive guide, more like a work-in-progress.
The Build Process
It is tempting to just rip open all those boxes you got, put it together, and pray. Bad idea.
1) Set up a suitable workspace.
1a) If you are in a low humidity environment, pay extra special attention to environmental static controls, including making sure that the clothes you're wearing have been conditioned with fabric softener. Fabric softener can be diluted with water in a sprayer and applied to carpets, which also reduces the nasty winter zaps!
1b) All computer assembly work should be done on top of an anti-static mat. Fry's, NewEgg, and Amazon all have these for around $20. This is not optional! Static damage can subtly damage your silicon.
1c) Assemble while wearing an anti-static wrist strap. These are available from $5-$20 at those retailers and elsewhere. Ideally you should wear some ESD gloves. Even the cheap $1-a-pair ones will not only help ESD but will also keep skin oils off your surfaces and help reduce contaminants.
1d) Make sure you hook up the wires from your anti-static gear to a proper ground.
1e) Do not wear static-prone clothing, particularly many synthetics. Short sleeves and shorts can help reduce static!
Handle all components only in your anti-static environment, preferably by their edges, never by exposed contacts. A lot of people like to install their RAM and CPU on a mainboard prior to installing in the case, but this increases the number of components in play at a time. It is ideal to install the mainboard in the chassis first, then ground the chassis, and then install components one at a time. In a small chassis build, this may be impractical, and even in a large chassis it could make installation of the CPU tricky. Only remove components from the packages as they are actually required. Resist the urge to unpack everything and spread it out. Extra handling is extra risk. Keep your mind on grounding and careful handling.
Make sure that when mounting your mainboard, that it is securely supported at all points where the chassis offers screw holes. Make sure that the chassis doesn't have any screw standoffs in places where the motherboard does not have a hole; these can short out a motherboard.
Tighten all screws until you feel moderate resistance, then give just a bit more of a twist. You want a solidly seated screw, not loose, not stripped.
Smoke Test
Our name for the initial power-on test. Computers run on smoke. Once the smoke comes out, they stop working.
You should smoke test on the bench with the chassis open so that you can visually inspect.
All cards should be fully seated, meaning that among other things you should see an even amount of the board's copper fingers exposed along the length of the socket.
All DIMM modules should be fully seated. A properly seated DIMM will include the clips on the side being fully engaged, which is a fairly quick visual test. Also verify that the module's copper fingers appear to be uniform.
Power it on and make sure that the hardware manifest agrees with what the BIOS reports. Observe to make sure that all your fans are spinning, and take some time to address any buzzy or unpleasant vibration noise.
Configuration
In the BIOS:
Reset the BIOS settings to defaults.
Configure server to power on after power loss (if this is desired)
Configure power button to NOT shut off host immediately upon press. A momentary press of the power button is supposed to send a signal to the OS to shut down cleanly and power off. It's a good idea to test that once the OS is installed.
Configure the boot priority to prioritize the USB or SATA DOM, and if possible, disable booting from the data disks.
Burn-In and Testing
The burn-in and testing may be separated out or done as kind of a commingled whole. It may disappoint you to discover that proper burn-in and testing actually requires more than a thousand hours of time - a month and a half. This is the time during which you want infant mortality to strike.
Be sure to run an exhaustive test of memtest86 on all the system memory. If you have chosen to play with fire by not buying an ECC system, this is really the only significant opportunity you have to discover problems without your pool being at risk. If you have ECC, try to identify the way in which your mainboard manufacturer provides failure logging (BIOS, BMC/IPMI, etc). Any memory failures should be addressed. Do not rely on ECC to repair problems in a problematic module. The memtest86 tests can be run several times throughout the burn-in period to validate your memory. Don't just run one pass. Run it for a week or two in aggregate, once early on, once later on.
Find and run one of the CPU-stressing utilities such as "CPUstress" (http://www.ultimatebootcd.com/) and run this for at least a day, and monitor the temperature and fan behaviour in your chassis. Especially if you are attempting to build a "quiet" NAS, now is the opportunity to make sure that your cooling strategy is going to work. Heatsink compound typically takes a while to cure or "break-in", typically around 150-250 hours of operation for materials such as AS5, and you should notice a temperature drop of five to ten degrees after that point. Just like with the memory, it's a great idea to run it for a day early on, and then a day later on near the end of the burn-in.
Run SMART tests on all your drives, including the manufacturer's conveyance test and the long test. For this purpose it might be convenient to configure up an instance of FreeNAS.
SMART is not sufficient to weed out bad drives, though it is a good thing to run. It is unable to identify subtle problems in the SATA/SAS channels to your drives, for example. In order to validate both the drives and their ability to successfully shuffle data, be sure to run plenty of tests on them. I suggest:
1) Individual sequential read and write tests. This is basically just using "dd if=/dev/da${n} of=/dev/null bs=1048576" to do a read test, and "dd if=/dev/zero of=/dev/da${n} bs=1048576" to do a write test. Note the reported read/write speeds at the end and compare them to both the other drives in the pool and also any benchmark stats you can find for that particular drive on the Internet. They should be very close.
2) Simultaneous sequential read and write tests. This is using those same tests in parallel.
3) I am kind of lazy so I will then have it do seek testing by starting multiple tests per drive and separating them by a minute or two each. The drives will do lots of seeking within a relatively small locality. Other people like to use different tools to do this; I have absolutely no objection but as I noted I'm kind of lazy.
4) I've recently provided a tool to assist: https://forums.freenas.org/index.php?resources/solnet-array-test.1/
NO AMOUNT OF STRESS should result in the kernel reporting timeouts or other issues communicating with your drives. If it does - update drivers, reseat cables, Google for known problems with your drives or controller, upgrade drive firmware, and/or generally fix your busted hardware. You cannot use a system throwing spurious storage subsystem errors for ZFS.
Hard drives have a high infant mortality rate, and so you really want to get past the thousand hour mark without incident. It's better to be doing this as part of burn-in rather than "oh crud I just trusted all my data to this RAIDZ2 and now two of my drives are reporting problems."
Once you've hit a system with tests like these for a few weeks, confidence in the platform should be established. Then you move on to installing FreeNAS, setting up ZFS, and beating on that for awhile... some comments on that and networking testing hopefully coming soon...
Last edited: