CoreyVidal
Dabbler
- Joined
- Jun 19, 2020
- Messages
- 14
Random reboots. Please help. Oh God, please send help. Or alcohol. Or a gun. I'm just feeling crushed by putting so much work into this.
Setting the scene: I'm a lifelong Windows-only user (25+ years) who was running a bunch of services like Plex off of my daily-usage desktop computer (like an IDIOT). A few months ago I got hit by hackers who encrypted all of my data. 17 TB worth (I have 2 large external DAS's-I'm a videomaker and video files are massive). I managed to recover my files but it taught me a valuable lesson: it was time to reapproach the way I was organizing my digital life. I want/need a dedicated server, dedicated NAS, and off-site storage.
I built a small server off of a new Intel NIC (here's the hardware). I finally made the plunge into Linux (Ubuntu 20.04), working in the command line for the first time in my life. It's overwhelming, but I'm getting it done. Lots of Googling and forums. I learned all about Docker. Blew my mind. It's set up and humming along nicely with about 20 containers going. A month has gone by.
Then I started building my NAS. I decided to go with FreeNAS over Unraid to get better performance because of the needs of my large video files. I'm gonna turn my 2 DAS's (6 drives × 6 TB each) into one big NAS (12 × 6 TB)!
I transferred everything off of 1 of the 2 DAS's so I could disassemble it and put its 6 drives in my NAS.
Here's the hardware of my FreeNAS system. The motherboard, processors, SAS card, and RAM are used server parts from eBay. The case, PSU, and fans are all brand new.
Updated BIOS and firmwares. Followed instructions on flashing firmware for my SAS card into IT mode. Put in the 6 drives, and finally installed FreeNAS onto an additional internal dedicated 256 GB SSD.
That night, it randomly rebooted 3 times.
Thus began a month-long journey into investigating all my hardware. Here's everything I did/tested/ran:
It's been weeks of testing, but we're back in FreeNAS. Let it just sit running for a day. No random reboots! I test copying files to it. Everything works great. I set up SMB. All good. Sharing to/from Windows. Wonderful. Stable for days. When I'm feeling confident, I plug my DAS into my NAS and use FreeNAS's Import Disk feature (Storage ➞ Import Disk) to copy everything.
RANDOM REBOOT.
"God dammit!" (← use your imagination)
I do a bunch of testing. Time passes. Life is pain and existence is a curse. I Google and try a bunch of different stuff. I check crash logs, but I don't really know what I'm looking for (I literally only learned command lines 2 months ago). I Google. I forum. I cry.
I extended the pool to add an additional SSD as a dedicated log drive. And somehow, magically, that instantly solved it. (EDITOR'S NOTE: I feel like this may be an important indicator of what might be wrong, maybe?!)
I try Import Disk again. It's ROCK SOLID! It takes a long time. The copy finishes. I now have a new copy of all my files on my FreeNAS, and have that same 17 TB of files still on my old DAS. I double-check and compare files on both to make sure everything's perfect. It is! I *NEED* to know that my NAS is rock solid. I don't trust it as the sole holder of this 17 TB yet. I'm gonna back up data to the cloud first, and then I'm going to take apart the remaining DAS and add it to my NAS.
In my Ubuntu server, I set up a permanent mount of my FreeNAS. All good. Gonna test some speed!
Result: 112 MB/s
Result: 117 MB/s
Result: 117 MB/s
After displaying the last result, a few seconds go by...
RANDOM REBOOT.
The pain. The disappointment. It boots back up, I don't touch it. A few minutes later, another random reboot. Again, I don't touch it. 10 minutes later, another random reboot. Don't touch it. An hour later, another random reboot. 4 reboots in a row.
I gave up and started drinking. This was last night. I left it overnight. Now it hasn't rebooted in 12 hours.
Feeling bold, I just ran those above 4 speed test commands on it again, exactly the same way. This time? No random reboot.
An hour has passed as I've been writing this. But I'm too cynical to think this nightmare is over. I just don't really know what to do from here.
Setting the scene: I'm a lifelong Windows-only user (25+ years) who was running a bunch of services like Plex off of my daily-usage desktop computer (like an IDIOT). A few months ago I got hit by hackers who encrypted all of my data. 17 TB worth (I have 2 large external DAS's-I'm a videomaker and video files are massive). I managed to recover my files but it taught me a valuable lesson: it was time to reapproach the way I was organizing my digital life. I want/need a dedicated server, dedicated NAS, and off-site storage.
I built a small server off of a new Intel NIC (here's the hardware). I finally made the plunge into Linux (Ubuntu 20.04), working in the command line for the first time in my life. It's overwhelming, but I'm getting it done. Lots of Googling and forums. I learned all about Docker. Blew my mind. It's set up and humming along nicely with about 20 containers going. A month has gone by.
Then I started building my NAS. I decided to go with FreeNAS over Unraid to get better performance because of the needs of my large video files. I'm gonna turn my 2 DAS's (6 drives × 6 TB each) into one big NAS (12 × 6 TB)!
I transferred everything off of 1 of the 2 DAS's so I could disassemble it and put its 6 drives in my NAS.
Here's the hardware of my FreeNAS system. The motherboard, processors, SAS card, and RAM are used server parts from eBay. The case, PSU, and fans are all brand new.
Updated BIOS and firmwares. Followed instructions on flashing firmware for my SAS card into IT mode. Put in the 6 drives, and finally installed FreeNAS onto an additional internal dedicated 256 GB SSD.
That night, it randomly rebooted 3 times.
Thus began a month-long journey into investigating all my hardware. Here's everything I did/tested/ran:
CPU
(USB boot Hiren's BootCD (off of USB) to Windows 10)
OCCTP:
Run Linpack: the latest version (for 3+ hours)
Run OCCT: small dataset, no AVX (for 3+ hours)
Run OCCT: medium data set, no AVX (for 3+ hours)
Run OCCT: large data set, no AVX (for 3+ hours)
Prime95:
Run: Blend (for 12 hours)
Run: Small (for 6 hours)
————————————————————
Memory
memtest86+:
Ran it on all 12 of my sticks
A few hours in:
RANDOM REBOOT!
Took out all 12 sticks. Tested 4 at a time. Across this, it fixed 5 non-critical ECC errors.
NO RANDOM REBOOTS!
Also, my server is quieter now? It used to make a very quiet chugging sound that I didn't think anything of, but that has stopped. I wonder if having unplugged and replugged in all the sticks might have fixed something?
memtest86+ (take 2):
Ran it on all 12 of my sticks - again!
NO RANDOM REBOOT!
————————————————————
Storage
I ran this custom script by a FreeNAS user based on suggestions from this forum. It ran SMART tests and 4 passes of badblocks and more SMART tests on my drives. I first tried running it on all 6 drives at once...
RANDOM REBOOT!
Ran that script on 1 drive (took 72ish hours). No problems. So ran that script on the other 5 drives (took 120ish hours). No problems!
I think maybe we're good to move on?!
(USB boot Hiren's BootCD (off of USB) to Windows 10)
OCCTP:
Run Linpack: the latest version (for 3+ hours)
Run OCCT: small dataset, no AVX (for 3+ hours)
Run OCCT: medium data set, no AVX (for 3+ hours)
Run OCCT: large data set, no AVX (for 3+ hours)
Prime95:
Run: Blend (for 12 hours)
Run: Small (for 6 hours)
————————————————————
Memory
memtest86+:
Ran it on all 12 of my sticks
A few hours in:
RANDOM REBOOT!
Took out all 12 sticks. Tested 4 at a time. Across this, it fixed 5 non-critical ECC errors.
NO RANDOM REBOOTS!
Also, my server is quieter now? It used to make a very quiet chugging sound that I didn't think anything of, but that has stopped. I wonder if having unplugged and replugged in all the sticks might have fixed something?
memtest86+ (take 2):
Ran it on all 12 of my sticks - again!
NO RANDOM REBOOT!
————————————————————
Storage
I ran this custom script by a FreeNAS user based on suggestions from this forum. It ran SMART tests and 4 passes of badblocks and more SMART tests on my drives. I first tried running it on all 6 drives at once...
RANDOM REBOOT!
Ran that script on 1 drive (took 72ish hours). No problems. So ran that script on the other 5 drives (took 120ish hours). No problems!
I think maybe we're good to move on?!
It's been weeks of testing, but we're back in FreeNAS. Let it just sit running for a day. No random reboots! I test copying files to it. Everything works great. I set up SMB. All good. Sharing to/from Windows. Wonderful. Stable for days. When I'm feeling confident, I plug my DAS into my NAS and use FreeNAS's Import Disk feature (Storage ➞ Import Disk) to copy everything.
RANDOM REBOOT.
"God dammit!" (← use your imagination)
I do a bunch of testing. Time passes. Life is pain and existence is a curse. I Google and try a bunch of different stuff. I check crash logs, but I don't really know what I'm looking for (I literally only learned command lines 2 months ago). I Google. I forum. I cry.
I extended the pool to add an additional SSD as a dedicated log drive. And somehow, magically, that instantly solved it. (EDITOR'S NOTE: I feel like this may be an important indicator of what might be wrong, maybe?!)
I try Import Disk again. It's ROCK SOLID! It takes a long time. The copy finishes. I now have a new copy of all my files on my FreeNAS, and have that same 17 TB of files still on my old DAS. I double-check and compare files on both to make sure everything's perfect. It is! I *NEED* to know that my NAS is rock solid. I don't trust it as the sole holder of this 17 TB yet. I'm gonna back up data to the cloud first, and then I'm going to take apart the remaining DAS and add it to my NAS.
In my Ubuntu server, I set up a permanent mount of my FreeNAS. All good. Gonna test some speed!
sync; dd if=/dev/zero of=tempfile bs=1M count=1024; sync
Result: 112 MB/s
dd if=tempfile of=/dev/null bs=1M count=1024
Result: 117 MB/s
sudo /sbin/sysctl -w vm.drop_caches=3
dd if=tempfile of=/dev/null bs=1M count=1024
Result: 117 MB/s
After displaying the last result, a few seconds go by...
RANDOM REBOOT.
The pain. The disappointment. It boots back up, I don't touch it. A few minutes later, another random reboot. Again, I don't touch it. 10 minutes later, another random reboot. Don't touch it. An hour later, another random reboot. 4 reboots in a row.
I gave up and started drinking. This was last night. I left it overnight. Now it hasn't rebooted in 12 hours.
Feeling bold, I just ran those above 4 speed test commands on it again, exactly the same way. This time? No random reboot.
An hour has passed as I've been writing this. But I'm too cynical to think this nightmare is over. I just don't really know what to do from here.