Freenas 11.3 Rebooting On It's Own.

nickwebha

Dabbler
Joined
Sep 15, 2017
Messages
12
[Jump down to post #15 to see the most updated information.]

  • FreeNAS version
    • 11.3
  • Motherboard make and model
    • ASRock C2750D4I
  • CPU make and model
    • Intel Octa Core Avoton C2750 Processor
  • RAM quantity
    • 32GB
  • Hard drives, quantity, model numbers, and RAID configuration, including boot drives
    • 12
    • Seagate ST6000DX000
    • raidz3
    • Dual Samsung Fit Plus 32GB
  • Hard disk controllers
    • Intel® C2750 : 2 x SATA3 6.0 Gb/s, 4 x SATA2 3.0 Gb/s
    • Marvell SE9172: 2 x SATA3 6.0 Gb/s
    • Marvell SE9230: 4 x SATA3 6.0 Gb/s
  • Network cards
    • [Built-in/unknown] Dual gigabit

I have been running FreeNAS for a few years now (since 9.x?). Same hardware, never had a problem besides a boot device dying here or there. Now the machine is rebooting every few days. As you can imagine iSCSI is not a fan to say the least.

I assume this is a kernel panic but am unsure on FreeBSD how to check that. What logs do I need to look at to see why the thing is rebooting on its own?

Thanks.
 
Last edited:

nickwebha

Dabbler
Joined
Sep 15, 2017
Messages
12

nickwebha

Dabbler
Joined
Sep 15, 2017
Messages
12
Check /data/crash

Here is the one from today (the most recent):
Dump header from device: /dev/ada11p1 Architecture: amd64 Architecture Version: 1 Dump Length: 789504 Blocksize: 512 Dumptime: Tue Jul 7 01:01:29 2020 Hostname: freenas.local Magic: FreeBSD Text Dump Version String: FreeBSD 11.3-RELEASE-p5 #0 r325575+8ed1cd24b60(HEAD): Mon Jan 27 18:07:23 UTC 2020 root@tnbuild02.tn.ixsystems.com:/freenas-releng/freenas/_BE/objs/freenas-releng/freenas/_BE/os/sys/Free Panic String: double fault Dump Parity: 3411573524 Bounds: 1 Dump Status: good
 

nickwebha

Dabbler
Joined
Sep 15, 2017
Messages
12
First impression is hardware. Check your IPMI log - ipmitool sel get all - I'm thinking RAM.

Happened again some time during the night.

Code:
1 | 07/08/2020 | 04:43:15 | System Event | Timestamp Clock Sync | Asserted
2 | 07/08/2020 | 04:43:15 | System Event #0xff | Timestamp Clock Sync | Asserted
3 | 07/08/2020 | 04:43:17 | System Event #0xff | Timestamp Clock Sync | Asserted
4 | 07/08/2020 | 04:43:17 | System Event | Timestamp Clock Sync | Asserted


Does this mean anything to anyone?
 

nickwebha

Dabbler
Joined
Sep 15, 2017
Messages
12
It just did it again right in front of me. It happened while unlocking an encrypted volume (although I suspect it finished unlocking and started mounting/fsck'ing due to the timing).

Got this: https://ibb.co/nPTNzmS
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
That's a bit more interesting. There's a bad vdev somewhere in one of your pools, and it could be in your boot pool.
 

colmconn

Contributor
Joined
Jul 28, 2015
Messages
174
That's a bit more interesting. There's a bad vdev somewhere in one of your pools, and it could be in your boot pool.
I'd vote for one or more failing USB boot drives. I'd replace one or both of those with SSDs. That said, in this case that will be difficult as all the SATA ports on that board are in use (unless I've miscounted). It might be worth considering getting a HBA for that system, putting all your storage drives on the HBA and using one or two of the SATA3 ports on the motherboard for boot drive(s).
 

nickwebha

Dabbler
Joined
Sep 15, 2017
Messages
12
it could be in your boot pool.

I have been meaning to replace my boot pool. The current devices have no blinking lights so I have no idea which one it is when one dies.

Plus I do not trust these mini USB drives. They seem to die for no reason pretty quickly. I have tried so many different brands. I even stuck them on smell USB extension cables thinking it might have been the heat but that made no difference.
 

colmconn

Contributor
Joined
Jul 28, 2015
Messages
174
I have been meaning to replace my boot pool. The current devices have no blinking lights so I have no idea which one it is when one dies.

Plus I do not trust these mini USB drives. They seem to die for no reason pretty quickly. I have tried so many different brands. I even stuck them on smell USB extension cables thinking it might have been the heat but that made no difference.
You could try a small SSD in a SATA-to-USB enclosure. It would probably be more reliable than those little USB thumb drives.
 

nickwebha

Dabbler
Joined
Sep 15, 2017
Messages
12
You could try a small SSD in a SATA-to-USB enclosure. It would probably be more reliable than those little USB thumb drives.

Thanks for the suggestion. I might try that.

So I replaced the the boot vdev (fresh install, fresh out-of-the-package USB drives). I am still getting the same crash (https://ibb.co/nPTNzmS).

If this is volume1 (my only volume) causing this how do I get my data (TBs) off it if it will not stay up long enough to transfer the data? I have a second identical remote backup server. However, due to a bone-headed mistake it contains no backup at the moment. I was just starting to rebuild the backup over a slow link (going to take a few months, which is fine compared to the cost of a faster link plus contract).

Could this be something else? Bad RAM, perhaps? These sticks are hard to find and expensive so I would like to confirm before replacing them. Perhaps I will try memtest?
 

colmconn

Contributor
Joined
Jul 28, 2015
Messages
174
Have you checked all your SATA and power connections? Made sure they're properly connected to both drive and MB? Might be worth reseating them. What does smartctl report for each drive? The machine well cooled, right? The drives are well ventilated? If I were you, I'd focus on getting the most important data of it, the data you don't want to live without. Running memtest is not a bad idea.
 

nickwebha

Dabbler
Joined
Sep 15, 2017
Messages
12
[My very late reply is due to me backing up my important, then less important data over a very slow link. Now that it is all backed up and safe I can proceed.]

I have checked every SATA and power connection; They were fine but I have reseated them just ot be sure. All S.M.A.R.T. tests (short and long) come back fine on all disks, each run only once at a time. The machine is very well cooled (the entire front of the case is fans, plus others) and is sitting in a 24/7 air conditioned room. The free version of Memtest86 comes back fine which I let run for 24+ hours.

So things have changed slightly although I have changed nothing. It is no longer rebooting but still giving error messages. If I just hammer the thing with reads and writes over NFS it works great. I wrote a tiny script which I ran from multiple machines, each reading random files, writing random files of various sizes (into the 100's of GB), and randomly deleting said files. I did this for almost a week, while still using the system normally (minus iSCSI), and it never complained about anything (although was slow, as would be expected).

Not the case with iSCSI. I have a single iSCSI target from which I run multiple (~5) containers hosted on another, dedicate machine (Ubuntu 18.04.5). If I run a container which does virtually no reads nor writes (say, a VPN) after starting everything looks fine. If I leave that container up and running for a week everything looks fine. If, however, I start a container that does either a lot of reads or a lot of writes (not always even a lot, maybe as low as 5MB/s) FreeNAS iSCSI goes nuts (see attached images). Sometimes it takes a few minutes, other times it days a whole day, but it always happens. Ubuntu ends up marking the iSCSI disk as read-only, containers start freaking out. Restarting everything (Ubuntu, containers) works but obviously is not a real solution.

I have done my best to try to figure out these error messages. They seem to have something to do with not being able to write to the disks fast enough and timing out which sounds crazy to me since NFS handles beasts of files without pausing or anything.

Not sure what my next steps are here. If anyone could lend me some suggestions I would appreciate it because I have no ideas.

Thank you.
 

Attachments

  • IMG_20201114_145148.jpg
    IMG_20201114_145148.jpg
    229.5 KB · Views: 172
  • IMG_20201114_145240.jpg
    IMG_20201114_145240.jpg
    245.2 KB · Views: 172
  • IMG_20201114_150717.jpg
    IMG_20201114_150717.jpg
    239 KB · Views: 161
  • IMG_20201114_154837.jpg
    IMG_20201114_154837.jpg
    253.9 KB · Views: 159
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
The iSCSI volblocksize being significantly smaller by default (16K) than the dataset (128K) means you're looking at 8x (or more) the individual "write commands" to your disks.

Question - is buying an LSI HBA (or reflashed OEM equivalent) a possibility for testing, to remove the Marvell controller from the mix?
 

nickwebha

Dabbler
Joined
Sep 15, 2017
Messages
12
Question - is buying an LSI HBA (or reflashed OEM equivalent) a possibility for testing, to remove the Marvell controller from the mix?
New hardware is out of reach right now. God forbid the disks start failing, I would just shut the thing down until I could replace them. Whenever that might be.

The iSCSI volblocksize being significantly smaller by default (16K) than the dataset (128K) means you're looking at 8x (or more) the individual "write commands" to your disks.
My volblocksize for that zvol is 64K. You are suggesting changing that to 128K?

I am going to brush up on my blocksize, volblocksize, and ashift and whatever the equivalent Linux settings are.

Edit
The oddest thing to me is, when this thing started months ago, I had not changed a thing. I had not even logged into either system and everything was running so great.
 
Last edited:
Top