Freenas 11.3 Rebooting On It's Own.

nickwebha · Jul 7, 2020

[Jump down to post #15 to see the most updated information.]

FreeNAS version
- 11.3
Motherboard make and model
- ASRock C2750D4I
CPU make and model
- Intel Octa Core Avoton C2750 Processor
RAM quantity
- 32GB
Hard drives, quantity, model numbers, and RAID configuration, including boot drives
- 12
- Seagate ST6000DX000
- raidz3
- Dual Samsung Fit Plus 32GB
Hard disk controllers
- Intel® C2750 : 2 x SATA3 6.0 Gb/s, 4 x SATA2 3.0 Gb/s
- Marvell SE9172: 2 x SATA3 6.0 Gb/s
- Marvell SE9230: 4 x SATA3 6.0 Gb/s
Network cards
- [Built-in/unknown] Dual gigabit

I have been running FreeNAS for a few years now (since 9.x?). Same hardware, never had a problem besides a boot device dying here or there. Now the machine is rebooting every few days. As you can imagine iSCSI is not a fan to say the least.

I assume this is a kernel panic but am unsure on FreeBSD how to check that. What logs do I need to look at to see why the thing is rebooting on its own?

Thanks.

colmconn · Jul 7, 2020

How old is your main board?

nickwebha · Jul 7, 2020

colmconn said:
How old is your main board?

I did have one fail so this is a replacement. ~Three years old.

HoneyBadger · Jul 7, 2020

It's possible that your replacement board is also affected by the Avoton C2000 clock signal issue/recall.

Check /data/crash

nickwebha · Jul 7, 2020

HoneyBadger said:
Check /data/crash

Here is the one from today (the most recent):

Dump header from device: /dev/ada11p1
  Architecture: amd64
  Architecture Version: 1
  Dump Length: 789504
  Blocksize: 512
  Dumptime: Tue Jul  7 01:01:29 2020
  Hostname: freenas.local
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 11.3-RELEASE-p5 #0 r325575+8ed1cd24b60(HEAD): Mon Jan 27 18:07:23 UTC 2020
    root@tnbuild02.tn.ixsystems.com:/freenas-releng/freenas/_BE/objs/freenas-releng/freenas/_BE/os/sys/Free
  Panic String: double fault
  Dump Parity: 3411573524
  Bounds: 1
  Dump Status: good

HoneyBadger · Jul 7, 2020

nickwebha said:
Panic String: double fault

First impression is hardware. Check your IPMI log - ipmitool sel get all - I'm thinking RAM.

nickwebha · Jul 8, 2020

HoneyBadger said:
First impression is hardware. Check your IPMI log - ipmitool sel get all - I'm thinking RAM.

Happened again some time during the night.

Code:

1 | 07/08/2020 | 04:43:15 | System Event | Timestamp Clock Sync | Asserted
2 | 07/08/2020 | 04:43:15 | System Event #0xff | Timestamp Clock Sync | Asserted
3 | 07/08/2020 | 04:43:17 | System Event #0xff | Timestamp Clock Sync | Asserted
4 | 07/08/2020 | 04:43:17 | System Event | Timestamp Clock Sync | Asserted

Does this mean anything to anyone?

nickwebha · Jul 8, 2020

It just did it again right in front of me. It happened while unlocking an encrypted volume (although I suspect it finished unlocking and started mounting/fsck'ing due to the timing).

Got this: https://ibb.co/nPTNzmS

Samuel Tai · Jul 8, 2020

That's a bit more interesting. There's a bad vdev somewhere in one of your pools, and it could be in your boot pool.

colmconn · Jul 8, 2020

Samuel Tai said:
That's a bit more interesting. There's a bad vdev somewhere in one of your pools, and it could be in your boot pool.

I'd vote for one or more failing USB boot drives. I'd replace one or both of those with SSDs. That said, in this case that will be difficult as all the SATA ports on that board are in use (unless I've miscounted). It might be worth considering getting a HBA for that system, putting all your storage drives on the HBA and using one or two of the SATA3 ports on the motherboard for boot drive(s).

nickwebha · Jul 8, 2020

Samuel Tai said:
it could be in your boot pool.

I have been meaning to replace my boot pool. The current devices have no blinking lights so I have no idea which one it is when one dies.

Plus I do not trust these mini USB drives. They seem to die for no reason pretty quickly. I have tried so many different brands. I even stuck them on smell USB extension cables thinking it might have been the heat but that made no difference.

colmconn · Jul 8, 2020

nickwebha said:
I have been meaning to replace my boot pool. The current devices have no blinking lights so I have no idea which one it is when one dies.

Plus I do not trust these mini USB drives. They seem to die for no reason pretty quickly. I have tried so many different brands. I even stuck them on smell USB extension cables thinking it might have been the heat but that made no difference.

You could try a small SSD in a SATA-to-USB enclosure. It would probably be more reliable than those little USB thumb drives.

nickwebha · Jul 9, 2020

colmconn said:
You could try a small SSD in a SATA-to-USB enclosure. It would probably be more reliable than those little USB thumb drives.

Thanks for the suggestion. I might try that.

So I replaced the the boot vdev (fresh install, fresh out-of-the-package USB drives). I am still getting the same crash (https://ibb.co/nPTNzmS).

If this is volume1 (my only volume) causing this how do I get my data (TBs) off it if it will not stay up long enough to transfer the data? I have a second identical remote backup server. However, due to a bone-headed mistake it contains no backup at the moment. I was just starting to rebuild the backup over a slow link (going to take a few months, which is fine compared to the cost of a faster link plus contract).

Could this be something else? Bad RAM, perhaps? These sticks are hard to find and expensive so I would like to confirm before replacing them. Perhaps I will try memtest?

colmconn · Jul 9, 2020

Have you checked all your SATA and power connections? Made sure they're properly connected to both drive and MB? Might be worth reseating them. What does smartctl report for each drive? The machine well cooled, right? The drives are well ventilated? If I were you, I'd focus on getting the most important data of it, the data you don't want to live without. Running memtest is not a bad idea.

nickwebha · Nov 20, 2020

[My very late reply is due to me backing up my important, then less important data over a very slow link. Now that it is all backed up and safe I can proceed.]

I have checked every SATA and power connection; They were fine but I have reseated them just ot be sure. All S.M.A.R.T. tests (short and long) come back fine on all disks, each run only once at a time. The machine is very well cooled (the entire front of the case is fans, plus others) and is sitting in a 24/7 air conditioned room. The free version of Memtest86 comes back fine which I let run for 24+ hours.

So things have changed slightly although I have changed nothing. It is no longer rebooting but still giving error messages. If I just hammer the thing with reads and writes over NFS it works great. I wrote a tiny script which I ran from multiple machines, each reading random files, writing random files of various sizes (into the 100's of GB), and randomly deleting said files. I did this for almost a week, while still using the system normally (minus iSCSI), and it never complained about anything (although was slow, as would be expected).

Not the case with iSCSI. I have a single iSCSI target from which I run multiple (~5) containers hosted on another, dedicate machine (Ubuntu 18.04.5). If I run a container which does virtually no reads nor writes (say, a VPN) after starting everything looks fine. If I leave that container up and running for a week everything looks fine. If, however, I start a container that does either a lot of reads or a lot of writes (not always even a lot, maybe as low as 5MB/s) FreeNAS iSCSI goes nuts (see attached images). Sometimes it takes a few minutes, other times it days a whole day, but it always happens. Ubuntu ends up marking the iSCSI disk as read-only, containers start freaking out. Restarting everything (Ubuntu, containers) works but obviously is not a real solution.

I have done my best to try to figure out these error messages. They seem to have something to do with not being able to write to the disks fast enough and timing out which sounds crazy to me since NFS handles beasts of files without pausing or anything.

Not sure what my next steps are here. If anyone could lend me some suggestions I would appreciate it because I have no ideas.

Thank you.

HoneyBadger · Nov 20, 2020

The iSCSI volblocksize being significantly smaller by default (16K) than the dataset (128K) means you're looking at 8x (or more) the individual "write commands" to your disks.

Question - is buying an LSI HBA (or reflashed OEM equivalent) a possibility for testing, to remove the Marvell controller from the mix?

nickwebha · Nov 20, 2020

HoneyBadger said:
Question - is buying an LSI HBA (or reflashed OEM equivalent) a possibility for testing, to remove the Marvell controller from the mix?

New hardware is out of reach right now. God forbid the disks start failing, I would just shut the thing down until I could replace them. Whenever that might be.

HoneyBadger said:
The iSCSI volblocksize being significantly smaller by default (16K) than the dataset (128K) means you're looking at 8x (or more) the individual "write commands" to your disks.

My volblocksize for that zvol is 64K. You are suggesting changing that to 128K?

I am going to brush up on my blocksize, volblocksize, and ashift and whatever the equivalent Linux settings are.

Edit
The oddest thing to me is, when this thing started months ago, I had not changed a thing. I had not even logged into either system and everything was running so great.

Important Announcement for the TrueNAS Community.

Freenas 11.3 Rebooting On It's Own.

nickwebha

Dabbler

colmconn

Contributor

nickwebha

Dabbler

HoneyBadger

actually does care

nickwebha

Dabbler

HoneyBadger

actually does care

nickwebha

Dabbler

nickwebha

Dabbler

Samuel Tai

Never underestimate your own stupidity

colmconn

Contributor

nickwebha

Dabbler

colmconn

Contributor

nickwebha

Dabbler

colmconn

Contributor

nickwebha

Dabbler

Attachments

HoneyBadger

actually does care

nickwebha

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Freenas 11.3 Rebooting On It's Own.

Dabbler

Contributor

Dabbler

actually does care

Dabbler

actually does care

Dabbler

Dabbler

Never underestimate your own stupidity

Contributor

Dabbler

Contributor

Dabbler

Contributor

Dabbler

Attachments

actually does care

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Freenas 11.3 Rebooting On It's Own."

Similar threads