SOLVED TrueNAS keeps restarting every 6-10 min

Sawtaytoes · Dec 5, 2023

Just enabled logging system boots:

This should help me track down when they occur.

I also had it turn off on AC Power Loss. That should also help identify what just happened.

On top of this, I've removed the NIC and 16e SAS controller since the primary data pool only needs 5x24i cards to operate.

---

Are these console errors related to my issue? I just noticed them after doing this UEFI update:

Sawtaytoes · Dec 5, 2023

Update again.

After removing the NIC and 16e card, upgraded the UEFI, but ran into issues, possibly with a config change relating to which drives were SATA vs NVMe. Either way, it wouldn't boot anymore into the UEFI menu, so I upgraded it again (which wipes the config), and now it's booting.

First thing I did was import my main zpool and start a scrub. I had it going for an hour before I said "lemme copy over some files now", and it died as soon as I started copying; this time, to the built-in 10Gb NIC.

I finally got some some sort of health report this time:

Is it the PSUs then?

Why does this only happen when I copy files over the network? My PC is doing a backup operation every 12 hours and on idle, so I wonder if that's also triggering it.

Sawtaytoes · Dec 5, 2023

Swapped the PSUs with those in another server and got a bunch of error messages (time is now 1 hour ahead):

This proves a few things:

The system can run on 1 PSU even under load (`zpool scrub`).
It logs error messages if the PSUs get disconnected. If both get disconnected, that's something different though.

I'm still not sure if this is a software or hardware issue. I would assume it's a hardware issue, but it looks like something did a force-reset like it physically attached the reset button pin on my motherboard, but only when certain things happen in software. That's the weird thing about it.

Sawtaytoes · Dec 5, 2023

I'm getting closer to the actual issue.

Even after swapping the PSUs, copying files over SMB killed it again. It was running fine with just the scrub though.

Since scrub works, that means reads work. That doesn't test writes at all!

I tried a `fio` benchmark to see what happens and guess what? It rebooted again! Now we're seeing a pattern.

Code:

fio --ioengine=libaio --filename=/mnt/Bunnies/performanceTest --direct=1 --sync=0 --rw=readwrite --bs=16M --runtime=10 --size=20G --time_based --name=fio
rm /mnt/Bunnies/performanceTest

Any idea why only writes, not reads, are causing an entire system meltdown?

nKk · Dec 5, 2023

Can you try fio test on boot pool and on "/tmp" to check if every one write cause system reboot or only when the write is on your pool?

Sawtaytoes · Dec 5, 2023

What drive is `/tmp`?

I made the script a bit gentler this time:

Code:

# /tmp
fio --ioengine=libaio --filename=/tmp/performanceTest --sync=0 --rw=readwrite --bs=16M --runtime=10 --size=1G --time_based --name=fio
rm /tmp/performanceTest
   READ: bw=5843MiB/s (6126MB/s), 5843MiB/s-5843MiB/s (6126MB/s-6126MB/s), io=57.1GiB (61.3GB), run=10001-10001msec
  WRITE: bw=6055MiB/s (6350MB/s), 6055MiB/s-6055MiB/s (6350MB/s-6350MB/s), io=59.1GiB (63.5GB), run=10001-10001msec

# boot-pool
fio --ioengine=libaio --filename=./performanceTest --sync=0 --rw=readwrite --bs=16M --runtime=10 --size=1G --time_based --name=fio
rm ./performanceTest
   READ: bw=6664MiB/s (6988MB/s), 6664MiB/s-6664MiB/s (6988MB/s-6988MB/s), io=65.1GiB (69.9GB), run=10002-10002msec
  WRITE: bw=6839MiB/s (7171MB/s), 6839MiB/s-6839MiB/s (7171MB/s-7171MB/s), io=66.8GiB (71.7GB), run=10002-10002msec

# TrueNAS-Apps
fio --ioengine=libaio --filename=/mnt/TrueNAS-Apps/performanceTest --sync=0 --rw=readwrite --bs=16M --runtime=10 --size=1G --time_based --name=fio
rm /mnt/TrueNAS-Apps/performanceTest
   READ: bw=6134MiB/s (6432MB/s), 6134MiB/s-6134MiB/s (6432MB/s-6432MB/s), io=59.9GiB (64.3GB), run=10001-10001msec
  WRITE: bw=6316MiB/s (6623MB/s), 6316MiB/s-6316MiB/s (6623MB/s-6623MB/s), io=61.7GiB (66.2GB), run=10001-10001msec

Looks like it's hitting memory, but that's fine right? It should still be writing to the pool eventually right?

Tried doing this on the Bunnies pool (all SSDs) and this is what happened right before it died:

Sawtaytoes · Dec 5, 2023

It's literally any write to this pool. `rm` didn't cause problems, but `touch` did:

Sawtaytoes · Dec 5, 2023

This is someone else who had the same, or a very similar, issue as me:

And started to write/read from it, that creates the panic, but if the pool is mounted in read-only mode there is no panic/reboot

ZFS - zpool keeps rebooting the server

I have a server with a mix of pools ssd & SATA disks, I started to notice that the server began to reboot frequently, I manage to boot in single mode, and notice that no more reboots, then I boot the pool with the oldest SATA disks and after a while, it restarted (no logs in console) also no...

forums.freebsd.org

I think this is the issue I'm running into, but why would it randomly start a few days ago?

FreeBSD 13, after importing pool system panics · Issue #14973 · openzfs/zfs

System information FreeBSD - 13.2-RELEASE releng/13.2-n254617-525ecfdad597 GENERIC amd64 $ zfs version zfs-2.1.9-FreeBSD_g92e0d9d18 zfs-kmod-2.1.9-FreeBSD_g92e0d9d18 Describe the problem you're obs...

github.com

Sawtaytoes · Dec 6, 2023

I got a few ZFS error emails from TrueNAS:

ZFS has detected that a device was removed.

impact: Fault tolerance of the pool may be compromised.
eid: 11
class: statechange
state: UNAVAIL
host: storeman
time: 2023-12-06 01:29:52-0600
vpath: /dev/sdak2
vguid: 0x83BF13979724B965
pool: Bunnies (0x3B56F42B4AAFB1A2)

I wish it would send me serial numbers rather than GUIDs. I dunno which GUID goes to which drive nor what drive `/dev/sdak` was at the time, but I can't verify it anymore because the drive letters change every reboot.

I got this email 4 times (for 4 separate drives) which means it's possible one of the SAS controller cables probably had issues as I was moving things around.

Also strange that the errors came in 12:09a, 12:57a, 1:01a, and 1:30a. Very random times for these drives to randomly start failing. Is it that these four drives caused the system shutdowns I experienced earlier? Or are these drives the last ones to be written to before the system gives up and forces a reboot?

EDIT: There were more of these errors. 6 so far total as I was skimming emails; all on different drives, but after a reboot, the drive letters change, so it still could have been the same drive each time.

Sawtaytoes · Dec 6, 2023

I did a Memtest and got no errors. I skipped a few tests near the end there:

nKk · Dec 6, 2023

Perhaps SAS cables or overheating HBA.
Do you close the case after each manipulation or you test with open case? Because with so many HBA'a in the system if the case is open the airflow can't reach HBA's and they can overheat.

Sawtaytoes · Dec 6, 2023

nKk said:
Perhaps SAS cables or overheating HBA.
Do you close the case after each manipulation or you test with open case? Because with so many HBA'a in the system if the case is open the airflow can't reach HBA's and they can overheat.

Yeah, I close it back to reduce heat.

This time I started my NAS, now a bunch of drives aren't showing up. I found 7 of them failed or removed or unavailable which are all in the same group; possibly on the same SAS controller.

Sawtaytoes · Dec 6, 2023

I pulled 2 24i SAS cards and re-added the 16e card. Then I hooked up 3 x 24i SAS cards.

This is a start:

I find it distressing that both drives in a mirror are resilvering together:

None of these drives needed resilvering until something happened in the previous startup. It was a clean startup too from when I shutdown the system myself to check something.

I'm going to assume this was related to a SAS controller, but even after a 10-15 minutes, the system still rebooted.

Something is still triggering it. It could be that multiple SAS controllers are bad, but I have no way of knowing. Any ideas?

I'm really upset this was working for months and suddenly decided to throw a fit and start rebooting itself.

Without touching anything, the system is currently resilvering for hopefully the next 6 hours. I'm 48 minutes and counting. Usually around an hour, it will reboot again, and I think that's because I had hourly snapshots. Those are disabled now though.

Once I can get this pool back to a working state, I can go back to finding out why writing to this pool causes issues.

Sawtaytoes · Dec 6, 2023

Sawtaytoes said:
I pulled 2 24i SAS cards and re-added the 16e card. Then I hooked up 3 x 24i SAS cards.

This is a start:

View attachment 73173

I find it distressing that both drives in a mirror are resilvering together:

View attachment 73174

None of these drives needed resilvering until something happened in the previous startup. It was a clean startup too from when I shutdown the system myself to check something.

I'm going to assume this was related to a SAS controller, but even after a 10-15 minutes, the system still rebooted.

Something is still triggering it. It could be that multiple SAS controllers are bad, but I have no way of knowing. Any ideas?

I'm really upset this was working for months and suddenly decided to throw a fit and start rebooting itself.

Without touching anything, the system is currently resilvering for hopefully the next 6 hours. I'm 48 minutes and counting. Usually around an hour, it will reboot again, and I think that's because I had hourly snapshots. Those are disabled now though.

Once I can get this pool back to a working state, I can go back to finding out why writing to this pool causes issues.

After the 12 hours of resilver and scrub completed, no errors.

My next plan is to use 5 SAS expanders and 1 SAS card to track down which is/are bad.

36 of my SSDs are unused right now, and I can use them to setup a new pool (exporting my main one).

Should be pretty quick to figure out which of the 5 SAS controllers is problematic at that point provided my SAS expanders are fine.

The rebooting on write is still suspect.

Sawtaytoes · Dec 6, 2023

I created a new pool and have written to it successfully twice now in different zpool configurations.

There was one thing I did before this issue began that I'd forgotten:
I rearranged all the drives so there were no gaps between which were in my main SSD pool and which were currently extra.

That tells me it's possible certain ports are problematic on certain SAS controllers. Also, some of my mirrors are filled up more than others, so I bet zero bytes are written to them meaning I can pinpoint which drives are more active than others and potentially problematic.

After I redid all the +5V power and got all 125 drives in here, I rearranged them again. It makes sense then that the SAS controllers could be the issue. It's possible I completely avoided one in this whole transition because of drive positions and how filled up each drive was in the zpool.

But... The reset only happens when writing, and it really seems like an OS issue like ZFS had an error and panicked.

Sawtaytoes · Dec 6, 2023

Here's where I'm at now. STILL completely confused:

My main zpool (Bunnies) causes reboots only when writing to it. It would even reboot if I `touch`'d a new file, but now it takes more than that such as a 20GiB `fio` test.
I can create a new SSD zpool, and it works fine with both reads and writes. I even moved around 4 SAS ports and was still successful.
My HDD pool in another chassis (connected via 3 ports on the 16e), uses 2 SSDs for metadata, and writes to it work fine.

My main zpool has 4 x Intel Optane NVMe drives for metadata. Writes should hit those first, and then the SSDs. And since I was potentially having PCIe issues, I wonder if those drives are related. Maybe one of them went bad?

I tried to run SMART tests on them, but TrueNAS says SMART tests can't run on these drives.

I wanna write some sort of data to these drives to see if I can force a system reboot. TrueNAS should make 2 partitions, so I should be able to format and write to one to test it right? Or is there another way to test?

Another thing I wanna do is test each SAS controller card 1-by-1, but so far, it seems to only occur when writing to my main zpool; so I really have no way of knowing the actual issue.

Sawtaytoes · Dec 6, 2023

After searching around, I found an unanswered StackOverflow question where someone asked to run:

Code:

journalctl | egrep 'kernel.*nvme'

Look what I found:

Code:

Dec 06 02:59:51 storeman kernel: nvme1n1: detected capacity change from 1875385008 to 0

I wonder if it's related.

Here's the rest of that log:

Code:

# journalctl | egrep 'kernel.*nvme'
Dec 06 02:56:09 storeman kernel: Command line: BOOT_IMAGE=/ROOT/23.10.0.1@/boot/vmlinuz-6.1.55-production+truenas root=ZFS=boot-pool/ROOT/23.10.0.1 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Dec 06 02:56:09 storeman kernel: Kernel command line: BOOT_IMAGE=/ROOT/23.10.0.1@/boot/vmlinuz-6.1.55-production+truenas root=ZFS=boot-pool/ROOT/23.10.0.1 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Dec 06 02:56:09 storeman kernel: nvme nvme0: pci function 0000:02:00.0
Dec 06 02:56:09 storeman kernel: nvme nvme1: pci function 0000:03:00.0
Dec 06 02:56:09 storeman kernel: nvme nvme2: pci function 0000:04:00.0
Dec 06 02:56:09 storeman kernel: nvme nvme3: pci function 0000:05:00.0
Dec 06 02:56:09 storeman kernel: nvme nvme3: 31/0/0 default/read/poll queues
Dec 06 02:56:09 storeman kernel: nvme nvme0: 31/0/0 default/read/poll queues
Dec 06 02:56:09 storeman kernel: nvme nvme1: 31/0/0 default/read/poll queues
Dec 06 02:56:09 storeman kernel: nvme nvme2: 31/0/0 default/read/poll queues
Dec 06 02:56:09 storeman kernel:  nvme1n1: p1
Dec 06 02:56:09 storeman kernel:  nvme2n1: p1
Dec 06 02:56:09 storeman kernel:  nvme3n1: p1
Dec 06 02:56:09 storeman kernel:  nvme0n1: p1
Dec 06 02:59:51 storeman kernel: nvme1n1: detected capacity change from 1875385008 to 0
Dec 06 02:59:56 storeman kernel: nvme nvme1: pci function 0000:03:00.0
Dec 06 02:59:56 storeman kernel: nvme 0000:03:00.0: enabling device (0000 -> 0002)
Dec 06 02:59:56 storeman kernel: nvme nvme1: 31/0/0 default/read/poll queues
Dec 06 02:59:56 storeman kernel:  nvme1n1: p1
Dec 06 03:18:56 storeman kernel: Command line: BOOT_IMAGE=/ROOT/23.10.0.1@/boot/vmlinuz-6.1.55-production+truenas root=ZFS=boot-pool/ROOT/23.10.0.1 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Dec 06 03:18:56 storeman kernel: Kernel command line: BOOT_IMAGE=/ROOT/23.10.0.1@/boot/vmlinuz-6.1.55-production+truenas root=ZFS=boot-pool/ROOT/23.10.0.1 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Dec 06 03:18:56 storeman kernel: nvme nvme0: pci function 0000:02:00.0
Dec 06 03:18:56 storeman kernel: nvme nvme1: pci function 0000:03:00.0
Dec 06 03:18:56 storeman kernel: nvme nvme2: pci function 0000:04:00.0
Dec 06 03:18:56 storeman kernel: nvme nvme3: pci function 0000:05:00.0
Dec 06 03:18:56 storeman kernel: nvme nvme0: 31/0/0 default/read/poll queues
Dec 06 03:18:56 storeman kernel: nvme nvme3: 31/0/0 default/read/poll queues
Dec 06 03:18:56 storeman kernel: nvme nvme1: 31/0/0 default/read/poll queues
Dec 06 03:18:56 storeman kernel: nvme nvme2: 31/0/0 default/read/poll queues
Dec 06 03:18:56 storeman kernel:  nvme1n1: p1
Dec 06 03:18:56 storeman kernel:  nvme2n1: p1
Dec 06 03:18:56 storeman kernel:  nvme0n1: p1
Dec 06 03:18:56 storeman kernel:  nvme3n1: p1
Dec 06 03:29:03 storeman kernel: Kernel command line: BOOT_IMAGE=/ROOT/23.10.0.1@/boot/vmlinuz-6.1.55-production+truenas root=ZFS=boot-pool/ROOT/23.10.0.1 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Dec 06 03:29:03 storeman kernel: nvme nvme0: pci function 0000:02:00.0
Dec 06 03:29:03 storeman kernel: nvme nvme1: pci function 0000:03:00.0
Dec 06 03:29:03 storeman kernel: nvme nvme2: pci function 0000:04:00.0
Dec 06 03:29:03 storeman kernel: nvme nvme3: pci function 0000:05:00.0
Dec 06 03:29:03 storeman kernel: nvme nvme0: 31/0/0 default/read/poll queues
Dec 06 03:29:03 storeman kernel: nvme nvme1: 31/0/0 default/read/poll queues
Dec 06 03:29:03 storeman kernel: nvme nvme2: 31/0/0 default/read/poll queues
Dec 06 03:29:03 storeman kernel: nvme nvme3: 31/0/0 default/read/poll queues
Dec 06 03:29:03 storeman kernel:  nvme2n1: p1
Dec 06 03:29:03 storeman kernel:  nvme3n1: p1
Dec 06 03:29:03 storeman kernel:  nvme0n1: p1
Dec 06 03:29:03 storeman kernel:  nvme1n1: p1
Dec 06 04:05:58 storeman kernel: Kernel command line: BOOT_IMAGE=/ROOT/23.10.0.1@/boot/vmlinuz-6.1.55-production+truenas root=ZFS=boot-pool/ROOT/23.10.0.1 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Dec 06 04:05:58 storeman kernel: nvme nvme0: pci function 0000:02:00.0
Dec 06 04:05:58 storeman kernel: nvme nvme1: pci function 0000:03:00.0
Dec 06 04:05:58 storeman kernel: nvme nvme2: pci function 0000:04:00.0
Dec 06 04:05:58 storeman kernel: nvme nvme3: pci function 0000:05:00.0
Dec 06 04:05:58 storeman kernel: nvme nvme0: 31/0/0 default/read/poll queues
Dec 06 04:05:58 storeman kernel: nvme nvme1: 31/0/0 default/read/poll queues
Dec 06 04:05:58 storeman kernel: nvme nvme2: 31/0/0 default/read/poll queues
Dec 06 04:05:58 storeman kernel: nvme nvme3: 31/0/0 default/read/poll queues
Dec 06 04:05:58 storeman kernel:  nvme2n1: p1
Dec 06 04:05:58 storeman kernel:  nvme3n1: p1
Dec 06 04:05:58 storeman kernel:  nvme0n1: p1
Dec 06 04:05:58 storeman kernel:  nvme1n1: p1
Dec 06 15:02:31 storeman kernel: Kernel command line: BOOT_IMAGE=/ROOT/23.10.0.1@/boot/vmlinuz-6.1.55-production+truenas root=ZFS=boot-pool/ROOT/23.10.0.1 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Dec 06 15:02:32 storeman kernel: nvme nvme0: pci function 0000:02:00.0
Dec 06 15:02:32 storeman kernel: nvme nvme1: pci function 0000:03:00.0
Dec 06 15:02:32 storeman kernel: nvme nvme2: pci function 0000:04:00.0
Dec 06 15:02:32 storeman kernel: nvme nvme3: pci function 0000:05:00.0
Dec 06 15:02:32 storeman kernel: nvme nvme0: 31/0/0 default/read/poll queues
Dec 06 15:02:32 storeman kernel: nvme nvme2: 31/0/0 default/read/poll queues
Dec 06 15:02:32 storeman kernel: nvme nvme3: 31/0/0 default/read/poll queues
Dec 06 15:02:32 storeman kernel: nvme nvme1: 31/0/0 default/read/poll queues
Dec 06 15:02:32 storeman kernel:  nvme1n1: p1
Dec 06 15:02:32 storeman kernel:  nvme3n1: p1
Dec 06 15:02:32 storeman kernel:  nvme0n1: p1
Dec 06 15:02:32 storeman kernel:  nvme2n1: p1
Dec 06 15:48:25 storeman kernel: Kernel command line: BOOT_IMAGE=/ROOT/23.10.0.1@/boot/vmlinuz-6.1.55-production+truenas root=ZFS=boot-pool/ROOT/23.10.0.1 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Dec 06 15:48:25 storeman kernel: nvme nvme0: pci function 0000:02:00.0
Dec 06 15:48:25 storeman kernel: nvme nvme1: pci function 0000:03:00.0
Dec 06 15:48:25 storeman kernel: nvme nvme2: pci function 0000:04:00.0
Dec 06 15:48:25 storeman kernel: nvme nvme3: pci function 0000:05:00.0
Dec 06 15:48:25 storeman kernel: nvme nvme0: 31/0/0 default/read/poll queues
Dec 06 15:48:25 storeman kernel: nvme nvme1: 31/0/0 default/read/poll queues
Dec 06 15:48:25 storeman kernel:  nvme1n1: p1
Dec 06 15:48:25 storeman kernel: nvme nvme2: 31/0/0 default/read/poll queues
Dec 06 15:48:25 storeman kernel:  nvme2n1: p1
Dec 06 15:48:25 storeman kernel:  nvme0n1: p1
Dec 06 15:48:25 storeman kernel: nvme nvme3: 31/0/0 default/read/poll queues
Dec 06 15:48:25 storeman kernel:  nvme3n1: p1
Dec 06 16:47:31 storeman kernel: Command line: BOOT_IMAGE=/ROOT/23.10.0.1@/boot/vmlinuz-6.1.55-production+truenas root=ZFS=boot-pool/ROOT/23.10.0.1 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Dec 06 16:47:31 storeman kernel: Kernel command line: BOOT_IMAGE=/ROOT/23.10.0.1@/boot/vmlinuz-6.1.55-production+truenas root=ZFS=boot-pool/ROOT/23.10.0.1 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Dec 06 16:47:31 storeman kernel: nvme nvme0: pci function 0000:02:00.0
Dec 06 16:47:31 storeman kernel: nvme nvme1: pci function 0000:03:00.0
Dec 06 16:47:31 storeman kernel: nvme nvme2: pci function 0000:04:00.0
Dec 06 16:47:31 storeman kernel: nvme nvme3: pci function 0000:05:00.0
Dec 06 16:47:31 storeman kernel: nvme nvme2: 31/0/0 default/read/poll queues
Dec 06 16:47:31 storeman kernel: nvme nvme1: 31/0/0 default/read/poll queues
Dec 06 16:47:31 storeman kernel: nvme nvme0: 31/0/0 default/read/poll queues
Dec 06 16:47:31 storeman kernel: nvme nvme3: 31/0/0 default/read/poll queues
Dec 06 16:47:31 storeman kernel:  nvme3n1: p1
Dec 06 16:47:31 storeman kernel:  nvme0n1: p1
Dec 06 16:47:31 storeman kernel:  nvme2n1: p1
Dec 06 16:47:31 storeman kernel:  nvme1n1: p1

Not sure if it's doing anything because there's clearly a size here:

Etorix · Dec 6, 2023

An Optane metadata vdev with a SSD pool? That's borderline crazy.

You have not fully described your system and how everything is wired and powered. From the symptoms, it could be that there's enough power for reading but not for writing.

Sawtaytoes · Dec 7, 2023

This was enough to cause it to reboot:

A few seconds after performing this task, that's when it rebooted. Is there something telling it "after writing a file, go ahead and force-reboot the system"?

Etorix said:
An Optane metadata vdev with a SSD pool? That's borderline crazy.

You have not fully described your system and how everything is wired and powered. From the symptoms, it could be that there's enough power for reading but not for writing.

Hardware Specs

Chassis: 45Drives Storinator XL60.
Motherboard: Supermicro HT12SSL-NT.
CPU: AMD Epyc 7313p (16-core).
RAM: 256GB -> 8 sticks of 32GB DDR4 3200.
PCIe1: ConnectX-6 -> PCIe 4.0 x4 -> 2-port 25Gb SFP28.
PCIe2: LSI 9305 24i -> PCIe 3.0 x8 (all 6 ports plugged into direct-attach backplanes with 24 SSDs).
PCIe3: LSI 9305 24i -> PCIe 3.0 x8 (all 6 ports plugged into direct-attach backplanes with 24 SSDs).
PCIe4: LSI 9305 24i -> PCIe 3.0 x8 (all 6 ports plugged into direct-attach backplanes with 24 SSDs).
PCIe5: LSI 9305 24i -> PCIe 3.0 x8 (all 6 ports plugged into direct-attach backplanes with 24 SSDs).
PCIe6: LSI 9305 24i -> PCIe 3.0 x8 (all 6 ports plugged into direct-attach backplanes with 24 SSDs).
PCIe7: LSI 9305 16e -> PCIe 3.0 x8 (3 ports plugged into SAS expanders in another Storinator XL60 chassis with 60 HDDs.
NVMe1: 960GB Intel Optane 905p -> PCIe 3.0 x4
NVMe1: 960GB Intel Optane 905p -> PCIe 3.0 x4
SlimSAS1: SlimSAS to 2 x U.2 -> 2 x 960GB Intel Optane 905p -> PCIe 3.0 x4
SlimSAS2: SlimSAS to 2 x miniSAS HD (only 6 SSDs connected, but both cables plugged in)

There are no free PCIe ports in this system; although, if I wanted to use bifurcation, there are an additional 20 lanes on the x16 PCIe ports.
zpool Specs

My main Bunnies zpool is gonna change to a multi-vdev dRAID configuration once I figure out what's wrong. When copying snapshots to my HDD pool, some were missed, so I need to copy those again.

boot-pool -> 2 x 60GB Corsair Force SSDs
TrueNAS-Apps -> 2 x 2TB Crucial MX500 SSDs
Bunnies
-> 80 x 2TB and 4TB Crucial MX500 SSDs and 4 x 960GB Intel Optane 905p
-> This is where I store my files.
-> I backup this pool to Wolves and also to another zpool on an offsite NAS.
Wolves
-> 60 x 10TB HGST Helium HDDs and 2 x 2TB Crucial MX500 SSDs as metadata.
-> All data in this pool is a backup of Bunnies and another pool on my offsite NAS.

Code:

# zpool status -vL
  pool: Bunnies
 state: ONLINE
  scan: scrub repaired 0B in 08:58:36 with 0 errors on Wed Dec  6 14:05:11 2023
remove: Removal of vdev 1 copied 1.79T in 1h37m, completed on Tue Oct 31 07:02:38 2023
        955M memory used for removed device mappings
config:

        NAME           STATE     READ WRITE CKSUM
        Bunnies        ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            sdch2      ONLINE       0     0     0
            sdcp2      ONLINE       0     0     0
          mirror-2     ONLINE       0     0     0
            sdfc2      ONLINE       0     0     0
            sdw2       ONLINE       0     0     0
          mirror-3     ONLINE       0     0     0
            sdg2       ONLINE       0     0     0
            sdi2       ONLINE       0     0     0
          mirror-4     ONLINE       0     0     0
            sdfj2      ONLINE       0     0     0
            sde2       ONLINE       0     0     0
          mirror-5     ONLINE       0     0     0
            sdfo2      ONLINE       0     0     0
            sdaa2      ONLINE       0     0     0
          mirror-8     ONLINE       0     0     0
            sdae2      ONLINE       0     0     0
            sdff2      ONLINE       0     0     0
          mirror-11    ONLINE       0     0     0
            sdfi2      ONLINE       0     0     0
            sdz2       ONLINE       0     0     0
          mirror-15    ONLINE       0     0     0
            sdfp2      ONLINE       0     0     0
            sdfe2      ONLINE       0     0     0
          mirror-16    ONLINE       0     0     0
            sdh2       ONLINE       0     0     0
            sdfr2      ONLINE       0     0     0
          mirror-17    ONLINE       0     0     0
            sdv2       ONLINE       0     0     0
            sdfg2      ONLINE       0     0     0
          mirror-18    ONLINE       0     0     0
            sdfh2      ONLINE       0     0     0
            sdfd2      ONLINE       0     0     0
          mirror-20    ONLINE       0     0     0
            sdf2       ONLINE       0     0     0
            sdfq2      ONLINE       0     0     0
          mirror-24    ONLINE       0     0     0
            sdk2       ONLINE       0     0     0
            sdcq2      ONLINE       0     0     0
          mirror-26    ONLINE       0     0     0
            sdad2      ONLINE       0     0     0
            sdd2       ONLINE       0     0     0
          mirror-27    ONLINE       0     0     0
            sdco2      ONLINE       0     0     0
            sdca2      ONLINE       0     0     0
          mirror-28    ONLINE       0     0     0
            sdcn2      ONLINE       0     0     0
            sdal2      ONLINE       0     0     0
          mirror-29    ONLINE       0     0     0
            sdce2      ONLINE       0     0     0
            sdn2       ONLINE       0     0     0
          mirror-31    ONLINE       0     0     0
            sdeh2      ONLINE       0     0     0
            sdan2      ONLINE       0     0     0
          mirror-32    ONLINE       0     0     0
            sdcg2      ONLINE       0     0     0
            sda2       ONLINE       0     0     0
          mirror-33    ONLINE       0     0     0
            sdc2       ONLINE       0     0     0
            sdb2       ONLINE       0     0     0
          mirror-34    ONLINE       0     0     0
            sdcb2      ONLINE       0     0     0
            sdp2       ONLINE       0     0     0
          mirror-35    ONLINE       0     0     0
            sdcf2      ONLINE       0     0     0
            sdq2       ONLINE       0     0     0
          mirror-36    ONLINE       0     0     0
            sdfb2      ONLINE       0     0     0
            sdao2      ONLINE       0     0     0
          mirror-37    ONLINE       0     0     0
            sdai2      ONLINE       0     0     0
            sdcv2      ONLINE       0     0     0
          mirror-38    ONLINE       0     0     0
            sdcl2      ONLINE       0     0     0
            sdaq2      ONLINE       0     0     0
          mirror-39    ONLINE       0     0     0
            sdap2      ONLINE       0     0     0
            sdfk2      ONLINE       0     0     0
          mirror-40    ONLINE       0     0     0
            sdfn2      ONLINE       0     0     0
            sdfl2      ONLINE       0     0     0
          mirror-41    ONLINE       0     0     0
            sdfm2      ONLINE       0     0     0
            sdcj2      ONLINE       0     0     0
          mirror-42    ONLINE       0     0     0
            sdcm2      ONLINE       0     0     0
            sdck2      ONLINE       0     0     0
          mirror-43    ONLINE       0     0     0
            sdj2       ONLINE       0     0     0
            sdah2      ONLINE       0     0     0
          mirror-44    ONLINE       0     0     0
            sdde2      ONLINE       0     0     0
            sdcu2      ONLINE       0     0     0
          mirror-45    ONLINE       0     0     0
            sdcs2      ONLINE       0     0     0
            sdeo2      ONLINE       0     0     0
          mirror-46    ONLINE       0     0     0
            sdaj2      ONLINE       0     0     0
            sdu2       ONLINE       0     0     0
          mirror-47    ONLINE       0     0     0
            sdt2       ONLINE       0     0     0
            sdac2      ONLINE       0     0     0
          mirror-48    ONLINE       0     0     0
            sdaf2      ONLINE       0     0     0
            sdcd2      ONLINE       0     0     0
          mirror-49    ONLINE       0     0     0
            sdci2      ONLINE       0     0     0
            sdak2      ONLINE       0     0     0
          mirror-50    ONLINE       0     0     0
            sdam2      ONLINE       0     0     0
            sdcc2      ONLINE       0     0     0
          mirror-51    ONLINE       0     0     0
            sdo2       ONLINE       0     0     0
            sdep2      ONLINE       0     0     0
          mirror-52    ONLINE       0     0     0
            sdr2       ONLINE       0     0     0
            sdag2      ONLINE       0     0     0
          mirror-53    ONLINE       0     0     0
            sdl2       ONLINE       0     0     0
            sdm2       ONLINE       0     0     0
        special
          mirror-13    ONLINE       0     0     0
            nvme1n1p1  ONLINE       0     0     0
            nvme0n1p1  ONLINE       0     0     0
          mirror-14    ONLINE       0     0     0
            nvme3n1p1  ONLINE       0     0     0
            nvme2n1p1  ONLINE       0     0     0
        spares
          sds2         AVAIL 

errors: No known data errors

  pool: TrueNAS-Apps
 state: ONLINE
  scan: resilvered 232K in 00:00:00 with 0 errors on Wed Dec  6 03:07:23 2023
config:

        NAME          STATE     READ WRITE CKSUM
        TrueNAS-Apps  ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            sdfw1     ONLINE       0     0     0
            sdfx2     ONLINE       0     0     0

errors: No known data errors

  pool: Wolves
 state: ONLINE
  scan: scrub repaired 0B in 05:11:50 with 0 errors on Thu Nov 23 10:07:44 2023
config:

        NAME                  STATE     READ WRITE CKSUM
        Wolves                ONLINE       0     0     0
          draid2:5d:15c:1s-0  ONLINE       0     0     0
            sdbt2             ONLINE       0     0     0
            sdbs2             ONLINE       0     0     0
            sdbu2             ONLINE       0     0     0
            sddv2             ONLINE       0     0     0
            sded2             ONLINE       0     0     0
            sdas2             ONLINE       0     0     0
            sddp2             ONLINE       0     0     0
            sdbg2             ONLINE       0     0     0
            sddn2             ONLINE       0     0     0
            sdbl2             ONLINE       0     0     0
            sdbm2             ONLINE       0     0     0
            sdbn2             ONLINE       0     0     0
            sdbi2             ONLINE       0     0     0
            sdbj2             ONLINE       0     0     0
            sdaw2             ONLINE       0     0     0
          draid2:5d:15c:1s-1  ONLINE       0     0     0
            sddx2             ONLINE       0     0     0
            sdax2             ONLINE       0     0     0
            sdda2             ONLINE       0     0     0
            sday2             ONLINE       0     0     0
            sdby2             ONLINE       0     0     0
            sdbv2             ONLINE       0     0     0
            sdbz2             ONLINE       0     0     0
            sdds2             ONLINE       0     0     0
            sdbw2             ONLINE       0     0     0
            sdbx2             ONLINE       0     0     0
            sddt2             ONLINE       0     0     0
            sddq2             ONLINE       0     0     0
            sdar2             ONLINE       0     0     0
            sdbc2             ONLINE       0     0     0
            sddo2             ONLINE       0     0     0
          draid2:5d:15c:1s-2  ONLINE       0     0     0
            sdaz2             ONLINE       0     0     0
            sdcx2             ONLINE       0     0     0
            sdba2             ONLINE       0     0     0
            sdbb2             ONLINE       0     0     0
            sddr2             ONLINE       0     0     0
            sdbd2             ONLINE       0     0     0
            sdbf2             ONLINE       0     0     0
            sdbe2             ONLINE       0     0     0
            sdcr2             ONLINE       0     0     0
            sdbk2             ONLINE       0     0     0
            sdbp2             ONLINE       0     0     0
            sdbh2             ONLINE       0     0     0
            sdea2             ONLINE       0     0     0
            sddz2             ONLINE       0     0     0
            sdee2             ONLINE       0     0     0
          draid2:5d:15c:1s-3  ONLINE       0     0     0
            sdeg2             ONLINE       0     0     0
            sdct2             ONLINE       0     0     0
            sdef2             ONLINE       0     0     0
            sdeq2             ONLINE       0     0     0
            sddu2             ONLINE       0     0     0
            sdei2             ONLINE       0     0     0
            sdat2             ONLINE       0     0     0
            sdec2             ONLINE       0     0     0
            sdau2             ONLINE       0     0     0
            sdeb2             ONLINE       0     0     0
            sdav2             ONLINE       0     0     0
            sddw2             ONLINE       0     0     0
            sdbo2             ONLINE       0     0     0
            sdbq2             ONLINE       0     0     0
            sdbr2             ONLINE       0     0     0
        special
          mirror-4            ONLINE       0     0     0
            sdab              ONLINE       0     0     0
            sdfs              ONLINE       0     0     0
        spares
          draid2-0-0          AVAIL 
          draid2-1-0          AVAIL 
          draid2-2-0          AVAIL 
          draid2-3-0          AVAIL 

errors: No known data errors

  pool: boot-pool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:00:15 with 0 errors on Sat Dec  2 03:45:17 2023
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdft3   ONLINE       0     0     0
            sdfu3   ONLINE       0     0     0

errors: No known data errors

Summary of the reboot issue

Bunnies is the one that has the force-reboot issue when writing to it. This began happening 3, maybe 4 days ago now?

I thought it was a power issue

2 days before, I had already removed 15 SSDs that I currently wasn't using. After it started happening, I thought it was a power issue, so I removed 48 SSDs (of 85), but still had the problem.

I already planned to add 128 SSD bays in here for my 123 SSDs, so I completely redid the power wiring and removed every one of the 128 SSD slots from the +5V rail on the PSUs. Also note, I swapped the PSUs in this server with the one in the other Storinator XL60 chassis, again, thinking power was the issue.

I thought it was a heat issue

At some point, I noticed heat issues because I shifted all the fans to Noctua NF-12s during this 128 SSD bay transition. I put them all back to stock, and the heat issues went away. Still, I removed the ConnectX-6 card, and the reboots became less frequent.

I only recently found out why:
The real problem occurs when writing to Bunnies.

Because my PCs backup their data every 12 hours and on idle, when I stepped away from one, it would start writing data to Bunnies; forcing the reboot situation. When I removed the ConnectX-6 and switched to the onboard NICs, the DNS hostname didn't match, so I could only access my NAS by IP. Because of this, my Windows boxes stopped backing up, lengthening the time between reboots to whenever my snapshots ran.

It was the writes

Based on the fact that reboots started occurring about every hour, it was pretty clear those were causing forced reboots since I take hourly snapshots.

After disabling all snapshots and backup tasks, and now since Windows can't access the NAS, it was able to successfully stay on for over 10 hours doing a `zpool scrub`. I'm 100% certain now that writes are the issue and only to the Bunnies pool. I'm not yet certain if the issue is physical (NVMe or SATA SSDs) or TrueNAS.

How to figure out what's wrong?

Is there a zdb way I can check this out? If it's just a corrupt pool, I already planned to nix it and convert it to dRAID, but I need to first move off some more recent snapshots. That requires manually running `zfs send` to Wolves.

nKk · Dec 7, 2023

To check if TrueNAS have a issue you can try to boot other Linux distro that use the same or new version of ZFS and try to import Bunnies pool and test writes.
If the resets continue TrueNAS is not a problem but still can be a software issue - ZFS or something else.
Or you can try TrueNAS Core if Core ZFS can import pool from Scale to check if there is something different in logs.

Important Announcement for the TrueNAS Community.

SOLVED TrueNAS keeps restarting every 6-10 min

Patron

Patron

Patron

Patron

Dabbler

Patron

Patron

Patron

Patron

Patron

Dabbler

Patron

Patron

Patron

Patron

Patron

Patron

Wizard

Patron

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "TrueNAS keeps restarting every 6-10 min"

Similar threads