Crash Under Heavy iSCSI Write Load

himebaugh · Jan 4, 2024

Running TrueNAS-SCALE-23.10.0.1

PowerEdge R740xd
2x Intel Xeon Silver 4216 CPU @ 2.1 GHz
384 GB RAM

Boot Drives:
2x Dell 480 GB SSD (MZ7LH480HBHQ0D3)

Storage Drives:
9x WD Gold 12 TB HD (WD121KRYZ-01)

Read Cache:
2x WD Black 1TB SN750 NVMe SSD

SLOG:
3x Intel OPTANE SSD P1600X Series 118GB M.2

Hard Drive Controller:
Dell HBA330 Firmware 16.17.01.00

Storage Configuration:
4x Mirror, 2 Wide (12 TB WD Gold drives)

Log VDEVs 1x 118 GB Mirror, 3 wide (Optane)

Cache VDEVs 2x 1 TB (WD Black SSD)

Spare VDEVs 1x 12 TB (WD Gold)

Network Cards:
Onboard NIC (Intel X550 4 Port 10 GB)
PCI-E Card (Intel X710-T 4 Port 10 GB)

Using system as an iSCSI target for VMware. Two VMware ESXi hosts are connected via 10 GB iSCSI (each host has two links, MPIO on, Jumbo Frames are on). Two Zvols have been setup (Spare). Both Zvols have been set to "Sync always".

I've had this particular issue happen 3 times now between two different TrueNAS Scale setups setup as listed above. Under heavy write utilization (large storage vMotion), the iSCSI target will stop functioning, along with with web interface. Ended up needing to reboot the TrueNAS server to restore iSCSI connectivity.

Image below is what shows on the console.

Utilization on the storage array is around 15% when this happens.

We are running ZFS Encryption on this volume.

Any suggestions?

Thank you for your time!

HoneyBadger · Jan 4, 2024

txg_sync tripping the kernel deadman timer is not a good thing, but your storage controller and layout don't infer anything that would immediately make me think hardware.

Lightning round style:

Are you using deduplication?
How are the M.2 cards connected? Riser board with/without bifurcation? Do they have adequate cooling across them?
You've got a lot of NICs on the TrueNAS host - are they set up in individual subnets, LACP'd?

himebaugh · Jan 4, 2024

HoneyBadger said:
txg_sync tripping the kernel deadman timer is not a good thing, but your storage controller and layout don't infer anything that would immediately make me think hardware.

Lightning round style:

Are you using deduplication?
How are the M.2 cards connected? Riser board with/without bifurcation? Do they have adequate cooling across them?
You've got a lot of NICs on the TrueNAS host - are they set up in individual subnets, LACP'd?

Thank you for the reply.

No, we are not using deduplication. We are using compression.

We have 2x 4 port M.2 Cards on a Riser board. Just using BIOS defaults - I didn't have to enable bifurcation. I believe they have adequate cooling. Looking at the temperature graphs for the M.2 cards, they normally are running at 48 C. That being said, one of them did go up to 58 C before this crash. I would think that would still be in an acceptable range.

All the NICs are on individual subnets. No LACP. Basically 2x 10GB iSCSI connections to 2 different VMware hosts (iSCSI MPIO), plus redundant management ports (management port does have an active/failover bond)

HoneyBadger · Jan 4, 2024

himebaugh said:
Thank you for the reply.

No, we are not using deduplication. We are using compression.

We have 2x 4 port M.2 Cards on a Riser board. Just using BIOS defaults - I didn't have to enable bifurcation. I believe they have adequate cooling. Looking at the temperature graphs for the M.2 cards, they normally are running at 48 C. That being said, one of them did go up to 58 C before this crash. I would think that would still be in an acceptable range.

All the NICs are on individual subnets. No LACP. Basically 2x 10GB iSCSI connections to 2 different VMware hosts (iSCSI MPIO), plus redundant management ports (management port does have an active/failover bond)

Darn, there goes the easy pointing the finger at dedup.

Can you link the riser boards being used for the M.2 cards? I think the 14th gen Dells might auto-bifurcate, but it's also possible the boards have a PCIe switch chip in them. If that chip is getting too toasty that might result in the drives dropping offline. I also saw some recent posts about specific WD drives having weird power state interactions with Linux where going into a sleep state they'd decide to drop offline.

Can you collect a debug from System -> Advanced -> Save Debug and attach to a ticket with the Report A Bug link at the top of the forums?

himebaugh · Jan 4, 2024

HoneyBadger said:
Darn, there goes the easy pointing the finger at dedup.

Can you link the riser boards being used for the M.2 cards? I think the 14th gen Dells might auto-bifurcate, but it's also possible the boards have a PCIe switch chip in them. If that chip is getting too toasty that might result in the drives dropping offline. I also saw some recent posts about specific WD drives having weird power state interactions with Linux where going into a sleep state they'd decide to drop offline.

Can you collect a debug from System -> Advanced -> Save Debug and attach to a ticket with the Report A Bug link at the top of the forums?

Here is the M.2 Card we are using.

AOC-SHG3-4M2P

I believe it has the associated PCIe switch in it, because I've used it in a system without bifurcation support before, and all four drives seemed to work fine.

The WD drives item is interesting as well - I'll have to look for those posts.

I'll grab a debug and and submit it.

Thank you!

himebaugh · Jan 4, 2024

Ticket # NAS-126040 has been submitted.

himebaugh · Jan 4, 2024

You mentioned about the M.2 card temperatures - looking at that a bit closer. One of them - normally runs at about 48 C. However, it climbed to 58 C about 4 hours before the crash, then stopped reporting it's temperature until the system was rebooted after the crash. So I don't know for sure that that particular M.2 drive (it was one of the drives in the SLOG mirror) didn't overheat.

HoneyBadger · Jan 4, 2024

himebaugh said:
Ticket # NAS-126040 has been submitted.

Thank you, got it and the debug.

himebaugh said:
However, it climbed to 58 C about 4 hours before the crash, then stopped reporting it's temperature until the system was rebooted after the crash.

I did see a couple things that seemed to be reporting somewhat toasty temperatures, including what looks to be a Broadcom network device that also might have a potential interaction with the system watchdog timer.

himebaugh · Jan 4, 2024

HoneyBadger said:
Thank you, got it and the debug.

I did see a couple things that seemed to be reporting somewhat toasty temperatures, including what looks to be a Broadcom network device that also might have a potential interaction with the system watchdog timer.

Okay, thanks. I do have two systems that are setup the same, but now that you pointed out the Broadcom network device I dug a little further.

The one I submitted the crash dump for does have different NICs than the other one. Crash dump one has:

2x Broadcom BCM57454 4x10G BT NICs.

That being said, I've had the issue on both systems (the one with the Intel NICs and the one with the Broadcom NICs).

Is it your opinion this is likely either a heat issue with the M.2 cards or a driver/compatibility issue with the NIC (although if it was the NIC item, the other system wouldn't have had this issue). If there are known issues with the Broadcom NIC, we'll likely replace them anyways.

HoneyBadger · Jan 4, 2024

I'm thinking heat on the adapter card, potentially the M.2 devices, or a weird firmware interaction.

Can you check the iDRAC on the server and look for the logs there? I'm suspecting a watchdog reset timer - if it thinks a component or the system itself is non-responsive, it might decide to throw the NMI signal through IPMI. Try disabling the watchdog from the Dell BIOS/EFI.

himebaugh · Jan 4, 2024

HoneyBadger said:
I'm thinking heat on the adapter card, potentially the M.2 devices, or a weird firmware interaction.

Can you check the iDRAC on the server and look for the logs there? I'm suspecting a watchdog reset timer - if it thinks a component or the system itself is non-responsive, it might decide to throw the NMI signal through IPMI. Try disabling the watchdog from the Dell BIOS/EFI.

There is nothing of note in the System Event Log or the Lifecycle log.

iDRAC Watchdog timer (believe it's called "Auto System Recovery Action") was/is disabled. Could be on in the BIOS. Still working on moving off some VMs to reboot it and see.

himebaugh · Feb 20, 2024

Took a little while, but I was able to reproduce the error and get the full console log. Created ticket # NAS-127456 with the relevant information.

Important Announcement for the TrueNAS Community.

Crash Under Heavy iSCSI Write Load

himebaugh

Cadet

HoneyBadger

actually does care

himebaugh

Cadet

HoneyBadger

actually does care

himebaugh

Cadet

himebaugh

Cadet

himebaugh

Cadet

HoneyBadger

actually does care

himebaugh

Cadet

HoneyBadger

actually does care

himebaugh

Cadet

himebaugh

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Crash Under Heavy iSCSI Write Load

Cadet

actually does care

Cadet

actually does care

Cadet

Cadet

Cadet

actually does care

Cadet

actually does care

Cadet

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Crash Under Heavy iSCSI Write Load"

Similar threads