SOLVED TrueNAS SCALE 21.08 BETA-1 Detected Hardware Unit Hang [Intel® I219V]

EzEkil

Cadet
Joined
Jun 27, 2020
Messages
1
Been getting this problem since I changed my hardware from my old platform (i5 4670k and z87 chipset).
TrueNAS SCALE was a fresh install after the hardware changes andthe pools was imported from my TrueNAS Core 12.0U5.1 install.

When this happens I need to restart the whole system to recover from the hang and the time before it hangs again varies. It is not due to being under load for too long as I tested a sustained transfer for 1 hour and no drop outs from the whole transfer, tested multiple plex streams (Direct and Transcoded streams) and it did not hang with that. The longest uptime I had was about 18 hours and my shortest was less than 1 hour. All components are brand new.

Specifications:
CPU: Intel Core i3 -10100
Motherboard: MSI B460 Tomahawk
RAM: 16gb kingston Hyperx Fury 2666mhz

If there are other information that is needed for further troubleshooting I'll be more than glad to provide.
 

Attachments

  • IMG_20210929_165527.jpg
    IMG_20210929_165527.jpg
    205 KB · Views: 220

whodat

Dabbler
Joined
Apr 28, 2018
Messages
34
I also have an e1000e Detected Hardware Unit Hang issue:

Code:
Jul 11 10:21:25 truenas kernel: e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang:
  TDH                  <69>
  TDT                  <a7>
  next_to_use          <a7>
  next_to_clean        <68>
buffer_info[next_to_clean]:
  time_stamp           <100013b79>
  next_to_watch        <69>
  jiffies              <100013e31>
  next_to_watch.status <0>
MAC Status             <80083>
PHY Status             <796d>
PHY 1000BASE-T Status  <7c00>
PHY Extended Status    <3000>
PCI Status             <10>
Jul 11 10:21:27 truenas kernel: e1000e 0000:00:19.0 enp0s25: NIC Link is Down
Jul 11 10:21:28 truenas ntpd[5325]: Deleting interface #16 enp0s25, x.x.x.x#123, interface stats: received=0, sent=0, dropped=0, active_time=268 secs
Jul 11 10:21:28 truenas ntpd[5325]: Deleting interface #17 enp0s25, x::x:x:x:x%3#123, interface stats: received=0, sent=0, dropped=0, active_time=268 secs
Jul 11 10:21:28 truenas ntpd[5325]: Deleting interface #18 macvtap0, x::x:x:x:x%4#123, interface stats: received=0, sent=0, dropped=0, active_time=268 secs
Jul 11 10:21:30 truenas kernel: e1000e 0000:00:19.0 enp0s25: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

As far as I can tell with my testing, I am able to resolve it via the cli with:
Code:
ethtool -K enp0s25 tso off gso off

However I'm not sure how to correctly persist this after reboot on TrueNAS Scale. Can someone please advise?

Google tells me this is how it's done on Proxmox, but it doesn't seem to work on TrueNAS Scale:
https://forum.proxmox.com/threads/trap-error-on-e1000-network-adapter.105758/post-456790
you can make it "permanent" by adding it as a post-up in your /etc/network/interfaces file:
Code:
iface enp0s25 inet manual
# other configuration options here
# post-up goes below
post-up ethtool -K enp0s25 tso off gso off
 

whodat

Dabbler
Joined
Apr 28, 2018
Messages
34
Answering my own question above: in the TrueNAS Scale GUI I used Settings > Advanced > Init/Shutdown Script to simply add a new Post Init entry:
Mod note: imgur is terrible in every way
 
Last edited by a moderator:

CacheMeIfYouCan

Dabbler
Joined
Oct 23, 2023
Messages
23
Hi EzEkil and @whodat , I too have this e1000e Detected Hardware Unit Hang with an Intel I219-V Ethernet interface that's on my AsRock H670M Pro RS motherboard.
I know it's been some time for you but have you gotten over this issue?

I am running TrueNAS Scale 23.10.0.1 at this moment.

Code:
Jan 17 06:00:03 serveur4 systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Jan 17 06:00:03 serveur4 systemd[1]: sysstat-collect.service: Deactivated successfully.
Jan 17 06:00:03 serveur4 systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Jan 17 06:00:15 serveur4 kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                                   TDH                  <c6>
                                   TDT                  <f3>
                                   next_to_use          <f3>
                                   next_to_clean        <c5>
                                 buffer_info[next_to_clean]:
                                   time_stamp           <101fb00cb>
                                   next_to_watch        <c6>
                                   jiffies              <101fb02b0>
                                   next_to_watch.status <0>
                                 MAC Status             <40080083>
                                 PHY Status             <796d>
                                 PHY 1000BASE-T Status  <3800>
                                 PHY Extended Status    <3000>
                                 PCI Status             <10>
Jan 17 06:00:17 serveur4 kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                                   TDH                  <c6>
                                   TDT                  <f3>
                                   next_to_use          <f3>
                                   next_to_clean        <c5>
                                 buffer_info[next_to_clean]:
                                   time_stamp           <101fb00cb>
                                   next_to_watch        <c6>
                                   jiffies              <101fb04a8>
                                   next_to_watch.status <0>
                                 MAC Status             <40080083>
                                 PHY Status             <796d>
                                 PHY 1000BASE-T Status  <3800>
                                 PHY Extended Status    <3000>
                                 PCI Status             <10>
Jan 17 06:00:19 serveur4 kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                                   TDH                  <c6>
                                   TDT                  <f3>
                                   next_to_use          <f3>
                                   next_to_clean        <c5>
                                 buffer_info[next_to_clean]:
                                   time_stamp           <101fb00cb>
                                   next_to_watch        <c6>
                                   jiffies              <101fb0698>
                                   next_to_watch.status <0>
                                 MAC Status             <40080083>
                                 PHY Status             <796d>
                                 PHY 1000BASE-T Status  <3800>
                                 PHY Extended Status    <3000>
                                 PCI Status             <10>
Jan 17 06:00:21 serveur4 kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                                   TDH                  <c6>
                                   TDT                  <f3>
                                   next_to_use          <f3>
                                   next_to_clean        <c5>
                                 buffer_info[next_to_clean]:
                                   time_stamp           <101fb00cb>
                                   next_to_watch        <c6>
                                   jiffies              <101fb0890>
                                   next_to_watch.status <0>
                                 MAC Status             <40080083>
                                 PHY Status             <796d>
                                 PHY 1000BASE-T Status  <3800>
                                 PHY Extended Status    <3000>
                                 PCI Status             <10>
Jan 17 06:00:22 serveur4 kernel: ------------[ cut here ]------------
Jan 17 06:00:22 serveur4 kernel: NETDEV WATCHDOG: enp0s31f6 (e1000e): transmit queue 0 timed out
Jan 17 06:00:22 serveur4 kernel: WARNING: CPU: 6 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x207/0x210
Jan 17 06:00:22 serveur4 kernel: Modules linked in: rpcsec_gss_krb5(E) nls_ascii(E) nls_cp437(E) vfat(E) fat(E) xt_tcpudp(E) nft_log(E) nft_limit(E) xt_limit(E) xt_NFLOG(E) nfnetlink_log(E) xt_physdev(E) veth(E) tls(E) xt_multiport(E) xt_addrtype(E) ip_vs_rr(E) dummy(E) ipt_REJECT(E) nf_reject_ipv4(E) ip_set_hash_ipport(E) xt_nat(E) xt_ipvs(E) xt_set(E) ip_vs(E) ip_set_hash_ip(E) ip_set_hash_net(E) ip_set(E) xt_MASQUERADE(E) nft_chain_nat(E) xt_mark(E) xt_conntrack(E) xt_comment(E) nft_compat(E) nf_tables(E) nfnetlink(E) iptable_filter(E) iptable_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) overlay(E) br_netfilter(E) vhost_net(E) vhost(E) vhost_iotlb(E) tap(E) tun(E) scst_vdisk(OE) isert_scst(OE) iscsi_scst(OE) scst(OE) rdma_cm(E) iw_cm(E) ib_cm(E) ib_core(E) dlm(E) nvme_fabrics(E) binfmt_misc(E) bridge(E) stp(E) llc(E) ntb_netdev(E) ntb_transport(E) ntb_split(E) ntb(E) ioatdma(E) dca(E) essiv(E) authenc(E) dm_crypt(E) snd_hda_codec_hdmi(E) snd_hda_codec_realtek(E)
Jan 17 06:00:22 serveur4 kernel:  snd_hda_codec_generic(E) ledtrig_audio(E) intel_rapl_msr(E) intel_rapl_common(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) snd_sof_pci_intel_tgl(E) kvm_intel(E) snd_sof_intel_hda_common(E) snd_sof_intel_hda(E) kvm(E) snd_sof_pci(E) irqbypass(E) snd_sof_xtensa_dsp(E) snd_sof(E) snd_sof_utils(E) ghash_clmulni_intel(E) snd_soc_hdac_hda(E) snd_hda_ext_core(E) sha512_ssse3(E) snd_soc_acpi_intel_match(E) snd_soc_acpi(E) sha512_generic(E) i915(E) snd_soc_core(E) snd_compress(E) aesni_intel(E) snd_hda_intel(E) snd_intel_dspcfg(E) crypto_simd(E) drm_buddy(E) cryptd(E) snd_hda_codec(E) rapl(E) drm_display_helper(E) intel_cstate(E) snd_hda_core(E) cec(E) snd_hwdep(E) mei_hdcp(E) intel_uncore(E) rc_core(E) snd_pcm(E) ttm(E) wmi_bmof(E) snd_timer(E) pcspkr(E) iTCO_wdt(E) mei_me(E) snd(E) drm_kms_helper(E) intel_pmc_bxt(E) iTCO_vendor_support(E) i2c_algo_bit(E) watchdog(E) mei(E) soundcore(E) ee1004(E) intel_pmc_core(E) acpi_pad(E) acpi_tad(E) evdev(E) joydev(E) button(E)
Jan 17 06:00:22 serveur4 kernel:  sg(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) drm(E) sunrpc(E) fuse(E) loop(E) efi_pstore(E) dm_mod(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) zfs(POE) spl(OE) efivarfs(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) raid1(E) raid0(E) multipath(E) linear(E) md_mod(E) sd_mod(E) ses(E) enclosure(E) hid_generic(E) usbhid(E) hid(E) nvme(E) nvme_core(E) t10_pi(E) ahci(E) ahciem(E) mpt3sas(E) xhci_pci(E) libahci(E) crc64_rocksoft(E) raid_class(E) crc64(E) e1000e(E) crc_t10dif(E) scsi_transport_sas(E) xhci_hcd(E) crct10dif_generic(E) crc32_pclmul(E) libata(E) i2c_i801(E) intel_lpss_pci(E) ptp(E) crc32c_intel(E) crct10dif_pclmul(E) i2c_smbus(E) pps_core(E) usbcore(E) intel_lpss(E) scsi_mod(E) crct10dif_common(E) usb_common(E) idma64(E) scsi_common(E) video(E) wmi(E)
Jan 17 06:00:22 serveur4 kernel: CPU: 6 PID: 0 Comm: swapper/6 Tainted: P           OE      6.1.55-production+truenas #2
Jan 17 06:00:22 serveur4 kernel: Hardware name: To Be Filled By O.E.M. H670M Pro RS/H670M Pro RS, BIOS 9.02 09/07/2022
Jan 17 06:00:22 serveur4 kernel: RIP: 0010:dev_watchdog+0x207/0x210
Jan 17 06:00:22 serveur4 kernel: Code: 00 e9 40 ff ff ff 48 89 df c6 05 91 35 3f 01 01 e8 6e de fa ff 44 89 e9 48 89 de 48 c7 c7 38 a6 3f 89 48 89 c2 e8 69 af 88 ff <0f> 0b e9 22 ff ff ff 66 90 0f 1f 44 00 00 55 53 48 89 fb 48 8b 6f
Jan 17 06:00:22 serveur4 kernel: RSP: 0018:ffffae8c002b0e80 EFLAGS: 00010286
Jan 17 06:00:22 serveur4 kernel: RAX: 0000000000000000 RBX: ffff9af715130000 RCX: 0000000000000000
Jan 17 06:00:22 serveur4 kernel: RDX: 0000000000000104 RSI: ffffffff8938a806 RDI: 00000000ffffffff
Jan 17 06:00:22 serveur4 kernel: RBP: ffff9af715130488 R08: 0000000000000000 R09: ffffae8c002b0cf0
Jan 17 06:00:22 serveur4 kernel: R10: 0000000000000003 R11: ffffffff89ad40a8 R12: ffff9af7151303dc
Jan 17 06:00:22 serveur4 kernel: R13: 0000000000000000 R14: ffffffff888117b0 R15: ffff9af715130488
Jan 17 06:00:22 serveur4 kernel: FS:  0000000000000000(0000) GS:ffff9b064f780000(0000) knlGS:0000000000000000
Jan 17 06:00:22 serveur4 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 17 06:00:22 serveur4 kernel: CR2: 00007ff2324f3000 CR3: 0000000afb410000 CR4: 0000000000752ee0
Jan 17 06:00:22 serveur4 kernel: PKRU: 55555554
Jan 17 06:00:22 serveur4 kernel: Call Trace:
Jan 17 06:00:22 serveur4 kernel:  <IRQ>
Jan 17 06:00:22 serveur4 kernel:  ? __warn+0x7d/0xc0
Jan 17 06:00:22 serveur4 kernel:  ? dev_watchdog+0x207/0x210
Jan 17 06:00:22 serveur4 kernel:  ? report_bug+0xe6/0x170
Jan 17 06:00:22 serveur4 kernel:  ? irq_work_queue+0xa/0x50
Jan 17 06:00:22 serveur4 kernel:  ? handle_bug+0x41/0x70
Jan 17 06:00:22 serveur4 kernel:  ? exc_invalid_op+0x13/0x60
Jan 17 06:00:22 serveur4 kernel:  ? asm_exc_invalid_op+0x16/0x20
Jan 17 06:00:22 serveur4 kernel:  ? pfifo_fast_reset+0x140/0x140
Jan 17 06:00:22 serveur4 kernel:  ? dev_watchdog+0x207/0x210
Jan 17 06:00:22 serveur4 kernel:  ? pfifo_fast_reset+0x140/0x140
Jan 17 06:00:22 serveur4 kernel:  call_timer_fn+0x24/0x130
Jan 17 06:00:22 serveur4 kernel:  __run_timers+0x21c/0x2a0
Jan 17 06:00:22 serveur4 kernel:  run_timer_softirq+0x2b/0x50
Jan 17 06:00:22 serveur4 kernel:  __do_softirq+0xed/0x2fe
Jan 17 06:00:22 serveur4 kernel:  __irq_exit_rcu+0xc7/0x130
Jan 17 06:00:22 serveur4 kernel:  sysvec_apic_timer_interrupt+0x9e/0xc0
Jan 17 06:00:22 serveur4 kernel:  </IRQ>
Jan 17 06:00:22 serveur4 kernel:  <TASK>
Jan 17 06:00:22 serveur4 kernel:  asm_sysvec_apic_timer_interrupt+0x16/0x20
Jan 17 06:00:22 serveur4 kernel: RIP: 0010:cpuidle_enter_state+0xde/0x420
Jan 17 06:00:22 serveur4 kernel: Code: 00 00 31 ff e8 33 f9 97 ff 45 84 ff 74 16 9c 58 0f 1f 40 00 f6 c4 02 0f 85 25 03 00 00 31 ff e8 68 6c 9e ff fb 0f 1f 44 00 00 <45> 85 f6 0f 88 85 01 00 00 49 63 d6 48 8d 04 52 48 8d 04 82 49 8d
Jan 17 06:00:22 serveur4 kernel: RSP: 0018:ffffae8c00197e90 EFLAGS: 00000246
Jan 17 06:00:22 serveur4 kernel: RAX: ffff9b064f780000 RBX: ffff9b064f7bbe00 RCX: 0000000000000000
Jan 17 06:00:22 serveur4 kernel: RDX: 0000000000000006 RSI: ffffffff8938a806 RDI: ffffffff89364511
Jan 17 06:00:22 serveur4 kernel: RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000026c27b0a
Jan 17 06:00:22 serveur4 kernel: R10: 0000000000000018 R11: 0000000000002448 R12: ffffffff89b9efa0
Jan 17 06:00:22 serveur4 kernel: R13: 000079280728e19d R14: 0000000000000003 R15: 0000000000000000
Jan 17 06:00:22 serveur4 kernel:  cpuidle_enter+0x29/0x40
Jan 17 06:00:22 serveur4 kernel:  do_idle+0x20c/0x2b0
Jan 17 06:00:22 serveur4 kernel:  cpu_startup_entry+0x19/0x20
Jan 17 06:00:22 serveur4 kernel:  start_secondary+0x130/0x150
Jan 17 06:00:22 serveur4 kernel:  secondary_startup_64_no_verify+0xe5/0xeb
Jan 17 06:00:22 serveur4 kernel:  </TASK>
Jan 17 06:00:22 serveur4 kernel: ---[ end trace 0000000000000000 ]---
Jan 17 06:00:22 serveur4 kernel: e1000e 0000:00:1f.6 enp0s31f6: Reset adapter unexpectedly
Jan 17 06:00:22 serveur4 kernel: br0: port 1(enp0s31f6) entered disabled state
Jan 17 06:00:26 serveur4 kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 17 06:00:26 serveur4 kernel: br0: port 1(enp0s31f6) entered blocking state
Jan 17 06:00:26 serveur4 kernel: br0: port 1(enp0s31f6) entered listening state
Jan 17 06:00:27 serveur4 k3s[6163]: {"level":"warn","ts":"2024-01-17T06:00:27.856-0500","logger":"etcd-client","caller":"v3@v3.5.7-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000590000/kine.sock","attempt":0,"error":"rpc error: code = Unknown desc = no such table: dbstat"}
Jan 17 06:00:28 serveur4 kernel: br0: port 1(enp0s31f6) received tcn bpdu
Jan 17 06:00:28 serveur4 kernel: br0: topology change detected, propagating
Jan 17 06:00:28 serveur4 kernel: br0: port 1(enp0s31f6) received tcn bpdu
Jan 17 06:00:28 serveur4 kernel: br0: topology change detected, propagating
Jan 17 06:00:41 serveur4 kernel: br0: port 1(enp0s31f6) entered learning state
Jan 17 06:00:56 serveur4 kernel: br0: port 1(enp0s31f6) entered forwarding state
Jan 17 06:00:56 serveur4 kernel: br0: topology change detected, propagating
Jan 17 06:01:01 serveur4 k3s[6163]: time="2024-01-17T06:01:01-05:00" level=info msg="COMPACT compactRev=9430489 targetCompactRev=9431002 currentRev=9432002"
Jan 17 06:01:01 serveur4 k3s[6163]: time="2024-01-17T06:01:01-05:00" level=info msg="COMPACT deleted 513 rows from 513 revisions in 5.763759ms - compacted to 9431002/9432002"


On my system, the hang is auto detected and auto healed by the driver, it seems. So it has far less impact. I get disconnections across the network for a few seconds and then everything returns to normal for a few hours or a full day. The strangest thing is that when it occurs, the error occurs at exactly the same time of day : 6:01 AM and/or 8:01 AM (in other words, once or twice per day). It's a moment of the day where the VM (a Proxmox Backup Server) syncs its datastore over the network, so this is a peak of network activity. However this is not the only moment in the day where this system has peaks network activity.
 

CacheMeIfYouCan

Dabbler
Joined
Oct 23, 2023
Messages
23
These exact symptoms that we see with the e1000 driver have been reported before, even dating back to 2016 way before the I219 even existed.

I've found references on Proxmox forums of users also using I219 chips.
 

CacheMeIfYouCan

Dabbler
Joined
Oct 23, 2023
Messages
23
Well, further searching lead me to the following link and where a note by Intel is referenced.
We see that this issue that shows up in the e1000 logs dates to the Intel 82579 and the I219. It may or may not be the same issue, notheless the symptoms are the same.
Personally, I am not going to try to solve it by patches, I am going to add a new network card, hopefully more robust.
 

whodat

Dabbler
Joined
Apr 28, 2018
Messages
34
@CacheMeIfYouCan looks like the imgur link to a screenshot of my solution in the above post I made has expired.

To “fix” this, I created a new Post-Init entry under Settings > Advanced > Init/Shutdown Scripts:

Type: COMMAND
Description: ethtool -K enp0s25 tso off gso off
When: POSTINIT
Enabled: True
 

CacheMeIfYouCan

Dabbler
Joined
Oct 23, 2023
Messages
23
I see, thanks. This solution was proposed by some of the sources, although with (unspecified) performance penalty and didn't work 100% for everyone. I will try it.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Yeah, please do not use imgur or any other external image hosts. They are nothing but a crutch for people looking to cheap out on their hosting by dumping the expense of hosting the images onto someone else.
 

CacheMeIfYouCan

Dabbler
Joined
Oct 23, 2023
Messages
23
@CacheMeIfYouCan looks like the imgur link to a screenshot of my solution in the above post I made has expired.

To “fix” this, I created a new Post-Init entry under Settings > Advanced > Init/Shutdown Scripts:

Type: COMMAND
Description: ethtool -K enp0s25 tso off gso off
When: POSTINIT
Enabled: True
I see, thanks. This solution was proposed by some of the sources, although with (unspecified) performance penalty and didn't work 100% for everyone. I will try it.
I've done various iperf3 tests and measured only very slight 15 mbps performance degradation in throughoutput (935 mbps vs 950 mbps). Nothing that I could even quantify on cpu load, it's the same. I should know within a few days if the "Detected Hardware Unit Hang" are gone.
 
Top