Debug random hard locks not caused by hardware

polygonalsnow · Apr 16, 2020

Hi all,

I'm about at my wits end with my freeNAS box. I've replaced every bit of hardware to try and stop these hard locks, and yet they persist. (For reference, by hard lock I mean the system goes unresponsive to any input) They happen usually under heavy disk/IO load, which leads me to believe maybe it's a ZFS issue? You can see my current setup in my signature. Is there some setting I can turn on to get more verbose logging and maybe see where these locks are coming from? Please help with anything you can think of

polygonalsnow · Apr 17, 2020

If it's helpful, here is what 'top' shows when left on when it crashes:

Code:

last pid: 65033;  load averages:  0.36,  0.88,  0.70                                                                                               up 0+21:14:18  14:45:01
98 processes:  1 running, 97 sleeping
CPU:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Mem: 494M Active, 2807M Inact, 12G Wired, 381M Free
ARC: 9580M Total, 211M MFU, 7688M MRU, 1053M Anon, 58M Header, 570M Other
     7532M Compressed, 7885M Uncompressed, 1.05:1 Ratio
Swap:

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
 6393 colton       10  20    0   326M   250M uwait   3  11:24   0.33% mono-sgen
 6127 root         34  20    0   427M   237M select  5   9:12   0.07% python2.7
30939 root          1  20    0  7940K  4604K CPU4    4   0:31   0.06% top
 6159 colton       25  20    0   534M   459M select  0  10:12   0.06% python2.7
 6449    972       13  52   15   245M   122M piperd  2   1:33   0.04% Plex Script Host
29735 root          1  20    0 11452K  7968K nanslp  7   0:17   0.02% gstat
 6263    972       21  23    0   581M   384M uwait   8  19:19   0.01% Plex Media Server
64481 root          1  20    0  6464K  2400K nanslp  6   0:00   0.01% cron
 6169 root          1  20    0  6464K  2392K nanslp 11   0:00   0.01% cron
 6401 root          1  20    0  6464K  2392K nanslp  5   0:00   0.01% cron
 3138 nobody        1  20    0  6928K  3240K select 14   0:01   0.01% mdnsd
 2105 nobody        1  20    0  6928K  3136K select 13   0:01   0.01% mdnsd
65028 root          1  20    0  6464K  2400K ppwait  5   0:00   0.01% cron
65031 root          1  20    0  6464K  2392K ppwait  3   0:00   0.01% cron
65030 root          1  20    0  6464K  2392K ppwait  6   0:00   0.01% cron
 6459    972       11  22    0 89168K 20120K usem    7   0:02   0.00% Plex Tuner Service
30904 root          1  20    0 12916K  8056K select  4   0:00   0.00% sshd
 4948 root          6  20    0   110M 30992K select  4   0:05   0.00% python2.7
 5916 colton       16  20    0   442M   370M uwait   4   8:57   0.00% mono-sgen
 6623 root          1  20    0 12916K  8328K select  5   0:04   0.00% sshd
30897 root          1  20    0 12916K  8116K select  6   0:02   0.00% sshd
 2301 root          2  20    0 20624K  9356K kqread  5   0:01   0.00% syslog-ng
  235 root         22  22    0   266M   212M kqread 12   5:17   0.00% python3.6
 3184 root          1  20    0   115M   100M kqread  4   0:04   0.00% uwsgi-3.6
 3025 root          8  20    0 35644K 12468K select 13   0:57   0.00% rrdcached
 2996 root          1  20    0 37456K 21368K select  5   0:01   0.00% winbindd
65032 root          1  20    0  6464K  2392K nanslp 10   0:00   0.00% cron
65029 root          1  20    0  6464K  2400K nanslp  7   0:00   0.00% cron
65033 root          1  20    0  6464K  2392K nanslp  3   0:00   0.00% cron
14484    972        1  20    0 25648K  6132K kqread 15   6:11   0.00% EasyAudioEncoder
 3392 root         11  20    0   102M 60888K nanslp 13   4:02   0.00% collectd
 3335 root         15  20    0   166M   134M umtxn  11   0:33   0.00% uwsgi-3.6
40935 root          1  20    0   170M   151M select  0   0:33   0.00% smbd
  345 root          2  20    0   123M   107M usem    4   0:17   0.00% python3.6
  348 root          2  20    0   123M   107M usem    3   0:17   0.00% python3.6
  347 root          2  20    0   123M   108M usem   14   0:16   0.00% python3.6
  346 root          2  20    0   123M   109M piperd  3   0:16   0.00% python3.6
  344 root          2  20    0   123M   107M usem   10   0:16   0.00% python3.6
64988 root          1  25    0  6296K  3040K zio->i  6   0:08   0.00% cp
 2784 root          1  20    0 12488K 12596K select  8   0:03   0.00% ntpd
 5595 root          1  52    0 65208K 59392K ttyin   6   0:01   0.00% python3.6
 2991 root          1  20    0   162M   145M select 14   0:01   0.00% smbd
 2988 root          1  20    0 29404K 16492K select 11   0:01   0.00% nmbd
 6625 root          1  20    0 12916K  8056K select 15   0:01   0.00% sshd
64313 root          1  52    0 45424K 39888K wait    0   0:01   0.00% python3.6
30799 root          1  20    0 12916K  8084K select  4   0:00   0.00% sshd
30806 root          1  20    0 12916K  8056K select 11   0:00   0.00% sshd

And also, this is what 'gstat -I ls -p' shows on crash:

Code:

dT: 1.018s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0      0      0      0    0.0      0      0    0.0    0.0| ada0
    0      0      0      0    0.0      0      0    0.0    0.0| ada1
    9      0      0      0    0.0      0      0    0.0    0.0| ada2
    0      0      0      0    0.0      0      0    0.0    0.0| ada3
    0      0      0      0    0.0      0      0    0.0    0.0| ada4
    0      0      0      0    0.0      0      0    0.0    0.0| ada5
    0      0      0      0    0.0      0      0    0.0    0.0| ada6
    0      0      0      0    0.0      0      0    0.0    0.0| ada7
    0      0      0      0    0.0      0      0    0.0    0.0| ada8

I think this is interesting, as disk usage is extremely high (90-100% across all disks) up until the moment it crashes, at which point it drops to 0, as seen above

Apollo · Apr 17, 2020

Issue could be your SATA card.
If you have a look at Netdata and if you are able to replicate this, then Netdata might give you some insight to the root cause of the problem.
Also, what is the reason for 90-100% disk usage? what is causing it, jail such as Plex or going through ZFS maintenance such as Scrubing or Resilvering.

polygonalsnow · Apr 17, 2020

Apollo said:
Issue could be your SATA card.

It happened before I installed the SATA card too. Like I said, I'm sure it's not hardware, I've replaced literally every component with one that is different enough to rule out hardware as the issue.

I'll look into Netdata to see if I can find anything, and will report back.

Apollo said:
Also, what is the reason for 90-100% disk usage?

Copying large files both internally and externally will cause the crash. If I do a transfer of a large movie to the server over SMB or if I do just a generic cp from one folder to another, both end in hard locks after a few minutes. I'm just not sure how to debug this, as nothing gets printed to /var/log/messages (I've checked)

Apollo · Apr 17, 2020

I have seen similar behavior in earlier version of Freenas (11.1 around or so) and was caused by jails.

polygonalsnow · Apr 17, 2020

So I managed to induce a crash again, this time with Netdata running. Everything looks normal except right before the crash, wired RAM usage rises significantly, and committed memory approaches 2.6GB exactly. Committed memory hits 2.5926GB about 15sec before the crash, and is at 2.5986GB 1sec before the crash. Does this number mean anything to FreeNAS or FreeBSD? Memory as a whole is fine though, as 1sec before the crash there was still 5GB free.

Apollo said:
I have seen similar behavior in earlier version of Freenas (11.1 around or so) and was caused by jails.

As for jails, yes, I'm running a few: 1 warden jail (Plex), and 4 iocage jails. They are all running when the crash occurs. The cp command is being run in an iocage jail, but I don't believe SMB uses a jail. What do you think I should do with them?

Apollo · Apr 17, 2020

It would seem consistent with what I have experienced in the past. Not sure when there was enough RAM available though.
All the jails use network interface and I think this is where the issue lies. If you are up for it, then you could upgrade to latest Freenas build. You want to make sure you have a backup of your config just in case.

polygonalsnow · Apr 17, 2020

I'd rather avoid 11.3 unless I knew for sure it was stable and there was a good chance it would fix my problem.

This system has actually been plagued with this issue since 11.1-U2, but I guess that's still in line with your experience. Is there some way I can debug this though? Like a file where live zfs, jail or network debug output is printed? I'm still pretty new when it comes to FreeBSD (even though I've been battling this for almost 2 years now)

polygonalsnow · Apr 18, 2020

I swapped over my last jail to iocage last night, as well as converted them over to using VNET instead of shared IP. Still crashes. Here's the top -S output:

Code:

last pid:  9614;  load averages:  2.87,  3.69,  3.14                                                  up 0+00:46:13  17:15:26
100 processes: 5 running, 94 sleeping, 1 waiting
CPU:  0.0% user,  0.0% nice, 25.3% system,  2.0% interrupt, 72.7% idle
Mem: 1767M Active, 857M Inact, 20M Laundry, 12G Wired, 1084M Free
ARC: 11G Total, 2544M MFU, 7578M MRU, 743M Anon, 28M Header, 25M Other
     10G Compressed, 10G Uncompressed, 1.03:1 Ratio
Swap:

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
   11 root         16 155 ki31     0K   256K CPU0    0 616:00 738.79% idle
9613 root          1  23    0  8584K  4676K CPU12  12   0:02 100.21% smartctl
9612 root          1  25    0  8584K  4676K CPU3    3   0:02 100.21% smartctl
3398 root         17  20    0   104M 57192K nanslp  8   0:11  20.41% collectd
   12 root         53 -56    -     0K   848K WAIT   -1  10:10  20.06% intr
9614 root          1  52    0  8584K  4680K cbwait 10   0:00  20.02% smartctl
   16 root          1 -16    -     0K    16K CPU5    5   2:58   9.74% rand_harvestq
2315 root          2  23    0 20456K  9072K kqread  0   0:00   7.65% syslog-ng
8478 colton       25  20    0   753M   624M select  7  22:03   0.12% python2.7
6084 root          1  20    0  7940K  3828K CPU6    6   0:01   0.06% top
7270    972       14  52   15   117M 71620K piperd  4   0:05   0.05% Plex Script Host
   17 root          3 -16    -     0K    48K psleep 10   0:01   0.01% pagedaemon
7223    972       18  20    0   285M   167M uwait   0   0:08   0.01% Plex Media Server
7281    972       10  52    0 75680K 49396K piperd 12   0:02   0.01% Plex Script Host
7280    972       11  21    0 32280K 21332K usem   11   0:00   0.00% Plex Tuner Service
5140 root          1  21    0 12916K  8080K select  5   0:02   0.00% sshd
8862 colton       16  20    0   298M   223M uwait   4   1:30   0.00% mono-sgen
3189 root          1  20    0   115M   100M kqread 10   0:02   0.00% uwsgi-3.6
   15 root         15  -8    -     0K   272K t->zth 13   0:41   0.00% zfskern
3039 root          8  20    0 33596K 11516K select  5   0:02   0.00% rrdcached
  234 root         22  26    0   240M   185M kqread  8   0:17   0.00% python3.6
   23 root          1  16    -     0K    16K syncer 11   0:00   0.00% syncer
   20 root          1 -16    -     0K    16K -      10   0:00   0.00% bufspacedaemon
   21 root          1 -16    -     0K    16K psleep 11   0:00   0.00% bufdaemon
    0 root        804 -16    -     0K 12864K swapin 14  29:53   0.00% kernel
    4 root          4 -16    -     0K    64K -      10   0:51   0.00% cam
9555 colton        1  38    0 12848K  7528K zio->i  4   0:22   0.00% unrar
7643 colton       14  20    0   281M   221M pause  13   0:15   0.00% mono-sgen
  344 root          2  20    0   122M   106M usem    6   0:05   0.00% python3.6
  345 root          2  20    0   120M   106M piperd  8   0:04   0.00% python3.6
  346 root          2  20    0   121M   107M usem    0   0:04   0.00% python3.6
  348 root          2  20    0   120M   106M usem   14   0:04   0.00% python3.6
  347 root          2  20    0   123M   107M usem    6   0:04   0.00% python3.6
3341 root         15  20    0   158M   126M umtxn  11   0:03   0.00% uwsgi-3.6
3998 root          1  52    0 65208K 59388K ttyin   2   0:01   0.00% python3.6
3005 root          1  20    0   161M   145M select 15   0:00   0.00% smbd
2798 root          1  20    0 12488K 12596K select 11   0:00   0.00% ntpd
3143 nobody        1  20    0  6928K  3380K select  5   0:00   0.00% mdnsd
9136 root          1  20    0  7480K  4416K pause   3   0:00   0.00% csh
2127 nobody        1  20    0  6928K  3308K select  6   0:00   0.00% mdnsd
3014 root          1  20    0 10500K  5284K nanslp  4   0:00   0.00% smartd
3131 www           1  20    0 12296K  8152K kqread 11   0:00   0.00% nginx
   13 root          3  -8    -     0K    48K -      15   0:00   0.00% geom
9132 root          1  20    0 12916K  8056K select 12   0:00   0.00% sshd
  243 root          1  52    0 15532K 11400K piperd 14   0:00   0.00% python3.6
2005 root          1  20    0  9176K  5556K select 15   0:00   0.00% devd
3002 root          1  20    0 29404K 16488K select 11   0:00   0.00% nmbd
   14 root         10 -68    -     0K   160K -       1   0:00   0.00% usb
3010 root          1  20    0 37448K 21352K select  5   0:00   0.00% winbindd
5142 root          1  20    0 12916K  8056K select 10   0:00   0.00% sshd
3027 root          1  20    0 78668K 63052K select 10   0:00   0.00% winbindd
5139 root          1  20    0   164M   147M select 14   0:00   0.00% smbd
    1 root          1  23    0  5396K  1012K wait    0   0:00   0.00% init
9134 root          1  20    0 12916K  8056K select  8   0:00   0.00% sshd
5144 root          1  40    0  7480K  4524K pause   2   0:00   0.00% csh
9139 root          1  24    0  7480K  3416K pause  11   0:00   0.00% csh
5147 root          1  24    0  7480K  3416K pause  13   0:00   0.00% csh
   22 root          1 -16    -     0K    16K vlruwt  5   0:00   0.00% vnlru
3782 root          1  20    0  6488K  2516K nanslp  0   0:00   0.00% cron
3168 root          1  20    0 39156K 21784K select  3   0:00   0.00% winbindd
7232 root          1  20    0  6464K  2424K nanslp  1   0:00   0.00% cron
7650 root          1  20    0  6464K  2392K nanslp 11   0:00   0.00% cron
9419 root          1  52    0  7108K  3120K wait    2   0:00   0.00% sh
3170 root          1  20    0 37428K 21644K select  3   0:00   0.00% winbindd
8869 root          1  20    0  6464K  2392K nanslp 14   0:00   0.00% cron
8489 root          1  20    0  6464K  2392K nanslp  8   0:00   0.00% cron
3772 root          1  20    0  9012K  4904K select 12   0:00   0.00% zfsd
7589 root          1  20    0  6412K  2380K select  3   0:00   0.00% syslogd
7130 root          1  20    0  6420K  2420K zio->i  9   0:00   0.00% syslogd
3047 root          1  20    0   119M   103M select  3   0:00   0.00% smbd
3054 root          1  20    0   119M   102M select  7   0:00   0.00% smbd
8808 root          1  20    0  6412K  2380K select  0   0:00   0.00% syslogd
9141 root          1  23    0 10164K  5660K select  2   0:00   0.00% sftp-server
8425 root          1  20    0  6420K  2388K select 15   0:00   0.00% syslogd
9430 root          1  20    0  6900K  2972K ttyin   7   0:00   0.00% more

I noticed that the kernel is spending all its time in the swapin state, but have no clue what that means. I also see that top reports blank for the swap, but don't even remember if I gave the system swap or not so that may be normal? Is this something I should look into? If so, how?

vmstat -i shows:

Code:

root@freenas:~ # vmstat -i
interrupt                          total       rate
cpu0:timer                        230176        494
cpu14:timer                       197305        424
cpu7:timer                        149664        321
cpu5:timer                        149331        321
cpu4:timer                        151933        326
cpu3:timer                        183405        394
cpu2:timer                        179987        387
cpu8:timer                        147327        316
cpu9:timer                        144982        311
cpu11:timer                       135363        291
cpu13:timer                       150557        323
cpu12:timer                       145430        312
cpu1:timer                        162831        350
cpu10:timer                       192923        414
cpu6:timer                        151416        325
cpu15:timer                       131126        282
irq260: ahci0                     821424       1764
irq261: ahci1                     611056       1313
irq262: igb0:que 0                387602        833
irq263: igb0:que 1                348632        749
irq264: igb0:que 2                327273        703
irq265: igb0:que 3                356571        766
irq266: igb0:link                      4          0
irq272: xhci1                        126          0
irq276: ahci2:ch3                   9756         21
Total                            5466200      11741

Which I think is pretty normal, but maybe an expert can chime in if any of these look too large for a system that's been up for 10min

Apollo · Apr 18, 2020

Not sure.
Check the cache on the pool as it was a sign of a memory leak. If you see the swap using other than 0 it is most likely the issue.
You could stop the jails and see if loading your system as you did would cause the crash to occur.
Look at the cache in NetData. If it reaches max cache capacity and crash, that would be the reason.

You next option might just be to upgrade to 11.3

polygonalsnow · Apr 24, 2020

There doesn't seem to be a memory leak, as there's free memory all the way up until it crashes. Moving files is definitely the cause though, as I just induced it again, even without jails running. Something as simple as a root CP from one folder to another is causing it to go down. Can someone PLEASE help me out with a way to debug this?

polygonalsnow · Apr 29, 2020

Bump, any advice?

Newfoundland.Republic · Apr 29, 2020

This is a little out of my depth, but could it be that generic SATA with the Marvell 88SE9215?

I've seen several posts on using AMD (some get it to work, others seem to have nothing but sorrow, it seems - that is listed in the update for TrueNAS Core 12.0 (better AMD support).

Newfoundland.Republic · Apr 29, 2020

Another thought: Have you considered upgrading to 11.3 U2.1? Check the jira log and see if anything matches your issue

polygonalsnow · Apr 29, 2020

Newfoundland.Republic said:
This is a little out of my depth, but could it be that generic SATA with the Marvell 88SE9215?

As I stated earlier (post #4) and multiple times, hardware is not the issue. It happened before I installed the SATA card, and I've used a different passthrough RAID card to confirm it's not the issue.

Newfoundland.Republic said:
Another thought: Have you considered upgrading to 11.3 U2.1? Check the jira log and see if anything matches your issue

I've thought about it but given the bugs other people are having, I'd rather not add more issues to the mix. I checked jira but nothing about a core issue with moving files has come up (it's a NAS for crying out loud, why is it not capable of moving files?)

Fundamentally, I need advice on how to debug the software. How can I peer into what ZFS is doing under the hood when it crashes? dmesg certainly doesn't give me enough info

Newfoundland.Republic · Apr 30, 2020

Ok, I didn't see the part on not using that SATA card.

There is a U2.1 release for 11.3 that has more fixes. Did you check that? I've been running on 11.3 up to the currrent U2.1 release without issues. (I'll note that you will see probably orders of magnitude more posts on people with issues - people without issues don't post;))

I see that you say you've replaced just about everything but just to check: Did you try replacing the SATA cables? There could be something itermittent there. I had that happen to me where the problem would only show up on heavy disk I/O.

If you haven't replaced the RAM have you run a memtest? I think that RAM is used as cache for ZFS and I could see heavy transfers showing up any problems there, I'm not sure on that.

polygonalsnow · May 2, 2020

Just updated to 11.3U2.1, will report back if it fixes/doesn't fix the issue

polygonalsnow · May 3, 2020

Welp, it just crashed again, still no luck.

Newfoundland.Republic said:
I see that you say you've replaced just about everything but just to check: Did you try replacing the SATA cables? There could be something itermittent there. I had that happen to me where the problem would only show up on heavy disk I/O.

If you haven't replaced the RAM have you run a memtest? I think that RAM is used as cache for ZFS and I could see heavy transfers showing up any problems there, I'm not sure on that.

I've tried replacing the SATA cables and the memory, once again, I repeat that my issue is not hardware related, or replacing literally every part on the machine would've fixed it. The only case it may be is that it is FreeNAS's (or therefore FreeBSD's) support of Ryzen CPUs as a whole, as I've tried both a 1700 and 2700 to make sure it wasn't the weird first generation Ryzen bug. I find this unlikely as others run Ryzen-based systems without this issue.

Newfoundland.Republic · May 3, 2020

Well, if you are set on Ryzen, the only other option, if you have not replaced the motherboard, I can think of is a server Ryzen board such as the ASRock Rack X470D4U. It seems like a nice board (it has onboard dual NICs, 2 M.2 slots, IPMI and can use ECC RAM and none of the non-server stuff like audio which you can't use anyway). But, if you search the threads, Intel i3 and Xeon boards are more successful than AMD.

polygonalsnow · May 3, 2020

Is there any place I can find people with experience actually debugging issues instead of throwing more hardware at it? I have replaced the motherboard already, so I'd rather not drop another $250 on a new one when the one I have has no issues, and it's FreeNAS's problem. This hard lock really should be something that drops some hints in the software before occurring, I just need to know how to read them.

I can't go to the FreeBSD forums because they'll just say I should be using 12.x and then send me back here. Can someone with more experience throw me a bone?

Important Announcement for the TrueNAS Community.

Debug random hard locks not caused by hardware

Dabbler

Dabbler

Wizard

Dabbler

Wizard

Dabbler

Wizard

Dabbler

Dabbler

Wizard

Dabbler

Dabbler

Guru

Guru

Dabbler

Guru

Dabbler

Dabbler

Guru

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Debug random hard locks not caused by hardware"

Similar threads