Freezes and crashes after Upgrading to TrueNAS-12.0-U2.1

CFelix

Cadet
Joined
Nov 11, 2019
Messages
4
Hi,

we upgraded our FreeNAS-11.3-U5 to TrueNAS-12.0-U2.1 a few days ago. The Upgrade/Install as such was smooth and done within about 10 to 20 minutes.

TrueNAS was running fine for a few days and than started acting up all of a sudden. We had no SSH access, no GUI access and our Storage backups could not be written anymore. At first I didn't think much of it and classed it as a one time only freeze and classically restarted the machine. But last night it happened again and this time, during a resilvering, as we are just about to upgrade some of the disks in our pool.

The Pool mainly consisted of 6x6TB disks and we want to replace those with 8TB disks, so we added two new 8TB disks and started replacing the first 6TB disk of the volume. The resilvering process was already at 60.xx% on Friday and I thought, I could start the replacement of the second one on Monday, until TrueNAS crashed.

I'm a little worried about the resilvering process, because when I issue the CLI command zpool status, it still says "resilvering" but when I look at the GUI, it doesn't reflect that status.

This is a snippet from /var/log/messages at the time, the behavior started:

Code:
Mar 13 00:00:01 vm-freenas-a-1 syslog-ng[1402]: Configuration reload request received, reloading configuration;
Mar 13 00:00:01 vm-freenas-a-1 syslog-ng[1402]: Configuration reload finished;
Mar 13 02:01:48 vm-freenas-a-1 1 2021-03-13T02:01:48.620077+00:00 vm-freenas-a-1 collectd 1730 - - Traceback (most recent call last):

  File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 62, in read

    with Client() as c:

  File "/usr/local/lib/python3.8/site-packages/middlewared/client/client.py", line 281, in __init__

    self._ws.connect()

  File "/usr/local/lib/python3.8/site-packages/middlewared/client/client.py", line 124, in connect

    rv = super(WSClient, self).connect()

  File "/usr/local/lib/python3.8/site-packages/ws4py/client/__init__.py", line 223, in connect

    bytes = self.sock.recv(128)

socket.timeout: timed out

Mar 13 16:56:49 vm-freenas-a-1 1 2021-03-13T16:56:48.865154+00:00 vm-freenas-a-1 collectd 1730 - - Traceback (most recent call last):

  File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 62, in read

    with Client() as c:

  File "/usr/local/lib/python3.8/site-packages/middlewared/client/client.py", line 281, in __init__

    self._ws.connect()

  File "/usr/local/lib/python3.8/site-packages/middlewared/client/client.py", line 124, in connect

    rv = super(WSClient, self).connect()

  File "/usr/local/lib/python3.8/site-packages/ws4py/client/__init__.py", line 223, in connect

    bytes = self.sock.recv(128)

socket.timeout: timed out

Mar 13 22:03:47 vm-freenas-a-1 kernel: pid 70919 (rsync), jid 0, uid 0, was killed: out of swap space
Mar 13 22:03:47 vm-freenas-a-1 kernel[1402]: Last message 'pid 70919 (rsync), j' repeated 1 times, suppressed by syslog-ng on vm-freenas-a-1
Mar 13 22:03:47 vm-freenas-a-1 kernel: pid 70920 (rsync), jid 0, uid 0, was killed: out of swap space
Mar 13 22:03:47 vm-freenas-a-1 kernel[1402]: Last message 'pid 70920 (rsync), j' repeated 1 times, suppressed by syslog-ng on vm-freenas-a-1
Mar 13 22:03:47 vm-freenas-a-1 kernel: pid 1730 (collectd), jid 0, uid 0, was killed: out of swap space
Mar 13 22:03:47 vm-freenas-a-1 kernel[1402]: Last message 'pid 1730 (collectd),' repeated 1 times, suppressed by syslog-ng on vm-freenas-a-1
Mar 13 22:03:47 vm-freenas-a-1 kernel: pid 277 (python3.8), jid 0, uid 0, was killed: out of swap space
Mar 13 22:03:47 vm-freenas-a-1 kernel[1402]: Last message 'pid 277 (python3.8),' repeated 1 times, suppressed by syslog-ng on vm-freenas-a-1
Mar 13 22:03:47 vm-freenas-a-1 kernel: pid 278 (python3.8), jid 0, uid 0, was killed: out of swap space
Mar 13 22:04:47 vm-freenas-a-1 kernel[1402]: Last message 'pid 278 (python3.8),' repeated 1 times, suppressed by syslog-ng on vm-freenas-a-1


I did do some research about the out of swap space message but could only find old threads from 2017/2018, which did only allow me to check a few things being mentioned but couldn't really help me to figure out what is actually going on:

About the machine:

CPU: 4 Cores
RAM: 12 GB
SYS HDD: 16GB

Autotune: This was and has never been enabled
Tunables: None (since Autotune was never enabled)
Jails: None

vfs.zfs.arc_max: This had been mentioned quite a few times in the other threads I could find but, this is what I get, when I issue the sysctl command:

Code:
root@vm-freenas-a-1:~ # sysctl -a vfs.zfs.arc_max
vfs.zfs.arc_max: 0


Not sure what is going on, since we didn't have those problems with the FreeNAS-11.3-U5 version. Please could someone help us out? Any help is really appreciated.

If I missed any information, please let me know and I will provide that as soon as possible.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
12 GB of RAM is pretty low for TrueNAS 12.x. Your system is trying to swap to compensate, but one difference between 11.x and 12.x is in 12, all the swap partitions are bound into a single mirror, so your total swap is 2 GB; whereas in 11.x, swap is a stripe of mirrors, so total swap is 4 GB. I'd say you ran into memory starvation, and should look into reverting to 11.3 until you can upgrade your RAM.

Please provide all details for your hardware, and your pool.
 

CFelix

Cadet
Joined
Nov 11, 2019
Messages
4
Hi Samuel,

Thank you for your response! You were saying that 12GB of RAM is pretty low for TrueNAS 12.x - How much do you recommend as a minimum?

TrueNAS is running as VM, so it wouldn't be a problem to allocate more RAM at any point. Would this affect the current resilvering in any way, if I just shut down the VM and allocate more RAM?

Please see below the hardware and pool details. (If that's not what you wanted, please do tell me if there is a CLI command, that provides the required information and I'm happy to provide it)

Hardware details:
CPU: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz (2400.00-MHz K8-class CPU) (4 Cores)
RAM: 12GB DDR-4 1866 MHz
HDD: 1x16GB | 5x6TB + 1x8TB (RAIDZ1)

Code:
#: top

last pid:  5785;  load averages:  1.24,  1.18,  1.16                                                                                                                               up 0+03:24:07  14:33:41
48 processes:  1 running, 47 sleeping
CPU:  0.4% user,  0.0% nice, 24.8% system,  0.0% interrupt, 74.8% idle
Mem: 730M Active, 1190M Inact, 9608M Wired, 374M Free
ARC: 6966M Total, 4263M MFU, 1556M MRU, 1860K Anon, 392M Header, 749M Other
     5449M Compressed, 18G Uncompressed, 3.37:1 Ratio
Swap: 6144M Total, 6144M Free


Code:
#: camcontrol devlist

<NECVMWar VMware IDE CDR10 1.00>   at scbus1 target 0 lun 0 (pass0,cd0)
<VMware Virtual disk 2.0>          at scbus2 target 0 lun 0 (pass1,da0)
<ATA HGST HDN726060AL T517>        at scbus3 target 0 lun 0 (pass2,da1)
<ATA HGST HUS728T8TAL RT03>        at scbus3 target 1 lun 0 (pass3,da2)
<ATA HGST HDN726060AL T517>        at scbus3 target 2 lun 0 (pass4,da3)
<ATA HGST HDN726060AL T517>        at scbus3 target 3 lun 0 (pass5,da4)
<ATA HGST HDN726060AL T517>        at scbus3 target 4 lun 0 (pass6,da5)
<ATA HGST HDN726060AL T517>        at scbus3 target 5 lun 0 (pass7,da6)
<ATA HGST HUS728T8TAL RT04>        at scbus3 target 6 lun 0 (pass8,da7)
<ATA HGST HUS728T8TAL RT04>        at scbus3 target 8 lun 0 (pass9,da8)


Code:
#: glabel status

                                      Name  Status  Components
gptid/4d0c1e07-fb10-11e5-a2bf-005056a71516     N/A  da0p1
gptid/08b93e77-fb15-11e5-a5e5-005056a71516     N/A  da1p2
gptid/8ba6696c-49b5-11ea-bcb1-005056a71516     N/A  da2p2
gptid/098b110c-fb15-11e5-a5e5-005056a71516     N/A  da3p2
gptid/0854553f-fb15-11e5-a5e5-005056a71516     N/A  da4p2
gptid/07f4c2e2-fb15-11e5-a5e5-005056a71516     N/A  da5p2
gptid/09ee16e6-fb15-11e5-a5e5-005056a71516     N/A  da6p2
gptid/4a90f09d-819b-11eb-a990-005056a71516     N/A  da7p2
gptid/08b3d65a-fb15-11e5-a5e5-005056a71516     N/A  da1p1


We are currently replacing da1 (6TB) with da7 (8TB)

Pool details:
Code:
#: zpool status -v

pool: freenas-boot
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 00:01:55 with 0 errors on Wed Mar 10 03:46:55 2021
config:

    NAME          STATE     READ WRITE CKSUM
    freenas-boot  ONLINE       0     0     0
      da0p2       ONLINE       0     0     0

errors: No known data errors

  pool: volume1
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Mar 10 12:22:35 2021
    15.9T scanned at 27.3M/s, 15.9T issued at 26.7M/s, 25.9T total
    2.71T resilvered, 61.59% done, 4 days 12:16:44 to go
config:

    NAME                                              STATE     READ WRITE CKSUM
    volume1                                           ONLINE       0     0     0
      raidz1-0                                        ONLINE       0     0     0
        gptid/07f4c2e2-fb15-11e5-a5e5-005056a71516    ONLINE       0     0     0
        gptid/0854553f-fb15-11e5-a5e5-005056a71516    ONLINE       0     0     0
        replacing-2                                   ONLINE       0     0     0
          gptid/08b93e77-fb15-11e5-a5e5-005056a71516  ONLINE       0     0     0
          gptid/4a90f09d-819b-11eb-a990-005056a71516  ONLINE       0     0     0  (resilvering)
        gptid/8ba6696c-49b5-11ea-bcb1-005056a71516    ONLINE       0     0     0
        gptid/098b110c-fb15-11e5-a5e5-005056a71516    ONLINE       0     0     0
        gptid/09ee16e6-fb15-11e5-a5e5-005056a71516    ONLINE       0     0     0

errors: No known data errors
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
You didn't mention you were running as a VM. How are you passing through the disks to TrueNAS? As individual disks, or did you pass through the whole controller?


You should wait until the resilver completes, just to be on the safe side. According to the Core Hardware Guide, 8 GB RAM is the absolute minimum, but 16 GB is a more realistic minimum. The more RAM you can throw at TrueNAS, the happier it will be with larger ARC available.
 

CFelix

Cadet
Joined
Nov 11, 2019
Messages
4
Sorry about the missing information, that TrueNAS is running as a VM. We pass through each single disk to the VM. (Raw Device Mapping) So far this was never an issue but reading the linked post, it does not seem recommended. We might have to rethink this strategy in the near future.

Regarding the RAM - The resilvering just finished successfully (I was a little worried after yesterdays crash), so I would now shutdown the VM and allocate at least 16GB and see how that goes. If that is still not enough, we will have to add more memory sticks in the server.
 
Top