SOLVED Update from 11.3 U5 to 12 poor performance and odd memory usage

Jon Moog

Dabbler
Joined
Apr 24, 2017
Messages
21
I've updated two machines to the latest release of 12 and have an unusual behavior on one.

The first isn't used much and has 32 GB RAM but behaves the same now as it did before the update. Performance and behavior are familiar.

The second machine has 160 GB RAM and shows 120+ free regardless of disk access and caching expectations. Despite the large free memory pool the machine seems to want to touch swap lightly. The ARC is showing ~8 GB and is really not growing. The performance of the machine is acting like there is a shortage of cache with the disks being accessed more frequently than was normal under 11.3. Monitoring the drives with gstat shows disks that are often pegged at 100% with high ops and low throughput. I've added an L2 ARC which grew for awhile then stalled out short of the disk capacity. Something appears to be wrong with this box but I'm at a loss where to look. Other than performance all functionality seems okay with VMs, jails and file sharing services working as expected, just slower.

Any suggestions are welcome.

Code:
155 processes: 1 running, 154 sleeping
CPU:  0.4% user,  0.0% nice,  5.0% system,  0.1% interrupt, 94.5% idle
Mem: 10G Active, 7178M Inact, 487M Laundry, 16G Wired, 122G Free
ARC: 7984M Total, 1648M MFU, 1845M MRU, 52M Anon, 722M Header, 3717M Other
     773M Compressed, 2779M Uncompressed, 3.59:1 Ratio
Swap: 64G Total, 5458M Used, 59G Free, 8% Inuse, 16K In
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Can you post complete system specs (doesn't have to be a dmesg, just a summary of CPU/board/memory/HBA/drives)

Monitoring the drives with gstat shows disks that are often pegged at 100% with high ops and low throughput.

Specifically looking for the exact model number of these drives please.

Regardless it doesn't look like your system is putting ARC to use. Do you have any autotune settings enabled?
 

Jon Moog

Dabbler
Joined
Apr 24, 2017
Messages
21
Can you post complete system specs (doesn't have to be a dmesg, just a summary of CPU/board/memory/HBA/drives)



Specifically looking for the exact model number of these drives please.

Regardless it doesn't look like your system is putting ARC to use. Do you have any autotune settings enabled?

All of the rotating disks are the same WD model.

Code:
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68WT0N0
Serial Number:    WD-WCC4E0JF49Y0
LU WWN Device Id: 5 0014ee 20cec3e2e
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Oct 25 15:52:05 2020 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled


The disks seem to be performing okay but without ARC it's just not the same box as 11.3.

Code:
dT: 1.002s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0      0      0      0    0.0      0      0    0.0    0.0| ada0
    4    406    234    970   14.8    173    826    0.1  100.3| da0
    3    395    265   1098   10.9    130    846    1.6   93.4| da1
    3    355    192    878   15.4    164    786    0.2   95.0| da2
    5    351    237   1066   13.9    115    758    1.2   99.3| da3
    3    376    236    978   13.6    141    778    0.9   98.3| da4
    3    460    325   1589    9.4    135    810    0.9  100.0| da5
    4    384    274   1210   11.2    110    798    2.0  100.4| da6
    3    412    283   1166   10.2    129    774    1.1   96.4| da7
    3    401    271   1226   12.3    130    794    0.8  100.0| da8
    3    380    258   1230   15.7    122    826    1.4   99.5| da9
    3    387    267   1130   11.8    120    834    1.3   99.6| da10
    3    354    221    938   12.2    134    838    1.1   88.1| da11
    4    360    225   1066   18.4    136    862    1.2  100.6| da12
    4    396    255   1054   12.4    141    822    1.1   99.2| da13
    3    374    253   1485   13.7    121    842    1.7  100.3| da14
    3    397    250   1026   13.1    147    830    1.0   99.5| da15
    3    285    230   1337   16.6     56    571    3.5  100.1| da16
    4    315    227   1305   19.2     89    539    0.2   99.3| da17
    4    340    237   1597   17.7    104    567    0.2   99.4| da18
    5    305    247   1573   16.5     59    567    2.0  100.3| da19
    3    282    222   1301   18.4     61    599    4.0   99.3| da20
    3    349    276   1717   13.2     73    615    2.0  100.4| da21
    3    293    223   1166   17.2     71    591    1.9   99.8| da22
    4    322    239   1437   14.7     84    611    1.5   99.2| da23
    3    311    235   1521   15.5     77    579    3.5   99.8| da24
    3    312    228   1305   16.9     85    583    1.4   98.8| da25
    3    330    253   1465   14.7     77    587    2.2   99.5| da26
    3    314    242   1441   14.2     73    587    1.8   98.8| da27
    3    332    250   1473   17.7     82    547    2.1   99.5| da28
    3    353    262   1465   15.0     91    547    1.3   99.3| da29
    3    343    277   1597   13.3     66    559    2.3  100.0| da30
    4    289    215   1329   16.5     75    571    2.3   99.0| da31
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Not SMR, good to see.

Can you post the other specs? Obviously there's heavy disk activity expected during cache warmup but if your ARC isn't actually growing then that's a bigger problem.

What does arc_summary.py show? If it's still just returning errors try sysctl -a kstat.zfs.misc.arcstats - interested to see if there's somehow a cap on your ARC size that's been put into play.
 

Jon Moog

Dabbler
Joined
Apr 24, 2017
Messages
21
Not SMR, good to see.

Can you post the other specs? Obviously there's heavy disk activity expected during cache warmup but if your ARC isn't actually growing then that's a bigger problem.

What does arc_summary.py show? If it's still just returning errors try sysctl -a kstat.zfs.misc.arcstats - interested to see if there's somehow a cap on your ARC size that's been put into play.

I dumped the zfs values from both systems and there is a notable difference. "kstat.zfs.misc.arcstats.arc_no_grow" Seems relevant but why would that ever be set to 1 and does it do what I think it does. There may be other issues I haven't noticed.

Affected machine:
Code:
kstat.zfs.misc.arcstats.abd_chunk_waste_size: 452608
kstat.zfs.misc.arcstats.cached_only_in_progress: 85
kstat.zfs.misc.arcstats.arc_raw_size: 0
kstat.zfs.misc.arcstats.arc_sys_free: 0
kstat.zfs.misc.arcstats.arc_need_free: 155648
kstat.zfs.misc.arcstats.demand_hit_prescient_prefetch: 58950
kstat.zfs.misc.arcstats.demand_hit_predictive_prefetch: 820999
kstat.zfs.misc.arcstats.async_upgrade_sync: 82477
kstat.zfs.misc.arcstats.arc_meta_min: 16777216
kstat.zfs.misc.arcstats.arc_meta_max: 8572656760
kstat.zfs.misc.arcstats.arc_dnode_limit: 12791796326
kstat.zfs.misc.arcstats.arc_meta_limit: 127917963264
kstat.zfs.misc.arcstats.arc_meta_used: 8117863952
kstat.zfs.misc.arcstats.arc_prune: 0
kstat.zfs.misc.arcstats.arc_loaned_bytes: 0
kstat.zfs.misc.arcstats.arc_tempreserve: 185344
kstat.zfs.misc.arcstats.arc_no_grow: 1
kstat.zfs.misc.arcstats.memory_available_bytes: 113745129472
kstat.zfs.misc.arcstats.memory_free_bytes: 117309534208
kstat.zfs.misc.arcstats.memory_all_bytes: 171631026176
kstat.zfs.misc.arcstats.memory_indirect_count: 0
kstat.zfs.misc.arcstats.memory_direct_count: 0
kstat.zfs.misc.arcstats.memory_throttle_count: 0
kstat.zfs.misc.arcstats.l2_rebuild_log_blks: 0
kstat.zfs.misc.arcstats.l2_rebuild_bufs_precached: 0
kstat.zfs.misc.arcstats.l2_rebuild_bufs: 0
kstat.zfs.misc.arcstats.l2_rebuild_asize: 0
kstat.zfs.misc.arcstats.l2_rebuild_size: 0
kstat.zfs.misc.arcstats.l2_rebuild_lowmem: 0
kstat.zfs.misc.arcstats.l2_rebuild_cksum_lb_errors: 0
kstat.zfs.misc.arcstats.l2_rebuild_dh_errors: 0
kstat.zfs.misc.arcstats.l2_rebuild_io_errors: 0
kstat.zfs.misc.arcstats.l2_rebuild_unsupported: 0
kstat.zfs.misc.arcstats.l2_rebuild_success: 0
kstat.zfs.misc.arcstats.l2_data_to_meta_ratio: 2445
kstat.zfs.misc.arcstats.l2_log_blk_count: 10843
kstat.zfs.misc.arcstats.l2_log_blk_asize: 219041792
kstat.zfs.misc.arcstats.l2_log_blk_avg_asize: 20060
kstat.zfs.misc.arcstats.l2_log_blk_writes: 10843
kstat.zfs.misc.arcstats.l2_hdr_size: 671051520
kstat.zfs.misc.arcstats.l2_asize: 495721872896
kstat.zfs.misc.arcstats.l2_size: 527220996608
kstat.zfs.misc.arcstats.l2_io_error: 0
kstat.zfs.misc.arcstats.l2_cksum_bad: 0
kstat.zfs.misc.arcstats.l2_abort_lowmem: 0
kstat.zfs.misc.arcstats.l2_free_on_write: 41457
kstat.zfs.misc.arcstats.l2_evict_l1cached: 0
kstat.zfs.misc.arcstats.l2_evict_reading: 0
kstat.zfs.misc.arcstats.l2_evict_lock_retry: 0
kstat.zfs.misc.arcstats.l2_writes_lock_retry: 359
kstat.zfs.misc.arcstats.l2_writes_error: 0
kstat.zfs.misc.arcstats.l2_writes_done: 100874
kstat.zfs.misc.arcstats.l2_writes_sent: 100874
kstat.zfs.misc.arcstats.l2_write_bytes: 615419733504
kstat.zfs.misc.arcstats.l2_read_bytes: 98507731968
kstat.zfs.misc.arcstats.l2_rw_clash: 0
kstat.zfs.misc.arcstats.l2_feeds: 140329
kstat.zfs.misc.arcstats.l2_misses: 126628440
kstat.zfs.misc.arcstats.l2_hits: 3290398
kstat.zfs.misc.arcstats.mfu_ghost_evictable_metadata: 1700681728
kstat.zfs.misc.arcstats.mfu_ghost_evictable_data: 157092352
kstat.zfs.misc.arcstats.mfu_ghost_size: 1857774080
kstat.zfs.misc.arcstats.mfu_evictable_metadata: 23312384
kstat.zfs.misc.arcstats.mfu_evictable_data: 499712
kstat.zfs.misc.arcstats.mfu_size: 1800100352
kstat.zfs.misc.arcstats.mru_ghost_evictable_metadata: 3518428160
kstat.zfs.misc.arcstats.mru_ghost_evictable_data: 345755648
kstat.zfs.misc.arcstats.mru_ghost_size: 3864183808
kstat.zfs.misc.arcstats.mru_evictable_metadata: 0
kstat.zfs.misc.arcstats.mru_evictable_data: 0
kstat.zfs.misc.arcstats.mru_size: 1762763264
kstat.zfs.misc.arcstats.anon_evictable_metadata: 0
kstat.zfs.misc.arcstats.anon_evictable_data: 0
kstat.zfs.misc.arcstats.anon_size: 38584832
kstat.zfs.misc.arcstats.other_size: 3932283496
kstat.zfs.misc.arcstats.bonus_size: 838288000
kstat.zfs.misc.arcstats.dnode_size: 2181926880
kstat.zfs.misc.arcstats.dbuf_size: 912068616
kstat.zfs.misc.arcstats.metadata_size: 3410056704
kstat.zfs.misc.arcstats.data_size: 191334400
kstat.zfs.misc.arcstats.hdr_size: 104414568
kstat.zfs.misc.arcstats.overhead_size: 2846872576
kstat.zfs.misc.arcstats.uncompressed_size: 2894465536
kstat.zfs.misc.arcstats.compressed_size: 754518528
kstat.zfs.misc.arcstats.size: 8309588832
kstat.zfs.misc.arcstats.c_max: 170557284352
kstat.zfs.misc.arcstats.c_min: 5363469568
kstat.zfs.misc.arcstats.c: 5732397238
kstat.zfs.misc.arcstats.p: 358274827
kstat.zfs.misc.arcstats.hash_chain_max: 6
kstat.zfs.misc.arcstats.hash_chains: 700359
kstat.zfs.misc.arcstats.hash_collisions: 5885438
kstat.zfs.misc.arcstats.hash_elements_max: 7612091
kstat.zfs.misc.arcstats.hash_elements: 7369322
kstat.zfs.misc.arcstats.evict_l2_skip: 0
kstat.zfs.misc.arcstats.evict_l2_ineligible: 34596002304
kstat.zfs.misc.arcstats.evict_l2_eligible: 509930107392
kstat.zfs.misc.arcstats.evict_l2_cached: 2850497044992
kstat.zfs.misc.arcstats.evict_not_enough: 268934870
kstat.zfs.misc.arcstats.evict_skip: 295678753538
kstat.zfs.misc.arcstats.access_skip: 5
kstat.zfs.misc.arcstats.mutex_miss: 274466209
kstat.zfs.misc.arcstats.deleted: 15108576
kstat.zfs.misc.arcstats.mfu_ghost_hits: 120242256
kstat.zfs.misc.arcstats.mfu_hits: 2277244860
kstat.zfs.misc.arcstats.mru_ghost_hits: 3020365
kstat.zfs.misc.arcstats.mru_hits: 18320510
kstat.zfs.misc.arcstats.prefetch_metadata_misses: 119923407
kstat.zfs.misc.arcstats.prefetch_metadata_hits: 2164
kstat.zfs.misc.arcstats.prefetch_data_misses: 1212327
kstat.zfs.misc.arcstats.prefetch_data_hits: 0
kstat.zfs.misc.arcstats.demand_metadata_misses: 2654814
kstat.zfs.misc.arcstats.demand_metadata_hits: 2249051857
kstat.zfs.misc.arcstats.demand_data_misses: 6135515
kstat.zfs.misc.arcstats.demand_data_hits: 46511381
kstat.zfs.misc.arcstats.misses: 129926063
kstat.zfs.misc.arcstats.hits: 2295565402


Working machine:
Code:
kstat.zfs.misc.arcstats.abd_chunk_waste_size: 14371328
kstat.zfs.misc.arcstats.cached_only_in_progress: 0
kstat.zfs.misc.arcstats.arc_raw_size: 0
kstat.zfs.misc.arcstats.arc_sys_free: 0
kstat.zfs.misc.arcstats.arc_need_free: 0
kstat.zfs.misc.arcstats.demand_hit_prescient_prefetch: 28760863
kstat.zfs.misc.arcstats.demand_hit_predictive_prefetch: 9381
kstat.zfs.misc.arcstats.async_upgrade_sync: 253
kstat.zfs.misc.arcstats.arc_meta_min: 16777216
kstat.zfs.misc.arcstats.arc_meta_max: 17533671112
kstat.zfs.misc.arcstats.arc_dnode_limit: 2489555558
kstat.zfs.misc.arcstats.arc_meta_limit: 24895555584
kstat.zfs.misc.arcstats.arc_meta_used: 17518654424
kstat.zfs.misc.arcstats.arc_prune: 0
kstat.zfs.misc.arcstats.arc_loaned_bytes: 0
kstat.zfs.misc.arcstats.arc_tempreserve: 0
kstat.zfs.misc.arcstats.arc_no_grow: 0
kstat.zfs.misc.arcstats.memory_available_bytes: 3111428096
kstat.zfs.misc.arcstats.memory_free_bytes: 3822272512
kstat.zfs.misc.arcstats.memory_all_bytes: 34267815936
kstat.zfs.misc.arcstats.memory_indirect_count: 0
kstat.zfs.misc.arcstats.memory_direct_count: 0
kstat.zfs.misc.arcstats.memory_throttle_count: 0
kstat.zfs.misc.arcstats.l2_rebuild_log_blks: 0
kstat.zfs.misc.arcstats.l2_rebuild_bufs_precached: 0
kstat.zfs.misc.arcstats.l2_rebuild_bufs: 0
kstat.zfs.misc.arcstats.l2_rebuild_asize: 0
kstat.zfs.misc.arcstats.l2_rebuild_size: 0
kstat.zfs.misc.arcstats.l2_rebuild_lowmem: 0
kstat.zfs.misc.arcstats.l2_rebuild_cksum_lb_errors: 0
kstat.zfs.misc.arcstats.l2_rebuild_dh_errors: 0
kstat.zfs.misc.arcstats.l2_rebuild_io_errors: 0
kstat.zfs.misc.arcstats.l2_rebuild_unsupported: 0
kstat.zfs.misc.arcstats.l2_rebuild_success: 0
kstat.zfs.misc.arcstats.l2_data_to_meta_ratio: 0
kstat.zfs.misc.arcstats.l2_log_blk_count: 0
kstat.zfs.misc.arcstats.l2_log_blk_asize: 0
kstat.zfs.misc.arcstats.l2_log_blk_avg_asize: 0
kstat.zfs.misc.arcstats.l2_log_blk_writes: 0
kstat.zfs.misc.arcstats.l2_hdr_size: 0
kstat.zfs.misc.arcstats.l2_asize: 0
kstat.zfs.misc.arcstats.l2_size: 0
kstat.zfs.misc.arcstats.l2_io_error: 0
kstat.zfs.misc.arcstats.l2_cksum_bad: 0
kstat.zfs.misc.arcstats.l2_abort_lowmem: 0
kstat.zfs.misc.arcstats.l2_free_on_write: 0
kstat.zfs.misc.arcstats.l2_evict_l1cached: 0
kstat.zfs.misc.arcstats.l2_evict_reading: 0
kstat.zfs.misc.arcstats.l2_evict_lock_retry: 0
kstat.zfs.misc.arcstats.l2_writes_lock_retry: 0
kstat.zfs.misc.arcstats.l2_writes_error: 0
kstat.zfs.misc.arcstats.l2_writes_done: 0
kstat.zfs.misc.arcstats.l2_writes_sent: 0
kstat.zfs.misc.arcstats.l2_write_bytes: 0
kstat.zfs.misc.arcstats.l2_read_bytes: 0
kstat.zfs.misc.arcstats.l2_rw_clash: 0
kstat.zfs.misc.arcstats.l2_feeds: 0
kstat.zfs.misc.arcstats.l2_misses: 0
kstat.zfs.misc.arcstats.l2_hits: 0
kstat.zfs.misc.arcstats.mfu_ghost_evictable_metadata: 9678336
kstat.zfs.misc.arcstats.mfu_ghost_evictable_data: 306867200
kstat.zfs.misc.arcstats.mfu_ghost_size: 316545536
kstat.zfs.misc.arcstats.mfu_evictable_metadata: 16233607168
kstat.zfs.misc.arcstats.mfu_evictable_data: 1227021824
kstat.zfs.misc.arcstats.mfu_size: 18105137152
kstat.zfs.misc.arcstats.mru_ghost_evictable_metadata: 3354380288
kstat.zfs.misc.arcstats.mru_ghost_evictable_data: 15691944448
kstat.zfs.misc.arcstats.mru_ghost_size: 19046324736
kstat.zfs.misc.arcstats.mru_evictable_metadata: 50880512
kstat.zfs.misc.arcstats.mru_evictable_data: 5002455552
kstat.zfs.misc.arcstats.mru_size: 5333775872
kstat.zfs.misc.arcstats.anon_evictable_metadata: 0
kstat.zfs.misc.arcstats.anon_evictable_data: 0
kstat.zfs.misc.arcstats.anon_size: 4042752
kstat.zfs.misc.arcstats.other_size: 290706528
kstat.zfs.misc.arcstats.bonus_size: 52303040
kstat.zfs.misc.arcstats.dnode_size: 177022224
kstat.zfs.misc.arcstats.dbuf_size: 61381264
kstat.zfs.misc.arcstats.metadata_size: 16594470400
kstat.zfs.misc.arcstats.data_size: 6848485376
kstat.zfs.misc.arcstats.hdr_size: 633477496
kstat.zfs.misc.arcstats.overhead_size: 701383680
kstat.zfs.misc.arcstats.uncompressed_size: 62891822592
kstat.zfs.misc.arcstats.compressed_size: 22741572096
kstat.zfs.misc.arcstats.size: 24381511128
kstat.zfs.misc.arcstats.c_max: 33194074112
kstat.zfs.misc.arcstats.c_min: 1070869248
kstat.zfs.misc.arcstats.c: 24426201600
kstat.zfs.misc.arcstats.p: 21621393700
kstat.zfs.misc.arcstats.hash_chain_max: 8
kstat.zfs.misc.arcstats.hash_chains: 522297
kstat.zfs.misc.arcstats.hash_collisions: 1942092
kstat.zfs.misc.arcstats.hash_elements_max: 2548953
kstat.zfs.misc.arcstats.hash_elements: 2548181
kstat.zfs.misc.arcstats.evict_l2_skip: 0
kstat.zfs.misc.arcstats.evict_l2_ineligible: 1785614336
kstat.zfs.misc.arcstats.evict_l2_eligible: 98442561024
kstat.zfs.misc.arcstats.evict_l2_cached: 0
kstat.zfs.misc.arcstats.evict_not_enough: 1
kstat.zfs.misc.arcstats.evict_skip: 6130
kstat.zfs.misc.arcstats.access_skip: 46
kstat.zfs.misc.arcstats.mutex_miss: 78
kstat.zfs.misc.arcstats.deleted: 1004124
kstat.zfs.misc.arcstats.mfu_ghost_hits: 1545
kstat.zfs.misc.arcstats.mfu_hits: 528551563
kstat.zfs.misc.arcstats.mru_ghost_hits: 23778
kstat.zfs.misc.arcstats.mru_hits: 21292239
kstat.zfs.misc.arcstats.prefetch_metadata_misses: 2234183
kstat.zfs.misc.arcstats.prefetch_metadata_hits: 9198023
kstat.zfs.misc.arcstats.prefetch_data_misses: 4025
kstat.zfs.misc.arcstats.prefetch_data_hits: 0
kstat.zfs.misc.arcstats.demand_metadata_misses: 85398
kstat.zfs.misc.arcstats.demand_metadata_hits: 538159895
kstat.zfs.misc.arcstats.demand_data_misses: 30949
kstat.zfs.misc.arcstats.demand_data_hits: 2485884
kstat.zfs.misc.arcstats.misses: 2354555
kstat.zfs.misc.arcstats.hits: 549843802
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
kstat.zfs.misc.arcstats.arc_no_grow
Set that to 0 and bump your vfs.zfs.arc_max up by a byte from where it is now if it doesn't immediately start refilling. This looks like an OpenZFS bug that's reoccurring where the arc_reclaim thread was looking at a wrong variable for "free memory" and would never reset the flag to shrink ARC.
 

Jon Moog

Dabbler
Joined
Apr 24, 2017
Messages
21
Set that to 0 and bump your vfs.zfs.arc_max up by a byte from where it is now if it doesn't immediately start refilling. This looks like an OpenZFS bug that's reoccurring where the arc_reclaim thread was looking at a wrong variable for "free memory" and would never reset the flag to shrink ARC.
The value for kstat.zfs.misc.arcstats.arc_no_grow is read only and I don't see a corresponding value in the vfs.zfs space. I did change the arc_max by +1 and that didn't seem to make much difference. I also tried flipping vfs.zfs.arc.grow_retry to 1 which also seems to have made no difference. On closer inspection though I see the kstat.zfs.misc.arcstats.arc_no_grow periodically flipping from 1 to 0 and back.

For some reason the box seems to be stuck with ARC at about 8 GB and won't grow which is likely the cause of the performance degradation from 11.3. I've tried reboots with no effect as well. I'm open to any other suggestions on where to look and things to poke at.
 

anodos

Sambassador
iXsystems
Joined
Mar 6, 2014
Messages
9,545
The value for kstat.zfs.misc.arcstats.arc_no_grow is read only and I don't see a corresponding value in the vfs.zfs space. I did change the arc_max by +1 and that didn't seem to make much difference. I also tried flipping vfs.zfs.arc.grow_retry to 1 which also seems to have made no difference. On closer inspection though I see the kstat.zfs.misc.arcstats.arc_no_grow periodically flipping from 1 to 0 and back.

For some reason the box seems to be stuck with ARC at about 8 GB and won't grow which is likely the cause of the performance degradation from 11.3. I've tried reboots with no effect as well. I'm open to any other suggestions on where to look and things to poke at.
Can you PM me a debug?
 

Jon Moog

Dabbler
Joined
Apr 24, 2017
Messages
21
After a little over 4 days of uptime the ARC has grown a bit. It is now almost 13 GB with 100 GB still free. The disk busy in gstat seems roughly back to normal (pre 12 install) despite the appearance of having little ARC available. It's hard to believe 4 GB makes that much difference so perhaps there is something else going on in ZFS land. The work load seen by the server hasn't changed in any meaningful way during this time, just the behavior of the server. Without arc_summary.py working I don't have familiar values to compare 12 to 11.3.
 

Jon Moog

Dabbler
Joined
Apr 24, 2017
Messages
21
It's been almost a month and a couple of reboots and still no change to the box. It refuses to use more than 5-10 GB for ARC which is horrible for performance. I presume something is up with the OpenZFS shims to VFS and the wrong values being used for ARC sizing but I have not been able to determine the cause. Any suggestions on where to look are welcome.
 

Jon Moog

Dabbler
Joined
Apr 24, 2017
Messages
21
So I have decided to poke at this some more and I did find a way to increase ARC. By setting vfs.zfs.arc_min to some larger value I can force the system to expand ARC to that point. This isn't ideal but has allowed server performance to be manually tuned back to previous levels. I did notice during monitoring that the no_grow flag seems to be toggling between 0 and 1 even when there is significant free memory available. I did a cursory glance over the OpenZFS commits that seemed relevant and didn't find anything I would consider applicable. As OpenZFS on FreeBSD matures I'm hoping that the issue will be resolved and if not I may have to dig deeper. The changes to the kstat.zfs sysctl domain to avoid dumping the huge structures will be a welcome help in sorting this problem out.
 

Jon Moog

Dabbler
Joined
Apr 24, 2017
Messages
21
In the interest of helping others who might find themselves with the same problem the cause and solution has been found. The machine in question is a dual socket server and had erroneously been configured with an extremely out of balance NUMA layout. Most of the RAM in the system was on one CPU and only a small portion was on the other. When the CPU with small memory footprint ran low of memory it caused ZFS to try to shrink ARC. Basically the maximum size of ARC was system limited by the CPU with the smaller memory configuration. Presumably the move to OpenZFS changed the knobs interfacing to the kernel which explains the behavior change from TrueNAS 11 to 12. The arc_min trick was just a bandaid for 12 and really didn't work at all on 13 prompting a proper explanation and resolution. I hope this information helps others in the future. The moral of the story is NUMA configurations can lead to more than just performance degradation.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,924
Thank you for your thoughfulness in closing this one out with good conclusion data.
 
Top