Upgrading High End Hardware with P5800X

Chris Tobey · Dec 1, 2021

Hi Everyone,

I currently have what I consider to be fairly high-end hardware in use a TrueNAS 12.0-U5.1 fileserver.

Chassis: SuperMicro 846BA-R920B
Motherboard: SuperMicro X10DRi-T4+
CPUs: 2 x E5-2620 v3 @ 2.40 GHz
Memory: 12 x 16 GB (192 GB) DDR4-2133 ECC
Boot Drives: Mirrored Intel DC S3500 120GB SSDs
Storage Drives: 2 Pools, each pool has two vdevs, each vdev is RAIDZ2 with 6 x 16 TB Seagate EXOS 12Gb/s SAS drives (ST16000NM002G) - 128 TB per pool.
Storage Controller: LSI 9300-8i
SLOG: Intel P4800X 400 GB
Network: Currently 1 of 2 10 GbE Chelsio NIC

I am looking to improve the performance for my use cases.

My main use cases is as the fileserver for a continuous integration compute cluster.

All storage is connected to the CentOS 7.x servers via NFS.
We have repositories of between 100 and 50,000 files of code that are checked out from git or SVN and then compiled.
Files checked out from revision control are generally between 0 kB and 1 MB.
In the worst case there can be ~200 jobs of 50,000 files all trying to check out in parallel.
The jobs also try to do a "git clean" if a workspace already exists, which would scan the repo of 50,000 files.
We see a major performance hit when we do a lot of parallel "git clean" or git/svn checkouts.

More configuration details:

The use case described is using only poolB.
The Intel P4800X is set to 4 kB sectors with a single 400 GB partition and used as the SLOG for poolA.
There are 12 free DDR4 DIMM slots.
I can likely buy a P5800X (assuming it's available).

So the question is, what can be done to improve performance?

Is it possible to partition the P4800X and use a SLOG for both poolA and poolB?
What metrics should be looked at to see if the 192 GB of memory is currently enough for the ARC, or needs to be increased?
What metrics should be looked at to see if the single 10 GbE is currently enough?
Buy a P5800X as the SLOG for poolB? How should it be configured?

jgreco · Dec 1, 2021

192GB of memory may be somewhat small, depending on the size of files and metadata. If you are running 200 jobs in parallel, this leaves about 1GB per 50,000 file job, so if your files were a mere 20KB each, that'd blow through all your available ARC and then a bit. The 2620 is a slow CPU, so you're also at a disadvantage computationally. You might want to look at what is being reported by "top" during peak churn times to see whether you are running out. Swapping out CPU's for something with higher core speed, like the E5-2643v3, or higher core count, might help.

You need to be running the ARC summary tools such as arcstat.py and arc_summary.py to determine how well the ARC is doing.

The use of RAIDZ2 with no mention of L2ARC in this context is highly concerning. RAIDZ is optimized towards archival storage of large files, and lacks the parallelism necessary to handle crushing seek workloads such as thousands of small files. You would be better off with mirrors. The addition of L2ARC may help to significantly offset the handicap RAIDZ2 presents, depending on the specifics. I am quite frankly a bit surprised that your system as described doesn't simply go near-catatonic for an hour when running such a workload.

The best design for the system you describe would be a pool with a number of mirror vdevs, either two- or three-way mirrors, depending on your redundancy requirements. The mirror vdevs offer massive read IOPS, which can be further augmented with L2ARC if needed. Your write IOPS scale nearly linearly with the number of mirror vdevs you have, so it is easy to scale IOPS in such a design.

If you are writing a bunch of stuff back to the pool, which seems implied by "git/svn checkouts", RAIDZ is poorly optimized for this and you need to be making sure you have lots of free space on the pool.

What is your reasoning for the SLOG device? SLOG has specific use cases, and this doesn't appear a candidate for a mandatory SLOG. You could and probably should try disabling sync writes entirely to see what your maximum possible write speeds are, and then we can discuss why you're using a SLOG, and whether or not that's even appropriate.

Chris Tobey · Dec 1, 2021

I am not familiar with the arc_summary, but running that command gives me this:

Code:

# arc_summary

------------------------------------------------------------------------
ZFS Subsystem Report                            Wed Dec 01 22:55:48 2021
FreeBSD 12.2-RELEASE-p9                                    zpl version 5
Machine: stargate (amd64)          spa version 5000

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                    76.2 %  145.4 GiB
        Target size (adaptive):                        76.3 %  145.5 GiB
        Min size (hard limit):                          3.1 %    6.0 GiB
        Max size (high water):                           31:1  190.8 GiB
        Most Frequently Used (MFU) cache size:         24.9 %   33.0 GiB
        Most Recently Used (MRU) cache size:           75.1 %   99.7 GiB
        Metadata cache size (hard limit):              75.0 %  143.1 GiB
        Metadata cache size (current):                 25.5 %   36.5 GiB
        Dnode cache size (hard limit):                 10.0 %   14.3 GiB
        Dnode cache size (current):                    38.8 %    5.5 GiB

ARC hash breakdown:
        Elements max:                                              31.4M
        Elements current:                              48.2 %      15.1M
        Collisions:                                                 1.7G
        Chain max:                                                    11
        Chains:                                                     2.5M

ARC misc:
        Deleted:                                                    1.9G
        Mutex misses:                                             842.5k
        Eviction skips:                                           708.6k

ARC total accesses (hits + misses):                               200.1G
        Cache hit ratio:                               99.4 %     199.0G
        Cache miss ratio:                               0.6 %       1.2G
        Actual hit ratio (MFU + MRU hits):             99.4 %     198.9G
        Data demand efficiency:                        97.5 %       3.2G
        Data prefetch efficiency:                       9.7 %     140.3M

Cache hits by cache type:
        Most frequently used (MFU):                    98.0 %     194.9G
        Most recently used (MRU):                       2.0 %       4.0G
        Most frequently used (MFU) ghost:               0.2 %     431.6M
        Most recently used (MRU) ghost:                 0.1 %     160.8M

Cache hits by data type:
        Demand data:                                    1.6 %       3.2G
        Demand prefetch data:                         < 0.1 %      13.6M
        Demand metadata:                               98.2 %     195.4G
        Demand prefetch metadata:                       0.2 %     341.3M

Cache misses by data type:
        Demand data:                                    6.7 %      79.9M
        Demand prefetch data:                          10.7 %     126.7M
        Demand metadata:                               34.5 %     410.1M
        Demand prefetch metadata:                      48.1 %     570.5M

DMU prefetch efficiency:                                            1.6G
        Hit ratio:                                     79.7 %       1.3G
        Miss ratio:                                    20.3 %     323.7M

L2ARC not detected, skipping section

Tunables:
        abd_chunk_size                                              4096
        abd_scatter_enabled                                            1
        allow_redacted_dataset_mount                                   0
        anon_data_esize                                                0
        anon_metadata_esize                                            0
        anon_size                                              268881920
        arc.average_blocksize                                       8192
        arc.dnode_limit                                                0
        arc.dnode_limit_percent                                       10
        arc.dnode_reduce_percent                                      10
        arc.evict_batch_limit                                         10
        arc.eviction_pct                                             200
        arc.grow_retry                                                 0
        arc.lotsfree_percent                                          10
        arc.max                                                        0
        arc.meta_adjust_restarts                                    4096
        arc.meta_limit                                                 0
        arc.meta_limit_percent                                        75
        arc.meta_min                                                   0
        arc.meta_prune                                             10000
        arc.meta_strategy                                              1
        arc.min                                                        0
        arc.min_prefetch_ms                                            0
        arc.min_prescient_prefetch_ms                                  0
        arc.p_dampener_disable                                         1
        arc.p_min_shift                                                0
        arc.pc_percent                                                 0
        arc.shrink_shift                                               0
        arc.sys_free                                                   0
        arc_free_target                                          1044488
        arc_max                                                        0
        arc_min                                                        0
        arc_no_grow_shift                                              5
        async_block_max_blocks                      18446744073709551615
        autoimport_disable                                             1
        ccw_retry_interval                                           300
        checksum_events_per_second                                    20
        commit_timeout_pct                                             5
        compressed_arc_enabled                                         1
        condense.indirect_commit_entry_delay_ms                        0
        condense.indirect_obsolete_pct                                25
        condense.indirect_vdevs_enable                                 1
        condense.max_obsolete_bytes                           1073741824
        condense.min_mapping_bytes                                131072
        condense_pct                                                 200
        crypt_sessions                                                 0
        dbgmsg_enable                                                  1
        dbgmsg_maxsize                                           4194304
        dbuf.cache_shift                                               5
        dbuf.metadata_cache_max_bytes               18446744073709551615
        dbuf.metadata_cache_shift                                      6
        dbuf_cache.hiwater_pct                                        10
        dbuf_cache.lowater_pct                                        10
        dbuf_cache.max_bytes                        18446744073709551615
        dbuf_state_index                                               0
        ddt_data_is_special                                            1
        deadman.checktime_ms                                       60000
        deadman.enabled                                                1
        deadman.failmode                                            wait
        deadman.synctime_ms                                       600000
        deadman.ziotime_ms                                        300000
        debug                                                          0
        debugflags                                                     0
        dedup.prefetch                                                 0
        default_bs                                                     9
        default_ibs                                                   15
        delay_min_dirty_percent                                       60
        delay_scale                                               500000
        dirty_data_max                                        4294967296
        dirty_data_max_max                                    4294967296
        dirty_data_max_max_percent                                    25
        dirty_data_max_percent                                        10
        dirty_data_sync_percent                                       20
        disable_ivset_guid_check                                       0
        dmu_object_alloc_chunk_shift                                   7
        dmu_offset_next_sync                                           0
        dmu_prefetch_max                                       134217728
        dtl_sm_blksz                                                4096
        flags                                                          0
        fletcher_4_impl [fastest] scalar superscalar superscalar4 sse2 ssse3 avx2
        free_bpobj_enabled                                             1
        free_leak_on_eio                                               0
        free_min_time_ms                                            1000
        history_output_max                                       1048576
        immediate_write_sz                                         32768
        initialize_chunk_size                                    1048576
        initialize_value                            16045690984833335022
        keep_log_spacemaps_at_export                                   0
        l2arc.feed_again                                               1
        l2arc.feed_min_ms                                            200
        l2arc.feed_secs                                                1
        l2arc.headroom                                                 2
        l2arc.headroom_boost                                         200
        l2arc.meta_percent                                            33
        l2arc.mfuonly                                                  0
        l2arc.noprefetch                                               1
        l2arc.norw                                                     0
        l2arc.rebuild_blocks_min_l2size                       1073741824
        l2arc.rebuild_enabled                                          0
        l2arc.trim_ahead                                               0
        l2arc.write_boost                                        8388608
        l2arc.write_max                                          8388608
        l2arc_feed_again                                               1
        l2arc_feed_min_ms                                            200
        l2arc_feed_secs                                                1
        l2arc_headroom                                                 2
        l2arc_noprefetch                                               1
        l2arc_norw                                                     0
        l2arc_write_boost                                        8388608
        l2arc_write_max                                          8388608
        l2c_only_size                                                  0
        livelist.condense.new_alloc                                    0
        livelist.condense.sync_cancel                                  0
        livelist.condense.sync_pause                                   0
        livelist.condense.zthr_cancel                                  0
        livelist.condense.zthr_pause                                   0
        livelist.max_entries                                      500000
        livelist.min_percent_shared                                   75
        lua.max_instrlimit                                     100000000
        lua.max_memlimit                                       104857600
        max_async_dedup_frees                                     100000
        max_auto_ashift                                               16
        max_dataset_nesting                                           50
        max_log_walking                                                5
        max_logsm_summary_length                                      10
        max_missing_tvds                                               0
        max_missing_tvds_cachefile                                     2
        max_missing_tvds_scan                                          0
        max_nvlist_src_size                                            0
        max_recordsize                                           1048576
        metaslab.aliquot                                          524288
        metaslab.bias_enabled                                          1
        metaslab.debug_load                                            0
        metaslab.debug_unload                                          0
        metaslab.df_alloc_threshold                               131072
        metaslab.df_free_pct                                           4
        metaslab.df_max_search                                  16777216
        metaslab.df_use_largest_segment                                0
        metaslab.force_ganging                                  16777217
        metaslab.fragmentation_factor_enabled                          1
        metaslab.fragmentation_threshold                              70
        metaslab.lba_weighting_enabled                                 1
        metaslab.load_pct                                             50
        metaslab.max_size_cache_sec                                 3600
        metaslab.mem_limit                                            75
        metaslab.preload_enabled                                       1
        metaslab.preload_limit                                        10
        metaslab.segment_weight_enabled                                1
        metaslab.sm_blksz_no_log                                   16384
        metaslab.sm_blksz_with_log                                131072
        metaslab.switch_threshold                                      2
        metaslab.unload_delay                                         32
        metaslab.unload_delay_ms                                  600000
        mfu_data_esize                                        9263039488
        mfu_ghost_data_esize                                  1267297792
        mfu_ghost_metadata_esize                            105654835712
        mfu_ghost_size                                      106922133504
        mfu_metadata_esize                                      10775040
        mfu_size                                             35457245696
        mg.fragmentation_threshold                                    95
        mg.noalloc_threshold                                           0
        min_auto_ashift                                                9
        min_metaslabs_to_flush                                         1
        mru_data_esize                                      101996753920
        mru_ghost_data_esize                                  2684892672
        mru_ghost_metadata_esize                             46227875328
        mru_ghost_size                                       48912768000
        mru_metadata_esize                                     451344896
        mru_size                                            107018419200
        multihost.fail_intervals                                      10
        multihost.history                                              0
        multihost.import_intervals                                    20
        multihost.interval                                          1000
        multilist_num_sublists                                         0
        no_scrub_io                                                    0
        no_scrub_prefetch                                              0
        nocacheflush                                                   0
        nopwrite_enabled                                               1
        obsolete_min_time_ms                                         500
        pd_bytes_max                                            52428800
        per_txg_dirty_frees_percent                                    5
        prefetch.array_rd_sz                                     1048576
        prefetch.disable                                               0
        prefetch.max_distance                                    8388608
        prefetch.max_idistance                                  67108864
        prefetch.max_streams                                           8
        prefetch.min_sec_reap                                          2
        read_history                                                   0
        read_history_hits                                              0
        rebuild_max_segment                                      1048576
        reconstruct.indirect_combinations_max                       4096
        recover                                                        0
        recv.queue_ff                                                 20
        recv.queue_length                                       16777216
        recv.write_batch_size                                    1048576
        reference_tracking_enable                                      0
        removal_suspend_progress                                       0
        remove_max_segment                                      16777216
        resilver_disable_defer                                         0
        resilver_min_time_ms                                        9000
        scan_checkpoint_intval                                      7200
        scan_fill_weight                                               3
        scan_ignore_errors                                             0
        scan_issue_strategy                                            0
        scan_legacy                                                    0
        scan_max_ext_gap                                         2097152
        scan_mem_lim_fact                                             20
        scan_mem_lim_soft_fact                                        20
        scan_strict_mem_lim                                            0
        scan_suspend_progress                                          0
        scan_vdev_limit                                          4194304
        scrub_min_time_ms                                           1000
        send.corrupt_data                                              0
        send.no_prefetch_queue_ff                                     20
        send.no_prefetch_queue_length                            1048576
        send.override_estimate_recordsize                              0
        send.queue_ff                                                 20
        send.queue_length                                       16777216
        send.unmodified_spill_blocks                                   1
        send_holes_without_birth_time                                  1
        slow_io_events_per_second                                     20
        spa.asize_inflation                                           24
        spa.discard_memory_limit                                16777216
        spa.load_print_vdev_tree                                       0
        spa.load_verify_data                                           1
        spa.load_verify_metadata                                       1
        spa.load_verify_shift                                          4
        spa.slop_shift                                                 5
        space_map_ibs                                                 14
        special_class_metadata_reserve_pct                            25
        standard_sm_blksz                                         131072
        super_owner                                                    0
        sync_pass_deferred_free                                        2
        sync_pass_dont_compress                                        8
        sync_pass_rewrite                                              2
        sync_taskq_batch_pct                                          75
        top_maxinflight                                             1000
        traverse_indirect_prefetch_limit                              32
        trim.extent_bytes_max                                  134217728
        trim.extent_bytes_min                                      32768
        trim.metaslab_skip                                             0
        trim.queue_limit                                              10
        trim.txg_batch                                                32
        txg.history                                                  100
        txg.timeout                                                    5
        unflushed_log_block_max                                   262144
        unflushed_log_block_min                                     1000
        unflushed_log_block_pct                                      400
        unflushed_max_mem_amt                                 1073741824
        unflushed_max_mem_ppm                                       1000
        user_indirect_is_special                                       1
        validate_skip                                                  0
        vdev.aggregate_trim                                            0
        vdev.aggregation_limit                                   1048576
        vdev.aggregation_limit_non_rotating                       131072
        vdev.async_read_max_active                                     3
        vdev.async_read_min_active                                     1
        vdev.async_write_active_max_dirty_percent                     60
        vdev.async_write_active_min_dirty_percent                     30
        vdev.async_write_max_active                                    5
        vdev.async_write_min_active                                    1
        vdev.bio_delete_disable                                        0
        vdev.bio_flush_disable                                         0
        vdev.cache_bshift                                             16
        vdev.cache_max                                             16384
        vdev.cache_size                                                0
        vdev.def_queue_depth                                          32
        vdev.default_ms_count                                        200
        vdev.default_ms_shift                                         29
        vdev.file.logical_ashift                                       9
        vdev.file.physical_ashift                                      9
        vdev.initializing_max_active                                   1
        vdev.initializing_min_active                                   1
        vdev.max_active                                             1000
        vdev.max_auto_ashift                                          16
        vdev.min_auto_ashift                                           9
        vdev.min_ms_count                                             16
        vdev.mirror.non_rotating_inc                                   0
        vdev.mirror.non_rotating_seek_inc                              1
        vdev.mirror.rotating_inc                                       0
        vdev.mirror.rotating_seek_inc                                  5
        vdev.mirror.rotating_seek_offset                         1048576
        vdev.ms_count_limit                                       131072
        vdev.nia_credit                                               10
        vdev.nia_delay                                                 2
        vdev.queue_depth_pct                                        1000
        vdev.read_gap_limit                                        32768
        vdev.rebuild_max_active                                        3
        vdev.rebuild_min_active                                        1
        vdev.removal_ignore_errors                                     0
        vdev.removal_max_active                                        2
        vdev.removal_max_span                                      32768
        vdev.removal_min_active                                        1
        vdev.removal_suspend_progress                                  0
        vdev.remove_max_segment                                 16777216
        vdev.scrub_max_active                                          8
        vdev.scrub_min_active                                          1
        vdev.sync_read_max_active                                     10
        vdev.sync_read_min_active                                     10
        vdev.sync_write_max_active                                    10
        vdev.sync_write_min_active                                    10
        vdev.trim_max_active                                           2
        vdev.trim_min_active                                           1
        vdev.validate_skip                                             0
        vdev.write_gap_limit                                        4096
        version.acl                                                    1
        version.ioctl                                                 15
        version.module v2021071201-zfs_f7ba541d64cbc60b21507bd7781331bea1abb12e
        version.spa                                                 5000
        version.zpl                                                    5
        vnops.read_chunk_size                                    1048576
        vol.mode                                                       2
        vol.recursive                                                  0
        vol.unmap_enabled                                              1
        zap_iterate_prefetch                                           1
        zevent.cols                                                   80
        zevent.console                                                 0
        zevent.len_max                                               512
        zevent.retain_expire_secs                                    900
        zevent.retain_max                                           2000
        zfetch.max_distance                                      8388608
        zfetch.max_idistance                                    67108864
        zil.clean_taskq_maxalloc                                 1048576
        zil.clean_taskq_minalloc                                    1024
        zil.clean_taskq_nthr_pct                                     100
        zil.maxblocksize                                          131072
        zil.nocacheflush                                               0
        zil.replay_disable                                             0
        zil.slog_bulk                                             786432
        zio.deadman_log_all                                            0
        zio.dva_throttle_enabled                                       1
        zio.exclude_metadata                                           0
        zio.requeue_io_start_cut_in_line                               1
        zio.slow_io_ms                                             30000
        zio.taskq_batch_pct                                           80
        zio.taskq_batch_tpq                                            0
        zio.use_uma                                                    1

VDEV cache disabled, skipping section

ZIL committed transactions:                                       845.4M
        Commit requests:                                          127.5M
        Flushes to stable storage:                                126.8M
        Transactions to SLOG storage pool:           21.4 TiB     295.1M
        Transactions to non-SLOG storage pool:      419.5 MiB     247.9k

In terms of CPU usage, I have not monitored via top, but the Reporting/CPU graphs appear to be ok during these times. I would guess that some of the cores are maxing out sometimes, so an issue of single thread speed vs core capacity.

I do not have an L2ARC, maybe I should?
I have the SLOG due to the NFS, wanting to keep the synchronous without a penalty.

Chris Tobey · Dec 1, 2021

Here is an example of top when the IO seems slow:

Code:

last pid: 44414;  load averages:  2.06,  2.15,  2.13                                        up 72+17:39:41  02:09:58
136 processes: 2 running, 134 sleeping
CPU:  1.6% user,  0.0% nice,  5.8% system,  2.0% interrupt, 90.6% idle
Mem: 510M Active, 8206M Inact, 318M Laundry, 169G Wired, 9405M Free
ARC: 146G Total, 37G MFU, 96G MRU, 581M Anon, 3509M Header, 8896M Other
     110G Compressed, 208G Uncompressed, 1.89:1 Ratio
Swap: 38G Total, 38G Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
 3760 root         24  20    0    11M  2728K zfsvfs   9 741.1H  45.79% nfsd
41442 root          1  83    0    24M    14M CPU17   17  23:49  43.31% ssh
40561 root          1  25    0   476M   446M kqread   1  54.0H   9.16% smbd
 1473 root         14  20    0  1017M   950M usem     6 639:15   8.48% python3.9
 1350 root         37  20    0  1071M   924M kqread  21  27.0H   8.28% python3.9
41443 root          5  22    0    19M  7892K tq_dra  22   9:19   7.08% zfs
 3792 root          1  20    0    83M    24M zfsvfs   8  21.5H   2.19% rpc.lockd
87859 root          1  20    0   258M   225M kqread   2  31:38   0.78% smbd
 3356    556       88  44    0    33G  4617M uwait   15 191:00   0.17% java
44396 root          1  20    0    14M  4268K CPU10   10   0:00   0.09% top
 3249 root          1  20    0   258M   225M kqread   3   1:46   0.06% smbd
 2374 root          1 -52   r0    11M    11M nanslp  18  11:43   0.04% watchdogd
87903 root          1  20    0   259M   226M kqread  20  59:45   0.02% smbd
81500 root          1  20    0   265M   228M kqread   9 241:17   0.02% smbd
37552 root          1  20    0    11M  2816K pause   18   0:04   0.01% iostat
80890 www           1  20    0    37M    10M kqread  13   0:07   0.01% nginx
 3748 root          1  20    0    84M    25M select  16   8:08   0.01% mountd
 9460 root          1  20    0    28M    14M select  13   0:35   0.01% sshd

Chris Tobey · Dec 1, 2021

The ARC hits seem to be pretty good, but the ARC Requests prefetch_* have a lot of misses.

Chris Tobey · Dec 1, 2021

I would like to clarify that the SVN / Git repositories are externally hosted (same 10G network), we are checking out from these remote repositories onto the NFS shares on TrueNAS poolB.

jgreco · Dec 2, 2021

So, checkouts by clients resulting in lots of NFS writes, onto RAIDZ. Mmm.

There can be some disagreement as to how to assess ARC results, but my feeling is that in this case, the controlling result is that arc_summary is showing 99%+ ARC hit rates, which I read as meaning you either have sufficient ARC, or at least are not substantially short on ARC.

My favorite metric for "sufficient ARC+L2ARC" is when running "zpool iostat <pool> 1" to see a relatively small amount of pool reads (possibly as low as zero) happening. This means that almost all of your pool's IOPS capacity is available for writes.

Your graphs suggest that the 99% isn't a consistent result, so this might suggest you could add a little more RAM along with some L2ARC to see if you could maybe eke out a few more percent, but since you're only going down to 97%, I doubt you'd feel a performance difference. This is hard to judge, because it is really dependent on your actual working set size. The huge pool size makes it impractical to "shadow cache" a significant portion of your pool in L2ARC, but understanding if that is an actual issue is something that requires analysis and workload familiarity beyond the sort of help you can easily get from a forum post.

I think if you run "gstat" when there's a lot of traffic, that you're just going to find your disks are really busy, and if ARC is shouldering most of the read load, then the disks are busy with writes. That's going to be a natural performance limit in RAIDZ. Your primary way to affect that is going to be to have lots of free space available on the pool, so that ZFS has an easier time laying down transaction groups contiguously (reducing seeks).

blanchet · Dec 2, 2021

Step 1
You should change your layout to use strip of mirrors to get as many IOPS as possible.

Step 2
Try to avoid NFS for git and or for any other applications that generate and scan millions of small files.

200 jobs with 50,000 files means that you will have 10M of files on your filesystem.
If you try to run a find on 10M files and you will see that NFS is very slow in such a situation.

Suggestions:

If your clients use CentOS Server are virtual machines, then use vdisks instead of NFS shares.
If you use CentOS Server as physical machines, then you may try using iSCSI volumes (I have never tested but it should behave like vdisks)

At the first sight, it sounds strange but indeed applications run faster on vdisks over NFS datastore than on direct NFS mountpoints.

c77dk · Dec 2, 2021

Sometimes sVDEVs for metadata can help when you have lots of files - and also have the possibility to offload small files. But it's important to remember sVDEVs are a one-way street, and require same level of redundancy as the data drives (for raidz2 - 2 sVDEV should be able to fail - for me it means 3-way mirror for my pool with 3x raidz2)

Arwen · Dec 2, 2021

You can also use L2ARC for metadata only caching. That is what I plan on doing for my NAS.

secondarycache=all|none|metadata
Controls what is cached in the secondary cache (L2ARC). If this property is set to all, then both user
data and metadata is cached. If this property is set to none, then neither user data nor metadata is
cached. If this property is set to metadata, then only metadata is cached. The default value is all.

Chris Tobey · Dec 2, 2021

jgreco said:
My favorite metric for "sufficient ARC+L2ARC" is when running "zpool iostat <pool> 1" to see a relatively small amount of pool reads (possibly as low as zero) happening. This means that almost all of your pool's IOPS capacity is available for writes.

Code:

# zpool iostat SG2 1
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
SG2         90.0T  40.7T    713  1.68K  37.3M  41.8M
SG2         90.0T  40.7T  1.00K  5.39K  34.5M   191M
SG2         90.0T  40.7T    899    674  18.3M  4.72M
SG2         90.0T  40.7T  2.05K      0  70.1M      0
SG2         90.0T  40.7T  2.22K      0  78.9M      0
SG2         90.0T  40.7T  2.57K  4.45K  65.7M   159M
SG2         90.0T  40.7T    849  3.13K  12.4M  67.5M
SG2         90.0T  40.7T  1.69K     47  48.9M   191K
SG2         90.0T  40.7T  2.37K      0  78.0M      0
SG2         90.0T  40.7T  1.88K      0  34.3M      0
SG2         90.0T  40.7T  1.16K  4.75K  11.3M   163M
SG2         90.0T  40.7T    752  4.58K  27.4M   190M
SG2         90.0T  40.7T  1.07K  1.66K  28.5M  13.3M
SG2         90.0T  40.7T  1.89K      0  34.9M      0
SG2         90.0T  40.7T  1.97K      0  57.1M      0
SG2         90.0T  40.7T  1.62K  5.62K  45.7M  93.1M
SG2         90.0T  40.7T    862  6.18K  20.7M   248M
SG2         90.0T  40.7T    929    752  15.8M  4.50M
SG2         90.0T  40.7T  2.18K      0  63.3M      0
SG2         90.0T  40.7T  2.10K      0  42.1M      0
SG2         90.0T  40.7T  2.04K  4.97K  68.1M   156M
SG2         90.0T  40.7T    917  4.98K  17.0M   135M
SG2         90.0T  40.7T  1.32K     47  63.6M   191K
SG2         90.0T  40.7T  2.56K      0  55.1M      0
SG2         90.0T  40.7T  2.18K      0  69.2M      0
SG2         90.0T  40.7T  1.69K  5.24K  58.0M   217M
SG2         90.0T  40.7T    380  5.84K  3.00M   178M
SG2         90.0T  40.7T  1.75K     47  35.8M   191K
SG2         90.0T  40.7T  2.05K      0  38.0M      0
SG2         90.0T  40.7T  1.58K      0  21.1M      0
SG2         90.0T  40.7T  1.44K  6.02K  33.9M   249M
SG2         90.0T  40.7T    780  4.96K  15.1M  93.4M
SG2         90.0T  40.7T  1.31K     47  43.2M   190K
SG2         90.0T  40.7T  2.04K      0  48.9M      0
SG2         90.0T  40.7T  1.95K      0  74.1M      0

Chris Tobey · Dec 2, 2021

jgreco said:
I think if you run "gstat" when there's a lot of traffic, that you're just going to find your disks are really busy, and if ARC is shouldering most of the read load, then the disks are busy with writes. That's going to be a natural performance limit in RAIDZ. Your primary way to affect that is going to be to have lots of free space available on the pool, so that ZFS has an easier time laying down transaction groups contiguously (reducing seeks).

Code:

dT: 1.003s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0      2      0      0    0.0      2      8    0.0    0.0| nvd0
    0      2      0      0    0.0      2      8    0.0    0.0| nvd0p1
    0      0      0      0    0.0      0      0    0.0    0.0| ada0
    0      2      0      0    0.0      2      8    0.0    0.0| gptid/0d56f7b3-c8f7-11e8-a885-0cc47a1e4734
    0      0      0      0    0.0      0      0    0.0    0.0| ada1
    3    365    154   4639   19.8    211   2632    0.3   83.6| da18
    0    846     57    810   16.7    789  18106    0.4   52.9| da12
    1    739     69    806   17.8    670  17812    0.6   68.7| da13
    1    650    117   3059   28.6    533  15909    0.8   88.6| da14
    4    649     81   2379   28.5    568  17562    1.0   86.1| da15
    1    706     81   2340   39.0    625  17959    0.8   80.3| da16
    4    732    114   1772   35.0    618  17796    0.8   93.7| da17
    1    813    143  15556   18.3    670  26797    0.6   88.7| da19
    6    702    142   5359   17.3    560  24565    0.6   85.5| da20
    1    790    133   7730   22.2    657  24986    0.6   88.6| da21
    1    656    144  13319   26.0    512  26742    1.0   89.9| da22
    7    790    174  10582   18.6    617  25645    0.8   91.2| da23

Obviously this is just a snapshot in time, but the disks were hitting 100%, and do appear to be quite busy.

ChrisRJ · Dec 2, 2021

A question about the overall setup: Do I understand correctly that your TrueNAS provides (via NFS) disk space to your CI servers (basically the workspace directory, if we are talking Jenkins), with the VCS repos hosted elsewhere and the CentOS VMs running on a different physical box as well?

If that is so, why not remove TrueNAS from the equation entirely? The redundancy requirement of the CI workspace is not that high (at least for me) and the CentOS VMs will very likely run off of SSDs already.

Chris Tobey · Dec 2, 2021

A few comments/replies.

We have about 100 physical servers running CentOS 7.9 mounting the drives via NFS (I think v3). The servers and TrueNAS are both joined to an AD for credentials. How they are mounted does not really matter, as long as it works. Is NFSv4 worth exploring? iSCSI?
The servers are part of a cluster computing Sun Grid Engine (SGE) to load-balance/queue the jobs. We have users that submit builds, and CI that submits builds. The CI is Jenkins, where we have a small number of execution hosts whose only job is checkout the files from revision control (on the mounted TrueNAS drive), submit the job to SGE, and wait for it to finish.
This chassis has room for 24 drives and is currently full. Changing the disk layout/vdevs is possible, but we still need redundancy and capacity. What would an ideal layout look like with 24 x 16 TB drives?
The pools are currently configured for Sync=Standard and Compression=LZ4.

Chris Tobey · Dec 2, 2021

ChrisRJ said:
If that is so, why not remove TrueNAS from the equation entirely? The redundancy requirement of the CI workspace is not that high (at least for me) and the CentOS VMs will very likely run off of SSDs already.

The servers themselves currently only have 128 GB SSDs - this is possible to change, but requires a change in the architecture of the setup.
The SGE backend is used to share the compute cluster between users and CI, where the CI has lower priority and tends to run overnight while users get highest priority so their builds are done during the workday. We are talking 20 minute to 4 hour builds, some up to 48h, not short compiles.
We COULD cut the TrueNAS out of the equation, but that would mean dedicating servers to CI and removing them from the SGE cluster.

Chris Tobey · Dec 2, 2021

Code:

# zpool list
NAME           SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
SG1            131T  47.3T  83.4T        -         -    16%    36%  1.00x    ONLINE  /mnt
SG2            131T  90.1T  40.6T        -         -    39%    68%  1.00x    ONLINE  /mnt
freenas-boot   111G  15.6G  95.4G        -         -      -    14%  1.00x    ONLINE  -

Should something be done about the fragmentation?

Chris Tobey · Dec 2, 2021

Currently, I think we are at least going to double the RAM by filling the free DIMM slots with 12 x 16 GB DDR4.
I can also buy two P5800X drives to use as SLOGs (if needed), and move the P4800X to be the L2ARC.

ChrisRJ · Dec 2, 2021

For me the current problem is that it is not clear at all what the bottleneck is. I would suggest to build a list of potential candidates and come up with ideas how the confirm or exclude them. Examples:

RAIDZ2 is the problem: Have a 2-vdev mirrored pool with the same HDDs
HDD is the problem: Use a single SSD
RAM: Perhaps(!) reduce number of concurrent jobs
Architecture: Take one/a few servers out of the cluster and run jobs entirely locally (perhaps with bigger SSD)
???

jgreco · Dec 3, 2021

Chris Tobey said:

No, fragmentation is a natural side effect in this case.

Chris Tobey said:
Currently, I think we are at least going to double the RAM by filling the free DIMM slots with 12 x 16 GB DDR4.
I can also buy two P5800X drives to use as SLOGs (if needed), and move the P4800X to be the L2ARC.

Chris Tobey said:
The pools are currently configured for Sync=Standard and Compression=LZ4.

In my opinion, there's little to nothing I'm seeing that suggests a need for SLOG over just disabling sync writes. How heavily is your SLOG device being hit? Use gstat to check it. With sync set to standard, you're really only sync-writing metadata, and since the environment is relatively ephemeral, it isn't clear to me what data loss scenario is being addressed with the SLOG.

Chris Tobey said:
Obviously this is just a snapshot in time, but the disks were hitting 100%, and do appear to be quite busy.

And that's your basic bottleneck. This brings us back around to my initial analysis, which is that some combination of RAIDZ2 and lack of ARC/L2ARC is hurting you.

Chris Tobey said:

So in my opinion here's how this breaks down. There is a fairly consistent level of read activity going on from the pool, and very bursty clumps of writes. The write clumps will tend to swamp out the ability of the pool to fulfill read requests. But the writes are unavoidable in your environment as designed, unless you start looking at alternative architectures as suggested by another poster. I consider redesigning your environment to be out-of-scope for this forum and am just looking at how to address the workload, but redesigning may not be a terrible idea.

So this boils down to a few questions:

1) Can the reads be fulfilled by ARC/L2ARC instead of from the pool? If so, this increases performance, but it doesn't seem to me like it'd be a huge increase, because it looks to me like it's trying to write maybe 100M and read maybe 50M per second, on average. This also depends on whether it is reasonable to think that the workload could be cached if there was more ARC/L2ARC, which is an unknown.

2) Would a pool redesign afford better IOPS? Mirrors provide better IOPS. If the data on this pool is not irreplaceable, then going to mirror pairs would represent a significant IOPS increase for both writes and reads; 12 drives in six vdevs. If you need double redundancy, three-way mirrors still gives you four vdevs.

3) At the end of the day, HDD is always going to be seek-limited, and it may be useful to contemplate ways to move workload onto SSD, whether NAS-based or locally.

ChrisRJ · Dec 3, 2021

About ARC/L2ARC: Would writing files via NFS put them into ARC implicitly? Or would an explicit read access be necessary?

If the latter is the case, the benefit of bigger ARC could be rather small, if the check out from VCS would always be a fresh one (which is usually a good idea).

Important Announcement for the TrueNAS Community.

Upgrading High End Hardware with P5800X

Contributor

Resident Grinch

Contributor

Contributor

Contributor

Contributor

Resident Grinch

Guru

Patron

MVP

Contributor

Contributor

Wizard

Contributor

Contributor

Contributor

Contributor

Wizard

Resident Grinch

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Upgrading High End Hardware with P5800X"

Similar threads