SOLVED Swap Cache SSD, then move boot to Old SSD

mpyusko

Dabbler
Joined
Jul 5, 2019
Messages
49
This is my current build:
Dell R710
2X Intel L5640 Xeon CPUs ( Dual Hexa-core, 24 threads)
96GB DDR3-1333 Reg EEC DIMM
250 GB Samsung 850 EVO (L2ARC)
1 TB Intel 660p (ZIL/SLOG) PCIe x4 bus
6x 3 TB WD RED NAS 5400 RPM (raidz2)
16 GB Sandisk Ultra USB 3.0 (internal USB 2.0 port)
LSI SAS-2011-8i HBA PCIe x8 @5GT/s
4x Broadcom Limited NetXtreme II BCM5709 Gigabit Ethernet

In the coming weeks I plan to
  1. upgrade the CPUs to X5660s
  2. increase RAM significantly (up to 160 GB RAM)
  3. Replace the L2ARC with a 500 GB GB 860 EVO
  4. Replace the Boot flash with the 250 GB 850 EVO
I have two persistent VMs that run on FreeNAS, and they utilize 16GB of RAM. One is for XO-ce to monitor/manage my XCP-ng hypervisors, the other is simply a Linux VM that I use to manipulate large amounts of data, sync dropbox (the cloud plugin doesn't truly sync) and various other tasks which are grossly accelerated by running directly on the storage server vs a hypervisor.

Now the hardware upgrades are pretty basic, but I'm not clear about upgrading the L2ARC and then using the old SSD for the Boot. As it sits, everything is running off the LSI card except the ZIL. That has and NVMe to PICe adapter. While I can use it as additional storage, the BIOS doesn't allow me to boot from it. Theis leaves me with the more traditional SATA or USB boot method. 7 of my 8i ports are populated so I don't want to add the Boot there as it will prevent me from adding a mirror for the cache later. Now I'm left with the on-board SATA-II controller that previously was popualated by the optical drive. (I pulled the optical drive and replaced it with a 2.5" drive adapter for the L2ARC). My experience has always been to image or mirror the USB flashdrive when swapping to a newer device.

Code:
root@cygnus[~]# zpool status ZFSvol
  pool: ZFSvol
state: ONLINE
  scan: scrub repaired 0 in 0 days 07:05:52 with 0 errors on Sun May 10 07:05:53 2020
config:

        NAME                                            STATE     READ WRITE CKSUM
        ZFSvol                                          ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/0018b084-d4be-11e9-a222-782bcb3282a1  ONLINE       0     0     0
            gptid/01ed4943-d4be-11e9-a222-782bcb3282a1  ONLINE       0     0     0
            gptid/0324d9c9-d4be-11e9-a222-782bcb3282a1  ONLINE       0     0     0
            gptid/044f8747-d4be-11e9-a222-782bcb3282a1  ONLINE       0     0     0
            gptid/05515586-d4be-11e9-a222-782bcb3282a1  ONLINE       0     0     0
            gptid/0659db94-d4be-11e9-a222-782bcb3282a1  ONLINE       0     0     0
        logs
          gptid/076fa604-d4be-11e9-a222-782bcb3282a1    ONLINE       0     0     0
        cache
          gptid/0702d603-d4be-11e9-a222-782bcb3282a1    ONLINE       0     0     0

errors: No known data errors
root@cygnus[~]#

Code:
root@cygnus[~]# zpool list
NAME           SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
ZFSvol        16.2T  9.78T  6.47T        -         -     9%    60%  1.00x  ONLINE  /mnt
freenas-boot    14G  3.54G  10.5G        -         -      -    25%  1.00x  ONLINE  -
root@cygnus[~]#


So here are my questions...

  1. To replace the L2ARC device:
    1. Remove the cache in the WebGUI
    2. replace the hardware
    3. assign the new SSD to the cache in the webgui
  2. To replace the Boot device:
    1. Can I then add the old SSD to the integrated SATA-II controller,
    2. boot normally
    3. add it as a mirror to the Boot volume in the WebGUI
    4. once the resilvering is complete, remove the USB flashdrive from the mirror set in the WebGUI
    5. reboot
    6. configure BIOS to boot from the SSD instead
    7. boot to FreeNAS?
  3. Since the 2 VMs are always persistent, should I reduce the ARC maximum by 16GB?
    1. If yes, how?
There are numerous examples of how to do this via the CLI but, in my experience, changes to FreeNAS via the CLI (especially when dealing with the cache devices) are not always persistent.

Code:
root@cygnus[~]# zpool iostat -v
                                           capacity     operations    bandwidth
pool                                    alloc   free   read  write   read  write
--------------------------------------  -----  -----  -----  -----  -----  -----
ZFSvol                                  9.78T  6.47T    489     97  43.6M  1.28M
  raidz2                                9.78T  6.47T    489     84  43.6M   838K
    gptid/0018b084-d4be-11e9-a222-782bcb3282a1      -      -     30     23  11.0M   289K
    gptid/01ed4943-d4be-11e9-a222-782bcb3282a1      -      -     34     23  11.0M   291K
    gptid/0324d9c9-d4be-11e9-a222-782bcb3282a1      -      -     28     22  11.1M   285K
    gptid/044f8747-d4be-11e9-a222-782bcb3282a1      -      -     34     23  11.0M   289K
    gptid/05515586-d4be-11e9-a222-782bcb3282a1      -      -     58     24  11.0M   291K
    gptid/0659db94-d4be-11e9-a222-782bcb3282a1      -      -     58     23  11.0M   285K
logs                                        -      -      -      -      -      -
  gptid/076fa604-d4be-11e9-a222-782bcb3282a1  3.05M   952G      0     12      6   468K
cache                                       -      -      -      -      -      -
  gptid/0702d603-d4be-11e9-a222-782bcb3282a1   182G  51.0G      0     11  22.9K  1.22M
--------------------------------------  -----  -----  -----  -----  -----  -----
freenas-boot                            3.54G  10.5G      0     12  4.42K   189K
  da7p2                                 3.54G  10.5G      0     12  4.42K   189K
--------------------------------------  -----  -----  -----  -----  -----  -----

root@cygnus[~]#


Code:
root@cygnus[~]# zpool get all
NAME          PROPERTY                       VALUE                          SOURCE
ZFSvol        size                           16.2T                          -
ZFSvol        capacity                       60%                            -
ZFSvol        altroot                        /mnt                           local
ZFSvol        health                         ONLINE                         -
ZFSvol        guid                           16569247781209315036           default
ZFSvol        version                        -                              default
ZFSvol        bootfs                         -                              default
ZFSvol        delegation                     on                             default
ZFSvol        autoreplace                    off                            default
ZFSvol        cachefile                      /data/zfs/zpool.cache          local
ZFSvol        failmode                       continue                       local
ZFSvol        listsnapshots                  off                            default
ZFSvol        autoexpand                     on                             local
ZFSvol        dedupditto                     0                              default
ZFSvol        dedupratio                     1.00x                          -
ZFSvol        free                           6.47T                          -
ZFSvol        allocated                      9.78T                          -
ZFSvol        readonly                       off                            -
ZFSvol        comment                        -                              default
ZFSvol        expandsize                     -                              -
ZFSvol        freeing                        0                              default
ZFSvol        fragmentation                  9%                             -
ZFSvol        leaked                         0                              default
ZFSvol        bootsize                       -                              default
ZFSvol        checkpoint                     -                              -
ZFSvol        feature@async_destroy          enabled                        local
ZFSvol        feature@empty_bpobj            active                         local
ZFSvol        feature@lz4_compress           active                         local
ZFSvol        feature@multi_vdev_crash_dump  enabled                        local
ZFSvol        feature@spacemap_histogram     active                         local
ZFSvol        feature@enabled_txg            active                         local
ZFSvol        feature@hole_birth             active                         local
ZFSvol        feature@extensible_dataset     enabled                        local
ZFSvol        feature@embedded_data          active                         local
ZFSvol        feature@bookmarks              enabled                        local
ZFSvol        feature@filesystem_limits      enabled                        local
ZFSvol        feature@large_blocks           enabled                        local
ZFSvol        feature@sha512                 enabled                        local
ZFSvol        feature@skein                  enabled                        local
ZFSvol        feature@device_removal         enabled                        local
ZFSvol        feature@obsolete_counts        enabled                        local
ZFSvol        feature@zpool_checkpoint       enabled                        local
freenas-boot  size                           14G                            -
freenas-boot  capacity                       25%                            -
freenas-boot  altroot                        -                              default
freenas-boot  health                         ONLINE                         -
freenas-boot  guid                           1652143396639454651            default
freenas-boot  version                        -                              default
freenas-boot  bootfs                         freenas-boot/ROOT/11.2-U8      local
freenas-boot  delegation                     on                             default
freenas-boot  autoreplace                    off                            default
freenas-boot  cachefile                      -                              default
freenas-boot  failmode                       wait                           default
freenas-boot  listsnapshots                  off                            default
freenas-boot  autoexpand                     off                            default
freenas-boot  dedupditto                     0                              default
freenas-boot  dedupratio                     1.00x                          -
freenas-boot  free                           10.5G                          -
freenas-boot  allocated                      3.54G                          -
freenas-boot  readonly                       off                            -
freenas-boot  comment                        -                              default
freenas-boot  expandsize                     -                              -
freenas-boot  freeing                        0                              default
freenas-boot  fragmentation                  -                              -
freenas-boot  leaked                         0                              default
freenas-boot  bootsize                       -                              default
freenas-boot  checkpoint                     -                              -
freenas-boot  feature@async_destroy          enabled                        local
freenas-boot  feature@empty_bpobj            active                         local
freenas-boot  feature@lz4_compress           active                         local
freenas-boot  feature@multi_vdev_crash_dump  disabled                       local
freenas-boot  feature@spacemap_histogram     disabled                       local
freenas-boot  feature@enabled_txg            disabled                       local
freenas-boot  feature@hole_birth             disabled                       local
freenas-boot  feature@extensible_dataset     disabled                       local
freenas-boot  feature@embedded_data          disabled                       local
freenas-boot  feature@bookmarks              disabled                       local
freenas-boot  feature@filesystem_limits      disabled                       local
freenas-boot  feature@large_blocks           disabled                       local
freenas-boot  feature@sha512                 disabled                       local
freenas-boot  feature@skein                  disabled                       local
freenas-boot  feature@device_removal         disabled                       local
freenas-boot  feature@obsolete_counts        disabled                       local
freenas-boot  feature@zpool_checkpoint       disabled                       local
root@cygnus[~]#


Thanks.
 
Joined
Oct 18, 2018
Messages
969
Now the hardware upgrades are pretty basic, but I'm not clear about upgrading the L2ARC and then using the old SSD for the Boot.
If you're not seeing a low ARC hit ratio, adding L2ARC drives will not improve performance. If you are seeing low ARC hit ratios, max out ram first if you have not already done so before adding an L2ARC device. L2ARC's index lives in ram so adding too much or adding it prematurely can harm performance rather than help it.

1 TB Intel 660p (ZIL/SLOG) PCIe x4 bus
So, you may have heard this before; but this really is not an ideal SLOG device. For one, it is WAY larger than it needs to be. Also it lacks Power Loss Protection (PLP). The purpose of the ZIL is to store transaction groups prior to their being fully committed to the pool. For sync writes, your system will wait until the data is fully written to the ZIL and confirmed to be there before confirming it has data. This is important. For async writes the system doesn't wait. Async writes are therefore going to be faster than sync writes. The benefit of the sync write approach is that if your system experiences an issue or a power loss event any data stored in the zil is still there. On power up your system checks the zil and moves any transaction groups there to your pool. However, you're not using a PLP device; so if you experience a power loss event while data is in transit to your device you could lose that transaction group. Read up on PLP devices and why folks recommend them for SLOG devices. You could get better performance without significant (or any) loss of data protection by turning off sync writes entirely and removing that SLOG device.

Since the 2 VMs are always persistent, should I reduce the ARC maximum by 16GB?
  1. If yes, how?
The ARC is an Adaptive Read Cache. Reducing the ARC size will likely dramatically harm your read performance.

If you haven't already done so, I suggest you check out how the SLOG and L2ARC work to improve the performance of your system. Understand how they work, when to use them, and when to give them resources. Also, check out what the ARC is and why it matters. Understand how the ARC IS a read cache and the ZIL/SLOG is NOT a write cache.

As for the approaches out outlined above; those generally seem reasonable to me. If you have backups of your pool and you don't touch them you should be fine to give what you suggested a shot. Be sure to keep backups of your system config as well as your encryption keys (if you use encryption). If things go south your pools will still be fine.
 
Last edited:

K_switch

Dabbler
Joined
Dec 18, 2019
Messages
44
As for the approaches out outlined above; those generally seem reasonable to me. If you have backups of your pool and you don't touch them you should be fine to give what you suggested a shot. Be sure to keep backups of your system config as well as your encryption keys (if you use encryption). If things go south your pools will still be fine.
Couldn't agree more with the above statement... The beauty of FreeNAS even if all things go wrong... you should still be able to import your pool on a newly configured FreeNAS. I am by no means an expert yet, however i will say that after reading This post my understanding of the Slog/Zil roll in ZFS grew significantly. You should be fine to move forward with your steps... but do make sure you have backups.

You could get better performance without significant (or any) loss of data protection by turning off sync writes entirely and removing that SLOG device.
@PhiloEpisteme You have a vastly superior understanding in this world than I, However i have heard from several experts that the risk is never worth the gain? how significant in your mind is the potential for Data loss by turning Sync off when using NFS to share datastores to say a proxmox host or XCP-ng? I appreciate the answer as i am always trying to learn!
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
You could get better performance without significant (or any) loss of data protection by turning off sync writes entirely and removing that SLOG device.

Not quite; the 660p will honor sync writes and cache flush requests - it just won't be particularly fast at it, or last long under that workload. See this post here in the SLOG benchmark thread for how slow it is:


So right now, it's being safe (assuming that XCP-ng follows NFS standards in using sync writes) but slow. Disabling sync writes will make it fast and unsafe. If you want "faster" and safe - a faster SLOG is needed. I'd recommend one of the M.2 Optane M10 cards. Even the 16GB will be much, much faster than the 660p - the 32GB is better in terms of both performance and write endurance. (Sadly the M15 line was cancelled.)
 
Joined
Oct 18, 2018
Messages
969
Not quite; the 660p will honor sync writes and cache flush requests - it just won't be particularly fast at it, or last long under that workload. See this post here in the SLOG benchmark thread for how slow it is:
Interesting, the intel doc lists no PLP, that thread is always a good place to check.
 
Joined
Oct 18, 2018
Messages
969
@PhiloEpisteme You have a vastly superior understanding in this world than I, However i have heard from several experts that the risk is never worth the gain? how significant in your mind is the potential for Data loss by turning Sync off when using NFS to share datastores to say a proxmox host or XCP-ng? I appreciate the answer as i am always trying to learn!
I generally find blanket statements of that sort to be difficult to agree with completely. This comes up in encryption a fair bit as well. Whether a certain feature is worth it to you depends a lot on your use case. How important is your data? How likely are you to experience a power loss event or something else where the ZIL is necessary? How important is write performance? These are all questions with subjective answers. In general, what I advocate for is that folks try their best to understand the objective points and numbers to inform their subjective opinions on these types of matters. There are a lot of great resources out there detailing how ZFS writes data to the ZIL and then to the pool for sync writes and how that differs from async writes. I would advice folks research that and then make an informed decision.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Interesting, the intel doc lists no PLP, that thread is always a good place to check.
SLOGs don't need full end-to-end PLP to be safe, they need it to be fast. A drive with PLP can treat its RAM cache as non-volatile and not have to flush the data to NAND immediately on request. The only way a drive is unsafe for SLOG is if it lies about its RAM cache or PLP state (like an OCZ Vertex drive I have).
 

mpyusko

Dabbler
Joined
Jul 5, 2019
Messages
49
I believe I have a fairly solid understanding of the ARC, L2ARC, ZIL, and SLOG. I've tried utilizing ZFS on other OS's too like Ubuntu and XCP-ng even has support for it. But when doing so, you have to define a certain amount of RAM to reserve for the OS so there isn't a constant competition between the OS and ARC. This is especially true of XCP-NG wher you want to limit the amount of RAM dedicated to Dom0 so you can keep space free to launch the VMs. Is this also true for FreeNAS? Should I limit the size of the ARC so it doesn't interfere with the RAM allocated to the 2 VMs?

Also, the L2ARC is a SATA-III interface because the 540 MB R/W is far in excess of the dual gigabit NICs. Even the random R/W is sufficient to serve data to the NICs. In fact, given the current specs, CrystalDiskmark running on a Windows VM can peg ~120 MB/s in nearly every test. Monitoring the resources in the FreeNAS reports, there is still substantial room for storage I/O.

The ZIL/SLOG is a 1TB Intel 660p NVMe. Why?
1, the NVMe is insanely faster than the SATA SSD especially in IOPS where the ZIL/SLOG require the performance. It is in an adapter that allows direct interface to the PCIe bus at the native x4 speed. Being an SSD is is non-volatile storage. The 710 is powered by 2 875 Watts redundant PSUs. Each is connected to it's own 900W/3000VA UPS. IF..... this server ever loses power unpredictably, then I have bigger issues than a write cache to worry about.
2, Unlike the ARC/L2ARC the resulting I/O traffic happens within the SAN/NAS and being able to flush/commit those writes as quickly as possible is key. I think we can agree that while the 660p is not, by any means, the fastest NVMe drive out there.... it is definitely not the bottleneck in this build.
3, I already had it kicking around from another build that never happened
4, 1 TB size goes against the wear leveling enhancing the endurance of the drive.

Comparative specs for both....

ZIL/SLOG
Code:
=== START OF INFORMATION SECTION ===
Model Number:                       INTEL SSDPEKNW010T8
Serial Number:                     
Firmware Version:                   002C
PCI Vendor/Subsystem ID:            0x8086
IEEE OUI Identifier:                0x5cd2e4
Controller ID:                      1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,024,209,543,168 [1.02 TB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Tue May 12 03:30:48 2020 EDT
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     77 Celsius
Critical Comp. Temp. Threshold:     80 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
0 +     4.00W       -        -    0  0  0  0        0       0
1 +     3.00W       -        -    1  1  1  1        0       0
2 +     2.20W       -        -    2  2  2  2        0       0
3 -   0.0300W       -        -    3  3  3  3     5000    5000
4 -   0.0040W       -        -    4  4  4  4     5000    9000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        30 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    4%
Data Units Read:                    5,678 [2.90 GB]
Data Units Written:                 54,582,985 [27.9 TB]
Host Read Commands:                 29,716
Host Write Commands:                498,115,168
Controller Busy Time:               3,019
Power Cycles:                       25
Power On Hours:                     5,916
Unsafe Shutdowns:                   5
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Thermal Temp. 1 Transition Count:   4
Thermal Temp. 1 Total Time:         27

Error Information (NVMe Log 0x01, max 256 entries)
No Errors Logged

L2ARC
Code:
=== START OF INFORMATION SECTION ===
Device Model:     Samsung SSD 860 EVO 250GB
Serial Number:   
LU WWN Device Id: 5 002538 e4072a4d3
Firmware Version: RVT01B6Q
User Capacity:    250,059,350,016 bytes [250 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue May 12 03:34:00 2020 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              23  ---  Lifetime Power-On Resets
0x01  0x010  4           13430  ---  Power-on Hours
0x01  0x018  6     70782775662  ---  Logical Sectors Written
0x01  0x020  6       332438084  ---  Number of Write Commands
0x01  0x028  6     25549351081  ---  Logical Sectors Read
0x01  0x030  6       173115586  ---  Number of Read Commands
0x01  0x038  6          131000  ---  Date and Time TimeStamp
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              26  ---  Current Temperature
0x05  0x020  1              41  ---  Highest Temperature
0x05  0x028  1              13  ---  Lowest Temperature
0x05  0x058  1              70  ---  Specified Maximum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4             206  ---  Number of Hardware Resets
0x06  0x010  4               0  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
0x07  0x008  1               8  N--  Percentage Used Endurance Indicator
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

( /dev/ada0 TBW: 36.2418 TB)

I currently have 2 VM's performing their nightly virus scan.... (enough to feed 8x X5660 cores @100% utilization in each VM)
1589267119877.png


And on FreeNAS.....
1589267177826.png

1589267216779.png

(da1 through da5 are identical graphs to da0)
1589267244388.png

(da1 through da5 are identical graphs to da0)
1589267296072.png

The L2ARC will sawtooth between 200 and 500 IOPS for the next couple hours while the virus scans run.
The ZIL/SLOG stays around 800 IOPS when the backup archives are pulled in at about 500 Mbps (my WAN bandwidth cap)

IMHO having an NVMe drive for the L2ARC would be a waste given my setup since the IOPS and R/W performance would be wasted.
However, adding more RAM is always an improvement
 
Last edited:

mpyusko

Dabbler
Joined
Jul 5, 2019
Messages
49
If you're not seeing a low ARC hit ratio, adding L2ARC drives will not improve performance. If you are seeing low ARC hit ratios, max out ram first if you have not already done so before adding an L2ARC device. L2ARC's index lives in ram so adding too much or adding it prematurely can harm

Currently both the ARC and L2ARC are full. Leaving 80 GB RAM for the OS and ARC, should be plenty to allow for decent performance for both, however I am going to increase the RAM to 160 GB, leaving 144 GB RAM for the ARC and L2ARC management. Please correct me if i'm wrong, but the purpose of the ARC/L2ARC is to be a accelerated block storage buffer so requests for recent/commonly read data are not passed to the spindle array. The WD NAS Reds, especially the 5400 RPM series, have extremely low random r/w performance, so having an L2ARC mitigate those I/O read requests increases performance on the whole. As long as there is enough RAM to spare to manage the L2ARC, you want that L2ARC to be as big as possible, which is why I want to swap in a 500GB. The Dell R710 can extend to 288 GB RAM however 192 GB is where you'll see peak performance. This is because when you populate the third rank it drops the DDR3 speed from 1333 to 800 MHz.

I believe your recommendation applies to systems where the RAM is much closer to the 8GB/64GB minimum.

"As a general rule of thumb, an L2ARC should not be added to a system with less than 64 GB of RAM and the size of an L2ARC should not exceed 5x the amount of RAM. In some cases, it may be more efficient to have two separate pools: one on SSDs for active data and another on hard drives for rarely used content. After adding an L2ARC, monitor its effectiveness using tools such as arcstat. If you need to increase the size of an existing L2ARC, you can stripe another cache device using Volume Manager. The GUI will always stripe L2ARC, not mirror it, as the contents of L2ARC are recreated at boot. Losing an L2ARC device will not affect the integrity of the pool, but may have an impact on read performance, depending upon the workload and the ratio of dataset size to cache size. Note that a dedicated L2ARC device can not be shared between ZFS pools."

64GB recommended minimum for L2ARC (I have 144/160 GB)
5x 144GB = 720 GB (500 GB SSD is well within that threshold)

Note: I have an old FreeNAS 9.10 box maxed out at 8GB RAM running with a 240 GB SSD for the L2ARC with just shy of 17 TB written to it.....
View attachment 38490

What is strange is the ARC/L2ARC sizes reported there vs my 11.2 box.
(How can 8GB RAM equal 163 GB ARC and a 240 GB SSD equal a 443 GB L2ARC???)
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
@mpyusko - There's a lot to break down here, so I'll try to go piece by piece, with a summary in bold rather than nesting and breaking apart a pile of quotes. Before that though, important question: are you using NFS or iSCSI to present the storage to your XCP-ng hypervisors? This matters.

"Should I limit the size of the ARC so it doesn't interfere with the RAM allocated to the 2 VMs?"
Since these VMs running on FreeNAS; yes, you should give your maximum ARC size a 16GB haircut (after upgrading the RAM of course) While ZFS is generally good about releasing memory from ARC quickly when it's demanded, it isn't instantaneous and a malloc() from bhyve could stall for a bit or fail if it was expecting a RAM-speed response and it's delayed.

"The ZIL/SLOG is a 1TB Intel 660p NVMe."
If you're getting good results and it's truly hitting the 660p, then you might not have enough writes going on to overwhelm the SLC-style caching of the 660p. Just be aware that it's still QLC; if you overwhelm the SLC buffer things will slow down quite a bit. And keep an eye on that endurance value.

"The Dell R710 can extend to 288 GB RAM however 192 GB is where you'll see peak performance. This is because when you populate the third rank it drops the DDR3 speed from 1333 to 800 MHz."
I'd wager (strongly) that the benefit of having an additional 96GB of ARC would far, far outweigh any performance potentially "lost" from downgrading to 800MHz. The other potential bottleneck is the slow clock speed of your L5640s - got any more of those X5660s you could drop into your FreeNAS box? ;)

"5x 144GB = 720 GB (500 GB SSD is well within that threshold)"
The "5x your RAM" thumbrule is dated. You can pretty safely go to 10x your RAM, but since you have L2ARC running right now, you can always run arc_summary.py from an SSH prompt and look at the L2ARC section under the total amount used and the header size (RAM consumed to index it) to see how it shakes out for your particular recordsize and workload.

"(How can 8GB RAM equal 163 GB ARC and a 240 GB SSD equal a 443 GB L2ARC???)"
Possibly a UI bug; although ARC and L2ARC both show values considering compression. 20:1 compression on your ARC though would imply you've got something extremely compressible loaded there though. arc_summary.py may show some insight there as well.
 
Joined
Oct 18, 2018
Messages
969
Please correct me if i'm wrong, but the purpose of the ARC/L2ARC is to be a accelerated block storage buffer so requests for recent/commonly read data are not passed to the spindle array. The WD NAS Reds, especially the 5400 RPM series, have extremely low random r/w performance, so having an L2ARC mitigate those I/O read requests increases performance on the whole. As long as there is enough RAM to spare to manage the L2ARC, you want that L2ARC to be as big as possible, which is why I want to swap in a 500GB. The Dell R710 can extend to 288 GB RAM however 192 GB is where you'll see peak performance. This is because when you populate the third rank it drops the DDR3 speed from 1333 to 800 MHz.

I believe your recommendation applies to systems where the RAM is much closer to the 8GB/64GB minimum.

You are correct that ARC/L2ARC are there to provide you with faster reads. The thing with L2ARC vs ARC is that the ARC will be faster than the L2ARC. Why? Because ram is faster than an SSD, even if using NVMe. For this reason; folks typically max our their ram first, because it is faster than L2ARC and then start adding L2ARC devices if they are still getting a low ARC hit ratio.

Finally, re ram speeds etc. Why would it drop from 1333MHz to 800MHz? Typically FreeNAS cares more about volume of ram rather than speed or number of channels. I did a quick, half-ass search for the bandwidth of 800MHz DDR3 ram and found that it is 6400 MB/s. That will still easily outclass your SSD, even if it is an NVMe drive.

Could you explain the reasoning why the arguments above would only apply to a system with less ram?

Looking at the data you provided; I would say you want to add more RAM. Your ARC hit ratio is 20% and your L2ARC is only 40%.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Finally, re ram speeds etc. Why would it drop from 1333MHz to 800MHz?

Populating the third DIMM channel on Nehalem reduces speed to 800MHz, it's an Intel limitation.
 

mpyusko

Dabbler
Joined
Jul 5, 2019
Messages
49
I'm in purple...

@mpyusko - There's a lot to break down here, so I'll try to go piece by piece, with a summary in bold rather than nesting and breaking apart a pile of quotes. Before that though, important question: are you using NFS or iSCSI to present the storage to your XCP-ng hypervisors? This matters.

It serves NFS, iSCSI and SMB Stroage Repositories to Xen. Hypervisors that flex in and out go in the NFS Repo, The "permanent" VMs are in iSCSI, and SMB is the ISO Repo. Both NFS and iSCSI are restricted to the LACP bonded NICs. The SMB is tied to another NIC, and the Fourth NIC is for LAN/WAN traffic. The VMs only see the LAN/WAN network. Traffic is handled by a Cisco Catalyst 3560G 48 port switch.

FreeNAS also serves SMB shares to my Windows network for shared storage. Those same shares are mounted via CIFS under Linux. Since SMB plays well with both Windows and Linux, I opted for that to be the shared storage. Also, the bandwidth requirements are lower so I wanted to keep NFS simply handling the NFS Storage Repository.


"The ZIL/SLOG is a 1TB Intel 660p NVMe."
If you're getting good results and it's truly hitting the 660p, then you might not have enough writes going on to overwhelm the SLC-style caching of the 660p. Just be aware that it's still QLC; if you overwhelm the SLC buffer things will slow down quite a bit. And keep an eye on that endurance value.

Without knowing exactly how much data actually gets written to the L2ARC and ZIL/SLOG until after it is running, it's hard to figure out how long the SSD's will last. Knowing this has bee n running 8 months, and the load on it, it is easier to estimate. The 660p has a very solid amout of SLC on the 1 TB model. I doubt, even at max load, the QLC even gets hit. SLC lasts longer than MLC, TLC, and QLC anyway.

"The Dell R710 can extend to 288 GB RAM however 192 GB is where you'll see peak performance. This is because when you populate the third rank it drops the DDR3 speed from 1333 to 800 MHz."
I'd wager (strongly) that the benefit of having an additional 96GB of ARC would far, far outweigh any performance potentially "lost" from downgrading to 800MHz. The other potential bottleneck is the slow clock speed of your L5640s - got any more of those X5660s you could drop into your FreeNAS box? ;)

See OP... yes, I have two that are going in with the RAM and other components.

"5x 144GB = 720 GB (500 GB SSD is well within that threshold)"
The "5x your RAM" thumbrule is dated. You can pretty safely go to 10x your RAM, but since you have L2ARC running right now, you can always run arc_summary.py from an SSH prompt and look at the L2ARC section under the total amount used and the header size (RAM consumed to index it) to see how it shakes out for your particular recordsize and workload.

Below...

"(How can 8GB RAM equal 163 GB ARC and a 240 GB SSD equal a 443 GB L2ARC???)"
Possibly a UI bug; although ARC and L2ARC both show values considering compression. 20:1 compression on your ARC though would imply you've got something extremely compressible loaded there though. arc_summary.py may show some insight there as well.

That box mostly handles emails and databases.

arc_summary output..... (Seems pretty well un inhibited to me.)
Code:
root@cygnus[~]# arc_summary.py
System Memory:

        1.80%   1.68    GiB Active,     10.22%  9.56    GiB Inact
        85.65%  80.12   GiB Wired,      0.00%   0       Bytes Cache
        2.33%   2.18    GiB Free,       -0.00%  -110592 Bytes Gap

        Real Installed:                         96.00   GiB
        Real Available:                 99.95%  95.95   GiB
        Real Managed:                   97.49%  93.54   GiB

        Logical Total:                          96.00   GiB
        Logical Used:                   87.77%  84.26   GiB
        Logical Free:                   12.23%  11.74   GiB

Kernel Memory:                                  1.17    GiB
        Data:                           96.45%  1.13    GiB
        Text:                           3.55%   42.55   MiB

Kernel Memory Map:                              93.54   GiB
        Size:                           5.51%   5.15    GiB
        Free:                           94.49%  88.39   GiB
                                                                Page:  1
------------------------------------------------------------------------

ARC Summary: (HEALTHY)
        Storage pool Version:                   5000
        Filesystem Version:                     5
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                6.88m
        Mutex Misses:                           7.55k
        Evict Skips:                            7.55k

ARC Size:                               96.48%  67.88   GiB
        Target Size: (Adaptive)         96.20%  67.67   GiB
        Min Size (Hard Limit):          16.44%  11.57   GiB
        Max Size (High Water):          6:1     70.35   GiB

ARC Size Breakdown:
        Recently Used Cache Size:       93.47%  63.44   GiB
        Frequently Used Cache Size:     6.53%   4.43    GiB

ARC Hash Breakdown:
        Elements Max:                           4.38m
        Elements Current:               96.41%  4.22m
        Collisions:                             5.53m
        Chain Max:                              6
        Chains:                                 449.57k
                                                                Page:  2
------------------------------------------------------------------------

ARC Total accesses:                                     698.92m
        Cache Hit Ratio:                97.55%  681.80m
        Cache Miss Ratio:               2.45%   17.12m
        Actual Hit Ratio:               95.42%  666.87m

        Data Demand Efficiency:         91.47%  38.95m
        Data Prefetch Efficiency:       63.93%  14.02m

        CACHE HITS BY CACHE LIST:
          Anonymously Used:             2.15%   14.64m
          Most Recently Used:           3.11%   21.17m
          Most Frequently Used:         94.71%  645.70m
          Most Recently Used Ghost:     0.03%   198.21k
          Most Frequently Used Ghost:   0.01%   88.54k

        CACHE HITS BY DATA TYPE:
          Demand Data:                  5.23%   35.63m
          Prefetch Data:                1.31%   8.96m
          Demand Metadata:              88.27%  601.81m
          Prefetch Metadata:            5.19%   35.39m

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  19.40%  3.32m
          Prefetch Data:                29.55%  5.06m
          Demand Metadata:              48.78%  8.35m
          Prefetch Metadata:            2.27%   388.77k
                                                                Page:  3
------------------------------------------------------------------------

L2 ARC Summary: (HEALTHY)
        Passed Headroom:                        828.45k
        Tried Lock Failures:                    44.47k
        IO In Progress:                         0
        Low Memory Aborts:                      2
        Free on Write:                          685
        Writes While Full:                      14.94k
        R/W Clashes:                            0
        Bad Checksums:                          0
        IO Errors:                              0
        SPA Mismatch:                           137.23m

L2 ARC Size: (Adaptive)                         315.53  GiB
        Compressed:                     71.12%  224.41  GiB
        Header Size:                    0.06%   199.28  MiB

L2 ARC Evicts:
        Lock Retries:                           33
        Upon Reading:                           0

L2 ARC Breakdown:                               17.11m
        Hit Ratio:                      23.54%  4.03m
        Miss Ratio:                     76.46%  13.08m
        Feeds:                                  310.42k

L2 ARC Buffer:
        Bytes Scanned:                          60.30   TiB
        Buffer Iterations:                      310.42k
        List Iterations:                        1.24m
        NULL List Iterations:                   912

L2 ARC Writes:
        Writes Sent:                    100.00% 86.38k
                                                                Page:  4
------------------------------------------------------------------------

DMU Prefetch Efficiency:                        81.83m
        Hit Ratio:                      4.53%   3.71m
        Miss Ratio:                     95.47%  78.12m

                                                                Page:  5
------------------------------------------------------------------------

                                                                Page:  6
------------------------------------------------------------------------

ZFS Tunable (sysctl):
        kern.maxusers                           6476
        vm.kmem_size                            100434550784
        vm.kmem_size_scale                      1
        vm.kmem_size_min                        0
        vm.kmem_size_max                        1319413950874
        vfs.zfs.vol.immediate_write_sz          32768
        vfs.zfs.vol.unmap_sync_enabled          0
        vfs.zfs.vol.unmap_enabled               1
        vfs.zfs.vol.recursive                   0
        vfs.zfs.vol.mode                        2
        vfs.zfs.sync_pass_rewrite               2
        vfs.zfs.sync_pass_dont_compress         5
        vfs.zfs.sync_pass_deferred_free         2
        vfs.zfs.zio.dva_throttle_enabled        1
        vfs.zfs.zio.exclude_metadata            0
        vfs.zfs.zio.use_uma                     1
        vfs.zfs.zil_slog_bulk                   786432
        vfs.zfs.cache_flush_disable             0
        vfs.zfs.zil_replay_disable              0
        vfs.zfs.version.zpl                     5
        vfs.zfs.version.spa                     5000
        vfs.zfs.version.acl                     1
        vfs.zfs.version.ioctl                   7
        vfs.zfs.debug                           0
        vfs.zfs.super_owner                     0
        vfs.zfs.immediate_write_sz              32768
        vfs.zfs.standard_sm_blksz               131072
        vfs.zfs.dtl_sm_blksz                    4096
        vfs.zfs.min_auto_ashift                 12
        vfs.zfs.max_auto_ashift                 13
        vfs.zfs.vdev.queue_depth_pct            1000
        vfs.zfs.vdev.write_gap_limit            4096
        vfs.zfs.vdev.read_gap_limit             32768
        vfs.zfs.vdev.aggregation_limit_non_rotating131072
        vfs.zfs.vdev.aggregation_limit          1048576
        vfs.zfs.vdev.trim_max_active            64
        vfs.zfs.vdev.trim_min_active            1
        vfs.zfs.vdev.scrub_max_active           2
        vfs.zfs.vdev.scrub_min_active           1
        vfs.zfs.vdev.async_write_max_active     10
        vfs.zfs.vdev.async_write_min_active     1
        vfs.zfs.vdev.async_read_max_active      3
        vfs.zfs.vdev.async_read_min_active      1
        vfs.zfs.vdev.sync_write_max_active      10
        vfs.zfs.vdev.sync_write_min_active      10
        vfs.zfs.vdev.sync_read_max_active       10
        vfs.zfs.vdev.sync_read_min_active       10
        vfs.zfs.vdev.max_active                 1000
        vfs.zfs.vdev.async_write_active_max_dirty_percent60
        vfs.zfs.vdev.async_write_active_min_dirty_percent30
        vfs.zfs.vdev.mirror.non_rotating_seek_inc1
        vfs.zfs.vdev.mirror.non_rotating_inc    0
        vfs.zfs.vdev.mirror.rotating_seek_offset1048576
        vfs.zfs.vdev.mirror.rotating_seek_inc   5
        vfs.zfs.vdev.mirror.rotating_inc        0
        vfs.zfs.vdev.trim_on_init               1
        vfs.zfs.vdev.bio_delete_disable         0
        vfs.zfs.vdev.bio_flush_disable          0
        vfs.zfs.vdev.cache.bshift               16
        vfs.zfs.vdev.cache.size                 0
        vfs.zfs.vdev.cache.max                  16384
        vfs.zfs.vdev.default_ms_shift           29
        vfs.zfs.vdev.min_ms_count               16
        vfs.zfs.vdev.max_ms_count               200
        vfs.zfs.vdev.trim_max_pending           10000
        vfs.zfs.txg.timeout                     5
        vfs.zfs.trim.enabled                    1
        vfs.zfs.trim.max_interval               1
        vfs.zfs.trim.timeout                    30
        vfs.zfs.trim.txg_delay                  32
        vfs.zfs.spa_min_slop                    134217728
        vfs.zfs.spa_slop_shift                  5
        vfs.zfs.spa_asize_inflation             24
        vfs.zfs.deadman_enabled                 1
        vfs.zfs.deadman_checktime_ms            5000
        vfs.zfs.deadman_synctime_ms             1000000
        vfs.zfs.debug_flags                     0
        vfs.zfs.debugflags                      0
        vfs.zfs.recover                         0
        vfs.zfs.spa_load_verify_data            1
        vfs.zfs.spa_load_verify_metadata        1
        vfs.zfs.spa_load_verify_maxinflight     10000
        vfs.zfs.max_missing_tvds_scan           0
        vfs.zfs.max_missing_tvds_cachefile      2
        vfs.zfs.max_missing_tvds                0
        vfs.zfs.spa_load_print_vdev_tree        0
        vfs.zfs.ccw_retry_interval              300
        vfs.zfs.check_hostid                    1
        vfs.zfs.mg_fragmentation_threshold      85
        vfs.zfs.mg_noalloc_threshold            0
        vfs.zfs.condense_pct                    200
        vfs.zfs.metaslab_sm_blksz               4096
        vfs.zfs.metaslab.bias_enabled           1
        vfs.zfs.metaslab.lba_weighting_enabled  1
        vfs.zfs.metaslab.fragmentation_factor_enabled1
        vfs.zfs.metaslab.preload_enabled        1
        vfs.zfs.metaslab.preload_limit          3
        vfs.zfs.metaslab.unload_delay           8
        vfs.zfs.metaslab.load_pct               50
        vfs.zfs.metaslab.min_alloc_size         33554432
        vfs.zfs.metaslab.df_free_pct            4
        vfs.zfs.metaslab.df_alloc_threshold     131072
        vfs.zfs.metaslab.debug_unload           0
        vfs.zfs.metaslab.debug_load             0
        vfs.zfs.metaslab.fragmentation_threshold70
        vfs.zfs.metaslab.force_ganging          16777217
        vfs.zfs.free_bpobj_enabled              1
        vfs.zfs.free_max_blocks                 18446744073709551615
        vfs.zfs.zfs_scan_checkpoint_interval    7200
        vfs.zfs.zfs_scan_legacy                 0
        vfs.zfs.no_scrub_prefetch               0
        vfs.zfs.no_scrub_io                     0
        vfs.zfs.resilver_min_time_ms            3000
        vfs.zfs.free_min_time_ms                1000
        vfs.zfs.scan_min_time_ms                1000
        vfs.zfs.scan_idle                       50
        vfs.zfs.scrub_delay                     4
        vfs.zfs.resilver_delay                  2
        vfs.zfs.top_maxinflight                 32
        vfs.zfs.delay_scale                     500000
        vfs.zfs.delay_min_dirty_percent         60
        vfs.zfs.dirty_data_sync                 67108864
        vfs.zfs.dirty_data_max_percent          10
        vfs.zfs.dirty_data_max_max              4294967296
        vfs.zfs.dirty_data_max                  4294967296
        vfs.zfs.max_recordsize                  1048576
        vfs.zfs.default_ibs                     15
        vfs.zfs.default_bs                      9
        vfs.zfs.zfetch.array_rd_sz              1048576
        vfs.zfs.zfetch.max_idistance            67108864
        vfs.zfs.zfetch.max_distance             33554432
        vfs.zfs.zfetch.min_sec_reap             2
        vfs.zfs.zfetch.max_streams              8
        vfs.zfs.prefetch_disable                0
        vfs.zfs.send_holes_without_birth_time   1
        vfs.zfs.mdcomp_disable                  0
        vfs.zfs.per_txg_dirty_frees_percent     30
        vfs.zfs.nopwrite_enabled                1
        vfs.zfs.dedup.prefetch                  1
        vfs.zfs.dbuf_cache_lowater_pct          10
        vfs.zfs.dbuf_cache_hiwater_pct          10
        vfs.zfs.dbuf_cache_shift                5
        vfs.zfs.dbuf_cache_max_bytes            3105025280
        vfs.zfs.arc_min_prescient_prefetch_ms   6
        vfs.zfs.arc_min_prefetch_ms             1
        vfs.zfs.l2c_only_size                   0
        vfs.zfs.mfu_ghost_data_esize            64196164096
        vfs.zfs.mfu_ghost_metadata_esize        0
        vfs.zfs.mfu_ghost_size                  64196164096
        vfs.zfs.mfu_data_esize                  2954125312
        vfs.zfs.mfu_metadata_esize              3245507072
        vfs.zfs.mfu_size                        7339530240
        vfs.zfs.mru_ghost_data_esize            8302808576
        vfs.zfs.mru_ghost_metadata_esize        0
        vfs.zfs.mru_ghost_size                  8302808576
        vfs.zfs.mru_data_esize                  61266748416
        vfs.zfs.mru_metadata_esize              74845696
        vfs.zfs.mru_size                        64785721344
        vfs.zfs.anon_data_esize                 0
        vfs.zfs.anon_metadata_esize             0
        vfs.zfs.anon_size                       1143296
        vfs.zfs.l2arc_norw                      0
        vfs.zfs.l2arc_feed_again                1
        vfs.zfs.l2arc_noprefetch                0
        vfs.zfs.l2arc_feed_min_ms               200
        vfs.zfs.l2arc_feed_secs                 1
        vfs.zfs.l2arc_headroom                  2
        vfs.zfs.l2arc_write_boost               40000000
        vfs.zfs.l2arc_write_max                 10000000
        vfs.zfs.arc_meta_limit                  18884532704
        vfs.zfs.arc_free_target                 522389
        vfs.zfs.arc_kmem_cache_reap_retry_ms    1000
        vfs.zfs.compressed_arc_enabled          1
        vfs.zfs.arc_grow_retry                  60
        vfs.zfs.arc_shrink_shift                7
        vfs.zfs.arc_average_blocksize           8192
        vfs.zfs.arc_no_grow_shift               5
        vfs.zfs.arc_min                         12420101120
        vfs.zfs.arc_max                         75538130816
        vfs.zfs.abd_chunk_size                  4096
                                                                Page:  7
------------------------------------------------------------------------

root@cygnus[~]#

 
Last edited:

mpyusko

Dabbler
Joined
Jul 5, 2019
Messages
49
Could you explain the reasoning why the arguments above would only apply to a system with less ram?
Because when you have less RAM, losing any of it to manage an L2ARC can be a serious blow. 1 GB from 160 GB is hardly noticable. 1 GB from 8 or 16.... that's huge. I stated I am running a L2ARC on an older box with only 8GB RAM (hard limit). We weighed the options, and ultimately, it came down to being able to compensate to the limitation of the integrated SATA-II controller. Although the array is built out of 4 WD Golds (RAID10), the ammount of I/O trraffic beween the steady flow of email and database transactions made it crawl. The L2ARC helps alleviate that.
Looking at the data you provided; I would say you want to add more RAM. Your ARC hit ratio is 20% and your L2ARC is only 40%.
You're looking at the wrong reports. Those are the reports for the 9.10 box. The L2ARC stats for the box I'm working on is in the post above it (and below). Notice the performance changes once the ClamAV scan starts on the two VMs at 3am (they finish around 7AM).
1589308751839.png


Moving discussion about this 9.10 box to this thread....
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
It serves NFS, iSCSI and SMB Stroage Repositories to Xen. Hypervisors that flex in and out go in the NFS Repo, The "permanent" VMs are in iSCSI, and SMB is the ISO Repo. Both NFS and iSCSI are restricted to the LACP bonded NICs. The SMB is tied to another NIC, and the Fourth NIC is for LAN/WAN traffic. The VMs only see the LAN/WAN network. Traffic is handled by a Cisco Catalyst 3560G 48 port switch.

Can you explain what you mean by "hypervisors that flex" - as in "generate lots of traffic"? Not sure what that means. But I can tell you this - unless you manually set sync=always on your iSCSI ZVOLs, your "permanent VMs" likely aren't doing sync writes, so they're bypassing your 660p entirely.

The network config of mixing NFS and iSCSI on the same ports also doesn't line up nicely - iSCSI prefers independent interfaces with MPIO providing redundancy and load balancing, rather than NFS which needs LACP (at least until NFSv4.1 works politely)

Also, the "LAN/WAN" interface - I assume you mean your regular LAN traffic that has a route to the WAN, and not a "direct public IP on your FreeNAS machine"?

Without knowing exactly how much data actually gets written to the L2ARC and ZIL/SLOG until after it is running, it's hard to figure out how long the SSD's will last. Knowing this has bee n running 8 months, and the load on it, it is easier to estimate. The 660p has a very solid amout of SLC on the 1 TB model. I doubt, even at max load, the QLC even gets hit. SLC lasts longer than MLC, TLC, and QLC anyway.

The 660p is a QLC drive - there's no actual SLC NAND on it. The controller cheats by treating QLC as SLC - normally it stores 4 bits and it has to be very precise with the voltage, but it can write a single bit much more quickly as accuracy of voltage is less important. You're still burning through the limited P/E cycles of that QLC NAND.

arc_summary output..... (Seems pretty well un inhibited to me.)

Code:
L2 ARC Size: (Adaptive)                         315.53  GiB
        Compressed:                     71.12%  224.41  GiB
        Header Size:                    0.06%   199.28  MiB

From this, you're using about 200M of RAM to index an effective 315G of space. I'd call that a very fair tradeoff. Can you show the arc_summary.pl results from your secondary/8GB-RAM machine?
 

mpyusko

Dabbler
Joined
Jul 5, 2019
Messages
49
Can you explain what you mean by "hypervisors that flex" - as in "generate lots of traffic"? Not sure what that means. But I can tell you this - unless you manually set sync=always on your iSCSI ZVOLs, your "permanent VMs" likely aren't doing sync writes, so they're bypassing your 660p entirely.

I misspoke.... I meant flexing VMs in and out.

The network config of mixing NFS and iSCSI on the same ports also doesn't line up nicely - iSCSI prefers independent interfaces with MPIO providing redundancy and load balancing, rather than NFS which needs LACP (at least until NFSv4.1 works politely)

Also, the "LAN/WAN" interface - I assume you mean your regular LAN traffic that has a route to the WAN, and not a "direct public IP on your FreeNAS machine"?

Yes, everything is NAT'd and firewalled. The management for FreeNAS is on LAN/WAN interface. FreeNAS is not directly

The 660p is a QLC drive - there's no actual SLC NAND on it. The controller cheats by treating QLC as SLC - normally it stores 4 bits and it has to be very precise with the voltage, but it can write a single bit much more quickly as accuracy of voltage is less important. You're still burning through the limited P/E cycles of that QLC NAND.

"The Intel SSD 660p employs a variable-size SLC cache, and all data written goes first to the SLC cache before being compacted and folded into QLC blocks. This means that the steady-state 100MB/s sequential write speed we've measured is significantly below what the drive could deliver if the writes went directly to the QLC without the extra SLC to QLC copying step getting in the way. When the drive is mostly empty, up to about half of the available flash memory cells will be treated as SLC NAND. As the drive fills up, blocks will be converted to QLC usage, shrinking the size of the cache and making it more likely that a real-world use case could write enough to fill that cache."

Again the large size means the drive stays mostly empty for optimal performance and endurance.

Code:
L2 ARC Size: (Adaptive)                         315.53  GiB
        Compressed:                     71.12%  224.41  GiB
        Header Size:                    0.06%   199.28  MiB

From this, you're using about 200M of RAM to index an effective 315G of space. I'd call that a very fair tradeoff. Can you show the arc_summary.pl results from your secondary/8GB-RAM machine?

I've found including the 8GB-RAM machine stats in this thread were becoming confusing, so I'm moving that discussion to another thread....
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I misspoke.... I meant flexing VMs in and out.

In this case it's very likely that the machines on the NFS export are using sync but the ones on iSCSI aren't. If you do a bunch of disk activity on an iSCSI machine, does your nvd0 SLOG device see a bunch of activity? Since zilstat still seems to be spotty, you can watch in real-time from SSH with gstat -pf nvd0 and you should see a large values in both w/sec and kBps.

When the drive is mostly empty, up to about half of the available flash memory cells will be treated as SLC NAND.

It's still not SLC NAND. It's being treated as such to mimic it, but it's not actually SLC. You're not putting a lot of writes against it (power-on hours shows about 247 days, with 28TB written, lining up to about 0.1 DWPD which is in line with QLC endurance) but this might also have to do with your iSCSI VMs potentially not using it at all.
 

mpyusko

Dabbler
Joined
Jul 5, 2019
Messages
49
In this case it's very likely that the machines on the NFS export are using sync but the ones on iSCSI aren't. If you do a bunch of disk activity on an iSCSI machine, does your nvd0 SLOG device see a bunch of activity? Since zilstat still seems to be spotty, you can watch in real-time from SSH with gstat -pf nvd0 and you should see a large values in both w/sec and kBps.
When running this test on an iSCSI VM, the w/s was up into the 30's peak and the bBps got up to ~1,100 peak. %busy, 0.9 peak
1589315087774.png


When running this test on an NFS VM, the w/s was up into the 3000's peak and the kBps got up to ~100,000+ peak. %busy, ~100 peak
1589315991955.png


Both VMs are win10 1909 with 4 cores (X5660), 8GB RAM and 200 GB Drive C:\ running on the same hypervisor.

It's still not SLC NAND. It's being treated as such to mimic it, but it's not actually SLC. You're not putting a lot of writes against it (power-on hours shows about 247 days, with 28TB written, lining up to about 0.1 DWPD which is in line with QLC endurance) but this might also have to do with your iSCSI VMs potentially not using it at all.
Or minimally..... The iSCSI VMs are virtual workstations so they spend a fair amount of time idle. The NFS VMs are either temporary windows based, or permanent Linux based. There are two OwnCloud servers that keep constant sync with numerous client workstations, and generally have light traffic (especially during the pandemic). Otherwise, each night it pulls in website and database server backup archives for redundant/off-site storage. This happens nightly and then they are scanned. From insight gained by your investigation and explanation, it seems the ZIL/SLOG has more utilization from the NFS traffic than the iSCSI traffic.

You can see here the IOPS for the ZIL (I ran the tests twice on each VM.)
1589316746249.png
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
When running this test on an iSCSI VM, the w/s was up into the 30's peak and the bBps got up to ~1,100 peak. %busy, 0.9 peak

When running this test on an NFS VM, the w/s was up into the 3000's peak and the kBps got up to ~100,000+ peak. %busy, ~100 peak

Both VMs are win10 1909 with 4 cores (X5660), 8GB RAM and 200 GB Drive C:\ running on the same hypervisor.

This means that your iSCSI VMs aren't actually issuing sync writes. Look at the difference in the small-block write speeds (4K QD32 T16) - 16.7MB/s over NFS, 108.75MB/s over iSCSI.
 
Top