SOLVED Storage array redesign - metadata vdev on single Optane

dxun

Explorer
Joined
Jan 24, 2016
Messages
52
I am in the process of designing a storage array on a 5-yr mission for my home(lab) and due to hardware limitations of my current platform (and the current/perceived lack of need for any changes), I am about to make a few compromises and would appreciate a discussion/characterisation/opinions/inputs about the alternatives ahead of me. I have an existing ~12 TB pool which I plan on replicating to the backup instance of TrueNAS 12, then reconfigure the new pool and replicate the pool contents back to the new instance of TrueNAS 12.
Ideally, I'd like to prevent any costly issues I might be overseeing right now, rather than having to suffer through it later.

As you can see from my configuration (in signature), this is a venerable platform which I am reaching the limits of - but given the mission time frame, I believe it should serve me well.

The storage array will be backed by either existing 4x10 TB Exos drives in RAIDZ1 or (if I happen to stumble upon a good Black Friday deal) 5x10 TB array (which is the hw maximum as it is a 5-unit drive bay).
The expected traffic will be an eclectic mix of:
- several (but < 10) Kubernetes cluster nodes (each node represented by a Proxmox VM)
- syslog/graylog traffic
- security camera surveillance feeds (BlueIris)
- some occasional light Plex duty (but no transcoding)

I _believe_ this storage array ought to be handle the traffic but using the existing Optane 900p drive efficiently is what I am trying optimise here.
Aside from myself (and the VMs) there won't be additional concurrent users.

My intent is to partition the drive into two (or perhaps even three) partitions and have it wear multiple caching hats:
- 32 GB SLOG partition
- (optional) 64 GB L2ARC partition
- (rest - between 200 and 260 GB) metadata vdev partition

From what I have read so far, this would be a tall order for most SSDs but Optane should be able to handle this (?).

As I mentioned before, the hardware is already pretty much maxed out and has no room for much growth - e.g. I am using 12 out of maximum 16 PCI lanes, the RAM is maxed out at 32 GB.

My concerns/questions mostly revolve around that Optane - I know that ideally this should be a mirrored drive but I am not yet prepared to make that jump.

Here is what I see as my options:
1) partition Optane as above but without any mirroring - introduce the single point of pool failure (if the Optane dies, the pool dies)
2) partition Optane with SLOG and L2ARC partitions only and use a pair of Samsung 870 EVO SATA SSDs for metadata vdev
3) partition Optane as above but with mirroring - the ideal solution

As of right now, I am mostly leaning towards 1) - I am prepared to take that risk as I'll have a backup pool for the whole pool (or at least, the critical data). I am not fully sure how much would the I/O be inconsistent (given the Optane sharing the same NVMe namespace over three partitions with mixed read/writes) - but am willing to test this out.
I am not in strong favour of 2) - I (intuitively) speculate that the likelihood of a single Optane failing vs a total mirror failure of both Samsung 870 EVO is not significant in my case (am I wrong here?).
As for the 3), the financial cost of purchasing a Squid PCIe Carrier board and a new identical Optane 900p would be exceeding 600 USD and I am not even sure if that card would play nicely with the Supermicro board from 2014 (especially that I'd be using the full max of all available lanes and there might be....unforseen problems with this).
So that is the last option.

As for the sizing of the metadata vdev, from what I have read and understood, that is a bit of an art still - nevertheless, I think 200-260 GB partition should be able to host the metadata for ~30 TB of data (especially after I reconfigure the new pool with dataset record sizes appropriate for file sizes).

I'd appreciate any thoughts and critiques of my thinking here.
To be totally clear - I am not so much concerned about the speed but the reliability reasoning. My understanding is: Optane should be able to handle it. Am I wildly over-estimating its reliability?

As a last item, I am posting the `zdb` output of my existing pool - I have tried to use it to estimate the future metadata vdev size but even after reading Wendell's articles/posts, I am still unclear what my existing metadata size is - I am not sure what I am looking at or for. Help/guidance here would be also appreciated.

Code:
root@truenas[~]# zdb -U /data/zfs/zpool.cache -Lbbbs Primary

Traversing all blocks ...

11.7T completed (19535MB/s) estimated time remaining: 0hr 00min 00sec
        bp count:              68609607
        ganged count:                 0
        bp logical:       8716583483904      avg: 127046
        bp physical:      8554463491584      avg: 124683     compression:   1.02
        bp allocated:    12850438782976      avg: 187297     compression:   0.68
        bp deduped:                   0    ref>1:      0   deduplication:   1.00
        Normal class:    12850438782976     used: 42.89%

        additional, non-pointer bps of type 0:    1260670
         number of (compressed) bytes:  number of bps
                         14:    270 *
                         15:    174 *
                         16:     28 *
                         17:    451 *
                         18:    182 *
                         19:     98 *
                         20:     39 *
                         21:    179 *
                         22:    129 *
                         23:    207 *
                         24:     67 *
                         25:     52 *
                         26:    137 *
                         27:     83 *
                         28:   2919 *
                         29:   4984 *
                         30:     90 *
                         31:    186 *
                         32:    278 *
                         33:     94 *
                         34:    191 *
                         35:    150 *
                         36:   9095 **
                         37:   1758 *
                         38: 326526 ****************************************
                         39:  59143 ********
                         40:    188 *
                         41:    143 *
                         42:     98 *
                         43:    208 *
                         44:     52 *
                         45:   1229 *
                         46:    101 *
                         47:    316 *
                         48:    504 *
                         49:   1543 *
                         50: 148100 *******************
                         51:   3077 *
                         52:   3742 *
                         53:  20066 ***
                         54: 119891 ***************
                         55:  69274 *********
                         56: 128335 ****************
                         57: 307760 **************************************
                         58:    645 *
                         59:    852 *
                         60:    816 *
                         61:    891 *
                         62:   1180 *
                         63:    963 *
                         64:    876 *
                         65:    956 *
                         66:   1042 *
                         67:   1133 *
                         68:    895 *
                         69:    568 *
                         70:    767 *
                         71:    609 *
                         72:    645 *
                         73:    632 *
                         74:   1055 *
                         75:    581 *
                         76:   1052 *
                         77:   1680 *
                         78:    643 *
                         79:    513 *
                         80:    645 *
                         81:    948 *
                         82:    476 *
                         83:    604 *
                         84:   1328 *
                         85:    524 *
                         86:   1075 *
                         87:   2109 *
                         88:    629 *
                         89:    540 *
                         90:    499 *
                         91:    565 *
                         92:    492 *
                         93:    468 *
                         94:   1113 *
                         95:    537 *
                         96:    550 *
                         97:    900 *
                         98:    520 *
                         99:   3047 *
                        100:   6092 *
                        101:    499 *
                        102:    592 *
                        103:    660 *
                        104:    541 *
                        105:    350 *
                        106:    505 *
                        107:    428 *
                        108:    343 *
                        109:    410 *
                        110:    452 *
                        111:    706 *
                        112:    392 *
        Dittoed blocks on same vdev: 466811

Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
     -      -       -       -       -       -        -  unallocated
     2    32K      8K     48K     24K    4.00     0.00  object directory
     2     1K      1K     48K     24K    1.00     0.00  object array
     1    16K      4K     24K     24K    4.00     0.00  packed nvlist
     -      -       -       -       -       -        -  packed nvlist size
     2    64K     24K    120K     60K    2.67     0.00      L1 bpobj
   422  52.6M   3.28M   19.7M   47.7K   16.06     0.00      L0 bpobj
   424  52.7M   3.30M   19.8M   47.8K   15.97     0.00  bpobj
     -      -       -       -       -       -        -  bpobj header
     -      -       -       -       -       -        -  SPA space map header
   108  2.80M    452K   2.65M   25.1K    6.34     0.00      L1 SPA space map
 2.22K  9.47M   9.21M   55.1M   24.8K    1.03     0.00      L0 SPA space map
 2.33K  12.3M   9.65M   57.7M   24.8K    1.27     0.00  SPA space map
     1    12K     12K     24K     24K    1.00     0.00  ZIL intent log
    28  3.50M    112K    448K     16K   32.00     0.00      L5 DMU dnode
    28  3.50M    112K    448K     16K   32.00     0.00      L4 DMU dnode
    28  3.50M    112K    448K     16K   32.00     0.00      L3 DMU dnode
    28  3.50M    112K    448K     16K   32.00     0.00      L2 DMU dnode
    97  12.1M   3.19M   9.82M    104K    3.80     0.00      L1 DMU dnode
 59.8K   957M    239M    957M   16.0K    4.00     0.01      L0 DMU dnode
 60.0K   983M    243M    969M   16.1K    4.05     0.01  DMU dnode
    29    58K     58K    472K   16.3K    1.00     0.00  DMU objset
     -      -       -       -       -       -        -  DSL directory
    25  13.5K      2K     48K   1.92K    6.75     0.00  DSL directory child map
     -      -       -       -       -       -        -  DSL dataset snap map
    26    44K      8K     48K   1.85K    5.50     0.00  DSL props
     -      -       -       -       -       -        -  DSL dataset
     -      -       -       -       -       -        -  ZFS znode
     -      -       -       -       -       -        -  ZFS V0 ACL
   141  4.41M    564K   2.20M     16K    8.00     0.00      L3 ZFS plain file
 8.20K   263M   37.8M    151M   18.4K    6.94     0.00      L2 ZFS plain file
  348K  10.9G   3.27G   13.1G   38.4K    3.33     0.11      L1 ZFS plain file
 64.3M  7.92T   7.78T   11.7T    186K    1.02    99.88      L0 ZFS plain file
 64.7M  7.93T   7.78T   11.7T    185K    1.02    99.99  ZFS plain file
 4.05K   130M   16.2M   64.8M   16.0K    8.00     0.00      L1 ZFS directory
  709K   511M   69.8M    558M     804    7.32     0.00      L0 ZFS directory
  713K   641M   86.1M    622M     893    7.45     0.01  ZFS directory
    22    22K     22K    352K     16K    1.00     0.00  ZFS master node
     -      -       -       -       -       -        -  ZFS delete queue
     -      -       -       -       -       -        -  zvol object
     -      -       -       -       -       -        -  zvol prop
     -      -       -       -       -       -        -  other uint8[]
     -      -       -       -       -       -        -  other uint64[]
     -      -       -       -       -       -        -  other ZAP
     -      -       -       -       -       -        -  persistent error log
     2   256K     20K    120K     60K   12.80     0.00  SPA history
     -      -       -       -       -       -        -  SPA history offsets
     -      -       -       -       -       -        -  Pool properties
     -      -       -       -       -       -        -  DSL permissions
     -      -       -       -       -       -        -  ZFS ACL
     -      -       -       -       -       -        -  ZFS SYSACL
     -      -       -       -       -       -        -  FUID table
     -      -       -       -       -       -        -  FUID table size
     1  1.50K   1.50K     24K     24K    1.00     0.00  DSL dataset next clones
     -      -       -       -       -       -        -  scan work queue
     -      -       -       -       -       -        -  ZFS user/group/project used
     -      -       -       -       -       -        -  ZFS user/group/project quota
     -      -       -       -       -       -        -  snapshot refcount tags
     -      -       -       -       -       -        -  DDT ZAP algorithm
     -      -       -       -       -       -        -  DDT statistics
     -      -       -       -       -       -        -  System attributes
     -      -       -       -       -       -        -  SA master node
    23  34.5K   34.5K    368K     16K    1.00     0.00  SA attr registration
    44   704K    176K    704K     16K    4.00     0.00  SA attr layouts
     -      -       -       -       -       -        -  scan translations
     -      -       -       -       -       -        -  deduplicated block
     -      -       -       -       -       -        -  DSL deadlist map
     -      -       -       -       -       -        -  DSL deadlist map hdr
     1  1.50K   1.50K     24K     24K    1.00     0.00  DSL dir clones
     -      -       -       -       -       -        -  bpobj subobj
    12   304K     48K    288K     24K    6.33     0.00      L1 deferred free
    19   402K     86K    528K   27.8K    4.67     0.00      L0 deferred free
    31   706K    134K    816K   26.3K    5.27     0.00  deferred free
     -      -       -       -       -       -        -  dedup ditto
    10  37.5K     15K    144K   14.4K    2.50     0.00  other
    28  3.50M    112K    448K     16K   32.00     0.00      L5 Total
    28  3.50M    112K    448K     16K   32.00     0.00      L4 Total
   169  7.91M    676K   2.64M     16K   11.98     0.00      L3 Total
 8.23K   266M   37.9M    151M   18.4K    7.01     0.00      L2 Total
  353K  11.0G   3.29G   13.1G   38.2K    3.35     0.11      L1 Total
 65.1M  7.92T   7.78T   11.7T    184K    1.02    99.89      L0 Total
 65.4M  7.93T   7.78T   11.7T    183K    1.02   100.00  Total

Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:   159K  79.4M  79.4M   159K  79.4M  79.4M      0      0      0
     1K:  84.6K   103M   182M  84.6K   103M   182M      0      0      0
     2K:  72.3K   189M   372M  72.3K   189M   372M      0      0      0
     4K:   359K  1.45G  1.81G  65.6K   362M   734M      0      0      0
     8K:   508K  5.34G  7.15G  68.8K   740M  1.44G   453K  3.54G  3.54G
    16K:   587K  12.6G  19.8G   112K  2.02G  3.46G   577K  10.8G  14.3G
    32K:   379K  16.7G  36.5G   392K  12.6G  16.1G   841K  35.1G  49.4G
    64K:   647K  58.6G  95.1G  24.3K  2.17G  18.3G   457K  40.5G  89.8G
   128K:  61.5M  7.69T  7.78T  63.3M  7.91T  7.93T  62.0M  11.6T  11.7T
   256K:      0      0  7.78T      0      0  7.93T      0      0  11.7T
   512K:      0      0  7.78T      0      0  7.93T      0      0  11.7T
     1M:      0      0  7.78T      0      0  7.93T      0      0  11.7T
     2M:      0      0  7.78T      0      0  7.93T      0      0  11.7T
     4M:      0      0  7.78T      0      0  7.93T      0      0  11.7T
     8M:      0      0  7.78T      0      0  7.93T      0      0  11.7T
    16M:      0      0  7.78T      0      0  7.93T      0      0  11.7T

                            capacity   operations   bandwidth  ---- errors ----
description                used avail  read write  read write  read write cksum
Primary                   11.7T 15.6T   924     0 6.14M     0     0     0     0
  raidz1                  11.7T 15.6T   924     0 6.14M     0     0     0     0
    /dev/gptid/73288854-d4ab-11e9-b86b-0cc47a0b7772.eli                     307     0 2.04M     0     0     0     0
    /dev/gptid/74117816-d4ab-11e9-b86b-0cc47a0b7772.eli                     309     0 2.05M     0     0     0     0
    /dev/gptid/74f89b34-d4ab-11e9-b86b-0cc47a0b7772.eli                     307     0 2.05M     0     0     0     0
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
As a general comment, it's recommended to have metadata VDEVs match your pool redundancy level, but at least be mirrored, since the metadata VDEV is pool integral and loss of it means total pool loss. Your option 2 accounts "correctly" for that.

You're already talking about RAIDZ1 to run very large drives... not recommended.

You are also proposing (if I understand it well) to run multiple VMs on that pool (RAIDZ1)... also not recommended. https://www.truenas.com/community/threads/the-path-to-success-for-block-storage.81165/

And, finally, it's not recommended to mix SLOG and L2ARC on the same physical device (although I see plenty of people doing what I'm sure is just reducing the life of their SLOG device by putting L2ARC there for probably no benefit at all... make sure you maxed out your RAM and sure you're missing a lot of ARC hits before doing that).
 

dxun

Explorer
Joined
Jan 24, 2016
Messages
52
As a general comment, it's recommended to have metadata VDEVs match your pool redundancy level, but at least be mirrored, since the metadata VDEV is pool integral and loss of it means total pool loss. Your option 2 accounts "correctly" for that.

You're correct on all accounts, of course and I fully agree with you. I am absolutely aware that I am departing from the standard recommendations. Sometimes, however, rules can be bent/broken and I am looking to fully understand the consequences of what happens if I do. In particular, I have gotten the impression that with Optane and good backup strategies, one can get away with certain things - if one's livelihood doesn't depend on it.

There is much I don't know about ZFS but I find it a fascinanting technology - all this is a learning experience. With this thread I am mostly hoping to see what understanding/knowledge I am lacking thus far that is causing me to underestimate the risks involved in my plan.


You're already talking about RAIDZ1 to run very large drives... not recommended. You are also proposing (if I understand it well) to run multiple VMs on that pool (RAIDZ1)... also not recommended. https://www.truenas.com/community/threads/the-path-to-success-for-block-storage.81165/

Thanks for that article - I've read that one multiple times plus also this one and this whitepaper. Based on all the info available and given the hardware at my disposal, here's the storage design I've landed on:

1636261073705.png



Based on what I've read, I've separated the block and iSCSI storage on two separate machines. I've also decided to invest into another 10 TB drive and bring the new storage pool up to RAID-Z2 as suggested. Given the size of the pool, I was thinking of defining a 500 GB separate metadata vdev using a garden variety of consumer drives but set up in a 3-way mirror. The disks in that vdev are older but lightly used - given the reserves in both data written (Samsung drives) and "overprovisioning" of that Sandisk SSD, they should last a long time.

That being said.....I am still tempted to put that metadata vdev just on that DC4510. It's a beast of its own - both in terms of endurance rating and speed. I would not be surprised if it crushed that 3-way mirror in both of them (that amazing 840 PRO and its MLC might outlive that DC4510, though). I just can't get a clear picture in my head of what is more probable to fail - a trio of stout and venerable SSDs or an enterprise-level Intel SSD.

And, finally, it's not recommended to mix SLOG and L2ARC on the same physical device (although I see plenty of people doing what I'm sure is just reducing the life of their SLOG device by putting L2ARC there for probably no benefit at all... make sure you maxed out your RAM and sure you're missing a lot of ARC hits before doing that).

Sounds good - at present, I am not concerned about the lifetime of the Optane. Intel is rating its endurance at little over 5.1 PBW - perhaps I am wrong, but I just can't see that device approaching that limit over the lifetime of this SAN server. RAM is already maxed out on that one and I'll at least reserve some space on it so that I have some flexibility to configure it and turn it on later if ARC misses become a problem. Also, SLOG size is probably (too) generously defined but, short of defining a super-fast, network scratch pool, I am not sure what else I could use the remaining Optane capacity for.

Questions:
- is there an ARC miss percentage that an L2ARC would usually be good to consider at?
- can I use the remaining Optane/DC4510 drives more efficiently? What do people usually do with "spare space" on these units?

Appreciate any thoughts or answers!
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
- is there an ARC miss percentage that an L2ARC would usually be good to consider at?
If it's over 10% for any good amount of time you can start thinking about how to deal with it (and if your RAM is already at maximum, then L2ARC is probably the answer).

- can I use the remaining Optane/DC4510 drives more efficiently? What do people usually do with "spare space" on these units?
I'm hearing that this is just a lab with no data you really can't afford to lose, so you could use a partition there to do a pool for either jails/VMs or whatever else makes sense to have fast storage.
 

dxun

Explorer
Joined
Jan 24, 2016
Messages
52
If it's over 10% for any good amount of time you can start thinking about how to deal with it (and if your RAM is already at maximum, then L2ARC is probably the answer).


I'm hearing that this is just a lab with no data you really can't afford to lose, so you could use a partition there to do a pool for either jails/VMs or whatever else makes sense to have fast storage.

Yes, this is a lab and no livelihood attached to it. There are some data I would really like to retain but I'll have them stashed on that cold storage pool, update periodically and locked away.

Any comments on the storage design? Any changes that you would advise considering?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Generally speaking, if your data isn't that important and the VMs are to play/experiment with, then you're probably much better off setting sync=disabled and putting them in a pool on the Optane directly.

You are going to get pretty ordinary performance out of Spinning disks in RAIDZ1 which will soon throttle your throughput when the SLOG can't offload quickly enough to the pool backing it.

Maybe consider a couple of mirrors instead (even though you're going to complain about capacity) just for the IOPS for the VMs.
 

dxun

Explorer
Joined
Jan 24, 2016
Messages
52
Well, the VMs are not exactly discardable but they're not mission critical either - I'll do what I can to preserve them with the hardware that I have but won't go into extra thousand(s) of dollars to harden them up to "datacentre-safe".

If you notice the diagrams above, I've split the load between filer/streaming machine (block storage for throughput) and will be upgrading it to RAID-Z2 (so that is 5x10 TB disks) and a pure SAN machine for those VMs that is backed by two 2-way mirrors (favours IOPS).

A few things I don't fully understand:
- I intend to deploy VMs on iSCSI shares; is there any benefit to also creating a fusion pool for a pool that hosts these VMs so that metadata vdev is also hosted on a fast mirrored array of SSDs?
- I thought the purpose of SLOG is to speed-up the slow (block) storage and get the ACK back to the client as soon as possible (but also as safe as device(s) that host(s) the SLOG allow(s) it)? Why do we need to worry that much about IOPS of the actual platter SAN when it's the SLOG that absorbs most of the hammering? What I am missing?
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
@dxun, would it be an option to have a local SSD on the VM host?
 

dxun

Explorer
Joined
Jan 24, 2016
Messages
52
@dxun, would it be an option to have a local SSD on the VM host?

Yes (in fact, that is what I am running now), but I'd rather not - for reasons of practicality and flexibility, mostly.
I am looking consolidate all the hardware inside a SAN box - especially given the hardware already at my disposal.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
- I intend to deploy VMs on iSCSI shares; is there any benefit to also creating a fusion pool for a pool that hosts these VMs so that metadata vdev is also hosted on a fast mirrored array of SSDs?
I have seen indications from other forum members that it should provide some advantage when dealing with metadata for the blocks on the ZVOLs, so it's probably worth consideration.

- I thought the purpose of SLOG is to speed-up the slow (block) storage and get the ACK back to the client as soon as possible (but also as safe as device(s) that host(s) the SLOG allow(s) it)? Why do we need to worry that much about IOPS of the actual platter SAN when it's the SLOG that absorbs most of the hammering? What I am missing?
As @jgreco says loudly and often, the fastest storage you can get is sync=disabled. As soon as you use sync (standard or always), you will have slower writes, which is where a SLOG can help to reduce the performance hit, but it does not in any way make the writes any faster than they can be with sync=disabled. SLOG is not a write cache (RAM... and only RAM... performs that role with ZFS) and if your pool (the HDDs) can't handle the transactions being thrown at the SLOG in time (just a few seconds), the pool will go into wait state (where your SLOG is not helping either until things catch up).

As I suggested, if you want blistering-fast writes and loads of IOPS, use the Optane as a pool directly. You could then (since the contents aren't ultra-critical) take occasional snapshots and replicate them to a pool of HDDs for some level of security if the Optane were to go toes-up.

If you want to understand it better, I like this article: https://arstechnica.com/information...01-understanding-zfs-storage-and-performance/

But also there's plenty of good material that goes very deep if you want it, just google ZFS and Matt Ahrens (lots of video there)
 
Top