dxun
Explorer
- Joined
 - Jan 24, 2016
 
- Messages
 - 52
 
I am in the process of designing a storage array on a 5-yr mission for my home(lab) and due to hardware limitations of my current platform (and the current/perceived lack of need for any changes), I am about to make a few compromises and would appreciate a discussion/characterisation/opinions/inputs about the alternatives ahead of me. I have an existing ~12 TB pool which I plan on replicating to the backup instance of TrueNAS 12, then reconfigure the new pool and replicate the pool contents back to the new instance of TrueNAS 12.
Ideally, I'd like to prevent any costly issues I might be overseeing right now, rather than having to suffer through it later.
As you can see from my configuration (in signature), this is a venerable platform which I am reaching the limits of - but given the mission time frame, I believe it should serve me well.
The storage array will be backed by either existing 4x10 TB Exos drives in RAIDZ1 or (if I happen to stumble upon a good Black Friday deal) 5x10 TB array (which is the hw maximum as it is a 5-unit drive bay).
The expected traffic will be an eclectic mix of:
- several (but < 10) Kubernetes cluster nodes (each node represented by a Proxmox VM)
- syslog/graylog traffic
- security camera surveillance feeds (BlueIris)
- some occasional light Plex duty (but no transcoding)
I _believe_ this storage array ought to be handle the traffic but using the existing Optane 900p drive efficiently is what I am trying optimise here.
Aside from myself (and the VMs) there won't be additional concurrent users.
My intent is to partition the drive into two (or perhaps even three) partitions and have it wear multiple caching hats:
- 32 GB SLOG partition
- (optional) 64 GB L2ARC partition
- (rest - between 200 and 260 GB) metadata vdev partition
From what I have read so far, this would be a tall order for most SSDs but Optane should be able to handle this (?).
As I mentioned before, the hardware is already pretty much maxed out and has no room for much growth - e.g. I am using 12 out of maximum 16 PCI lanes, the RAM is maxed out at 32 GB.
My concerns/questions mostly revolve around that Optane - I know that ideally this should be a mirrored drive but I am not yet prepared to make that jump.
Here is what I see as my options:
1) partition Optane as above but without any mirroring - introduce the single point of pool failure (if the Optane dies, the pool dies)
2) partition Optane with SLOG and L2ARC partitions only and use a pair of Samsung 870 EVO SATA SSDs for metadata vdev
3) partition Optane as above but with mirroring - the ideal solution
As of right now, I am mostly leaning towards 1) - I am prepared to take that risk as I'll have a backup pool for the whole pool (or at least, the critical data). I am not fully sure how much would the I/O be inconsistent (given the Optane sharing the same NVMe namespace over three partitions with mixed read/writes) - but am willing to test this out.
I am not in strong favour of 2) - I (intuitively) speculate that the likelihood of a single Optane failing vs a total mirror failure of both Samsung 870 EVO is not significant in my case (am I wrong here?).
As for the 3), the financial cost of purchasing a Squid PCIe Carrier board and a new identical Optane 900p would be exceeding 600 USD and I am not even sure if that card would play nicely with the Supermicro board from 2014 (especially that I'd be using the full max of all available lanes and there might be....unforseen problems with this).
So that is the last option.
As for the sizing of the metadata vdev, from what I have read and understood, that is a bit of an art still - nevertheless, I think 200-260 GB partition should be able to host the metadata for ~30 TB of data (especially after I reconfigure the new pool with dataset record sizes appropriate for file sizes).
I'd appreciate any thoughts and critiques of my thinking here.
To be totally clear - I am not so much concerned about the speed but the reliability reasoning. My understanding is: Optane should be able to handle it. Am I wildly over-estimating its reliability?
As a last item, I am posting the `zdb` output of my existing pool - I have tried to use it to estimate the future metadata vdev size but even after reading Wendell's articles/posts, I am still unclear what my existing metadata size is - I am not sure what I am looking at or for. Help/guidance here would be also appreciated.
	
		
			
		
		
	
			
			Ideally, I'd like to prevent any costly issues I might be overseeing right now, rather than having to suffer through it later.
As you can see from my configuration (in signature), this is a venerable platform which I am reaching the limits of - but given the mission time frame, I believe it should serve me well.
The storage array will be backed by either existing 4x10 TB Exos drives in RAIDZ1 or (if I happen to stumble upon a good Black Friday deal) 5x10 TB array (which is the hw maximum as it is a 5-unit drive bay).
The expected traffic will be an eclectic mix of:
- several (but < 10) Kubernetes cluster nodes (each node represented by a Proxmox VM)
- syslog/graylog traffic
- security camera surveillance feeds (BlueIris)
- some occasional light Plex duty (but no transcoding)
I _believe_ this storage array ought to be handle the traffic but using the existing Optane 900p drive efficiently is what I am trying optimise here.
Aside from myself (and the VMs) there won't be additional concurrent users.
My intent is to partition the drive into two (or perhaps even three) partitions and have it wear multiple caching hats:
- 32 GB SLOG partition
- (optional) 64 GB L2ARC partition
- (rest - between 200 and 260 GB) metadata vdev partition
From what I have read so far, this would be a tall order for most SSDs but Optane should be able to handle this (?).
As I mentioned before, the hardware is already pretty much maxed out and has no room for much growth - e.g. I am using 12 out of maximum 16 PCI lanes, the RAM is maxed out at 32 GB.
My concerns/questions mostly revolve around that Optane - I know that ideally this should be a mirrored drive but I am not yet prepared to make that jump.
Here is what I see as my options:
1) partition Optane as above but without any mirroring - introduce the single point of pool failure (if the Optane dies, the pool dies)
2) partition Optane with SLOG and L2ARC partitions only and use a pair of Samsung 870 EVO SATA SSDs for metadata vdev
3) partition Optane as above but with mirroring - the ideal solution
As of right now, I am mostly leaning towards 1) - I am prepared to take that risk as I'll have a backup pool for the whole pool (or at least, the critical data). I am not fully sure how much would the I/O be inconsistent (given the Optane sharing the same NVMe namespace over three partitions with mixed read/writes) - but am willing to test this out.
I am not in strong favour of 2) - I (intuitively) speculate that the likelihood of a single Optane failing vs a total mirror failure of both Samsung 870 EVO is not significant in my case (am I wrong here?).
As for the 3), the financial cost of purchasing a Squid PCIe Carrier board and a new identical Optane 900p would be exceeding 600 USD and I am not even sure if that card would play nicely with the Supermicro board from 2014 (especially that I'd be using the full max of all available lanes and there might be....unforseen problems with this).
So that is the last option.
As for the sizing of the metadata vdev, from what I have read and understood, that is a bit of an art still - nevertheless, I think 200-260 GB partition should be able to host the metadata for ~30 TB of data (especially after I reconfigure the new pool with dataset record sizes appropriate for file sizes).
I'd appreciate any thoughts and critiques of my thinking here.
To be totally clear - I am not so much concerned about the speed but the reliability reasoning. My understanding is: Optane should be able to handle it. Am I wildly over-estimating its reliability?
As a last item, I am posting the `zdb` output of my existing pool - I have tried to use it to estimate the future metadata vdev size but even after reading Wendell's articles/posts, I am still unclear what my existing metadata size is - I am not sure what I am looking at or for. Help/guidance here would be also appreciated.
Code:
root@truenas[~]# zdb -U /data/zfs/zpool.cache -Lbbbs Primary
Traversing all blocks ...
11.7T completed (19535MB/s) estimated time remaining: 0hr 00min 00sec
        bp count:              68609607
        ganged count:                 0
        bp logical:       8716583483904      avg: 127046
        bp physical:      8554463491584      avg: 124683     compression:   1.02
        bp allocated:    12850438782976      avg: 187297     compression:   0.68
        bp deduped:                   0    ref>1:      0   deduplication:   1.00
        Normal class:    12850438782976     used: 42.89%
        additional, non-pointer bps of type 0:    1260670
         number of (compressed) bytes:  number of bps
                         14:    270 *
                         15:    174 *
                         16:     28 *
                         17:    451 *
                         18:    182 *
                         19:     98 *
                         20:     39 *
                         21:    179 *
                         22:    129 *
                         23:    207 *
                         24:     67 *
                         25:     52 *
                         26:    137 *
                         27:     83 *
                         28:   2919 *
                         29:   4984 *
                         30:     90 *
                         31:    186 *
                         32:    278 *
                         33:     94 *
                         34:    191 *
                         35:    150 *
                         36:   9095 **
                         37:   1758 *
                         38: 326526 ****************************************
                         39:  59143 ********
                         40:    188 *
                         41:    143 *
                         42:     98 *
                         43:    208 *
                         44:     52 *
                         45:   1229 *
                         46:    101 *
                         47:    316 *
                         48:    504 *
                         49:   1543 *
                         50: 148100 *******************
                         51:   3077 *
                         52:   3742 *
                         53:  20066 ***
                         54: 119891 ***************
                         55:  69274 *********
                         56: 128335 ****************
                         57: 307760 **************************************
                         58:    645 *
                         59:    852 *
                         60:    816 *
                         61:    891 *
                         62:   1180 *
                         63:    963 *
                         64:    876 *
                         65:    956 *
                         66:   1042 *
                         67:   1133 *
                         68:    895 *
                         69:    568 *
                         70:    767 *
                         71:    609 *
                         72:    645 *
                         73:    632 *
                         74:   1055 *
                         75:    581 *
                         76:   1052 *
                         77:   1680 *
                         78:    643 *
                         79:    513 *
                         80:    645 *
                         81:    948 *
                         82:    476 *
                         83:    604 *
                         84:   1328 *
                         85:    524 *
                         86:   1075 *
                         87:   2109 *
                         88:    629 *
                         89:    540 *
                         90:    499 *
                         91:    565 *
                         92:    492 *
                         93:    468 *
                         94:   1113 *
                         95:    537 *
                         96:    550 *
                         97:    900 *
                         98:    520 *
                         99:   3047 *
                        100:   6092 *
                        101:    499 *
                        102:    592 *
                        103:    660 *
                        104:    541 *
                        105:    350 *
                        106:    505 *
                        107:    428 *
                        108:    343 *
                        109:    410 *
                        110:    452 *
                        111:    706 *
                        112:    392 *
        Dittoed blocks on same vdev: 466811
Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
     -      -       -       -       -       -        -  unallocated
     2    32K      8K     48K     24K    4.00     0.00  object directory
     2     1K      1K     48K     24K    1.00     0.00  object array
     1    16K      4K     24K     24K    4.00     0.00  packed nvlist
     -      -       -       -       -       -        -  packed nvlist size
     2    64K     24K    120K     60K    2.67     0.00      L1 bpobj
   422  52.6M   3.28M   19.7M   47.7K   16.06     0.00      L0 bpobj
   424  52.7M   3.30M   19.8M   47.8K   15.97     0.00  bpobj
     -      -       -       -       -       -        -  bpobj header
     -      -       -       -       -       -        -  SPA space map header
   108  2.80M    452K   2.65M   25.1K    6.34     0.00      L1 SPA space map
 2.22K  9.47M   9.21M   55.1M   24.8K    1.03     0.00      L0 SPA space map
 2.33K  12.3M   9.65M   57.7M   24.8K    1.27     0.00  SPA space map
     1    12K     12K     24K     24K    1.00     0.00  ZIL intent log
    28  3.50M    112K    448K     16K   32.00     0.00      L5 DMU dnode
    28  3.50M    112K    448K     16K   32.00     0.00      L4 DMU dnode
    28  3.50M    112K    448K     16K   32.00     0.00      L3 DMU dnode
    28  3.50M    112K    448K     16K   32.00     0.00      L2 DMU dnode
    97  12.1M   3.19M   9.82M    104K    3.80     0.00      L1 DMU dnode
 59.8K   957M    239M    957M   16.0K    4.00     0.01      L0 DMU dnode
 60.0K   983M    243M    969M   16.1K    4.05     0.01  DMU dnode
    29    58K     58K    472K   16.3K    1.00     0.00  DMU objset
     -      -       -       -       -       -        -  DSL directory
    25  13.5K      2K     48K   1.92K    6.75     0.00  DSL directory child map
     -      -       -       -       -       -        -  DSL dataset snap map
    26    44K      8K     48K   1.85K    5.50     0.00  DSL props
     -      -       -       -       -       -        -  DSL dataset
     -      -       -       -       -       -        -  ZFS znode
     -      -       -       -       -       -        -  ZFS V0 ACL
   141  4.41M    564K   2.20M     16K    8.00     0.00      L3 ZFS plain file
 8.20K   263M   37.8M    151M   18.4K    6.94     0.00      L2 ZFS plain file
  348K  10.9G   3.27G   13.1G   38.4K    3.33     0.11      L1 ZFS plain file
 64.3M  7.92T   7.78T   11.7T    186K    1.02    99.88      L0 ZFS plain file
 64.7M  7.93T   7.78T   11.7T    185K    1.02    99.99  ZFS plain file
 4.05K   130M   16.2M   64.8M   16.0K    8.00     0.00      L1 ZFS directory
  709K   511M   69.8M    558M     804    7.32     0.00      L0 ZFS directory
  713K   641M   86.1M    622M     893    7.45     0.01  ZFS directory
    22    22K     22K    352K     16K    1.00     0.00  ZFS master node
     -      -       -       -       -       -        -  ZFS delete queue
     -      -       -       -       -       -        -  zvol object
     -      -       -       -       -       -        -  zvol prop
     -      -       -       -       -       -        -  other uint8[]
     -      -       -       -       -       -        -  other uint64[]
     -      -       -       -       -       -        -  other ZAP
     -      -       -       -       -       -        -  persistent error log
     2   256K     20K    120K     60K   12.80     0.00  SPA history
     -      -       -       -       -       -        -  SPA history offsets
     -      -       -       -       -       -        -  Pool properties
     -      -       -       -       -       -        -  DSL permissions
     -      -       -       -       -       -        -  ZFS ACL
     -      -       -       -       -       -        -  ZFS SYSACL
     -      -       -       -       -       -        -  FUID table
     -      -       -       -       -       -        -  FUID table size
     1  1.50K   1.50K     24K     24K    1.00     0.00  DSL dataset next clones
     -      -       -       -       -       -        -  scan work queue
     -      -       -       -       -       -        -  ZFS user/group/project used
     -      -       -       -       -       -        -  ZFS user/group/project quota
     -      -       -       -       -       -        -  snapshot refcount tags
     -      -       -       -       -       -        -  DDT ZAP algorithm
     -      -       -       -       -       -        -  DDT statistics
     -      -       -       -       -       -        -  System attributes
     -      -       -       -       -       -        -  SA master node
    23  34.5K   34.5K    368K     16K    1.00     0.00  SA attr registration
    44   704K    176K    704K     16K    4.00     0.00  SA attr layouts
     -      -       -       -       -       -        -  scan translations
     -      -       -       -       -       -        -  deduplicated block
     -      -       -       -       -       -        -  DSL deadlist map
     -      -       -       -       -       -        -  DSL deadlist map hdr
     1  1.50K   1.50K     24K     24K    1.00     0.00  DSL dir clones
     -      -       -       -       -       -        -  bpobj subobj
    12   304K     48K    288K     24K    6.33     0.00      L1 deferred free
    19   402K     86K    528K   27.8K    4.67     0.00      L0 deferred free
    31   706K    134K    816K   26.3K    5.27     0.00  deferred free
     -      -       -       -       -       -        -  dedup ditto
    10  37.5K     15K    144K   14.4K    2.50     0.00  other
    28  3.50M    112K    448K     16K   32.00     0.00      L5 Total
    28  3.50M    112K    448K     16K   32.00     0.00      L4 Total
   169  7.91M    676K   2.64M     16K   11.98     0.00      L3 Total
 8.23K   266M   37.9M    151M   18.4K    7.01     0.00      L2 Total
  353K  11.0G   3.29G   13.1G   38.2K    3.35     0.11      L1 Total
 65.1M  7.92T   7.78T   11.7T    184K    1.02    99.89      L0 Total
 65.4M  7.93T   7.78T   11.7T    183K    1.02   100.00  Total
Block Size Histogram
  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:   159K  79.4M  79.4M   159K  79.4M  79.4M      0      0      0
     1K:  84.6K   103M   182M  84.6K   103M   182M      0      0      0
     2K:  72.3K   189M   372M  72.3K   189M   372M      0      0      0
     4K:   359K  1.45G  1.81G  65.6K   362M   734M      0      0      0
     8K:   508K  5.34G  7.15G  68.8K   740M  1.44G   453K  3.54G  3.54G
    16K:   587K  12.6G  19.8G   112K  2.02G  3.46G   577K  10.8G  14.3G
    32K:   379K  16.7G  36.5G   392K  12.6G  16.1G   841K  35.1G  49.4G
    64K:   647K  58.6G  95.1G  24.3K  2.17G  18.3G   457K  40.5G  89.8G
   128K:  61.5M  7.69T  7.78T  63.3M  7.91T  7.93T  62.0M  11.6T  11.7T
   256K:      0      0  7.78T      0      0  7.93T      0      0  11.7T
   512K:      0      0  7.78T      0      0  7.93T      0      0  11.7T
     1M:      0      0  7.78T      0      0  7.93T      0      0  11.7T
     2M:      0      0  7.78T      0      0  7.93T      0      0  11.7T
     4M:      0      0  7.78T      0      0  7.93T      0      0  11.7T
     8M:      0      0  7.78T      0      0  7.93T      0      0  11.7T
    16M:      0      0  7.78T      0      0  7.93T      0      0  11.7T
                            capacity   operations   bandwidth  ---- errors ----
description                used avail  read write  read write  read write cksum
Primary                   11.7T 15.6T   924     0 6.14M     0     0     0     0
  raidz1                  11.7T 15.6T   924     0 6.14M     0     0     0     0
    /dev/gptid/73288854-d4ab-11e9-b86b-0cc47a0b7772.eli                     307     0 2.04M     0     0     0     0
    /dev/gptid/74117816-d4ab-11e9-b86b-0cc47a0b7772.eli                     309     0 2.05M     0     0     0     0
    /dev/gptid/74f89b34-d4ab-11e9-b86b-0cc47a0b7772.eli                     307     0 2.05M     0     0     0     0