Pools, Tipps / Tricks, Pitfalls - Supermicro Server with 2 JBOD Extensions as a VMWare Datastore and Backup Store

Yves_ · Jun 8, 2020

Hi fellow FreeNASers,

I am really new to this forum. So I hope I am doing everything according to the forum rules. Otherwise please tell me.

I just finished building my first FreeNAS lab storage with old parts I got for a very good price.

- Main Chassie: Supermicro SuperChassis 847E16-R1400LPB (Link) has two Backplanes (Front: 24 Drives | Back: 12 Drives)
- Mainboard: Supermicro X9DR3-F/i (Link)
- CPU: 2x Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz (Link)
- RAM: 16x 16GB DDR3 eq total of 256GB
- SystemDisk: Supermicro SATA-DOM 32GB
- Additional NIC: 1x Mellanox ConnectX3-Pro Dual Port 40Gbps (Link), 1x some Intel Dual Port 10Gbps SFP+
- Slog / Cache Drive: Intel Optane 900p 280GB (with 2 Partitions) 60GB for SLOG / rest for Cache
- Controller: LSI SAS9305-24i (Link) with the Supermicro Bracket AOM-SAS3-16i16e-LP adapter so I can use the two SAS Backplanes from the SuperChassis 847E16-R1400LPB and the two JBOD Extensions SuperChassis 847E16-RJBOD1
- JBOD Extensions: Supermicro SuperChassis 847E16-RJBOD1 (Link) each has also two Backplanes (Front: 24 Drives | Back: 21 Drives)
- Disks: 52x Western Digital White Label 8TB WD80EZAZ SATA (I think they are 5400 rpm drives)

Everything is up and running and works flawless as far as I can tell.

My idea was following:

The goal is a smaller but faster pool for the VMs (connected thru iSCSi) with Slog / Cache on the Intel Optane 900p and a bigger pool without a special Slog / Cache drive for Backups (mostly big single files) mainly thru SMB. So I tought I create a Pool vm-datastore01 with two VDevs (each has 4 drives and is RaidZ2) I partition the Optane with 60GB for Slog and rest (around 200GB) for Cache. For the other pool backup-store01 with 44 drives I create 4 VDevs with 11 drives each on a RaidZ1 with no Cache and no Slog. Any objections?

I am also very unsure about the 24 and 12 backplane situation.... and what happens if a complete backplane fails!? Should I only do data VDevs over disks on the same backplane? Does it matter? Should I only do vdevs over one complete system and the new VDevs from the first JBOD? How do I check if my performance is in the range it should be?

Thanks for your feedback,
Yves

Yves_ · Jun 9, 2020

Soooo little follow up. I created the two pools (vm-datastore01 | bigdata01)

vm-datastore01 Pool with the Intel 900p as Slog 60GB and Cache 200GB and two VDevs with RaidZ2

Code:

zpool status vm-datastore01
  pool: vm-datastore01
 state: ONLINE
  scan: none requested
config:

        NAME                                            STATE     READ WRITE CKSUM
        vm-datastore01                                  ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/83a3adc9-a59c-11ea-bef5-002590982d14  ONLINE       0     0     0
            gptid/84f4c955-a59c-11ea-bef5-002590982d14  ONLINE       0     0     0
            gptid/8510cf13-a59c-11ea-bef5-002590982d14  ONLINE       0     0     0
            gptid/84b4416c-a59c-11ea-bef5-002590982d14  ONLINE       0     0     0
          raidz2-1                                      ONLINE       0     0     0
            gptid/84ba0ab3-a59c-11ea-bef5-002590982d14  ONLINE       0     0     0
            gptid/852acce6-a59c-11ea-bef5-002590982d14  ONLINE       0     0     0
            gptid/852961fe-a59c-11ea-bef5-002590982d14  ONLINE       0     0     0
            gptid/8447f780-a59c-11ea-bef5-002590982d14  ONLINE       0     0     0
        logs
          nvd0p1                                        ONLINE       0     0     0
        cache
          nvd0p2                                        ONLINE       0     0     0

errors: No known data errors

bigdata01 Pool with not all of my disks yet. Currently 28 drives in 4 VDevs on RaidZ1

Code:

zpool status bigdata01
  pool: bigdata01
 state: ONLINE
  scan: none requested
config:

        NAME                                                STATE     READ WRITE CKSUM
        bigdata01                                           ONLINE       0     0     0
          raidz1-0                                          ONLINE       0     0     0
            gptid/199e803b-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/16fdb004-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/1a31ffeb-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/187f5690-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/1aa8d8e0-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/16ddcdac-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/1aca088f-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
          raidz1-1                                          ONLINE       0     0     0
            gptid/15983367-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/18c4e64c-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/1b048556-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/18970955-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/18b88cc6-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/19780e0b-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/1b84128d-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
          raidz1-2                                          ONLINE       0     0     0
            gptid/17004377-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/1a83a744-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/2180e3a5-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/246919cf-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/240b702a-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/249c9bb8-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/246671b9-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
          raidz1-3                                          ONLINE       0     0     0
            gptid/232bfca3-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/2361557d-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/2408b798-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/23f4cd5e-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/24e0cfe3-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/2564da74-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0
            gptid/256bf0a2-a9b8-11ea-bef5-002590982d14.eli  ONLINE       0     0     0

errors: No known data errors

So after googleing aroudn a little I read that it is recommend to do tests with dd instead of fio and its important to turn off compression. Here are my initial tests of the vm-datastore01

Code:

dd if=/dev/zero of=/mnt/vm-datastore01/testfile bs=1M count=50000
50000+0 records in
50000+0 records out
52428800000 bytes transferred in 99.031242 secs (529416767 bytes/sec)

gstat of the nvme

Code:

dT: 1.003s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0   5016      0      0    0.0   5016 567207    0.2   27.0| nvd0

What I don't understand is why the device is only 27% busy? If I take the same Intel 900p out of the Pool and create a Pool only on the nvme and run the same test on it it is 100% busy and performance is a lot better. Do I understand SLOG wrong?

Code:

dd if=/dev/zero of=/mnt/nvme/testfile bs=1M count=50000
50000+0 records in
50000+0 records out
52428800000 bytes transferred in 22.493678 secs (2330823819 bytes/sec)

Code:

dT: 1.003s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    9  17097      0      0    0.0  17097 2152748    0.5   98.7| nvd0

jgreco · Jun 9, 2020

Why is your RAIDZ2 named "vm-datastore01"?

https://www.ixsystems.com/community/resources/some-differences-between-raidz-and-mirrors-and-why-we-use-mirrors-for-block-storage.112/

https://www.ixsystems.com/community/threads/the-path-to-success-for-block-storage.81165/

jgreco · Jun 9, 2020

Yves_ said:
Do I understand SLOG wrong?

Also I don't know if you understand SLOG wrong, but having written tons on it, generally when people ask the question, the answer is yes, they misunderstand it. SLOG is not a write cache -- misconception #1.

https://www.ixsystems.com/community/threads/some-insights-into-slog-zil-with-zfs-on-freenas.13633/

HoneyBadger · Jun 10, 2020

@jgreco has provided some excellent links already, I'd like to build on them for a little bit.

"Running/active VMs" and "Backup target" are two very, very different workloads; I'm glad to see you have two different pools for them, but as pointed out, the configuration of your vm-datastore01 really ought to be mirrors for the best performance. You won't really lose any usable space either (and might actually gain some if you've got a lot of 4K records that will be causing the use of padding)

Optane will tolerate being split into L2ARC and SLOG better than other devices thanks to its abundant performance, but it's still generally not recommended. L2ARC and SLOG are two very different workloads (L2ARC is "read-heavy" higher queues, SLOG is effectively "write-only read-never" low queue depth) and sharing a device compromises its ability to do a good job of each. I'd suggest you let the Optane be pure SLOG (with, say, 16GB allocated for that) and pick up a cheaper SSD for L2ARC duties.

And finally, your SLOG itself isn't doing much of anything for iSCSI unless you've set sync=always on the ZVOLs you're presenting.

Yves_ · Jun 15, 2020

jgreco said:
Why is your RAIDZ2 named "vm-datastore01"?

https://www.ixsystems.com/community...and-why-we-use-mirrors-for-block-storage.112/

https://www.ixsystems.com/community/threads/the-path-to-success-for-block-storage.81165/

Hi @jgreco, thanks a lot for your help and insight. I read both links. BIG SORRY that I did not do that upfront... if I understand your linked articles correctly you would recommend to have a lot of mirrors instead of raidz. So I changed the layout of the vm-datastore01 to following:

Code:

  pool: vm-datastore01
 state: ONLINE
  scan: none requested
config:

        NAME                                            STATE     READ WRITE CKSUM
        vm-datastore01                                  ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/25c00a5d-aa66-11ea-a970-002590982d14  ONLINE       0     0     0
            gptid/26928993-aa66-11ea-a970-002590982d14  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            gptid/2591e81b-aa66-11ea-a970-002590982d14  ONLINE       0     0     0
            gptid/2696058a-aa66-11ea-a970-002590982d14  ONLINE       0     0     0
          mirror-2                                      ONLINE       0     0     0
            gptid/26247d30-aa66-11ea-a970-002590982d14  ONLINE       0     0     0
            gptid/26d4c4b4-aa66-11ea-a970-002590982d14  ONLINE       0     0     0
          mirror-3                                      ONLINE       0     0     0
            gptid/25d5307a-aa66-11ea-a970-002590982d14  ONLINE       0     0     0
            gptid/26bc3f49-aa66-11ea-a970-002590982d14  ONLINE       0     0     0
        logs
          nvd0p1                                        ONLINE       0     0     0
        cache
          nvd0p2                                        ONLINE       0     0     0

errors: No known data errors

Here are some tests, i think I am still missing something. The results with Sync -> Disabled are a lot higher than with Always which should use the Optane as a SLOG? Or am I doing a missconseption?

Pool settings: Sync -> Always, Compression Level -> off, Atime -> On, Dedupe -> Off.

1M Blocksize:

Code:

dd if=/dev/zero of=/mnt/vm-datastore01/testfile bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 13.223538 secs (792961739 bytes/sec)

4k Blocksize:

Code:

dd if=/dev/zero of=/mnt/vm-datastore01/testfile bs=4k count=100000
100000+0 records in
100000+0 records out
409600000 bytes transferred in 6.737311 secs (60795766 bytes/sec)

Pool settings: Sync -> Disabled, Compression Level -> off, Atime -> On, Dedupe -> Off.

1M Blocksize:

Code:

dd if=/dev/zero of=/mnt/vm-datastore01/testfile bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 9.536430 secs (1099547768 bytes/sec)

4k Blocksize:

Code:

dd if=/dev/zero of=/mnt/vm-datastore01/testfile bs=4k count=100000
100000+0 records in
100000+0 records out
409600000 bytes transferred in 0.716156 secs (571942426 bytes/sec)

@HoneyBadger: thanks also a lot for your input. So you would say rater use the Optane for SLOG and use a normal SSD for L2ARC?

HoneyBadger · Jun 15, 2020

Yves_ said:
Here are some tests, i think I am still missing something. The results with Sync -> Disabled are a lot higher than with Always which should use the Optane as a SLOG? Or am I doing a missconseption?

The missing piece here boils down to this: with sync=disabled, your data is not safe.

Without enforcing sync writes, your data is collected as a transaction group in RAM before being flushed to disk; but your client systems over iSCSI will think it is safe on disk. If you never crash, well, things will work great. But as soon as you do, any data in the past few seconds runs the risk of potentially being permanently lost. After all, it was only in RAM.

Now, for some people, losing the data is acceptable. They have backups, they'll restore them, and the risk of downtime is outweighed by the increase in speed. But for transactional databases or other situations where data loss is unacceptable, you need to use sync=always and accept the performance hit.

Yves_ said:
So you would say rater use the Optane for SLOG and use a normal SSD for L2ARC?

That would be my suggestion. Not necessarily right down to the cheapest consumer drive you can get but you certainly do not need Optane for L2ARC. A "mixed use" SSD is fine, likely rated between 1-3 DWPD (drive writes per day) as opposed to the 5-10 of "write intensive" solutions or the <1 of consumer-grade.

Yves_ · Jun 15, 2020

@HoneyBadger: Again big thanks for explaining everything so detailed! Makes total sence now. But shouldn't the SLOG drive (the optane) make that all faster again? What about the 3rd setting -> Standard?

I have a few Intel S4600 from lab tests around. I think they work well for that.

HoneyBadger · Jun 15, 2020

Yves_ said:
But shouldn't the SLOG drive (the optane) make that all faster again? What about the 3rd setting -> Standard?

You need to compare apples to apples. sync=disabled will always give you the fastest results because you are "writing to RAM" rather than to disk, but it's unsafe and putting your data at risk. sync=standard in the case of iSCSI will do much the same; client data is written to RAM, but internal ZFS metadata is written synchronously. You're still at risk.

The two scenarios you need to compare are "sync=always, with Optane SLOG" and "sync=always, without Optane SLOG"

Do a comparison between the two (although maybe with only 1000 of those 4K writes) and you'll see very quickly why the Optane is necessary. You're getting about 60MB/s of writes with the Optane SLOG at 4K; without it, I wager you'll be lucky to get 6MB/s.

Yves_ said:
I have a few Intel S4600 from lab tests around. I think they work well for that.

Those are a good choice. You will need to add some tunables (of sysctl type) for vfs.zfs.l2arc_write_boost and vfs.zfs.l2arc_write_max - the defaults are an underwhelming 8MB/s. write_max is the limit for how much will be written to the L2ARC under normal situations, and write_boost is the "warmup" after first boot until you start seeing evictions from ARC. They can definitely be raised from the 8MB/s defaults, but make sure that you leave them low enough to allow for L2ARC to serve its primary purpose of reads. Even certain SSDs don't handle simultaneous read/writes well, so increase slowly and keep an eye on your latency values.

Yves_ · Jun 16, 2020

@HoneyBadger: You are of course totally right! The same pool with Optane SLOG is like you said around 60MB/s, and with the Optane offline still on always I am around 0.38 MB/s....

So if I understand you correctly, please tell if not. The speed with SLOG on Optane of around 60MB/s is a good value? Can it be improved without turning off sync=always? I mean 60MB/s on 4k rand write is not bad... but compared to the almost 600MB/s with sync=standard / disabled its a lot less...

What about NFS instead of the iSCSi? Same issue? If sync is on disabled you lose data and if its on always you lose speed? //edit: I think I already found the answer to my question here: Sync writes, or: Why is my ESXi NFS so slow, and why is iSCSI faster?

edit:// not really sure if there is an error somewhere... but this numbers look very nice currently (iSCSI 10GB, sync=always, Optane ZIL, Optane Cache, ATime On

2020-06-16 10_54_26-10.9.50.172 - Remotedesktopverbindung.png

edit2:// One more idea I just had... I run two pools... one quite big one for backup and big files (isos, system images etc.) and the "smaller" one as a vm-datastore. Now what if I would split up the Optane in two parts for SLOG for each pool and split the S4600 in two parts for Cache for each pool? And additionally I currently run a 4x 2 Disks mirror for the vm-datastore would it improve performance if I would go up to like 8x 2 disks?

HoneyBadger · Jun 17, 2020

Have a look at your FreeNAS host machine and look for any power-saving settings in your BIOS, especially the ones related to PCI link state power management, and disable them. I think you should be able to push an Optane 900p faster than that.

Splitting SLOG and L2ARC devices between pools (or even one pool with SLOG/L2ARC on the same) is discouraged right now because there is no way to enforce a given performance ratio on the device, so the two workloads tend to stomp all over each other.

Some very limited and expensive NVMe devices support QoS on multiple namespaces (I only know of the Micron 9300 that supports both; the Intel DC P4510 supports namespaces in a new FW but still not QoS) and if both are properly implemented then you can have performance assigned to each namespace; eg: create a small namespace with the lion's share of the write power, and a larger one that gets the remainder. But for general devices, don't do that.

ZFS pool performance generally scales by vdev count, so adding more mirror pairs will improve performance. Your SLOG device or network latency may still be a bottleneck though at the small block sizes.

Yves_ · Jun 28, 2020

sorry, busy week... no time to toy around.

So I removed the slog and did a diskinfo -wS on the optane 900p which gave me this results (I think there is something broken since it never goes higher than 128kbytes in these tests. I already did a thread about that.

Code:

diskinfo -wS /dev/nvd0
/dev/nvd0
        512             # sectorsize
        280065171456    # mediasize in bytes (261G)
        547002288       # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        INTEL SSDPED1D280GA     # Disk descr.
        PHMB742400E4280CGN      # Disk ident.
        Yes             # TRIM/UNMAP support
        0               # Rotation rate in RPM

Synchronous random writes:
         0.5 kbytes:     21.2 usec/IO =     23.0 Mbytes/s
           1 kbytes:     21.2 usec/IO =     46.1 Mbytes/s
           2 kbytes:     23.3 usec/IO =     83.9 Mbytes/s
           4 kbytes:     13.5 usec/IO =    289.3 Mbytes/s
           8 kbytes:     18.8 usec/IO =    414.8 Mbytes/s
          16 kbytes:     25.7 usec/IO =    606.8 Mbytes/s
          32 kbytes:     37.6 usec/IO =    830.5 Mbytes/s
          64 kbytes:     55.2 usec/IO =   1131.4 Mbytes/s
         128 kbytes:     88.4 usec/IO =   1413.5 Mbytes/s
         256 kbytes: diskinfo: AIO write submit error: Operation not supported

As I can see 4K should be around 289.3Mbyte/s? So why am I only getting around 50 - 60Mbyte/s in SLOG? Can I optimize SLOG? About the power saving I don't think thats turned on on this old X9 board. I at least did not find an option for it. Can post BIOS screenshots if you want.

edit:// if I do gstat I can see that nvd0 is only 9-16% busy

Important Announcement for the TrueNAS Community.

Pools, Tipps / Tricks, Pitfalls - Supermicro Server with 2 JBOD Extensions as a VMWare Datastore and Backup Store

Yves_

Dabbler

Yves_

Dabbler

jgreco

Resident Grinch

jgreco

Resident Grinch

HoneyBadger

actually does care

Yves_

Dabbler

HoneyBadger

actually does care

Yves_

Dabbler

HoneyBadger

actually does care

Yves_

Dabbler

HoneyBadger

actually does care

Yves_

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Pools, Tipps / Tricks, Pitfalls - Supermicro Server with 2 JBOD Extensions as a VMWare Datastore and Backup Store

Dabbler

Dabbler

Resident Grinch

Resident Grinch

actually does care

Dabbler

actually does care

Dabbler

actually does care

Dabbler

actually does care

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Pools, Tipps / Tricks, Pitfalls - Supermicro Server with 2 JBOD Extensions as a VMWare Datastore and Backup Store"

Similar threads