Considering A Flash Upgrade

HANDLEric · Nov 21, 2023

Hello,

we currently have a mild TrueNAS build that's been running in our office for ~4 years now and served us well providing storage for our VMWare environment. With that being said we're finally hitting some constraints and looking to expand or adjust to meet the current needs.

Here is the overview of the current build:

(2) Intel Xeon E5-2670
512GB RAM
(1) Intel DC P3700 partitioned for SLOG
LSI SAS9200 (Running in IT mode)
PowerVault MD1200 3.5" Drive Shelf
(7) 8TB Segate Exos 7E8 3.5" SAS drives in a RaidZ-3 pool
Pool is configured with lz4 compression and deduplication set to verify

We would like to potentially add a second shelf with SATA SSDs and weren't really sure if that would be a plug and play operation or if we need to consider some additional component upgrades as part of this initiative. I was also a bit curious if there were any best practice guidelines out here somewhere that talk about block sizing and/or any related zvol settings that should be considered for various workloads.

HANDLEric · Nov 22, 2023

I guess another thought here is that I assume since we're leveraging iSCSI block storage that SLOG would still be required. Is the P3700 still an ideal choice or should we move to something newer that might better align with an SSD pool?

NugentS · Nov 22, 2023

SSD's will need a 9300 LSI HBA, not a 9200. The 9300 has a much better throughput whilst the 9200 runs out of puff rapidly when dealing with SSD's

I would suggest swapping the 9200 for a 9300 (of an appropriate model) and running both shelves from that. There are newer models than 9300 (aks 9400, 9500, maybe even 9600 I dunno) - but storage tends to be very conservative - and newer models don't have the testing.

I am also concerned that you are running dedupe without a dedupe metadata special vdev - but if its working for you thats great
There is a thread somewhere here about appropriate SLOG's - but the precis is: Optane (and not the M10 varient) - you want 900p or better (other options are RMS200 or RMS300 - but I think you need to be cautious of the amount of memory on them.
[Optanes also make great dedupe metadata as well]

Lastly iSCSI on Z3 - thats a lot of IOPS you don't have. I guess the VM's are not that busy. iSCSI should be mirrors for proper performance.

HANDLEric · Nov 22, 2023

NugentS said:
SSD's will need a 9300 LSI HBA, not a 9200. The 9300 has a much better throughput whilst the 9200 runs out of puff rapidly when dealing with SSD's

I would suggest swapping the 9200 for a 9300 (of an appropriate model) and running both shelves from that. There are newer models than 9300 (aks 9400, 9500, maybe even 9600 I dunno) - but storage tends to be very conservative - and newer models don't have the testing.

I am also concerned that you are running dedupe without a dedupe metadata special vdev - but if its working for you thats great
There is a thread somewhere here about appropriate SLOG's - but the precis is: Optane (and not the M10 varient) - you want 900p or better (other options are RMS200 or RMS300 - but I think you need to be cautious of the amount of memory on them.
[Optanes also make great dedupe metadata as well]

Lastly iSCSI on Z3 - thats a lot of IOPS you don't have. I guess the VM's are not that busy. iSCSI should be mirrors for proper performance.

We can definitely swap the SLOG over to an Optane PCI-E drive, but when you talk about not running iSCSI on Z3 - we recently lost an entire pool due to a double drive failure which is how we ended up rebuilding on a Z3 in the first place.

Would you recommend changing the existing pool from the 7 drive Z3 layout to a stripe across (3) 2-disk mirrors with a hot spare and disabling deduplication?

NugentS · Nov 22, 2023

Is the existing pool working well for you - its definately not optimised, but if its working then great. However if you are planning SSD's because the pool performance isn't ideal then your Z3 is one reason for that. Dedupe being on will be another reason for that as the DDT's have to be stored on the HDD pool which I would have thought would chew up most of your IOPS.

SSD's don't fail the way that HDD's do (generally) - they tend to fail suddenly. Also rebuilding a vdev isn't so much of a strain on the remaining SSD(s) in the vdev because its just reading those devices and only writing to the new one. So SSD Mirrors are good.

In general terms pools for iSCSI and virtual disks should be mirrors, with a SLOG and if using dedupe then with a dedupe metatdata special as well.

How much capacity in the pool for iSCSI do you need? If you need 30+TB of space for VHD's then SSD's are going to be an expensive proposition. Also are you making (what I consider) the classic mistake of virtualising a file server and storing the files that way?

My general advice would be to do something like the following:

1. Have two pools for VM's. Pool1 of HDD's in mirrors (possibly 3 way mirrors, HDD's are cheap(ish)) + SLOG + dedupe special (if using dedupe). Pool2 of SSD's in mirrors + SLOG + Dedupe metadata special (if using dedupe)
2. Split the VM's into 2. SSD's get VM's needing high IOPS (databases and such). HDD's get VM's not needing high IOPS (domain controllers and such). Other stuff, you will have to decide
3. If you have a file server thats virtualised then put that function on either another NAS or on this NAS on a third pool (In RAIDZn) with no SLOG and a dedupe metadata special (if using dedupe). File serving does not need expensive disk.
4. Make sure the NAS has at least two NICs - one dedicated to iSCSI, on its own VLAN/switch. ESX hosts should match this. The other for other traffic

This should reduce the total space required on SSD (expensive) and make things more practical and performant. The issue here is not so much disk space, but slots for disks - but you are already considering an extra shelf. Your basic hardware seems good, with loads of RAM, good CPU's and lots of PCIe lanes for NVMe SLOG and metadata specials. The only issue I see with the hardware is that it isn't exactly new given the CPU's were discontinued Q2, 2015 and use DDR3.

For more specific advice - I would need to know a lot more details, and I mean a lot - much more than should be put on a public forum

As a question however, as I am intrigued - do you have sync on or off on the isci zvol?

ChrisRJ · Nov 22, 2023

You did not specify what constraints you are running into. Can you be more specific about that?

jenksdrummer · Nov 23, 2023

Dedupe without SSD or better being used as a special vdev will give you a bad time. Even with, over time, it can still induce pain.

Z* setups for IOPS it is not; mirrored pairs - effectively raid 10...delivers.

SLOG isn't a requirement. Data gets co-written to RAM as well as SLOG, before it does an organized write-blast to the underlying vdevs. As a result you need something fast, with crazy levels of write endurance/performance for a SLOG, and it would only be beneficial should there be an abend with in-flight data. To that, smaller SLOG is best, like at most 2x RAM, then under-provision it to 25%...so it can burn up cells and not die. Not sure what happens when a SLOG vdev dies...

Also, optane is mentioned; Intel has ended Optane.

NugentS · Nov 23, 2023

You can still get Optane - but it is becoming more difficult, but they haven't got to silly prices yet (other than the ones that were already silly prices)
If a SLOG dies (assuming not at a unexpected reboot) then the pool will just continue, but without the effect of the SLOG

@jenksdrummer - your comment "and it would only be beneficial should there be an abend with in-flight data" is not actually correct. The way sync writes work is that the data has to be written to a permanent storage device - which would normally be the pool. The data is held in memory, but to acknowledge the write the data has to be written to somewhere where it won't vanish in the event of a power loss. A SLOG acts as an alternative to the pool for the log - so it needs (as you say) to be very fast, low latency and high endurance with PLP. This will significantly accelerate sync writes, but not to the point/speed of async writes
[edit - I don't think I explained that very well]

jenksdrummer · Nov 26, 2023

NugentS said:
You can still get Optane - but it is becoming more difficult, but they haven't got to silly prices yet (other than the ones that were already silly prices)
If a SLOG dies (assuming not at a unexpected reboot) then the pool will just continue, but without the effect of the SLOG

@jenksdrummer - your comment "and it would only be beneficial should there be an abend with in-flight data" is not actually correct. The way sync writes work is that the data has to be written to a permanent storage device - which would normally be the pool. The data is held in memory, but to acknowledge the write the data has to be written to somewhere where it won't vanish in the event of a power loss. A SLOG acts as an alternative to the pool for the log - so it needs (as you say) to be very fast, low latency and high endurance with PLP. This will significantly accelerate sync writes, but not to the point/speed of async writes
[edit - I don't think I explained that very well]

Great reply, explained well

HANDLEric · Jan 1, 2024

ChrisRJ said:
You did not specify what constraints you are running into. Can you be more specific about that?

I think right now we're just hitting the limitations of a basic spinning disk array in that we're seeing latency hitting the sky and speeds plummet through the floor to the point we're seeing some more sensitive Linux systems crash. I think it's honestly just time to catch up a bit.

rvassar · Jan 1, 2024

HANDLEric said:
I think right now we're just hitting the limitations of a basic spinning disk array in that we're seeing latency hitting the sky and speeds plummet through the floor to the point we're seeing some more sensitive Linux systems crash. I think it's honestly just time to catch up a bit.

The problem with a RAIDZ pool for VM's is they generally have the write performance of the slowest single component device in the pool, and gain only marginally more than that single device in read performance. A simple mirror starts at 2x read and 1x write, as each mirror component can seek a different LBA on read. Adding a second mirror vdev and you get a force multiplier, 4x read and 2x write, and it goes up from there as you add vdev's.

You can do something similar by adding RAIDz2 vdev's but you end up needing a lot more devices and the math doesn't work out quite as clean. Basically you create a RAIDz2 and add another RAIDz2 vdev, and get 2+-ish times read and something slightly less than 2x write. But you start at 8 disks and go up 4 devices per vdev to get the gains. For VM's you're better off going the multi-mirror route and getting the fault tolerance you need with a massive I/O gains on reads.

You probably need to look at Enterprise NVMe drives for a commercial production environment. You'll have to ditch the disk shelf and go direct attach, but consider... 15Tb NVMe U.2 drives are $1400 these days, and I've personally held a 25Tb 2.5" U.2 in my hand.... Bigger stuff (61Tb!) is in the R&D pipe, all it takes is $$$$. I expect spinning disks to hold out as bulk storage until about 2030, and then disappear. NVMe drives are pushing HDD's out for VM block storage now.

HANDLEric · Jan 1, 2024

rvassar said:
The problem with a RAIDZ pool for VM's is they generally have the write performance of the slowest single component device in the pool, and gain only marginally more than that single device in read performance. A simple mirror starts at 2x read and 1x write, as each mirror component can seek a different LBA on read. Adding a second mirror vdev and you get a force multiplier, 4x read and 2x write, and it goes up from there as you add vdev's.

You can do something similar by adding RAIDz2 vdev's but you end up needing a lot more devices and the math doesn't work out quite as clean. Basically you create a RAIDz2 and add another RAIDz2 vdev, and get 2+-ish times read and something slightly less than 2x write. But you start at 8 disks and go up 4 devices per vdev to get the gains. For VM's you're better off going the multi-mirror route and getting the fault tolerance you need with a massive I/O gains on reads.

You probably need to look at Enterprise NVMe drives for a commercial production environment. You'll have to ditch the disk shelf and go direct attach, but consider... 15Tb NVMe U.2 drives are $1400 these days, and I've personally held a 25Tb 2.5" U.2 in my hand.... Bigger stuff (61Tb!) is in the R&D pipe, all it takes is $$$$. I expect spinning disks to hold out as bulk storage until about 2030, and then disappear. NVMe drives are pushing HDD's out for VM block storage now.

Makes sense and totally understood - from our perspective it's also necessary to consider the bus speed and throughput of a single disk as we look at HDD -vs- SSD -vs- NVMe and so on where that base pool speed and mirror multipliers carry more weight. I think we just want to make sure that we're making the right decisions.

It's also worth noting that this exact deployment ran like adream with nearly the same I/O 2 or so years ago which is what's throwing us off. Either way we just want to improve.

rvassar · Jan 1, 2024

HANDLEric said:
It's also worth noting that this exact deployment ran like adream with nearly the same I/O 2 or so years ago which is what's throwing us off. Either way we just want to improve.

Was that before or after you moved from mirrors to RAIDZ (November 22 post this thread)? It's quite a big difference in performance for iSCSI block storage.

HANDLEric · Jan 1, 2024

rvassar said:
Was that before or after you moved from mirrors to RAIDZ (November 22 post this thread)? It's quite a big difference in performance for iSCSI block storage.

We were actually on RAIDZ the entire time however we started on RAIDZ-1 (Single Parity) and moved to RAIDZ-3 (Triple Parity) which should have added a hair of read penalty but no additional write penalty if i'm not mistaken. We were already seeing performance degradation prior to that event which then just exaggerated it, thus the thought of transitioning to newer/faster hardware.

Important Announcement for the TrueNAS Community.

Considering A Flash Upgrade

HANDLEric

Dabbler

HANDLEric

Dabbler

NugentS

MVP

HANDLEric

Dabbler

NugentS

MVP

ChrisRJ

Wizard

jenksdrummer

Patron

NugentS

MVP

jenksdrummer

Patron

HANDLEric

Dabbler

rvassar

Guru

HANDLEric

Dabbler

rvassar

Guru

HANDLEric

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Considering A Flash Upgrade

Dabbler

Dabbler

MVP

Dabbler

MVP

Wizard

Patron

MVP

Patron

Dabbler

Guru

Dabbler

Guru

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Considering A Flash Upgrade"

Similar threads