Considering A Flash Upgrade

HANDLEric

Dabbler
Joined
May 6, 2019
Messages
47
Hello,

we currently have a mild TrueNAS build that's been running in our office for ~4 years now and served us well providing storage for our VMWare environment. With that being said we're finally hitting some constraints and looking to expand or adjust to meet the current needs.

Here is the overview of the current build:
  1. (2) Intel Xeon E5-2670
  2. 512GB RAM
  3. (1) Intel DC P3700 partitioned for SLOG
  4. LSI SAS9200 (Running in IT mode)
  5. PowerVault MD1200 3.5" Drive Shelf
  6. (7) 8TB Segate Exos 7E8 3.5" SAS drives in a RaidZ-3 pool
  7. Pool is configured with lz4 compression and deduplication set to verify

We would like to potentially add a second shelf with SATA SSDs and weren't really sure if that would be a plug and play operation or if we need to consider some additional component upgrades as part of this initiative. I was also a bit curious if there were any best practice guidelines out here somewhere that talk about block sizing and/or any related zvol settings that should be considered for various workloads.
 

HANDLEric

Dabbler
Joined
May 6, 2019
Messages
47
I guess another thought here is that I assume since we're leveraging iSCSI block storage that SLOG would still be required. Is the P3700 still an ideal choice or should we move to something newer that might better align with an SSD pool?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
SSD's will need a 9300 LSI HBA, not a 9200. The 9300 has a much better throughput whilst the 9200 runs out of puff rapidly when dealing with SSD's

I would suggest swapping the 9200 for a 9300 (of an appropriate model) and running both shelves from that. There are newer models than 9300 (aks 9400, 9500, maybe even 9600 I dunno) - but storage tends to be very conservative - and newer models don't have the testing.

I am also concerned that you are running dedupe without a dedupe metadata special vdev - but if its working for you thats great
There is a thread somewhere here about appropriate SLOG's - but the precis is: Optane (and not the M10 varient) - you want 900p or better (other options are RMS200 or RMS300 - but I think you need to be cautious of the amount of memory on them.
[Optanes also make great dedupe metadata as well]

Lastly iSCSI on Z3 - thats a lot of IOPS you don't have. I guess the VM's are not that busy. iSCSI should be mirrors for proper performance.
 

HANDLEric

Dabbler
Joined
May 6, 2019
Messages
47
SSD's will need a 9300 LSI HBA, not a 9200. The 9300 has a much better throughput whilst the 9200 runs out of puff rapidly when dealing with SSD's

I would suggest swapping the 9200 for a 9300 (of an appropriate model) and running both shelves from that. There are newer models than 9300 (aks 9400, 9500, maybe even 9600 I dunno) - but storage tends to be very conservative - and newer models don't have the testing.

I am also concerned that you are running dedupe without a dedupe metadata special vdev - but if its working for you thats great
There is a thread somewhere here about appropriate SLOG's - but the precis is: Optane (and not the M10 varient) - you want 900p or better (other options are RMS200 or RMS300 - but I think you need to be cautious of the amount of memory on them.
[Optanes also make great dedupe metadata as well]

Lastly iSCSI on Z3 - thats a lot of IOPS you don't have. I guess the VM's are not that busy. iSCSI should be mirrors for proper performance.
We can definitely swap the SLOG over to an Optane PCI-E drive, but when you talk about not running iSCSI on Z3 - we recently lost an entire pool due to a double drive failure which is how we ended up rebuilding on a Z3 in the first place.

Would you recommend changing the existing pool from the 7 drive Z3 layout to a stripe across (3) 2-disk mirrors with a hot spare and disabling deduplication?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Is the existing pool working well for you - its definately not optimised, but if its working then great. However if you are planning SSD's because the pool performance isn't ideal then your Z3 is one reason for that. Dedupe being on will be another reason for that as the DDT's have to be stored on the HDD pool which I would have thought would chew up most of your IOPS.

SSD's don't fail the way that HDD's do (generally) - they tend to fail suddenly. Also rebuilding a vdev isn't so much of a strain on the remaining SSD(s) in the vdev because its just reading those devices and only writing to the new one. So SSD Mirrors are good.

In general terms pools for iSCSI and virtual disks should be mirrors, with a SLOG and if using dedupe then with a dedupe metatdata special as well.

How much capacity in the pool for iSCSI do you need? If you need 30+TB of space for VHD's then SSD's are going to be an expensive proposition. Also are you making (what I consider) the classic mistake of virtualising a file server and storing the files that way?

My general advice would be to do something like the following:

1. Have two pools for VM's. Pool1 of HDD's in mirrors (possibly 3 way mirrors, HDD's are cheap(ish)) + SLOG + dedupe special (if using dedupe). Pool2 of SSD's in mirrors + SLOG + Dedupe metadata special (if using dedupe)
2. Split the VM's into 2. SSD's get VM's needing high IOPS (databases and such). HDD's get VM's not needing high IOPS (domain controllers and such). Other stuff, you will have to decide
3. If you have a file server thats virtualised then put that function on either another NAS or on this NAS on a third pool (In RAIDZn) with no SLOG and a dedupe metadata special (if using dedupe). File serving does not need expensive disk.
4. Make sure the NAS has at least two NICs - one dedicated to iSCSI, on its own VLAN/switch. ESX hosts should match this. The other for other traffic

This should reduce the total space required on SSD (expensive) and make things more practical and performant. The issue here is not so much disk space, but slots for disks - but you are already considering an extra shelf. Your basic hardware seems good, with loads of RAM, good CPU's and lots of PCIe lanes for NVMe SLOG and metadata specials. The only issue I see with the hardware is that it isn't exactly new given the CPU's were discontinued Q2, 2015 and use DDR3.

For more specific advice - I would need to know a lot more details, and I mean a lot - much more than should be put on a public forum

As a question however, as I am intrigued - do you have sync on or off on the isci zvol?
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
You did not specify what constraints you are running into. Can you be more specific about that?
 

jenksdrummer

Patron
Joined
Jun 7, 2011
Messages
250
Dedupe without SSD or better being used as a special vdev will give you a bad time. Even with, over time, it can still induce pain.

Z* setups for IOPS it is not; mirrored pairs - effectively raid 10...delivers.

SLOG isn't a requirement. Data gets co-written to RAM as well as SLOG, before it does an organized write-blast to the underlying vdevs. As a result you need something fast, with crazy levels of write endurance/performance for a SLOG, and it would only be beneficial should there be an abend with in-flight data. To that, smaller SLOG is best, like at most 2x RAM, then under-provision it to 25%...so it can burn up cells and not die. Not sure what happens when a SLOG vdev dies...

Also, optane is mentioned; Intel has ended Optane.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
You can still get Optane - but it is becoming more difficult, but they haven't got to silly prices yet (other than the ones that were already silly prices)
If a SLOG dies (assuming not at a unexpected reboot) then the pool will just continue, but without the effect of the SLOG

@jenksdrummer - your comment "and it would only be beneficial should there be an abend with in-flight data" is not actually correct. The way sync writes work is that the data has to be written to a permanent storage device - which would normally be the pool. The data is held in memory, but to acknowledge the write the data has to be written to somewhere where it won't vanish in the event of a power loss. A SLOG acts as an alternative to the pool for the log - so it needs (as you say) to be very fast, low latency and high endurance with PLP. This will significantly accelerate sync writes, but not to the point/speed of async writes
[edit - I don't think I explained that very well]
 

jenksdrummer

Patron
Joined
Jun 7, 2011
Messages
250
You can still get Optane - but it is becoming more difficult, but they haven't got to silly prices yet (other than the ones that were already silly prices)
If a SLOG dies (assuming not at a unexpected reboot) then the pool will just continue, but without the effect of the SLOG

@jenksdrummer - your comment "and it would only be beneficial should there be an abend with in-flight data" is not actually correct. The way sync writes work is that the data has to be written to a permanent storage device - which would normally be the pool. The data is held in memory, but to acknowledge the write the data has to be written to somewhere where it won't vanish in the event of a power loss. A SLOG acts as an alternative to the pool for the log - so it needs (as you say) to be very fast, low latency and high endurance with PLP. This will significantly accelerate sync writes, but not to the point/speed of async writes
[edit - I don't think I explained that very well]
Great reply, explained well
 

HANDLEric

Dabbler
Joined
May 6, 2019
Messages
47
You did not specify what constraints you are running into. Can you be more specific about that?
I think right now we're just hitting the limitations of a basic spinning disk array in that we're seeing latency hitting the sky and speeds plummet through the floor to the point we're seeing some more sensitive Linux systems crash. I think it's honestly just time to catch up a bit.
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
I think right now we're just hitting the limitations of a basic spinning disk array in that we're seeing latency hitting the sky and speeds plummet through the floor to the point we're seeing some more sensitive Linux systems crash. I think it's honestly just time to catch up a bit.

The problem with a RAIDZ pool for VM's is they generally have the write performance of the slowest single component device in the pool, and gain only marginally more than that single device in read performance. A simple mirror starts at 2x read and 1x write, as each mirror component can seek a different LBA on read. Adding a second mirror vdev and you get a force multiplier, 4x read and 2x write, and it goes up from there as you add vdev's.

You can do something similar by adding RAIDz2 vdev's but you end up needing a lot more devices and the math doesn't work out quite as clean. Basically you create a RAIDz2 and add another RAIDz2 vdev, and get 2+-ish times read and something slightly less than 2x write. But you start at 8 disks and go up 4 devices per vdev to get the gains. For VM's you're better off going the multi-mirror route and getting the fault tolerance you need with a massive I/O gains on reads.

You probably need to look at Enterprise NVMe drives for a commercial production environment. You'll have to ditch the disk shelf and go direct attach, but consider... 15Tb NVMe U.2 drives are $1400 these days, and I've personally held a 25Tb 2.5" U.2 in my hand.... Bigger stuff (61Tb!) is in the R&D pipe, all it takes is $$$$. I expect spinning disks to hold out as bulk storage until about 2030, and then disappear. NVMe drives are pushing HDD's out for VM block storage now.
 

HANDLEric

Dabbler
Joined
May 6, 2019
Messages
47
The problem with a RAIDZ pool for VM's is they generally have the write performance of the slowest single component device in the pool, and gain only marginally more than that single device in read performance. A simple mirror starts at 2x read and 1x write, as each mirror component can seek a different LBA on read. Adding a second mirror vdev and you get a force multiplier, 4x read and 2x write, and it goes up from there as you add vdev's.

You can do something similar by adding RAIDz2 vdev's but you end up needing a lot more devices and the math doesn't work out quite as clean. Basically you create a RAIDz2 and add another RAIDz2 vdev, and get 2+-ish times read and something slightly less than 2x write. But you start at 8 disks and go up 4 devices per vdev to get the gains. For VM's you're better off going the multi-mirror route and getting the fault tolerance you need with a massive I/O gains on reads.

You probably need to look at Enterprise NVMe drives for a commercial production environment. You'll have to ditch the disk shelf and go direct attach, but consider... 15Tb NVMe U.2 drives are $1400 these days, and I've personally held a 25Tb 2.5" U.2 in my hand.... Bigger stuff (61Tb!) is in the R&D pipe, all it takes is $$$$. I expect spinning disks to hold out as bulk storage until about 2030, and then disappear. NVMe drives are pushing HDD's out for VM block storage now.
Makes sense and totally understood - from our perspective it's also necessary to consider the bus speed and throughput of a single disk as we look at HDD -vs- SSD -vs- NVMe and so on where that base pool speed and mirror multipliers carry more weight. I think we just want to make sure that we're making the right decisions.

It's also worth noting that this exact deployment ran like adream with nearly the same I/O 2 or so years ago which is what's throwing us off. Either way we just want to improve.
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
It's also worth noting that this exact deployment ran like adream with nearly the same I/O 2 or so years ago which is what's throwing us off. Either way we just want to improve.

Was that before or after you moved from mirrors to RAIDZ (November 22 post this thread)? It's quite a big difference in performance for iSCSI block storage.
 

HANDLEric

Dabbler
Joined
May 6, 2019
Messages
47
Was that before or after you moved from mirrors to RAIDZ (November 22 post this thread)? It's quite a big difference in performance for iSCSI block storage.
We were actually on RAIDZ the entire time however we started on RAIDZ-1 (Single Parity) and moved to RAIDZ-3 (Triple Parity) which should have added a hair of read penalty but no additional write penalty if i'm not mistaken. We were already seeing performance degradation prior to that event which then just exaggerated it, thus the thought of transitioning to newer/faster hardware.
 
Top