Slow transfers via FC when SLOG and sync enabled

CLEsportsfan

Dabbler
Joined
Nov 23, 2020
Messages
20
I have a Dell R720 server that I'm using as an ESXi datastore for a cluster of two servers. The ESXi boxes will be connected to my TrueNAS Core box via 8gb fibre channel. Right now, I'm using some older Samsung 480gb SSDs (SATA), but will be buying 10k spinning disk SAS drives if this proof-of-concept works.

I enabled the FC cards via this post on the forum and the ESXi servers are able to see the zvol across the fiber, no problem. Not sure what speeds I should be seeing, but I was averaging about 134 MB/sec when copying locally from ESXi to the attached shared storage on TrueNAS via FC. On the zfs pool, I have an Intel DC S3700 as the SLOG drive, but noticed there was zero activity on that drive when copying the files. I thought the SLOG drive would be utilized automatically since it's not NFS connected but I guess not.

So, I ran zfs set sync=always on my zfs Pool and the performance dropped by about 50% to roughly 63 MB/sec, and the SLOG drive was now being used. Since this will be used for virtual machines, I'd need to have sync=always otherwise they could get corrupted, right?

What things should I check on my TrueNAS server to increase performance? Would getting a better SLOG drive change things much?
 

jenksdrummer

Patron
Joined
Jun 7, 2011
Messages
250
SLOG is more or less like an unorganized write-commit that mirrors what's always written to RAM, but not yet to disk; where actual writes to disk happen with ZFS about every 5 seconds and get organized into larger stripes. That 5 second gap is where a SLOG *could* (well, ideally...should) prevent data corruption, in that, as I understand it, basically acts as a disk based mirror of RAM. It will absolutely slow down your writes; but that's the penalty for pretty well guaranteed data.

&& All of the above is "as I understand it" which can be absolutely wrong &&

Opinion: Odds are slim that data corruption can occur if one has server-grade hardware involved with a good backup, and using Release + 1 versioning for the OS. IE, I wouldn't really trust RELEASE, but I would put a bit more trust in U1; if it's a business critical workload.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Since this will be used for virtual machines, I'd need to have sync=always otherwise they could get corrupted, right?
True

What things should I check on my TrueNAS server to increase performance? Would getting a better SLOG drive change things much?
Look at your pool design, review the performance options for SLOG if your pool is fast enough for it to matter:

I thought the SLOG drive would be utilized automatically since it's not NFS connected but I guess not.
Did you add it to your pool? that won't be automatic.

I'm not sure what you mean by it not being NFS connected.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
FC, like iSCSI, runs through the ctl (CAM target layer) service which does not issue sync writes by default.

So let's pick at a few things in your POC setup.

TrueNAS-12.0-RELEASE - Core
Dell R720 w/ dual Intel(R) Xeon(R) CPU E5-2609 v2, 48gb RAM
Two LSI SAS 9207-8i cards
Two QLogic QLE2562 8Gb fibre channel cards
14x Samsung SSD SM843T 480gb in RAID-z3 array + 1 spare
Intel DC S3700 100gb for SLOG

First off, good chassis for a build. I'm guessing it's a 16-bay system based on drive count.

Add more RAM - as much as you can fit/afford based on DIMM slot count and density. 256GB is a good target here.

SAS and FC cards are good but we may end up pulling something if you need a PCIe slot here.

For storage, your SM843T's are going to give you way better performance than 10K SAS drives. Bear that in mind before you get too attached to the idea of cheap capacity. Consider sticking with the NAND. If there's budget/capacity/other reasons for going with spinners, I'd suggest we build in considerations for some flash for L2ARC and metadata vdevs.

As mentioned by @sretalla a single 14-wide RAIDZ3 isn't good for VM performance, even with SSDs. Mirrors will be best, but if you stick with SSDs, I suppose you could do very narrow RAIDZ1 (5x 3-wide) if the space is an absolute concern and there's no dollars for upgraded drives; although you'll be giving up performance for non-guaranteed returns on your space efficiency (small records like those used by block storage often don't mix well with RAIDZ and end up being as space-hungry as mirrors, depending on recordsize and RAIDZ width). Absolutely do not use RAIDZ if you're limited to spinning disks.

Finally, that S3700 was a great SLOG circa 2014; it's definitely not up to par today. If you can accept the possibility of downtime for a swap and can spare a PCIe slot, you'll make significant performance gains by adding an Optane card internally. If you've got to stick with SAS, pick up one of the 10 DWPD HGST DC SS 530 drives (search for "WUSTM3240ASS200") as they're shockingly fast for SAS. You'll lose a bit downshifting to 6G SAS vs the 12G that the drive can do, but they're top-class SAS SSDs (think 70-80% of an Optane)

And last but definitely not least - welcome to the forums! :)
 

CLEsportsfan

Dabbler
Joined
Nov 23, 2020
Messages
20
Thanks guys! I'm new to FreeNAS/TrueNAS and appreciate each response. Trying to grasp all you guys have written so far so bear with me.

I know spinning disks will slow things down, but I'm hoping with a working SLOG it won't seem as bad (wishful thinking?) Budget is tight and this is supposedly a "temporary" setup until sometime next year when we have the funds to add an enclosure to our current SAN. The SM843's I have to test here aren't great as you can see below.
root@truenas[~]# diskinfo -wS /dev/da15
/dev/da15
512 # sectorsize
100030242816 # mediasize in bytes (93G)
195371568 # mediasize in sectors
0 # stripesize
0 # stripeoffset
12161 # Cylinders according to firmware.
255 # Heads according to firmware.
63 # Sectors according to firmware.
ATA INTEL SSDSC2BA10 # Disk descr.
BTTV341009N0100FGN # Disk ident.
Yes # TRIM/UNMAP support
0 # Rotation rate in RPM
Not_Zoned # Zone Mode

Synchronous random writes:
0.5 kbytes: 155.9 usec/IO = 3.1 Mbytes/s
1 kbytes: 153.5 usec/IO = 6.4 Mbytes/s
2 kbytes: 154.6 usec/IO = 12.6 Mbytes/s
4 kbytes: 158.0 usec/IO = 24.7 Mbytes/s
8 kbytes: 166.4 usec/IO = 47.0 Mbytes/s
16 kbytes: 193.7 usec/IO = 80.7 Mbytes/s
32 kbytes: 229.5 usec/IO = 136.1 Mbytes/s
64 kbytes: 313.5 usec/IO = 199.4 Mbytes/s
128 kbytes: 624.1 usec/IO = 200.3 Mbytes/s
256 kbytes: 1244.6 usec/IO = 200.9 Mbytes/s
512 kbytes: 2480.9 usec/IO = 201.5 Mbytes/s
1024 kbytes: 4954.7 usec/IO = 201.8 Mbytes/s
2048 kbytes: 9962.1 usec/IO = 200.8 Mbytes/s
4096 kbytes: 19951.0 usec/IO = 200.5 Mbytes/s
8192 kbytes: 39986.3 usec/IO = 200.1 Mbytes/s
root@truenas[~]# diskinfo -wS /dev/da14
/dev/da14
512 # sectorsize
480103981056 # mediasize in bytes (447G)
937703088 # mediasize in sectors
0 # stripesize
0 # stripeoffset
58369 # Cylinders according to firmware.
255 # Heads according to firmware.
63 # Sectors according to firmware.
ATA VK0480GDPVT # Disk descr.
S1CYNYAF106147 # Disk ident.
Yes # TRIM/UNMAP support
0 # Rotation rate in RPM
Not_Zoned # Zone Mode

Synchronous random writes:
0.5 kbytes: 2046.3 usec/IO = 0.2 Mbytes/s
1 kbytes: 2123.7 usec/IO = 0.5 Mbytes/s
2 kbytes: 2081.2 usec/IO = 0.9 Mbytes/s
4 kbytes: 1956.0 usec/IO = 2.0 Mbytes/s
8 kbytes: 2768.7 usec/IO = 2.8 Mbytes/s
16 kbytes: 2807.0 usec/IO = 5.6 Mbytes/s
32 kbytes: 2866.6 usec/IO = 10.9 Mbytes/s
64 kbytes: 8833.1 usec/IO = 7.1 Mbytes/s
128 kbytes: 3445.1 usec/IO = 36.3 Mbytes/s
256 kbytes: 3914.4 usec/IO = 63.9 Mbytes/s
512 kbytes: 10128.3 usec/IO = 49.4 Mbytes/s
1024 kbytes: 29501.6 usec/IO = 33.9 Mbytes/s
2048 kbytes: 148797.5 usec/IO = 13.4 Mbytes/s
4096 kbytes: 142033.1 usec/IO = 28.2 Mbytes/s
8192 kbytes: 368267.3 usec/IO = 21.7 Mbytes/s
FC, like iSCSI, runs through the ctl (CAM target layer) service which does not issue sync writes by default.
So I'll need to run zfs set sync=always then on the ZFS side for my VM datastore or does it not matter as much since FC is block accessed like iSCSI? If I can leave it "standard", I think I'm ok. Setting to always obviously kills the speed. Would you consider this to be the biggest slow-down right now?

@sretalla, I was getting NFS mixed up with iSCSI during all my reading on which needed sync set. And yes, the SLOG drive was added to the pool as a Log vdev device. Thanks for those links too

Finally, that S3700 was a great SLOG circa 2014; it's definitely not up to par today. If you can accept the possibility of downtime for a swap and can spare a PCIe slot, you'll make significant performance gains by adding an Optane card internally. If you've got to stick with SAS, pick up one of the 10 DWPD HGST DC SS 530 drives (search for "WUSTM3240ASS200") as they're shockingly fast for SAS. You'll lose a bit downshifting to 6G SAS vs the 12G that the drive can do, but they're top-class SAS SSDs (think 70-80% of an Optane)
Yes, I am planning on switching the SLOG SSD to something else, it just so happened I had an S3700 here and read (an older) post that said it works well as an SLOG drive. Thank you very much for the suggestion on that drive, I hadn't come across that specific one yet and I'll add it to my list. The ones I was considering were:
  • Hitachi HUSSL4020ASS600 200gb SSD (SLC)
  • Samsung 883 DCT Series SSD 480gb (MLC V-NAND)
  • Intel SSD D3-S4610 480gb (TLC 3D NAND)
For storage, your SM843T's are going to give you way better performance than 10K SAS drives. Bear that in mind before you get too attached to the idea of cheap capacity. Consider sticking with the NAND. If there's budget/capacity/other reasons for going with spinners, I'd suggest we build in considerations for some flash for L2ARC and metadata vdevs.

As mentioned by @sretalla a single 14-wide RAIDZ3 isn't good for VM performance, even with SSDs. Mirrors will be best, but if you stick with SSDs, I suppose you could do very narrow RAIDZ1 (5x 3-wide) if the space is an absolute concern and there's no dollars for upgraded drives; although you'll be giving up performance for non-guaranteed returns on your space efficiency (small records like those used by block storage often don't mix well with RAIDZ and end up being as space-hungry as mirrors, depending on recordsize and RAIDZ width). Absolutely do not use RAIDZ if you're limited to spinning disks.
I need minimum 7TB of space currently, which means I need minimum of 8.5TB available to stay under 80%, right? I meant to pick RAID-Z2 for this PoC when I built this array but selected the wrong option. Is there a good raid calculator site out there that would help me figure out what size drives I'd need if I used multiple mirrors or RAIDZ in my pool? It'll start to get confusing if there are multiple. For the space I need, the SSDs I have now won't cut it (plus they're slow). I could add SSD for L2ARC and metadata vdevs if it would help spinning disks and keep prices down. Considering all options.

Thanks for all the help!!
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Thanks guys! I'm new to FreeNAS/TrueNAS and appreciate each response. Trying to grasp all you guys have written so far so bear with me.

I know spinning disks will slow things down, but I'm hoping with a working SLOG it won't seem as bad (wishful thinking?) Budget is tight and this is supposedly a "temporary" setup until sometime next year when we have the funds to add an enclosure to our current SAN. The SM843's I have to test here aren't great as you can see below.

SLOG will help absorb your writes (to a degree) but it won't do anything to help with reads - that's where you'll need lots of RAM (for primary ARC) and a bit of help from more flash on the read side (L2ARC)

Those SM843T drives are just turning in awful sync-write numbers; they do have power-loss capacitors so they shouldn't be that bad. Even if sync writes are slow at small blocks, they should pick up at higher sizes. That feels like the drive write cache is disabled or something - are your two LSI 9207-8i's in pure IT firmware mode and set to enable on-disk caches (not "controller cache")?

So I'll need to run zfs set sync=always then on the ZFS side for my VM datastore or does it not matter as much since FC is block accessed like iSCSI? If I can leave it "standard", I think I'm ok. Setting to always obviously kills the speed. Would you consider this to be the biggest slow-down right now?

You'll have to force sync=always here - NFS it isn't necessary but for FC and iSCSI it is.

Yes, I am planning on switching the SLOG SSD to something else, it just so happened I had an S3700 here and read (an older) post that said it works well as an SLOG drive. Thank you very much for the suggestion on that drive, I hadn't come across that specific one yet and I'll add it to my list. The ones I was considering were:
  • Hitachi HUSSL4020ASS600 200gb SSD (SLC)
  • Samsung 883 DCT Series SSD 480gb (MLC V-NAND)
  • Intel SSD D3-S4610 480gb (TLC 3D NAND)
Drop the Samsung entirely, and consider the Hitachi HUSMH/HUSMM series (Ultrastar SSD1600) instead of the older HUSSL. The D3-S4610 is still a "mixed use" SSD - you really want to go for the ones labeled "Write Intensive" - even a larger S3700 or S3710 would do better. Also an option is the Toshiba PX04SM/PX05SM - again, look for the "write intensive" or "10DWPD" models. You need to be able to handle huge amounts of sustained writes for an SLOG device.

I need minimum 7TB of space currently, which means I need minimum of 8.5TB available to stay under 80%, right? I meant to pick RAID-Z2 for this PoC when I built this array but selected the wrong option. Is there a good raid calculator site out there that would help me figure out what size drives I'd need if I used multiple mirrors or RAIDZ in my pool? It'll start to get confusing if there are multiple. For the space I need, the SSDs I have now won't cut it (plus they're slow). I could add SSD for L2ARC and metadata vdevs if it would help spinning disks and keep prices down. Considering all options.

Thanks for all the help!!

A former forum member made this calculator - I'm not sure if it's still the new hotness, but it's what I have bookmarked. (It doesn't seem to do decimals properly though, and there's lots of 10K SAS 2.5" drives that come in 1.x TB's)


Bear in mind that it's not a magic "works/doesn't work" switch at 80% utilization - performance in ZFS is highly correlated with free space on the pool, and as that free space drops so does performance. See discussion point #6 in the "Path to success for block storage" document linked previously.

Assuming 16 bays, I'd count on 12 bays for data vdevs, two bays for a metadata mirror (use a fast SSD for this) and two more that are either a set of mirrored SLOGs (with an internal NVMe L2ARC device) or an L2ARC + hotspare (if you're able to go Optane/other fast NVMe SLOG)
 

CLEsportsfan

Dabbler
Joined
Nov 23, 2020
Messages
20
Those SM843T drives are just turning in awful sync-write numbers; they do have power-loss capacitors so they shouldn't be that bad. Even if sync writes are slow at small blocks, they should pick up at higher sizes. That feels like the drive write cache is disabled or something - are your two LSI 9207-8i's in pure IT firmware mode and set to enable on-disk caches (not "controller cache")?
Yes, I am running the IT firmware mode, version 20.00.07.00. I couldn't access the settings from some LSI tool at bootup, but was able to access them through the Dell BIOS. On all the disks, the write cache is enabled. I didn't see any settings for controller cache so I'm not sure how to check that. The S3700 I tested was through the same card as the SM843T.

Drop the Samsung entirely, and consider the Hitachi HUSMH/HUSMM series (Ultrastar SSD1600) instead of the older HUSSL. The D3-S4610 is still a "mixed use" SSD - you really want to go for the ones labeled "Write Intensive" - even a larger S3700 or S3710 would do better. Also an option is the Toshiba PX04SM/PX05SM - again, look for the "write intensive" or "10DWPD" models. You need to be able to handle huge amounts of sustained writes for an SLOG device.
Thanks. I'll look for prices on the newer Hitachi drives. The others were reasonably priced under $200. Hard to find SAS SSD without paying and arm & leg.

Assuming 16 bays, I'd count on 12 bays for data vdevs, two bays for a metadata mirror (use a fast SSD for this) and two more that are either a set of mirrored SLOGs (with an internal NVMe L2ARC device) or an L2ARC + hotspare (if you're able to go Optane/other fast NVMe SLOG)
Found a good RAID calculator here (assuming it's accurate) that helps me figure out multi-level RAIDs like you were suggesting. For 12 drives I could do four groups of 3-drive RAID-Z1 at 1.8tb drives and seems like it would be enough space. TrueNAS would stripe data writes across all 4 groups and perform better than a 14-drive RAID-z2? I have open PCI slots but the Optane cards I've seen thus far are a little pricey. I'll look around for an older smaller capacity model if they're out there. If all NVMe sticks are the same, those seem cheaper if there's a PCIe adapter for them that works in FreeBSD. Thank you for all help!
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Yes, I am running the IT firmware mode, version 20.00.07.00. I couldn't access the settings from some LSI tool at bootup, but was able to access them through the Dell BIOS. On all the disks, the write cache is enabled. I didn't see any settings for controller cache so I'm not sure how to check that. The S3700 I tested was through the same card as the SM843T.

If you do a camcontrol identify daX against them, do you get something along the lines of the following in the output:

Code:
Feature                      Support  Enabled   Value           Vendor
read ahead                     yes      yes
write cache                    yes      yes
flush cache                    yes      yes


They might not get the write cache toggled on until they're put in a pool as well.

Thanks. I'll look for prices on the newer Hitachi drives. The others were reasonably priced under $200. Hard to find SAS SSD without paying and arm & leg.

As the saying goes, "speed costs, how fast do you want to go?" You should be able to get 200G Ultrastar SSDs for around USD$50/ea as datacenter pulls. Mind their writes, as they've likely had rough lives, but also know that the "write intensive" ones go happily into the petabytes-written stage before failing gracefully.

Found a good RAID calculator here (assuming it's accurate) that helps me figure out multi-level RAIDs like you were suggesting. For 12 drives I could do four groups of 3-drive RAID-Z1 at 1.8tb drives and seems like it would be enough space. TrueNAS would stripe data writes across all 4 groups and perform better than a 14-drive RAID-z2?

I'll stress this again - don't do RAIDZ-anything with spinning disks for block storage. You'll have a bad time when it comes to the random access. Build mirrors, buy a JBOD if you have to for extra slots. 12x1.8T at 80% used gives you 7.5T of available space - and this is before compression which could easily net you another 25% if not more.

Cutting corners here will hurt the overall performance and there's no way to change a vdev layout after the fact.

I have open PCI slots but the Optane cards I've seen thus far are a little pricey. I'll look around for an older smaller capacity model if they're out there. If all NVMe sticks are the same, those seem cheaper if there's a PCIe adapter for them that works in FreeBSD. Thank you for all help!

See above re: the speed:price relationship. For a production system I wouldn't say anything under the P4801X 100G should be considered, the "consumer" cards won't have the write endurance to hold up, and neither will they have the warranty support. If this was a homelab I'd consider the 900p/905p - and if your risk tolerance is there you could do this as well, since you could probably afford mirrored consumer Optane vs. single enterprise Optane.

For PCIe to M.2 slot adaptors they're are all well-supported as they're just traces on a PCB essentially, but confirm that your card will physically fit the largest M.2 22110 form factor (110mm) as many are designed to only hold the 80mm "consumer grade" cards at max, and the P4801X is M.2 22110.
 

jenksdrummer

Patron
Joined
Jun 7, 2011
Messages
250
SLOG will help absorb your writes (to a degree) but it won't do anything to help with reads - that's where you'll need lots of RAM (for primary ARC) and a bit of help from more flash on the read side (L2ARC)

My understanding of a SLOG is that it does NOT contribute to write speeds, but it does contribute to data integrity in the gap between data written to RAM vs data written to disk; which happens relatively on a timer to help consolidate writes into larger stripes. IE, what I've read is that a SLOG will never increase the speed of a pool.
 

CLEsportsfan

Dabbler
Joined
Nov 23, 2020
Messages
20
If you do a camcontrol identify daX against them, do you get something along the lines of the following in the output:

Code:
Feature                      Support  Enabled   Value           Vendor
read ahead                     yes      yes
write cache                    yes      yes
flush cache                    yes      yes


They might not get the write cache toggled on until they're put in a pool as well.
OK, here's what camcontrol identify da15 said for the SM843T drives:
Code:
Feature                      Support  Enabled   Value           Vendor
read ahead                     yes      yes
write cache                    yes      yes
flush cache                    yes      yes
So really, it appears those SM843T drives just stink after all. Same setting on the Intel DC S3700's which performed better.

As the saying goes, "speed costs, how fast do you want to go?" You should be able to get 200G Ultrastar SSDs for around USD$50/ea as datacenter pulls. Mind their writes, as they've likely had rough lives, but also know that the "write intensive" ones go happily into the petabytes-written stage before failing gracefully.
Wow, that's a great going price for those drives. I'll hunt around eBay for the model number you posted earlier. That'll help with the tight budget for everything else we'll need to buy.
I'll stress this again - don't do RAIDZ-anything with spinning disks for block storage. You'll have a bad time when it comes to the random access. Build mirrors, buy a JBOD if you have to for extra slots. 12x1.8T at 80% used gives you 7.5T of available space - and this is before compression which could easily net you another 25% if not more.

Cutting corners here will hurt the overall performance and there's no way to change a vdev layout after the fact.
Budget for this project is no more than $4k USD for everything I'll need including RAM, cache devices, etc., so not a ton to work with. RAIDZ would have helped for drive costs, but I don't want to sacrifice speed as much as it already will be with spinners. So, a bunch of mirrors it is! With my budget, I don't think it'll be possible to use all SSD. There are 2.4tb 10k drives available, so I might have to go that route over the 1.8tb to give us more usable space and somewhat better performance with more free space available.

After reading this path to block storage success link, I have a question on free space needed for ZFS as far as performance goes.... Does it go by un-allocated space on the drives or unused space in allocated sections? So, if I have a 10TB pool and I create a 8TB volume for ESXi, but I only have 5TB of files stored on it, what does ZFS consider is my "free space"? Either 2TB unallocated, 3TB unused in the volume or the 5TB free between drive size and what's used on the drive? I guess based on your response, that will tell me if it's better to have a volume size closer to the size of files I have to fill the volume or if it's better to create the biggest volume I can regardless of how many files I'll have.

See above re: the speed:price relationship. For a production system I wouldn't say anything under the P4801X 100G should be considered, the "consumer" cards won't have the write endurance to hold up, and neither will they have the warranty support. If this was a homelab I'd consider the 900p/905p - and if your risk tolerance is there you could do this as well, since you could probably afford mirrored consumer Optane vs. single enterprise Optane.

For PCIe to M.2 slot adaptors they're are all well-supported as they're just traces on a PCB essentially, but confirm that your card will physically fit the largest M.2 22110 form factor (110mm) as many are designed to only hold the 80mm "consumer grade" cards at max, and the P4801X is M.2 22110.
I don't see this as being a terribly busy box for our ESXi datastore, but then again with needing sync=always, maybe my SLOG will probably be busier than I think. All this box will need to do is get us by for a year at the most until we can expand our existing SAN.
Thank you for the info on the PCIe to M.2 slot adaptors. Since this is a full server, I would hope the adaptors fit but I'll measure to be sure. Is it only the SLOG vdev that I have to worry about having power loss protection or is it also L2ARC and metadata vdevs? For a metadata vdev, would an Intel DC S3700 like I have be ok for that, or should it be an NVMe like SLOG and L2ARC?

Thanks again for all your help and input!
 

CLEsportsfan

Dabbler
Joined
Nov 23, 2020
Messages
20
I hope it's ok to post a response asking for feedback on what I'll need before I buy more.
What's left in my budget is $3415.
Spent $285 to bring my RAM total to 192gb.
SLOG: Intel DC P3700 1.6TB - spent $300.
Still to buy:
L2ARC: Ultrastar SSD1600 200/400gb or Toshiba PX04SM 400gb (still recommended with 192gb RAM?) $50-$100 on eBay
Metadata vdev: Buy duplicate from above?
Pool drives: 14 drives 10k (7 vdevs of 2 drive mirrors). 1.8tb drives 11TB(8.8TB usable) or 2.4TB drives 14.7TB(11.7TB usable)
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Sorry about getting lost on this one. I answered the free space behavior in your other thread, I'll link to it here:


I don't see this as being a terribly busy box for our ESXi datastore, but then again with needing sync=always, maybe my SLOG will probably be busier than I think. All this box will need to do is get us by for a year at the most until we can expand our existing SAN.

sync=always will definitely cause your SLOG to be under heavy load. The P3700 1.6TB is a good choice from a speed perspective but you'll want to ensure you only set up a very small chunk of it through underprovisioning.

Instructions for that can be found in an Intel PDF: https://www.intel.com/content/dam/w...nd-based-ssds-better-endurance-whitepaper.pdf

Since this is intending to replace/supplement a production SAN, you need to consider the impacts of downtime. PCIe SLOGs are definitely faster than the majority of less-expensive SATA/SAS ones, but NVMe hotswap seems to still only be "partial" in 12.0-U1 so you may end up with a downtime if it fails inside of that year - and hotswap is still a U.2 thing, don't try it with a conventional PCIe card.

Thank you for the info on the PCIe to M.2 slot adaptors. Since this is a full server, I would hope the adaptors fit but I'll measure to be sure. Is it only the SLOG vdev that I have to worry about having power loss protection or is it also L2ARC and metadata vdevs? For a metadata vdev, would an Intel DC S3700 like I have be ok for that, or should it be an NVMe like SLOG and L2ARC?

Power-loss-protection is only technically necessary for the SLOG device, and there it's from a performance standpoint - if the drive can safely ignore the cache flush, then it doesn't burn performance trying to do that. Metadata being as critical as it is though, PLP is a good idea there too. L2ARC definitely doesn't need it.

For your meta drives, you do want something fairly fast, so the Ultrastars are good here. I would look for the PX04SS (write-intensive) ones if you can find them, but the SM drives should be okay. Avoid the SV/SR as (unless I'm misremembering) they have poor sync-write performance. That might have been the PX02 series though.

For L2ARC, you might actually be better off to use the same model and size of drives across the board. They'll be overkill for L2ARC duties, but in a pinch you can use it as a spare drive to your meta SSDs. Redundancy, redundancy, redundancy. You're talking about something that's standing in for a year's time for a production SAN.

So to check on the vdev design - you've got sixteen 2.5" bays, correct? If so, I would build in this manner:

Data
12x 2.4TB drives in mirrors. 6x2.4=12.6T, don't assign more than 10T for the 80% threshold but expect performance to drop off well before that point due to fragmentation. Compression will get you some wins as described in the other thread but you're still on spinning disks.

Meta
2x 400GB SSDs in mirrors. Since you're not (and shouldn't be) using dedup, the "1% of logical data" estimate should be more than enough, even with small blocks, and that would allow for 40T of logical data. I don't think you'll get 4:1 compression, no matter how hyped people are for zstd.

L2ARC
1x 400GB SSD. We'll look at some tunables here (don't L2ARC prefetched data, maybe only feed from MFU tail) to ensure that it gets the high value data and doesn't churn too hard. It will slow down the perceived initial feed but with how "dumb" L2ARC is vs. main ARC, you need to give it all the help it can with the hit rate.

Spare bay + spares kit
Keep one bay open, and buy at least one extra 2.4TB drive and another 400GB SSD. This will let you jump on top of things quickly if there's a failure, even if you need to use remote hands. "Find empty bay. Insert (HDD/SSD) and make sure the light comes on. Thank you."

This also lets you keep a bay free in case Bad Things Happen and you need to push one of the Ultrastar SSDs into emergency SLOG duty until you can schedule downtime to swap the NVMe card.
 

CLEsportsfan

Dabbler
Joined
Nov 23, 2020
Messages
20
Thank you a TON for all your guidance, knowledge and experience! This will help me greatly. I really like your idea of leaving one bay available just in case. Never know when something goes belly-up. Bought an extra SSD and but the extra spinner will have to wait a few months.

Would it be helpful to overprovision the SSDs too? Realized only after I bought them, that I ended up with the PX04SV drives instead of the SM that I searched for. Got a great price though, especially for brand new. The P3700 I'll overprovision; one of the threads mentioned setting theirs at 20% or 25%.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I believe the smallest PX04SV is 480G, you could limit it to 400G just for the sake of spare area. They are rated for mixed-use 3 DWPD so I wouldn't expect them to burn out, but extra endurance never hurts. You can do the SLOG benchmark on a drive to check its sync-write performance as well, but I imagine any kind of decent "enterprise-grade NAND" will be an improvement over meta on HDDs.
 
Top