Register for the iXsystems Community to get an ad-free experience and exclusive discounts in our eBay Store.
Resource icon

SLOG benchmarking and finding the best SLOG

HoneyBadger

Mushroom! Mushroom!
Joined
Feb 6, 2014
Messages
3,331
3 GB/sec is kinda shorthand for "don't throttle it, and if the SSD doesn't live forever, so be it". Its not a statement of expected data transfer values ;-)
Moreso that non-Optane SSDs have increased read latencies as the write workload ramps up. The L2ARC write limits are there to protect the read latencies from spiking up too badly.

I can't find any reference to it on github, any link about what you are talking about?
It doesn't seem to be part of OpenZFS2.0 nor does it seem to be developed at the moment.
This is likely referring to the "log-based dedup approach" that Matt Ahrens proposed back in 2017:


The good news is that the S3610 just makes your cutoff at 3 DWPD. The S3710 series does 10, though at a 25% premium over the cost of the 3610 series and the largest drive available in the S3710 series is 1.2TB vs. 1.6TB in the 3610 series.

eBay is filled with these kinds It drives now and for $200 a 1.6TB enterprise-grade drive in a 3-way mirror seems like a ok risk/reward proposition. I may even buy a 4th to have a cold, proven spare here.
You can underprovision the S3610's down to the same 1.2T and get a bit more effective life out of them if you're really concerned, but given your cold-spare and the costing of them I'd say you're fine as long as you can get the media wearout indicators from SMART, and replace proactively - remember, if they're in mirrors, they're all getting the same amount of writes, so they'll burn out at roughly the same time.
 

rshakin

Junior Member
Joined
Apr 2, 2019
Messages
20
Hey just a short update on my RMS-200 almost a year later. Still chugging along in my system. Wanted to post the SMART STATS for it.

Code:
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        46 Celsius
Available Spare:                    0%
Available Spare Threshold:          0%
Percentage Used:                    0%
Data Units Read:                    95,513,598,780 [48.9 PB]
Data Units Written:                 96,577,229,991 [49.4 PB]
Host Read Commands:                 1,148,532,179
Host Write Commands:                1,220,615,011
Controller Busy Time:               430
Power Cycles:                       19
Power On Hours:                     3,293
Unsafe Shutdowns:                   14
Media and Data Integrity Errors:    0
Error Information Log Entries:      1


=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        55 Celsius
Available Spare:                    0%
Available Spare Threshold:          0%
Percentage Used:                    0%
Data Units Read:                    95,514,382,172 [48.9 PB]
Data Units Written:                 225,318,828,690 [115 PB]
Host Read Commands:                 1,148,541,024
Host Write Commands:                2,657,674,621
Controller Busy Time:               462,032
Power Cycles:                       30
Power On Hours:                     13,840
Unsafe Shutdowns:                   24
Media and Data Integrity Errors:    0
Error Information Log Entries:      1

Error Information (NVMe Log 0x01, max 63 entries)
No Errors Logged
 

HoneyBadger

Mushroom! Mushroom!
Joined
Feb 6, 2014
Messages
3,331
Data Units Written: 225,318,828,690 [115 PB]
100PB in a year? That's a busy system my friend, and I imagine you've saved the cost of that RMS-200 several times over by now versus conventional NAND (or even Optane) that would have burned out and required a swap. And if your chassis didn't have hotswap NVMe you'd be looking at downtime.
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,371
That's wild - over 13k power-on hours, indicating the thing is not even two years in use, yet 115PB written... so a petabyte every 5 days. WoW!
 

Asteroza

Junior Member
Joined
Feb 12, 2018
Messages
13
That's wild - over 13k power-on hours, indicating the thing is not even two years in use, yet 115PB written... so a petabyte every 5 days. WoW!

er, the two listings is one year prior and and now right? So that's 115.0-49.4=65.6PB written, and at probably less than 100TB read, 10547 hours power on (439.5 days, so 62.7 weeks, so a little over a year, when the previous post was 15 months ago so that roughly matches).

Still beastly at 9328 DWPD (roughly 150TB per day...)
 

rshakin

Junior Member
Joined
Apr 2, 2019
Messages
20
Yep your math is correct. Very fun little toy, especially the for the price, fits great for my usage case.
 

CrimsonMars

Junior Member
Joined
Aug 7, 2020
Messages
24
was bench-marking my slog and I see something strange...
my disks are a couple of 16G optane+4 480G SSD´s all have 2 8gpartitions for my 2 pools but I get writes around 200-300 but disks can do a lot more(from my tests disk provide roughly 500Mb/s, on freenas they show as roughly 200Mbytes/s so as i only stress one pool at a time 2x200+1x130=530 at minimum but I only get roughly 300 out of them):

Optane are nve so connected to MB, the others go through a IBM sas expander connected to HP H220

Does anyone has any clue as to why this is, the tests have been done from VM´s on ISCSI with always sync flag at zvol level. If I do async writes(changing Zvol to sync standard) I get roughly 1.3-2GB/s

Pools:
pool: SAS
state: ONLINE
scan: scrub repaired 0B in 00:02:57 with 0 errors on Mon Aug 10 23:25:07 2020
config:

NAME STATE READ WRITE CKSUM
SAS ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/50cad0f1-d0d7-11ea-8717-2cf05d07d39b ONLINE 0 0 0
gptid/511b4e3f-d0d7-11ea-8717-2cf05d07d39b ONLINE 0 0 0
gptid/51294ecd-d0d7-11ea-8717-2cf05d07d39b ONLINE 0 0 0
gptid/517b5371-d0d7-11ea-8717-2cf05d07d39b ONLINE 0 0 0
gptid/51907b32-d0d7-11ea-8717-2cf05d07d39b ONLINE 0 0 0
gptid/51924d04-d0d7-11ea-8717-2cf05d07d39b ONLINE 0 0 0
gptid/51a8084b-d0d7-11ea-8717-2cf05d07d39b ONLINE 0 0 0
gptid/5189a1b0-d0d7-11ea-8717-2cf05d07d39b ONLINE 0 0 0
gptid/51df64e8-d0d7-11ea-8717-2cf05d07d39b ONLINE 0 0 0
gptid/51e871f0-d0d7-11ea-8717-2cf05d07d39b ONLINE 0 0 0
logs
mirror-3 ONLINE 0 0 0
gptid/e9893949-e065-11ea-8ad9-2cf05d07d39b ONLINE 0 0 0
gptid/eca197e8-e065-11ea-8ad9-2cf05d07d39b ONLINE 0 0 0
mirror-5 ONLINE 0 0 0
gptid/1e560354-e0cb-11ea-b1f6-2cf05d07d39b ONLINE 0 0 0
gptid/1f4fb511-e0cb-11ea-b1f6-2cf05d07d39b ONLINE 0 0 0
mirror-6 ONLINE 0 0 0
gptid/1ab61b43-e0cb-11ea-b1f6-2cf05d07d39b ONLINE 0 0 0
gptid/1cd69291-e0cb-11ea-b1f6-2cf05d07d39b ONLINE 0 0 0
spares
gptid/a8f6d298-d4d4-11ea-8717-2cf05d07d39b AVAIL

errors: No known data errors

pool: VM
state: ONLINE
scan: scrub repaired 0B in 22:11:05 with 0 errors on Sun Aug 2 22:11:06 2020
config:

NAME STATE READ WRITE CKSUM
VM ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/30a41307-2f7f-11ea-850b-902b34d39495 ONLINE 0 0 0
gptid/e5e1a6c5-2f61-11ea-bcc5-902b34d39495 ONLINE 0 0 0
gptid/db50c760-2f74-11ea-9ab2-902b34d39495 ONLINE 0 0 0
logs
mirror-3 ONLINE 0 0 0
gptid/dfd4c43d-e065-11ea-8ad9-2cf05d07d39b ONLINE 0 0 0
gptid/e26ced7c-e065-11ea-8ad9-2cf05d07d39b ONLINE 0 0 0
mirror-5 ONLINE 0 0 0
gptid/9dd59435-e0ca-11ea-b1f6-2cf05d07d39b ONLINE 0 0 0
gptid/9ebb7ea4-e0ca-11ea-b1f6-2cf05d07d39b ONLINE 0 0 0
mirror-6 ONLINE 0 0 0
gptid/9b94667f-e0ca-11ea-b1f6-2cf05d07d39b ONLINE 0 0 0
gptid/9d189c17-e0ca-11ea-b1f6-2cf05d07d39b ONLINE 0 0 0

Smart Data of 1 SSD:


=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Total time to complete Offline
data collection: (65535) seconds.
Offline data collection
capabilities: (0x00) Offline data collection not supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate -O--CK 100 100 000 - 0
9 Power_On_Hours -O--CK 100 100 000 - 1643
12 Power_Cycle_Count -O--CK 100 100 000 - 238
148 Unknown_Attribute ------ 100 100 000 - 0
149 Unknown_Attribute ------ 100 100 000 - 0
167 Write_Protect_Mode ------ 100 100 000 - 0
168 SATA_Phy_Error_Count -O--C- 100 100 000 - 0
169 Bad_Block_Rate ------ 100 100 000 - 16
170 Bad_Blk_Ct_Erl/Lat ------ 100 100 010 - 0/16
172 Erase_Fail_Count -O--CK 100 100 000 - 0
173 MaxAvgErase_Ct ------ 100 100 000 - 9 (Average 5)
181 Program_Fail_Count -O--CK 100 100 000 - 0
182 Erase_Fail_Count ------ 100 100 000 - 0
187 Reported_Uncorrect -O--CK 100 100 000 - 0
192 Unsafe_Shutdown_Count -O--C- 100 100 000 - 111
194 Temperature_Celsius -O---K 034 042 000 - 34 (Min/Max 22/42)
196 Reallocated_Event_Count -O--CK 100 100 000 - 0
199 SATA_CRC_Error_Count -O--CK 100 100 000 - 0
218 CRC_Error_Count -O--CK 100 100 000 - 339
231 SSD_Life_Left ------ 099 099 000 - 99
233 Flash_Writes_GiB -O--CK 100 100 000 - 1051
241 Lifetime_Writes_GiB -O--CK 100 100 000 - 1085
242 Lifetime_Reads_GiB -O--CK 100 100 000 - 1635
244 Average_Erase_Count ------ 100 100 000 - 5
245 Max_Erase_Count ------ 100 100 000 - 9
246 Total_Erase_Count ------ 100 100 000 - 56352
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning

General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x03 GPL R/O 64 Ext. Comprehensive SMART error log
0x04 GPL,SL R/O 8 Device Statistics log
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 1 Extended self-test log
0x09 SL R/W 1 Selective self-test log
0x10 GPL R/O 1 NCQ Command Error log
0x11 GPL R/O 1 SATA Phy Event Counters log
0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log

SMART Extended Comprehensive Error Log Version: 1 (64 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged. [To run self-tests, use: smartctl -t]

Selective Self-tests/Logging not supported

SCT Commands not supported

Device Statistics (GP Log 0x04)
Page Offset Size Value Flags Description
0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 238 --- Lifetime Power-On Resets
0x01 0x010 4 1643 --- Power-on Hours
0x01 0x018 6 2276613609 --- Logical Sectors Written
0x01 0x028 6 3430379015 --- Logical Sectors Read
0x04 ===== = = === == General Errors Statistics (rev 1) ==
0x04 0x008 4 0 --- Number of Reported Uncorrectable Errors
0x05 ===== = = === == Temperature Statistics (rev 1) ==
0x05 0x008 1 34 --- Current Temperature
0x05 0x020 1 42 --- Highest Temperature
0x05 0x028 1 22 --- Lowest Temperature
0x06 ===== = = === == Transport Statistics (rev 1) ==
0x06 0x018 4 339 --- Number of Interface CRC Errors
0x07 ===== = = === == Solid State Device Statistics (rev 1) ==
0x07 0x008 1 0 --- Percentage Used Endurance Indicator
|||_ C monitored condition met
||__ D supports DSN
|___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 2 0 Command failed due to ICRC error
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
0x0008 2 0 Device-to-host non-data FIS retries
0x0009 4 0 Transition from drive PhyRdy to drive PhyNRdy
0x000a 4 1 Device-to-host register FISes sent due to a COMRESET
0x000f 2 0 R_ERR response for host-to-device data FIS, CRC
0x0010 2 0 R_ERR response for host-to-device data FIS, non-CRC
0x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC
0x0013 2 0 R_ERR response for host-to-device non-data FIS, non-CRC


Performance of 1 SSD:

root@freenas[~]# diskinfo -wS /dev/da12
/dev/da12
512 # sectorsize
480103981056 # mediasize in bytes (447G)
937703088 # mediasize in sectors
0 # stripesize
0 # stripeoffset
58369 # Cylinders according to firmware.
255 # Heads according to firmware.
63 # Sectors according to firmware.
ATA KINGSTON SA400S3 # Disk descr.
50026B778306FD05 # Disk ident.
Yes # TRIM/UNMAP support
0 # Rotation rate in RPM
Not_Zoned # Zone Mode

Synchronous random writes:
0.5 kbytes: 1024.9 usec/IO = 0.5 Mbytes/s
1 kbytes: 950.4 usec/IO = 1.0 Mbytes/s
2 kbytes: 917.4 usec/IO = 2.1 Mbytes/s
4 kbytes: 696.6 usec/IO = 5.6 Mbytes/s
8 kbytes: 725.8 usec/IO = 10.8 Mbytes/s
16 kbytes: 760.4 usec/IO = 20.5 Mbytes/s
32 kbytes: 1127.2 usec/IO = 27.7 Mbytes/s
64 kbytes: 1569.7 usec/IO = 39.8 Mbytes/s
128 kbytes: 7820.5 usec/IO = 16.0 Mbytes/s
256 kbytes: 2745.9 usec/IO = 91.0 Mbytes/s
512 kbytes: 4634.5 usec/IO = 107.9 Mbytes/s
1024 kbytes: 6719.2 usec/IO = 148.8 Mbytes/s
2048 kbytes: 11237.5 usec/IO = 178.0 Mbytes/s
4096 kbytes: 20076.3 usec/IO = 199.2 Mbytes/s
8192 kbytes: 36941.6 usec/IO = 216.6 Mbytes/s
root@freenas[~]# gpart list da12 | egrep 'Mediasize|rawuuid'
Mediasize: 8589934592 (8.0G)
rawuuid: 9b94667f-e0ca-11ea-b1f6-2cf05d07d39b
Mediasize: 8589934592 (8.0G)
rawuuid: 1ab61b43-e0cb-11ea-b1f6-2cf05d07d39b
Mediasize: 480103981056 (447G)
 

Stilez

Senior Member
Joined
Apr 8, 2016
Messages
511
I can't find any reference to it on github, any link about what you are talking about?
It doesn't seem to be part of OpenZFS2.0 nor does it seem to be developed at the moment.

Is it possible you are confusing some sort of Proof-of-Concept thing with actual development? There are a LOT of PoC's being done with ZFS and the time from PoC to actual PR is about a year, the time from PoC to merge is about 3-5 years.
This is likely referring to the "log-based dedup approach" that Matt Ahrens proposed back in 2017:
Yes, and, I know what I mean ;-)

From the ZFS leadership team minutes, 28 April 2020, and Matt Ahrens' email of the minutes 30 April 2020 on the developer list (same content), this tantalising snippet, sounding like it's referring to some actual development not just PoC on Matt Ahrens' 2017 concept:

* "Status on dedup-log & DDT limit (Allan)"
* "Both coming along nicely."

*Now* can someone tell me more....? :)
 

Stilez

Senior Member
Joined
Apr 8, 2016
Messages
511
No, it isn't...

That PR is indeed "DDT limit". But the minutes state clearly that dedup-log is "coming on nicely". Thats the really interesting one, and it's completely unambiguous - the term "dedup-log" hasn't been used for anything else AFAIK, and is a very similar concept as log-spacemap - keeping a log of DDT changes and reducing write-out load for updates. It wouldn't say "both" are coming along if they were the same thing, so that's not likely either.

I don't think ZFS folks like Allan Jude and Matt Ahrens would say that dedup-log is "coming along nicely", accidentally, so I take it at face value, and want to know more.
 

ornias

Neophyte Sage
Joined
Mar 6, 2020
Messages
1,461
No, it isn't...
Yes it is, it is just half of the answer you where looking for. I just didn't have time to look for the other one. I also never said they where the same thing, I just only answered half of them.
Taken some time: i can't find any PR to confirm the statements Matt made. Not saying there isn't. You can throw him an email if you like to know though!
(I think thats the easiest solution, instead of us digging through github)

I don't think ZFS folks like Allan Jude and Matt Ahrens would say that dedup-log is "coming along nicely", accidentally, so I take it at face value, and want to know more.
Allan "PoC" Jude says and proposes a lot of things, a lot of which you shouldn't rely upon and has a significant history of throwing up Proof of Concepts which he then (mostly) ignores byond PoC stage for others to finish years later.
The Leadership meeting also isn't always correct, including Allan and Matt.

"coming on nicely" is also often ZFS slang for: Somewhere the comming 3 years.

To be clear: I'm not saying there is bad intent from anyone here. Just stating some facts that should be taken into account, because I think your expectations are a bit to high when it comes to zfs feature development speed. People like Matt and Allan are very busy and have A LOT of projects on their hand, even if things "come along nicely", that does not mean it's anywhere close to being merged.

Even if ZFS projects are "mostly finished" it might take up to a year to get merged, simply because maintainers don't have all the time in the world? Source?

So TLDR:
- If you want the link to the PR, just go and throw Matt an email, he often responds within a day or so.
- "coming on nicely" does not mean it's anywhere close to being merged. draid and zstd are also ""coming on nicely" for a few years now ;)
 

Stilez

Senior Member
Joined
Apr 8, 2016
Messages
511
Yes it is, it is just half of the answer you where looking for. I just didn't have time to look for the other one. I also never said they where the same thing, I just only answered half of them.
Taken some time: i can't find any PR to confirm the statements Matt made. Not saying there isn't. You can throw him an email if you like to know though!
(I think thats the easiest solution, instead of us digging through github)


Allan "PoC" Jude says and proposes a lot of things, a lot of which you shouldn't rely upon and has a significant history of throwing up Proof of Concepts which he then (mostly) ignores byond PoC stage for others to finish years later.
The Leadership meeting also isn't always correct, including Allan and Matt.

"coming on nicely" is also often ZFS slang for: Somewhere the comming 3 years.

To be clear: I'm not saying there is bad intent from anyone here. Just stating some facts that should be taken into account, because I think your expectations are a bit to high when it comes to zfs feature development speed. People like Matt and Allan are very busy and have A LOT of projects on their hand, even if things "come along nicely", that does not mean it's anywhere close to being merged.

Even if ZFS projects are "mostly finished" it might take up to a year to get merged, simply because maintainers don't have all the time in the world? Source?

So TLDR:
- If you want the link to the PR, just go and throw Matt an email, he often responds within a day or so.
- "coming on nicely" does not mean it's anywhere close to being merged. draid and zstd are also ""coming on nicely" for a few years now ;)
Education indeed!!!!!
 

HoneyBadger

Mushroom! Mushroom!
Joined
Feb 6, 2014
Messages
3,331
draid and zstd are also ""coming on nicely" for a few years now ;)
Fair point on DRAID, but zstd just made an official commit into OpenZFS:


(Although with ZSTD being developed by Facebook originally, it may be slower as it scans all your data for personal information.)
 

HoneyBadger

Mushroom! Mushroom!
Joined
Feb 6, 2014
Messages
3,331
was bench-marking my slog and I see something strange...
my disks are a couple of 16G optane+4 480G SSD´s all have 2 8gpartitions for my 2 pools but I get writes around 200-300 but disks can do a lot more(from my tests disk provide roughly 500Mb/s, on freenas they show as roughly 200Mbytes/s so as i only stress one pool at a time 2x200+1x130=530 at minimum but I only get roughly 300 out of them):
What are you trying to do with your SLOG configuration here? If you want more performance, use better devices; lose the complicated multi-split setup.

Optane 16G sticks are okay, but the SATA SSDs are not suitable SLOG devices, A400s do not have the endurance or performance for this. You're also sharing them between multiple pools which will degrade things further. Having ZFS flip between the two of them for SLOG may also be resulting in some inconsistent throughput depending on where your records land.

The VM pool is also a RAIDZ1 and parity performs poorly for block/small-recordsize IO. Mirrors are strongly suggested for that use case.
 

ornias

Neophyte Sage
Joined
Mar 6, 2020
Messages
1,461
Fair point on DRAID, but zstd just made an official commit into OpenZFS:
I know as I was one of the people behind that PR, as well as the one who just submitted the PR's to add it to the TrueNAS 12 GUI ;)
I still stand by my argument, considering the timetable behind ZSTD...

It has been developed by Yann Collet who has been hired by facebook to do it. Just so you know ;)
Yann and the folks at ZSTD also reviewed the ZFS implementation btw.
 

HoneyBadger

Mushroom! Mushroom!
Joined
Feb 6, 2014
Messages
3,331
I know as I was one of the people behind that PR, as well as the one who just submitted the PR's to add it to the TrueNAS 12 GUI ;)
I still stand by my argument, considering the timetable behind ZSTD...
I defer to your knowledge of the subject regarding the timelines of course. "It's done when it's done" is a good metric, and historically I am very risk-averse about new features, especially when it comes to data integrity or safety. Doubly/triply so when it isn't my data at risk.

And the "scans your data" comment was intended as a jab at Facebook, not an attack on anyone's personal credibility. Just so that's clear. :)
 

ornias

Neophyte Sage
Joined
Mar 6, 2020
Messages
1,461
I defer to your knowledge of the subject regarding the timelines of course. "It's done when it's done" is a good metric, and historically I am very risk-averse about new features, especially when it comes to data integrity or safety. Doubly/triply so when it isn't my data at risk.
Yeah, no offense taken.
And yes: Data Safety is tricky. It beter be done right the first time around :)
 
Top