Parity usage included in dataset size?

shog · Apr 5, 2018

Hello,

Although I am phrasing this as a question, it is possibly more of an observation. Although I would be happy for someone better versed in pool design/layout to confirm. If true, on limited drive numbers, mirrors are not as wasteful of space as one might otherwise consider.

As an interim solution at work, I have created some FreeNAS appliances for a datawarehouse DB store. This is because the existing storage is ~7 years old and lacks the grunt to drive the new DB. Mission accomplished, because performance has been good. Unfortunately, all I have had to work with are Dell R730s (not the XD variant) and so am limited to 8x2.5" slots.

As I was told that the use pattern was likely to be random access, I went for 8 x 2TB Samsung SSDs in mirrored vdevs, although I have a suspicion that it is a lot more sequential in nature - mainly from looking at ARC requests (prefetch data).

As the procurement process is going to take a while to get in the new storage system, I have been asked whether the current interim solution could host a second (smaller, about 50% of the size) data warehouse DB. I suggested that it most likely would, as I believed 2 x raidz1 or even a single raidz2 would probably offer good enough (possibly better) IO given the largely sequential type of access. The theory being 4 x 2TB mirrors ~ 8TB raw, whereas 2 x 6TB Raidz1 ~ 12TB raw. Ideal, I thought.

Initial testing looked promising, performance likely to be adequate, so I went ahead and replicated the currently used datasets to the new appliance (which would then replace the original after a storage cutover).

Horror! The datasets consume 1.44 (or so) as much space on the two raidz1s as on the 4xmirrors. Therefore pretty much completely negating the increased apparent size:

Code:

Mirrors:
NAME															 USED  AVAIL  REFER  MOUNTPOINT
ssdpool8k/obi/mcoradata_obipre_12c_ssd8k@OBIPRE-20180126-KEEP	978M	  -   533G  -
ssdpool8k/obi/mcoradata_obipre_12c_ssd8k@AftFEB18_pre.620OBI		0	  -   533G  -
ssdpool8k/obi/mcoradata_obipre_12c_ssd8k@Post_Feb18_PSUOBI		  0	  -   533G  -
ssdpool8k/obi/mcoradata_obipre_12c_ssd8k@POST.FEB18.PSU.OBI		 0	  -   533G  -
ssdpool8k/obi/mcoradata_obipre_12c_ssd8k@PRE_ETL-20180201	   8.33G	  -   536G  -
ssdpool8k/obi/mcoradata_obipre_12c_ssd8k@POSTetl-20180205		120G	  -   519G  -
ssdpool8k/obi/mcoradata_obipre_12c_ssd8k@Kayes_test			 15.9G	  -   530G  -
ssdpool8k/obi/mcoradata_obipre_12c_ssd8k@auto-20180403.0300-2d  7.10G	  -   531G  -
ssdpool8k/obi/mcoradata_obipre_12c_ssd8k@auto-20180404.0300-2d  6.59G	  -   531G  -
ssdpool8k/obi/mcoradata_obipre_12c_ssd8k@auto-20180405.0300-2d   169M	  -   531G  -

Raidz1
NAME															USED  AVAIL  REFER  MOUNTPOINT
pressd01/obi/mcoradata_obipre_12c_ssd8k@OBIPRE-20180126-KEEP   1.35G	  -   772G  -
pressd01/obi/mcoradata_obipre_12c_ssd8k@AftFEB18_pre.620OBI		0	  -   772G  -
pressd01/obi/mcoradata_obipre_12c_ssd8k@Post_Feb18_PSUOBI		  0	  -   772G  -
pressd01/obi/mcoradata_obipre_12c_ssd8k@POST.FEB18.PSU.OBI		 0	  -   772G  -
pressd01/obi/mcoradata_obipre_12c_ssd8k@PRE_ETL-20180201	   11.9G	  -   777G  -
pressd01/obi/mcoradata_obipre_12c_ssd8k@POSTetl-20180205		173G	  -   751G  -
pressd01/obi/mcoradata_obipre_12c_ssd8k@Kayes_test			 22.5G	  -   767G  -
pressd01/obi/mcoradata_obipre_12c_ssd8k@auto-20180403.0300-2d  10.2G	  -   769G  -
pressd01/obi/mcoradata_obipre_12c_ssd8k@auto-20180404.0300-2d  9.49G	  -   769G  -
pressd01/obi/mcoradata_obipre_12c_ssd8k@auto-20180405.0300-2d	  0	  -   769G

So, to get any effective extra space I believe I will have to configure as a raidz1 (7 drives + parity, offends my sense of symmetry).

Would some kind and knowledgeable soul please confirm my theory is correct, and that I haven't done something daft?

Thanks a lot :)

joeschmuck · Apr 5, 2018

Just use zpool list to list your drive capacity data, it's a bit more simplistic.

A RAIDZ2 of the 2TB SSDs would yield ~10.9TB of storage, subtract 20% & Default Swap Space = ~8.8TB total usable space.

A RAIDZ1 would be the same as above but add 2TB to each total for a grand usable size of ~10.8TB.

I don't do mirrors myself but I think I have the math correct. A set of four pairs of mirrors in a VDEV would yield 1.8TB per pair * 4 = ~7.1TB, subtract the 20% & Default Swap Space = ~6.6TB.

The assumption here is your 2TB SSDs are actually 2TB in size and not 1.8TB each.

You should post your system specs and someone might be able to offer some sound advice. Hopefully you have high endurance SSDs otherwise you will wear them out pretty quick I think. You can build a very fast system using spinning rust as well and may even save some money.

Good Luck.

shog · Apr 6, 2018

joeschmuck said:
Just use zpool list to list your drive capacity data, it's a bit more simplistic.

A RAIDZ2 of the 2TB SSDs would yield ~10.9TB of storage, subtract 20% & Default Swap Space = ~8.8TB total usable space.

A RAIDZ1 would be the same as above but add 2TB to each total for a grand usable size of ~10.8TB.

I don't do mirrors myself but I think I have the math correct. A set of four pairs of mirrors in a VDEV would yield 1.8TB per pair * 4 = ~7.1TB, subtract the 20% & Default Swap Space = ~6.6TB.

The assumption here is your 2TB SSDs are actually 2TB in si

Good Luck.

Thanks very much for your reply, and your space calculations are pretty accurate :)

However, my issue is not with the available space, but with the additional space consumed by the same data.

On 4 x mirrors, I have a ~1TB datafile. When replicated to a 2xraidz1 setup, it occupies ~1.4TB. The individual snapshots are all proportionately bigger. Compression is on (lzjb) at both source and destination, with same reported compressratio. Both pools created with ashift of 13.

So, although I have more space, I am using more of it to save the same amount of data. In a 4 x 2T data disks -> 6 x 2T data disks with parity, you actually seem to lose most of the additional space to the added parity requirement.

Or, am I missing something here? If I think about it logically, it seems to me that I should have 1.5x the space (ish) + the parity drives, and so I do not understand the apparent increased usage for the same data.

Sorry for the confusion...

Bidule0hm · Apr 6, 2018

ashift = 13? why?

shog said:
so I do not understand the apparent increased usage for the same data.

Please look at this thread because it's probably the main factor here, especially with ashift = 13.

shog · Apr 6, 2018

Bidule0hm said:
ashift = 13? why?

Please look at this thread because it's probably the main factor here, especially with ashift = 13.

Thank you for the pointer, it begins to make sense. My google-fu did not find that thread, probably because I was not looking so much for wasted space! I have been edumacated!

I have set ashift=13 because (I think) I understand that these ssds have a native page size of 8K. Having said that, it is virtually impossible to find data telling you what that page size actually is, and they report themselves as 512 byte block size devices:

root@sscprodobizfs01:~ # camcontrol devlist
<ATA Samsung SSD 850 2B6Q> at scbus1 target 0 lun 0 (pass0,da0)
<ATA Samsung SSD 850 2B6Q> at scbus1 target 1 lun 0 (pass1,da1)
<ATA Samsung SSD 850 2B6Q> at scbus1 target 2 lun 0 (pass2,da2)
<ATA Samsung SSD 850 2B6Q> at scbus1 target 3 lun 0 (pass3,da3)
<ATA Samsung SSD 850 2B6Q> at scbus1 target 4 lun 0 (pass4,da4)
<ATA Samsung SSD 850 2B6Q> at scbus1 target 5 lun 0 (pass5,da5)
<ATA Samsung SSD 850 2B6Q> at scbus1 target 6 lun 0 (pass6,da6)
<ATA Samsung SSD 850 2B6Q> at scbus1 target 7 lun 0 (pass7,da7)
<DP BP13G+ 2.25> at scbus1 target 32 lun 0 (pass8,ses0)
<PLDS DVD+-RW DS-8ABSH LD51> at scbus11 target 0 lun 0 (pass9,cd0)
<iDRAC Virtual CD 0329> at scbus13 target 0 lun 0 (pass10,cd1)
<iDRAC Virtual Floppy 0329> at scbus13 target 0 lun 1 (pass11,da8)
<SanDisk Ultra Fit 1.00> at scbus14 target 0 lun 0 (pass12,da9)
<SanDisk Ultra Fit 1.00> at scbus15 target 0 lun 0 (pass13,da10)

root@sscprodobizfs01:~ # camcontrol identify da0
pass0: <Samsung SSD 850 EVO 2TB EMT02B6Q> ACS-2 ATA SATA 3.x device
pass0: 150.000MB/s transfers, Command Queueing Enabled

protocol ATA/ATAPI-9 SATA 3.x
device model Samsung SSD 850 EVO 2TB
firmware revision EMT02B6Q
serial number S2RMNB0J800810Z
WWN 5002538c407b19f9
cylinders 16383
heads 16
sectors/track 63
sector size logical 512, physical 512, offset 0
LBA supported 268435455 sectors
LBA48 supported 3907029168 sectors
PIO supported PIO4
DMA supported WDMA2 UDMA6
media RPM non-rotating

Feature Support Enabled Value Vendor
read ahead yes yes
write cache yes yes
flush cache yes yes
overlap no
Tagged Command Queuing (TCQ) no no
Native Command Queuing (NCQ) yes 32 tags
NCQ Queue Management no
NCQ Streaming no
Receive & Send FPDMA Queued yes
SMART yes yes
microcode download yes yes
security yes no
power management yes yes
advanced power management no no
automatic acoustic management no no
media status notification no no
power-up in Standby no no
write-read-verify yes no 0/0x0
unload no no
general purpose logging yes yes
free-fall no no
Data Set Management (DSM/TRIM) yes
DSM - max 512byte blocks yes 8
DSM - deterministic read no
Host Protected Area (HPA) yes no 3907029168/3892349104
HPA - Security n

I believe the Wear_Levelling_Count is an indicator of % used, which would make 7% since Christmas:

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 2737
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 10
177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 7

I understand one would not usually use consumer grade SSDs for an enterprise datawarehouse system (Oracle 12c), but there wasn't a lot of budget and the lifetime is expected to be on the order of a few months while new storage appliances are procured. The only server hardware available were some Dell R730s with 8x2.5" slots. And the perc, of course, in hba mode. I still had to wage battle with this to get write caching enabled on the drives. At the latest firmware revision, it suggested they were but the performance did not support that assertion. Downgrading the firmware to an older level largely sorted the issue. These R730s have 2 x 20 thread xeons and 1/2 TB RAM, and the dataset largely fits in ARC so they do fly, saving a Christmas go-live when nothing would work on the incumbent (7 year old) NetApp.

I was asked if the systems would cope with an additional data warehouse type DB (also Oracle 12c), and was experimenting with the putative DR appliance to see if I could get more space. Interestingly, raidz1/2 are not so bad for this application - I guess they are most likely doing sequential IO. Down on writes, but beating the mirrors on reads. 2 x raidz1 seemed the better all round performer. The trouble here is that there is never enough time to do proper testing before the deliverable date.

I did run some tests (using sio_ntap - NetApps simulated IO tool) with ashift=12 and performance was worse, and zpool iostat <pool> 5 showed uneven data rates on the 8 individual drives. I therefore assumed my assumption was correct, and rebuilt the pool with ashift=13 in 3 different formats:
4 x mirror
2 x raidz1
1 x raidz2

I was settled on 2 x raidz1, until I noticed that I was using pretty much as much extra space as I had created.

I pretty much followed the Oracle ZFS recommendations for the different datasets. I was using an 8K recordsize for my testing, representing the data files, but apparently the DB is configured for 16KB. In fact, I now understand that 32KB recordsize can be used which would have alleviated a lot of the problem with space.

The settings on the datasets are:
[root@sscprodobizfs01 ~]# for F in /mnt/prdssd01/obi/*; do zfs get recordsize,primarycache,logbias $F; done
NAME PROPERTY VALUE SOURCE
prdssd01/obi/mcoraarch_obipre_12c_ssd8k recordsize 128K received
prdssd01/obi/mcoraarch_obipre_12c_ssd8k primarycache none received
prdssd01/obi/mcoraarch_obipre_12c_ssd8k logbias throughput received
NAME PROPERTY VALUE SOURCE
prdssd01/obi/mcoracle_base_obipre_01_12c_ssd8k recordsize 8K received
prdssd01/obi/mcoracle_base_obipre_01_12c_ssd8k primarycache all received
prdssd01/obi/mcoracle_base_obipre_01_12c_ssd8k logbias throughput received
NAME PROPERTY VALUE SOURCE
prdssd01/obi/mcoracle_base_obipre_02_12c_ssd8k recordsize 8K received
prdssd01/obi/mcoracle_base_obipre_02_12c_ssd8k primarycache all received
prdssd01/obi/mcoracle_base_obipre_02_12c_ssd8k logbias throughput received
NAME PROPERTY VALUE SOURCE
prdssd01/obi/mcoradata_obipre_12c_ssd8k recordsize 16K received
prdssd01/obi/mcoradata_obipre_12c_ssd8k primarycache all received
prdssd01/obi/mcoradata_obipre_12c_ssd8k logbias throughput received
NAME PROPERTY VALUE SOURCE
prdssd01/obi/mcoraredoa_obipre_12c_ssd8k recordsize 128K received
prdssd01/obi/mcoraredoa_obipre_12c_ssd8k primarycache none received
prdssd01/obi/mcoraredoa_obipre_12c_ssd8k logbias latency received
NAME PROPERTY VALUE SOURCE
prdssd01/obi/mcoraredob_obipre_12c_ssd8k recordsize 128K received
prdssd01/obi/mcoraredob_obipre_12c_ssd8k primarycache none received
prdssd01/obi/mcoraredob_obipre_12c_ssd8k logbias latency received
NAME PROPERTY VALUE SOURCE
prdssd01/obi/mcoratemp_obipre_12c_ssd8k recordsize 128K received
prdssd01/obi/mcoratemp_obipre_12c_ssd8k primarycache none received
prdssd01/obi/mcoratemp_obipre_12c_ssd8k logbias throughput received

Using your calculations, I think I see:
2xRAID-Z1 of 4 drives ashift=13 recordsize=8K, losing 2-1.334/1.334 = 49.92%
RAID-Z2 of 8 drives ashift=13 recordsize=8k, losing 3-1.334/1.334 = 124% (ouch!)
2xRAID-Z1 of 4 drives ashift=13 recordsize=16K, Losing 4-2.667/2.667 = 49.98%
RAID-Z2 of 8 drives ashift=13 recordsize=16k, losing 3-2.667/2.667 = 12.48%
2xRAID-Z1 of 4 drives ashift=13 recordsize=32k, losing 6-5.34/5.34 = 12.35%
RAID-Z2 of 8 drives ashift=13 recordsize=32k, losing 6-5.34/5.34 = 12.35%

tbh, ~12% is not a show-stopper and I would have been happy to lose that much space overall. I wish I had had time to do more testing.

Do those calculations look right to you?

Bidule0hm · Apr 6, 2018

Ok for the ashift.

Now, why the recordsize so low?

shog · Apr 9, 2018

Primarily because of Oracle recommendations for Oracle DB on ZFS, although this is obviously their own implementation. They recommend that the data dataset use the same recordsize as the DB itself:

OLTP type DBs are recommended to use blocksize of 8KB.
DW type Dbs are recommended to use 8 or 16, rising to 32KB.

Our DBAs have already implemented at 8KB and 16KB respectively, and although I may suggest I cannot force adoption of larger sizes.

Bidule0hm · Apr 9, 2018

Ah ok, I see.

So yeah, you don't have much choice here.

Important Announcement for the TrueNAS Community.

Parity usage included in dataset size?

shog

Cadet

joeschmuck

Old Man

shog

Cadet

Bidule0hm

Server Electronics Sorcerer

shog

Cadet

Bidule0hm

Server Electronics Sorcerer

shog

Cadet

Bidule0hm

Server Electronics Sorcerer

Similar threads

Important Announcement for the TrueNAS Community.

Parity usage included in dataset size?

Cadet

Old Man

Cadet

Server Electronics Sorcerer

Cadet

Server Electronics Sorcerer

Cadet

Server Electronics Sorcerer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Parity usage included in dataset size?"

Similar threads