Confused why I'm getting single SSD speeds on writes to 6 ssd mirror array.

coreyman

Dabbler
Joined
Apr 2, 2018
Messages
13
So I'm testing my zfs pool using the following tests -
Code:
#This is my Crucial MX500 boot disk.

dd if=/dev/zero of=/tempfile bs=1M count=5k
5120+0 records in
5120+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 14.0936 s, 381 MB/s

 dd if=/dev/zero of=/tempfile bs=1M count=5k oflag=direct
5120+0 records in
5120+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 12.7091 s, 422 MB/s

#This is my array
zdb
kvm:
    version: 5000
    name: 'kvm'
    state: 0
    txg: 67
    pool_guid: 11505207513529118781
    errata: 0
    hostid: 4285015651
    hostname: 'prod'
    vdev_children: 3
    vdev_tree:
        type: 'root'
        id: 0
        guid: 11505207513529118781
        children[0]:
            type: 'mirror'
            id: 0
            guid: 13930234995489822907
            metaslab_array: 38
            metaslab_shift: 33
            ashift: 9
            asize: 1000189984768
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 17328537525774326006
                path: '/dev/disk/by-id/scsi-35002538e40c2eb67-part1'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 7440773068342029255
                path: '/dev/disk/by-id/scsi-35002538e40e5c209-part1'
                whole_disk: 1
                create_txg: 4
        children[1]:
            type: 'mirror'
            id: 1
            guid: 13633949727752237663
            metaslab_array: 36
            metaslab_shift: 33
            ashift: 9
            asize: 1000189984768
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 13401956080122052633
                path: '/dev/disk/by-id/scsi-35002538e40da4ca2-part1'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 9163985048442606291
                path: '/dev/disk/by-id/scsi-35002538e000cd7ac-part1'
                whole_disk: 1
                create_txg: 4
        children[2]:
            type: 'mirror'
            id: 2
            guid: 15210515774431942742
            metaslab_array: 34
            metaslab_shift: 33
            ashift: 9
            asize: 1000189984768
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 4547121569999765888
                path: '/dev/disk/by-id/scsi-35002538e40e0481c-part1'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 12916523593271101090
                path: '/dev/disk/by-id/scsi-35002538e40e5d5d9-part1'
                whole_disk: 1
                create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data

#This is the speeds I'm getting on the array
root@prod:~# dd if=/dev/zero of=/kvm/testfile bs=1G count=5 oflag=direct
5+0 records in
5+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 12.0899 s, 444 MB/s
root@prod:~# dd if=/dev/zero of=/kvm/testfile bs=1G count=5
5+0 records in
5+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 14.778 s, 363 MB/s
root@prod:~# zfs set dedup=off kvm
root@prod:~# dd if=/dev/zero of=/kvm/testfile bs=1G count=5
5+0 records in
5+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 14.6284 s, 367 MB/s

#This is the settings of the pool, I get similar speeds if dedupe is on or off.

zfs get kvm
bad property list: invalid property 'kvm'
usage:
        get [-rHp] [-d max] [-o "all" | field[,...]]
            [-t type[,...]] [-s source[,...]]
            <"all" | property[,...]> [filesystem|volume|snapshot] ...

The following properties are supported:

        PROPERTY       EDIT  INHERIT   VALUES

        available        NO       NO   <size>
        clones           NO       NO   <dataset>[,...]
        compressratio    NO       NO   <1.00x or higher if compressed>
        creation         NO       NO   <date>
        defer_destroy    NO       NO   yes | no
        logicalreferenced  NO       NO   <size>
        logicalused      NO       NO   <size>
        mounted          NO       NO   yes | no
        origin           NO       NO   <snapshot>
        refcompressratio  NO       NO   <1.00x or higher if compressed>
        referenced       NO       NO   <size>
        type             NO       NO   filesystem | volume | snapshot | bookmark
        used             NO       NO   <size>
        usedbychildren   NO       NO   <size>
        usedbydataset    NO       NO   <size>
        usedbyrefreservation  NO       NO   <size>
        usedbysnapshots  NO       NO   <size>
        userrefs         NO       NO   <count>
        written          NO       NO   <size>
        aclinherit      YES      YES   discard | noallow | restricted | passthrough | passthrough-x
        acltype         YES      YES   noacl | posixacl
        atime           YES      YES   on | off
        canmount        YES       NO   on | off | noauto
        casesensitivity  NO      YES   sensitive | insensitive | mixed
        checksum        YES      YES   on | off | fletcher2 | fletcher4 | sha256
        compression     YES      YES   on | off | lzjb | gzip | gzip-[1-9] | zle | lz4
        context         YES       NO   <selinux context>
        copies          YES      YES   1 | 2 | 3
        dedup           YES      YES   on | off | verify | sha256[,verify]
        defcontext      YES       NO   <selinux defcontext>
        devices         YES      YES   on | off
        exec            YES      YES   on | off
        filesystem_count YES       NO   <count>
        filesystem_limit YES       NO   <count> | none
        fscontext       YES       NO   <selinux fscontext>
        logbias         YES      YES   latency | throughput
        mlslabel        YES      YES   <sensitivity label>
        mountpoint      YES      YES   <path> | legacy | none
        nbmand          YES      YES   on | off
        normalization    NO      YES   none | formC | formD | formKC | formKD
        overlay         YES      YES   on | off
        primarycache    YES      YES   all | none | metadata
        quota           YES       NO   <size> | none
        readonly        YES      YES   on | off
        recordsize      YES      YES   512 to 1M, power of 2
        redundant_metadata YES      YES   all | most
        refquota        YES       NO   <size> | none
        refreservation  YES       NO   <size> | none
        relatime        YES      YES   on | off
        reservation     YES       NO   <size> | none
        rootcontext     YES       NO   <selinux rootcontext>
        secondarycache  YES      YES   all | none | metadata
        setuid          YES      YES   on | off
        sharenfs        YES      YES   on | off | share(1M) options
        sharesmb        YES      YES   on | off | sharemgr(1M) options
        snapdev         YES      YES   hidden | visible
        snapdir         YES      YES   hidden | visible
        snapshot_count  YES       NO   <count>
        snapshot_limit  YES       NO   <count> | none
        sync            YES      YES   standard | always | disabled
        utf8only         NO      YES   on | off
        version         YES       NO   1 | 2 | 3 | 4 | 5 | current
        volblocksize     NO      YES   512 to 128k, power of 2
        volsize         YES       NO   <size>
        vscan           YES      YES   on | off
        xattr           YES      YES   on | off | dir | sa
        zoned           YES      YES   on | off
        userused@...     NO       NO   <size>
        groupused@...    NO       NO   <size>
        userquota@...   YES       NO   <size> | none
        groupquota@...  YES       NO   <size> | none
        written@<snap>   NO       NO   <size>


As you can see, my single crucial mx500 is just as fast as 6 Samsung 860 Pro drives. I've done a ton of reading and I don't see the issue here.
 

coreyman

Dabbler
Joined
Apr 2, 2018
Messages
13
Oh as a follow up this is an idle machine with two E5-2643's, 128gb ram, and drives attached to perc h310 in passthru mode.
 

coreyman

Dabbler
Joined
Apr 2, 2018
Messages
13
try three sets of mirror pairs

Isn't that what I have?

Code:
root@prod:~# zpool status
  pool: kvm
 state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Sun Dec  8 00:24:01 2019
config:

        NAME                        STATE     READ WRITE CKSUM
        kvm                         ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            scsi-35002538e40c2eb67  ONLINE       0     0     0
            scsi-35002538e40e5c209  ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            scsi-35002538e40da4ca2  ONLINE       0     0     0
            scsi-35002538e000cd7ac  ONLINE       0     0     0
          mirror-2                  ONLINE       0     0     0
            scsi-35002538e40e0481c  ONLINE       0     0     0
            scsi-35002538e40e5d5d9  ONLINE       0     0     0

errors: No known data errors
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
Are the results the same if you don't use 1G blocksize but something more realistic like 1M?
 

coreyman

Dabbler
Joined
Apr 2, 2018
Messages
13
Are the results the same if you don't use 1G blocksize but something more realistic like 1M?

Yes, single SSD ->
Code:
 dd if=/dev/zero of=/tempfile bs=1M count=5k oflag=direct
5120+0 records in
5120+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 12.6762 s, 424 MB/s

6 samsungs ->
Code:
dd if=/dev/zero of=/kvm/tempfile bs=1M count=5k oflag=direct
5120+0 records in
5120+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 12.6006 s, 426 MB/s
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
perc h310 in passthru mode
The Dell PERC H310 has a notoriously bad stock firmware that limits its queue depth to 25 I/O across the entire adapter. This might be causing your problems. Try updating this to the LSI SAS2008 firmware - it's possible with the "mini-mono" type cards now but the process is different so ensure that you're applying the right steps.
 

coreyman

Dabbler
Joined
Apr 2, 2018
Messages
13
The Dell PERC H310 has a notoriously bad stock firmware that limits its queue depth to 25 I/O across the entire adapter. This might be causing your problems. Try updating this to the LSI SAS2008 firmware - it's possible with the "mini-mono" type cards now but the process is different so ensure that you're applying the right steps.

Should I also be doing ashift=12 ? I'm not sure why the software chose ashift=9 by default.
 

coreyman

Dabbler
Joined
Apr 2, 2018
Messages
13
Definitely should be using ashift=12 - this pool was created through the webUI?

Flashed the card to IT mode and used ashift=12, I'm getting the performance I expect now.

Code:
dd if=/dev/zero of=/kvm/tempfile1 bs=1M count=5k
5120+0 records in
5120+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 3.21709 s, 1.7 GB/s


dd if=/dev/zero of=/kvm/tempfile1 bs=1G count=5
5+0 records in
5+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 3.11919 s, 1.7 GB/s


zfs set compression=lz4 kvm


dd if=/dev/zero of=/kvm/tempfile1 bs=1M count=5k

5120+0 records in
5120+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 1.77144 s, 3.0 GB/s


 dd if=/dev/zero of=/kvm/tempfile1 bs=1M count=5k
5120+0 records in
5120+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 1.71974 s, 3.1 GB/s
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I'm reading it's common with the samsung ssd's because they report 512 instead of 4096 block size.
Unless I'm mistaken, the default in FreeNAS (possibly OpenZFS as a whole? Edit: Not this, there's debate about it apparently) was set to use ashift=12 regardless of what the drive reported. Does smartctl show the drives reporting 512b logical and physical, or does it show 512b logical/4k physical? smartctl -a /dev/daX should tell you.
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
You caught me ;)
Your main pool being called "kvm" kind of gave me a hint. ;) At least now I won't be shy about giving you shell commands. No shame in using a different ZFS version, FreeNAS isn't always the answer (heresy, I know!) but there's an example of something that was "designed around" on this particular implementation.

There's some different default tunables as well.

I noticed you buried a reference to dedup in the original post as well - avoid that, with the exception of "you're on ZoL 0.8.2, have access to and are using special allocation classes, and have leveraged a sufficiently fast and redundant vdev to hold the DDT" - but even then, dedup is rarely always the right answer.
 

coreyman

Dabbler
Joined
Apr 2, 2018
Messages
13
Your main pool being called "kvm" kind of gave me a hint. ;) At least now I won't be shy about giving you shell commands. No shame in using a different ZFS version, FreeNAS isn't always the answer (heresy, I know!) but there's an example of something that was "designed around" on this particular implementation.

There's some different default tunables as well.

I noticed you buried a reference to dedup in the original post as well - avoid that, with the exception of "you're on ZoL 0.8.2, have access to and are using special allocation classes, and have leveraged a sufficiently fast and redundant vdev to hold the DDT" - but even then, dedup is rarely always the right answer.

Yeah I decided against dedup for my use case for now. I thought the DDT was stored in RAM? What are these default tunables that I should be looking at? I'm setting this one already ->
Code:
zfs set xattr=sa kvm
 

Jessep

Patron
Joined
Aug 19, 2018
Messages
379
I found this command for checking ashift of existing pools:
zdb -U /data/zfs/zpool.cache | grep ashift
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Yeah I decided against dedup for my use case for now. I thought the DDT was stored in RAM? What are these default tunables that I should be looking at? I'm setting this one already ->

It's less so that there's tunables that need immediate adjusting, but that the defaults will be different. Eg: the write throttle on ZoL has a dirty_data_max_max of 25% of system RAM vs. the fixed 4GB on FreeNAS.

Regarding DDT in RAM, the tables do get pulled there into the ARC meta space, but they also need to be kept on stable storage so that they aren't lost on shutdown; and updates to them are considered sync writes. There's some pretty severe write amplification (1:3 I believe) for the metadata updates so this causes a lot of churn on your vdevs even if DDT fits entirely in RAM. Putting the DDTs to separate SSD lets you redirect (not eliminate) the I/O amplification to other devices, but you then need to make sure that your dedup vdev is seriously redundant (think "triple mirror") because it being lost means your whole pool is toast.
 
Top