Zpool and all devices UNAVAILABLE after reboot

Status
Not open for further replies.

ChrisH

Dabbler
Joined
Jan 29, 2014
Messages
11
This is new supermicro hardware, and my first ZFS box. I've got 36 NL-SAS drives attached to six LSI SAS2308 controllers (one onboard, five HBAs) in point-to-point configuration.

Specs:
* Supermicro 6047R-E1R36L
* Intel(R) Xeon(R) CPU E5-2630L v2 @ 2.40GHz (12 real cores)
* FreeNAS 9.2.0 64-bit
* 256GB RAM
* 10Gb Ethernet: Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 2.5.15
* SSD SLOG

I did read about a potential mps driver issue in "Known Issues", which might be the cause.

  • The mps driver for 6gbps LSI SAS HBAs is version 13, which requires phase 13 firmware on the controller. This is a hard requirement and running older firmware can cause many woes, including the failure to probe all of the attached disks, which can lead to degraded or unavailable arrays.
My LSI controllers are running version 17, so I emphasized "older" in the quote above - I am hoping that v >=13 is okay.

I created the zpool at the command line, because it seemed a lot easier than the web interface. Basically, there are 34 drives in mirrored vdev (details below), plus two hot spares. After all the vdevs were the way I wanted them, I did a zpool export, and automatic import from the web gui. This seemed to work fine. Also, this configuration may not be final - I plan to benchmark raidz* vdevs as well.

Code:
[root@lens] ~# zpool status -v
  pool: datapool
state: ONLINE
  scan: none requested
config:
 
NAME        STATE    READ WRITE CKSUM
datapool    ONLINE      0    0    0
mirror-0  ONLINE      0    0    0
  da2      ONLINE      0    0    0
  da3      ONLINE      0    0    0
mirror-1  ONLINE      0    0    0
  da4      ONLINE      0    0    0
  da5      ONLINE      0    0    0
mirror-2  ONLINE      0    0    0
  da6      ONLINE      0    0    0
  da7      ONLINE      0    0    0
mirror-3  ONLINE      0    0    0
  da8      ONLINE      0    0    0
  da9      ONLINE      0    0    0
mirror-4  ONLINE      0    0    0
  da10    ONLINE      0    0    0
  da11    ONLINE      0    0    0
mirror-5  ONLINE      0    0    0
  da12    ONLINE      0    0    0
  da13    ONLINE      0    0    0
mirror-6  ONLINE      0    0    0
  da14    ONLINE      0    0    0
  da15    ONLINE      0    0    0
mirror-7  ONLINE      0    0    0
  da16    ONLINE      0    0    0
  da17    ONLINE      0    0    0
mirror-8  ONLINE      0    0    0
  da18    ONLINE      0    0    0
  da19    ONLINE      0    0    0
mirror-9  ONLINE      0    0    0
  da20    ONLINE      0    0    0
  da21    ONLINE      0    0    0
mirror-10  ONLINE      0    0    0
  da22    ONLINE      0    0    0
  da23    ONLINE      0    0    0
mirror-11  ONLINE      0    0    0
  da24    ONLINE      0    0    0
  da25    ONLINE      0    0    0
mirror-12  ONLINE      0    0    0
  da26    ONLINE      0    0    0
  da27    ONLINE      0    0    0
mirror-13  ONLINE      0    0    0
  da28    ONLINE      0    0    0
  da29    ONLINE      0    0    0
mirror-14  ONLINE      0    0    0
  da30    ONLINE      0    0    0
  da31    ONLINE      0    0    0
mirror-15  ONLINE      0    0    0
  da32    ONLINE      0    0    0
  da33    ONLINE      0    0    0
mirror-16  ONLINE      0    0    0
  da34    ONLINE      0    0    0
  da35    ONLINE      0    0    0
logs
ada1p1    ONLINE      0    0    0
spares
da0        AVAIL
da1        AVAIL
 
errors: No known data errors


I can then use the zpool and it seems to roughly work (I have not done thorough testing yet, but basic NFS read/write works fine).

After I reboot, the zpool is UNAVAILABLE:

Code:
[root@lens] ~# zpool status
  pool: datapool
state: UNAVAIL
status: One or more devices could not be opened.  There are insufficient
    replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
  see: http://illumos.org/msg/ZFS-8000-3C
  scan: none requested
config:
 
    NAME                      STATE    READ WRITE CKSUM
    datapool                  UNAVAIL      0    0    0
      mirror-0                UNAVAIL      0    0    0
        3592302713242575290  UNAVAIL      0    0    0  was /dev/da2
        2021266809768084791  UNAVAIL      0    0    0  was /dev/da3
      mirror-1                UNAVAIL      0    0    0
        5112586467922550107  UNAVAIL      0    0    0  was /dev/da4
        6419444988605564983  UNAVAIL      0    0    0  was /dev/da5
      mirror-2                UNAVAIL      0    0    0
        8652468216450752575  UNAVAIL      0    0    0  was /dev/da6
        14819342351271755606  UNAVAIL      0    0    0  was /dev/da7
      mirror-3                UNAVAIL      0    0    0
        15880097202006874895  UNAVAIL      0    0    0  was /dev/da8
        16907735780769163509  UNAVAIL      0    0    0  was /dev/da9
      mirror-4                UNAVAIL      0    0    0
        3071978661370149652  UNAVAIL      0    0    0  was /dev/da10
        1791407592833643381  UNAVAIL      0    0    0  was /dev/da11
      mirror-5                UNAVAIL      0    0    0
        14867928338641177295  UNAVAIL      0    0    0  was /dev/da12
        3150936061178659772  UNAVAIL      0    0    0  was /dev/da13
      mirror-6                UNAVAIL      0    0    0
        8316830207967157625  UNAVAIL      0    0    0  was /dev/da14
        6685696694650249293  UNAVAIL      0    0    0  was /dev/da15
      mirror-7                UNAVAIL      0    0    0
        17794825240028426542  UNAVAIL      0    0    0  was /dev/da16
        14952460444424399977  UNAVAIL      0    0    0  was /dev/da17
      mirror-8                UNAVAIL      0    0    0
        12167009551578455686  UNAVAIL      0    0    0  was /dev/da18
        2627324427696886329  UNAVAIL      0    0    0  was /dev/da19
      mirror-9                UNAVAIL      0    0    0
        4625506170048938841  UNAVAIL      0    0    0  was /dev/da20
        11965429751940287398  UNAVAIL      0    0    0  was /dev/da21
      mirror-10              UNAVAIL      0    0    0
        12210899300310530724  UNAVAIL      0    0    0  was /dev/da22
        12988575474745012328  UNAVAIL      0    0    0  was /dev/da23
      mirror-11              UNAVAIL      0    0    0
        17984353124639465830  UNAVAIL      0    0    0  was /dev/da24
        8748366759598853076  UNAVAIL      0    0    0  was /dev/da25
      mirror-12              UNAVAIL      0    0    0
        10315855113324936583  UNAVAIL      0    0    0  was /dev/da26
        8172850504687767722  UNAVAIL      0    0    0  was /dev/da27
      mirror-13              UNAVAIL      0    0    0
        918384368181134623    UNAVAIL      0    0    0  was /dev/da28
        14501400028025044371  UNAVAIL      0    0    0  was /dev/da29
      mirror-14              UNAVAIL      0    0    0
        11784271588938812269  UNAVAIL      0    0    0  was /dev/da30
        4281149748709097750  UNAVAIL      0    0    0  was /dev/da31
      mirror-15              UNAVAIL      0    0    0
        18083932246763186878  UNAVAIL      0    0    0  was /dev/da32
        4983784078936673443  UNAVAIL      0    0    0  was /dev/da33
      mirror-16              UNAVAIL      0    0    0
        13178871247697575703  UNAVAIL      0    0    0  was /dev/da34
        13024903489373090391  UNAVAIL      0    0    0  was /dev/da35


I have to manually delete the zpool from the web interface, and do an automatic import in order to get the zpool functional again. What's going on here? I'm fairly certain the geom names (e.g. da34) are not switching after reboot. The problem recurs after every reboot.

I have searched for answers to this problem, but have not found anything yet.
About Me: New to ZFS and FreeNAS/FreeBSD, but otherwise tech savvy.
 

framewrangler

Dabbler
Joined
Nov 18, 2013
Messages
41
I had something similar happen during testing as well. Can you try recreating the pool from the GUI and testing after a reboot? After abandoning the CLI for zpool creation my issue went away.
 

ChrisH

Dabbler
Joined
Jan 29, 2014
Messages
11
I had something similar happen during testing as well. Can you try recreating the pool from the GUI and testing after a reboot? After abandoning the CLI for zpool creation my issue went away.


Thanks for the suggestion framewrangler. I have done so, and immediately notice that the disk vdevs are gptid mapped, unlike the way I naively created the zpool. If it was a geom name problem, that will fix it. I'm rebooting now, but won't be able to post again for a few hours. I will update the thread then. Thanks.

Code:
[root@lens] /mnt/lenspool/test#    zpool status
  pool: lenspool
state: ONLINE
  scan: none requested
config:
 
    NAME                                            STATE    READ WRITE CKSUM
    lenspool                                        ONLINE      0    0    0
      mirror-0                                      ONLINE      0    0    0
        gptid/c6404576-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
        gptid/c6b5d13f-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
      mirror-1                                      ONLINE      0    0    0
        gptid/c714682b-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
        gptid/c786b0c1-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
      mirror-2                                      ONLINE      0    0    0
        gptid/c7e8a2d8-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
        gptid/c86dc83d-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
      mirror-3                                      ONLINE      0    0    0
        gptid/c8d42ead-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
        gptid/c937f3d2-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
      mirror-4                                      ONLINE      0    0    0
        gptid/c99cd4c8-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
        gptid/ca01498a-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
      mirror-5                                      ONLINE      0    0    0
        gptid/ca6959d1-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
        gptid/cacde466-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
      mirror-6                                      ONLINE      0    0    0
        gptid/cb441eee-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
        gptid/cba9fec8-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
      mirror-7                                      ONLINE      0    0    0
        gptid/cc12ea5d-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
        gptid/cc81028f-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
      mirror-8                                      ONLINE      0    0    0
        gptid/cce969a9-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
        gptid/cd63fdd4-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
      mirror-9                                      ONLINE      0    0    0
        gptid/cde3d550-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
        gptid/ce6d2355-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
      mirror-10                                    ONLINE      0    0    0
        gptid/cee2cee8-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
        gptid/cf4b85be-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
      mirror-11                                    ONLINE      0    0    0
        gptid/cfcfcce5-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
        gptid/d046c6e3-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
      mirror-12                                    ONLINE      0    0    0
        gptid/d0c7ad9a-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
        gptid/d1375c53-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
      mirror-13                                    ONLINE      0    0    0
        gptid/d1d1febc-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
        gptid/d23b53dd-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
      mirror-14                                    ONLINE      0    0    0
        gptid/d2ae7b85-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
        gptid/d318bb9d-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
      mirror-15                                    ONLINE      0    0    0
        gptid/d38df23e-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
        gptid/d3f986c6-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
      mirror-16                                    ONLINE      0    0    0
        gptid/d4c3f10d-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
        gptid/d5328d56-8952-11e3-b829-90e2ba50d4e4  ONLINE      0    0    0
    logs
      gptid/d5918c4e-8952-11e3-b829-90e2ba50d4e4    ONLINE      0    0    0
    spares
      gptid/c57cdfe5-8952-11e3-b829-90e2ba50d4e4    AVAIL 
      gptid/c5ddccf5-8952-11e3-b829-90e2ba50d4e4    AVAIL 
 
errors: No known data errors
 

ChrisH

Dabbler
Joined
Jan 29, 2014
Messages
11
Okay, so that did fix the UNAVAILABLE problem. So my next question is: why?

I saved zdb output before and after several reboots and it's always been identical - it even is after the latest zpool recreation. If zdb output is identical, then that means geom names never changed, right?

Code:
[root@lens] ~# zdb
datapool:
    version: 5000
    name: 'datapool'
    state: 0
    txg: 15
    pool_guid: 5444403026939293646
    hostid: 3647479095
    hostname: 'freenas.local'
    vdev_children: 17
    vdev_tree:
        type: 'root'
        id: 0
        guid: 5444403026939293646
        create_txg: 4
        children[0]:
            type: 'mirror'
            id: 0
            guid: 2886331068647639526
            metaslab_array: 52
            metaslab_shift: 35
            ashift: 12
            asize: 4000782221312
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 3592302713242575290
                path: '/dev/da2'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 2021266809768084791
                path: '/dev/da3'
                whole_disk: 1
                create_txg: 4
        children[1]:
            type: 'mirror'
            id: 1
            guid: 2505359889235095376
            metaslab_array: 50
            metaslab_shift: 35
            ashift: 12
            asize: 4000782221312
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 5112586467922550107
                path: '/dev/da4'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 6419444988605564983
                path: '/dev/da5'
                whole_disk: 1
                create_txg: 4
        children[2]:
            type: 'mirror'
            id: 2
            guid: 10421073651611869117
            metaslab_array: 49
            metaslab_shift: 35
            ashift: 12
            asize: 4000782221312
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 8652468216450752575
                path: '/dev/da6'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 14819342351271755606
                path: '/dev/da7'
                whole_disk: 1
                create_txg: 4
        children[3]:
            type: 'mirror'
            id: 3
            guid: 10447122375613732985
            metaslab_array: 48
            metaslab_shift: 35
            ashift: 12
            asize: 4000782221312
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 15880097202006874895
                path: '/dev/da8'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 16907735780769163509
                path: '/dev/da9'
                whole_disk: 1
                create_txg: 4
        children[4]:
            type: 'mirror'
            id: 4
            guid: 7961595152656346579
            metaslab_array: 47
            metaslab_shift: 35
            ashift: 12
            asize: 4000782221312
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 3071978661370149652
                path: '/dev/da10'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 1791407592833643381
                path: '/dev/da11'
                whole_disk: 1
                create_txg: 4
        children[5]:
            type: 'mirror'
            id: 5
            guid: 12139764366068428165
            metaslab_array: 46
            metaslab_shift: 35
            ashift: 12
            asize: 4000782221312
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 14867928338641177295
                path: '/dev/da12'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 3150936061178659772
                path: '/dev/da13'
                whole_disk: 1
                create_txg: 4
        children[6]:
            type: 'mirror'
            id: 6
            guid: 10005074899943732872
            metaslab_array: 45
            metaslab_shift: 35
            ashift: 12
            asize: 4000782221312
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 8316830207967157625
                path: '/dev/da14'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 6685696694650249293
                path: '/dev/da15'
                whole_disk: 1
                create_txg: 4
        children[7]:
            type: 'mirror'
            id: 7
            guid: 9261199040083744418
            metaslab_array: 44
            metaslab_shift: 35
            ashift: 12
            asize: 4000782221312
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 17794825240028426542
                path: '/dev/da16'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 14952460444424399977
                path: '/dev/da17'
                whole_disk: 1
                create_txg: 4
        children[8]:
            type: 'mirror'
            id: 8
            guid: 2281533990930977260
            metaslab_array: 43
            metaslab_shift: 35
            ashift: 12
            asize: 4000782221312
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 12167009551578455686
                path: '/dev/da18'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 2627324427696886329
                path: '/dev/da19'
                whole_disk: 1
                create_txg: 4
        children[9]:
            type: 'mirror'
            id: 9
            guid: 12707581191148113
            metaslab_array: 42
            metaslab_shift: 35
            ashift: 12
            asize: 4000782221312
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 4625506170048938841
                path: '/dev/da20'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 11965429751940287398
                path: '/dev/da21'
                whole_disk: 1
                create_txg: 4
        children[10]:
            type: 'mirror'
            id: 10
            guid: 11257593337302698635
            metaslab_array: 41
            metaslab_shift: 35
            ashift: 12
            asize: 4000782221312
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 12210899300310530724
                path: '/dev/da22'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 12988575474745012328
                path: '/dev/da23'
                whole_disk: 1
                create_txg: 4
        children[11]:
            type: 'mirror'
            id: 11
            guid: 4913412708716834115
            metaslab_array: 40
            metaslab_shift: 35
            ashift: 12
            asize: 4000782221312
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 17984353124639465830
                path: '/dev/da24'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 8748366759598853076
                path: '/dev/da25'
                whole_disk: 1
                create_txg: 4
        children[12]:
            type: 'mirror'
            id: 12
            guid: 14330913717439157264
            metaslab_array: 39
            metaslab_shift: 35
            ashift: 12
            asize: 4000782221312
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 10315855113324936583
                path: '/dev/da26'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 8172850504687767722
                path: '/dev/da27'
                whole_disk: 1
                create_txg: 4
        children[13]:
            type: 'mirror'
            id: 13
            guid: 5731155082637580458
            metaslab_array: 38
            metaslab_shift: 35
            ashift: 12
            asize: 4000782221312
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 918384368181134623
                path: '/dev/da28'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 14501400028025044371
                path: '/dev/da29'
                whole_disk: 1
                create_txg: 4
        children[14]:
            type: 'mirror'
            id: 14
            guid: 6117171966242349072
            metaslab_array: 37
            metaslab_shift: 35
            ashift: 12
            asize: 4000782221312
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 11784271588938812269
                path: '/dev/da30'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 4281149748709097750
                path: '/dev/da31'
                whole_disk: 1
                create_txg: 4
        children[15]:
            type: 'mirror'
            id: 15
            guid: 7892927257024989257
            metaslab_array: 36
            metaslab_shift: 35
            ashift: 12
            asize: 4000782221312
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 18083932246763186878
                path: '/dev/da32'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 4983784078936673443
                path: '/dev/da33'
                whole_disk: 1
                create_txg: 4
        children[16]:
            type: 'mirror'
            id: 16
            guid: 14787340581406451108
            metaslab_array: 34
            metaslab_shift: 35
            ashift: 12
            asize: 4000782221312
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 13178871247697575703
                path: '/dev/da34'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 13024903489373090391
                path: '/dev/da35'
                whole_disk: 1
                create_txg: 4
    features_for_read:
 

framewrangler

Dabbler
Joined
Nov 18, 2013
Messages
41
I think the answer lies deeper in FreeNAS and its interaction with the ZFS topology as a whole, not ZFS as a standalone element. That said, I'm out of my element here and can't help. Sorry.
 

ChrisH

Dabbler
Joined
Jan 29, 2014
Messages
11
I think I stumbled into the answer. I noticed in the logs that freenas/django helpfully logs the command they use to create a zpool:

zpool create -o cachefile=/data/zfs/zpool.cache -o failmode=continue -o autoexpand=on -O aclmode=passthrough -O aclinherit=passthrough -f -m /lenspool -o altroot=/mnt lenspool spare /dev/gptid/c57cdfe5-8952-11e3-b829-90e2ba50d4e4 /dev/gptid/c5ddccf5-8952-11e3-b829-90e2ba50d4e4 ...

I think the applicable flag here is
-o cachefile=/data/zfs/zpool.cache

A helpful post about cachefile says:
To be able to quickly import a pool without scanning all of the devices on your system, ZFS keeps a cache of information about the pool and the devices its found on.
And:
any pool not using the system default cachefile is not automatically imported on boot

When I created the zpool on the CLI, I did not set the cachefile.
 

indy

Patron
Joined
Dec 28, 2013
Messages
287
Just out of curiosity: is this a recommended setup with regards to data security?
Why did you choose it over 3x 11 disk raidz3?
 

ChrisH

Dabbler
Joined
Jan 29, 2014
Messages
11
Just out of curiosity: is this a recommended setup with regards to data security?
Why did you choose it over 3x 11 disk raidz3?


Data is actually more vulnerable in this configuration compared to raidz2 or raidz3. Although, I think that in many cases striped mirrors are an acceptable risk. Like anything else, backups are really important. Another downside, you lose storage space compared to raidz*.

The primary reason you'd pick a striped mirror configuration is if your primary concern is performance. I can hopefully post some benchmarks in a bit - but a striped mirror should be a lot faster than anything else (particularly in read-heavy environments), and still give you reasonable reliability.
 

indy

Patron
Joined
Dec 28, 2013
Messages
287
Benchmarks would indeed be interesting. ;)
 

ChrisH

Dabbler
Joined
Jan 29, 2014
Messages
11
Sorry this took a while. Here are the surprising (to me) iozone benchmark results.

Based on these benchmarks I decided that the performance benefit of mirrored vdevs isn't worth the lost storage space.

3 of the charts show performance with ARC and SLOG enabled at full force - the final one shows performance with no SLOG and very small ARC.

chart1.png
chart2.png
chart3.png
small-cache.png
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Ok.. Instead or providing the pretty charts can you provide the actual command and raw output. 99% of the time charts are a bad way to go. They look pretty, but they water down the real information and eventually add no value. Solely based on your test results, I'd say you are definitely in that 99%. ;)
 

ChrisH

Dabbler
Joined
Jan 29, 2014
Messages
11
Well charts do not tell the whole story, but I tried to concisely tell the story without uploading the spreadsheets. All y-axises are in MiB/s, btw (only one chart has that labeled).

I also collected more data which I'll share. Since you want all the detail, here ya go. I will agree to disagree with you on the value of the charts, but I also know how they were made so that may make them seem more useful to me than they are to others.
 

Attachments

  • sheets.zip
    458.5 KB · Views: 201

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
So, you are doing exactly what I thought you were doing before I even opened the charts. It's a common noob mistake.. don't feel bad about it. Here's your mistake.

I'm assuming you are still running 256GB of RAM, right? If so, all of your tests are instantly invalidated. Your test size must be at least two times your system RAM(ARC size technically.. which is basically all of your system RAM minus a couple of GB). For a system of your size, this means the tests could potentially take days. The reason for this is that the ZFS cache is going to give you huge artificial performance gains. What you're going to need to do is rerun the tests by forcing the ARC to like 10GB and then do a 50GB test size unless you want very realistic tests that take days. If your test size is less than your ARC size then the test file ends up entirely in RAM, which gives you artificially high values. I can run a test on a pool with a single disk that will make you think my pool does 4GB/sec.

Another thing you aren't recogizing is that a 64k record size is far far more than your typical scenario will use. So that parameter needs to change. I'd expect something in the 8k to 16k is more likely, which is going to seriously kill the benchmark values you get.

And the last easy reason your test is invalidated is the L2ARC. L2ARCs are great. They can take an extremely poor performing pool and make it perform adequately, or even at rocket speeds. The problem though, is that the L2ARC doesn't fill quickly(something like 8MB/sec maximum), it doesn't expire data as rapidly as you may need for the benchmarks to appear to show a gain, it doesn't necessarily fill the ARC to capacity, the drive chosen may not be a good drive for an L2ARC(this can make a big difference too!), and no data goes into the L2ARC until it's been read something like 5 times. L2ARC isn't the pedal-to-the-metal device people expect. It's virtually nonexistent for benchmarking because of all of these design considerations. I explain L2ARC and ARCs with the analogy that the ARC is like a racecar in the red and the L2ARC is a mack truck. You want the ARC to get you high speed short term and you want the L2ARC to get you high speed long term. A well size ARC and L2ARC can be amazing. But if you don't rightsize them for your workload, you can actually make things worse. Throwing more hardware at the problem because it's slow is a recipe to spend alot of money on the wrong stuff. If you have money to burn you can paypal it to me.. I can take care of it for you. ;)

iozone is no joke though. You have to know how to apply iozone properly for your situation, and you have to know how to interpret the responses. iozone also doesn't simulate all situations well. In many situations it's a useless benchmark. To be honest, when I do consultation work for small companies that are doing setups like yours I never do iozone. Why? Because it's virtually impossible to apply it properly and interpret it properly. iozone is not a one stop shop for benchmarking, and improperly applying it gives you bad information that will lead you down a path that is wrong. I rely on my prior experience and knowledge to create a server that should work fairly well for the user, has room for expansion in aspects that expansion may be likely or expected, and see where things go. I prefer to start with a relatively inexpensive server than to go all-out and throw $10k on a server.

It's easier to deal with a server when you run it in production and see what's slow and what's not and add or change hardware as appropriate than to try to get iozone to give you some synthetic benchmark that means something for your situation. I talked to someone a year or so ago about iozone and he said that he's only seen a handful of cases where iozone was properly applied to a server in almost 10 years of doing work on FreeBSD. He said it's so complex that there's no chance someone is going to do some reading and learn how to use iozone properly without a few years under their belt. I can believe it too, because although I don't consider myself an authority with regards to iozone, if I can pick out a few of the common reasons why people fail these test who knows what else is AFU with the testing parameters.

If you are looking for someone to do some consultation send me a PM. Your situation is far from trivial and your situation is sufficiently complex I wouldn't dare try to dive into this problem in a forum setting. You will probably find my price very reasonable too. ;)
 

ChrisH

Dabbler
Joined
Jan 29, 2014
Messages
11
I'm assuming you are still running 256GB of RAM, right? If so, all of your tests are instantly invalidated. Your test size must be at least two times your system RAM(ARC size technically.. which is basically all of your system RAM minus a couple of GB). For a system of your size, this means the tests could potentially take days. The reason for this is that the ZFS cache is going to give you huge artificial performance gains. What you're going to need to do is rerun the tests by forcing the ARC to like 10GB and then do a 50GB test size unless you want very realistic tests that take days. If your test size is less than your ARC size then the test file ends up entirely in RAM, which gives you artificially high values. I can run a test on a pool with a single disk that will make you think my pool does 4GB/sec.

Because cache is used in production I wanted to run benchmarks against the cache. This wasn't a mistake, though you can argue about the merits of testing w/wo cache. You'll notice I also did additional testing with 1GB ARC, SLOG disabled and 10GB file size (so cache effectively not used). I care more about the results with cache because that's closer to the real world use case.

Another thing you aren't recogizing is that a 64k record size is far far more than your typical scenario will use. So that parameter needs to change. I'd expect something in the 8k to 16k is more likely, which is going to seriously kill the benchmark values you get.

This might be a good point. I picked 64k because that is the size of the NFS packets in production, and I wanted this server to be robust against large data transfers. I could be missing why 64k is a poor choice, but I haven't seen an explanation for why a lower number will be better for the use case I planned for.

And the last easy reason your test is invalidated is the L2ARC. L2ARCs are great.

I did not use an L2ARC, I'm not sure if you noticed that. It's all ARC baby. ;) I followed what I took to be good advice from someone with ZFS experience that said why bother with an L2ARC when you can just boost the RAM instead? Why buy a mactruck when you can buy a racecar with room for storage?

iozone is no joke though. You have to know how to apply iozone properly for your situation, and you have to know how to interpret the responses. iozone also doesn't simulate all situations well. In many situations it's a useless benchmark.

There's definitely truth to this. Benchmarking is hard to do accurately.

The point is that I ran the same exact tests against a 17xmirror (34 disk) and a 4x9raidz2 (36 disk) system. As long as the tests are identical, even if you don't deeply understand the tests you at least learn something about the comparison between the two configurations. I was trying to figure out what configuration to use for my system. Not to state the best solution for every system.

These are my tests for my expected workload. I found them useful. YMMV.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Because cache is used in production I wanted to run benchmarks against the cache. This wasn't a mistake, though you can argue about the merits of testing w/wo cache. You'll notice I also did additional testing with 1GB ARC, SLOG disabled and 10GB file size (so cache effectively not used). I care more about the results with cache because that's closer to the real world use case.

Ok, but you're testing a (at maximum size) 100GB file. It'll fit 100% in ARC(obviously). So you're really going to tell me that 100% of your real-world data will always be in the ARC? Because remember, iozone is about reproducing real-world, which means duplicating real-world.

Money says you have far more than 240GB or so of data that you are wanting to use over NFS(presumably VMs). If it were so little data I'd tell you to buy the most expensive awesome SSD you can find and put your data on that. It'll be very fast and you'd be very happy. But, let's be real.. you have far more data than 240GB, don't you? So yes, your test is invalid and not indicative of real world at all.

So I stand by my comments and example that your testing methods are in no way a reflection of real-world. Sorry, but it's a fundamental truth I've picked up after being here for more than 2 years. Take it or leave it. I was here hoping to teach you why your test are invalid. You aren't going to convince me of otherwise because I clearly have years of experience over you and I know better because I see the same mistakes all the time. That's why I knew before I even saw your little charts you were going to get it wrong. I've had this argument with many people over the years in this forum, and they always come crawling back with a "damn, you have your stuff together" and admitting that you were defeated by my experience. Not that I'm a total god and you should blindly listen to me, but instead of arguing with me why your test is valid, you should actually listen to why your test is invalid and go do more research.

Anyway, good luck. I've spent about as much time on this thread as I feel like spending. I'm not a fan of writing novels I've already written in the past in the forums.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
The point is that I ran the same exact tests against a 17xmirror (34 disk) and a 4x9raidz2 (36 disk) system. As long as the tests are identical, even if you don't deeply understand the tests you at least learn something about the comparison between the two configurations.

The concern is that if you're just hammering ARC, you aren't actually testing anything different between the two configurations (RAIDZ vs mirrored) since you never stress them beyond that point.

To extend the car analogy, it's like saying that the racecar and the Mack truck perform equally well based on a city driving test where neither one is allowed to exceed 30mph. However, one is going to be able to push 200mph on the open salt flats, and the other will merrily haul 20 tons of cargo. Ask them to switch roles and they'll both explode horribly.
 
Status
Not open for further replies.
Top