LSI 9201 - 16i resetting when receiving snapshots

herr_tichy · Oct 26, 2014

I have the following controller:

Code:

mps0@pci0:3:0:0:        class=0x010700 card=0x30c01000 chip=0x00641000 rev=0x02 hdr=0x00
    vendor     = 'LSI Logic / Symbios Logic'
    device     = 'SAS2116 PCI-Express Fusion-MPT SAS-2 [Meteor]'
    class      = mass storage
    subclass   = SAS

... which is an LSI 9201 - 16i with 14 Seagate Constellation ES.3 connected to it.

Now, when I receive a ZFS snapshot on the machine, the following happens after some time:

Code:

mps0: IOC Fault 0x40000d04, Resetting
mps0: Reinitializing controller,
mps0: Firmware: 19.00.00.00, Driver: 16.00.00.00-fbsd
mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
mps0: mps_reinit finished sc 0xffffff8000c7a000 post 4 free 3
        (probe6:mps0:0:6:0): TEST UNIT READY. CDB: 00 00 00 00 00 00 length 0 SMID 198 command timeout cm 0xffffff8000cc7db0 ccb 0xfffffe01650c6800
        (noperiph:mps0:0:4294967295:0): SMID 1 Aborting command 0xffffff8000cc7db0
        (probe6:mps0:0:6:0): TEST UNIT READY. CDB: 00 00 00 00 00 00 length 0 SMID 863 command timeout cm 0xffffff8000cfd1b8 ccb 0xfffffe01650c6800
        (noperiph:mps0:0:4294967295:0): SMID 2 Aborting command 0xffffff8000cfd1b8
        (probe6:mps0:0:6:0): TEST UNIT READY. CDB: 00 00 00 00 00 00 length 0 SMID 528 command timeout cm 0xffffff8000ce2480 ccb 0xfffffe01650c6800
        (noperiph:mps0:0:4294967295:0): SMID 3 Aborting command 0xffffff8000ce2480
        (probe6:mps0:0:6:0): TEST UNIT READY. CDB: 00 00 00 00 00 00 length 0 SMID 428 command timeout cm 0xffffff8000cda460 ccb 0xfffffe01650c6800
        (noperiph:mps0:0:4294967295:0): SMID 4 Aborting command 0xffffff8000cda460
        (probe6:mps0:0:6:0): TEST UNIT READY. CDB: 00 00 00 00 00 00 length 0 SMID 1013 command timeout cm 0xffffff8000d091e8 ccb 0xfffffe01650c6800
        (noperiph:mps0:0:4294967295:0): SMID 5 Aborting command 0xffffff8000d091e8
        (probe6:mps0:0:6:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 312 command timeout cm 0xffffff8000cd0fc0 ccb 0xfffffe01650c6800
        (noperiph:mps0:0:4294967295:0): SMID 6 Aborting command 0xffffff8000cd0fc0
(probe6:mps0:0:6:0): INQUIRY. CDB: 12 00 00 00 24 00
(probe6:mps0:0:6:0): CAM status: Command timeout
(probe6:mps0:0:6:0): Retrying command
        (probe6:mps0:0:6:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 213 command timeout cm 0xffffff8000cc90e8 ccb 0xfffffe01650c6800
        (noperiph:mps0:0:4294967295:0): SMID 7 Aborting command 0xffffff8000cc90e8
(probe6:mps0:0:6:0): INQUIRY. CDB: 12 00 00 00 24 00
(probe6:mps0:0:6:0): CAM status: Command timeout
(probe6:mps0:0:6:0): Retrying command
        (probe6:mps0:0:6:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 657 command timeout cm 0xffffff8000cec9c8 ccb 0xfffffe01650c6800
        (noperiph:mps0:0:4294967295:0): SMID 8 Aborting command 0xffffff8000cec9c8
(probe6:mps0:0:6:0): INQUIRY. CDB: 12 00 00 00 24 00
(probe6:mps0:0:6:0): CAM status: Command timeout
(probe6:mps0:0:6:0): Retrying command
        (probe6:mps0:0:6:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 532 command timeout cm 0xffffff8000ce29a0 ccb 0xfffffe01650c6800
        (noperiph:mps0:0:4294967295:0): SMID 9 Aborting command 0xffffff8000ce29a0
(probe6:mps0:0:6:0): INQUIRY. CDB: 12 00 00 00 24 00
(probe6:mps0:0:6:0): CAM status: Command timeout
(probe6:mps0:0:6:0): Retrying command
        (probe6:mps0:0:6:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 329 command timeout cm 0xffffff8000cd2588 ccb 0xfffffe01650c6800
        (noperiph:mps0:0:4294967295:0): SMID 10 Aborting command 0xffffff8000cd2588
(probe6:mps0:0:6:0): INQUIRY. CDB: 12 00 00 00 24 00
(probe6:mps0:0:6:0): CAM status: Command timeout
(probe6:mps0:0:6:0): Error 5, Retries exhausted
da4 at mps0 bus 0 scbus0 target 6 lun 0
da4: <SEAGATE ST2000NM0023 A001> s/n Z1X04SVD0000S334R4W1 detached
        (da4:mps0:0:6:0): WRITE(10). CDB: 2a 00 ca dd 6b 40 00 01 00 00 length 131072 SMID 315 command timeout cm 0xffffff8000cd1398 ccb 0xfffffe001baf2800
        (noperiph:mps0:0:4294967295:0): SMID 11 Aborting command 0xffffff8000cd1398
(da4:mps0:0:6:0): WRITE(10). CDB: 2a 00 ca dd 6b 40 00 01 00 00
(da4:mps0:0:6:0): CAM status: Command timeout
(da4:mps0:0:6:0): Error 5, Periph was invalidated
(da4:mps0:0:6:0): Periph destroyed
[root@ltm-babylon] ~# zpool status
  pool: ltm_storage
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0 in 9h13m with 0 errors on Mon Oct 20 20:57:32 2014
config:

        NAME                                            STATE     READ WRITE CKSUM
        ltm_storage                                     DEGRADED     0     0     0
          raidz3-0                                      DEGRADED     0     0     0
            gptid/410bd563-05ca-11e4-9087-001e6788f6c7  ONLINE       0     0     0
            gptid/a840a5a0-a78d-11e3-b33f-001e6788f6c7  ONLINE       0     0     0
            gptid/a881c1e9-a78d-11e3-b33f-001e6788f6c7  ONLINE       0     0     0
            gptid/a8c4ce09-a78d-11e3-b33f-001e6788f6c7  ONLINE       0     0     0
            gptid/1c9f179a-0c15-11e4-9cab-001e6788f6c7  ONLINE       0     0     0
            gptid/a94930d0-a78d-11e3-b33f-001e6788f6c7  ONLINE       0     0     0
            706702018939685941                          REMOVED      0     0     0  was /dev/gptid/a98c8616-a78d-11e3-b33f-001e6788f6c7
          raidz3-1                                      ONLINE       0     0     0
            gptid/a9d009d6-a78d-11e3-b33f-001e6788f6c7  ONLINE       0     0     0
            gptid/aa0fed44-a78d-11e3-b33f-001e6788f6c7  ONLINE       0     0     0
            gptid/aa52a552-a78d-11e3-b33f-001e6788f6c7  ONLINE       0     0     0
            gptid/aa9661dd-a78d-11e3-b33f-001e6788f6c7  ONLINE       0     0     0
            gptid/dd8716b4-06a1-11e4-9087-001e6788f6c7  ONLINE       0     0     0
            gptid/ab1f5e9c-a78d-11e3-b33f-001e6788f6c7  ONLINE       0     0     0
            gptid/ab667c1f-a78d-11e3-b33f-001e6788f6c7  ONLINE       0     0     0

errors: No known data errors
[root@ltm-babylon] ~#

When I reset the machine, everything is fine again.

I've tried simulating the load during receiving a ZFS snapshot by sending /dev/zero via nc to /dev/null from another machine at linespeed while appending a small amount of random data (so it isn't compressable) from a tmpfs onto a file on the ZFS partition over and over which resulted in 112MB/s on the network and about 1GB/s sustained writes on the ZFS. I've done that twice now for the entire capacity of the pool and it worked fine, yet, when I start receive a snapshot, the above happens again after a few GB.

I'm all out of ideas how to even start to solve this problem. Halp?

cyberjock · Oct 26, 2014

This sounds like a very select area of /dev/da4 is damaged and it is being hit when you try to do a snapshot.

Post the output of smartctl -a /dev/da4 in code like you did for your zpool status. ;)

herr_tichy · Oct 26, 2014

Code:


[root@ltm-babylon] ~# smartctl -a /dev/da4
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST2000NM0023
Revision:             A001
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
Logical block provisioning type unreported, LBPME=0, LBPRZ=0
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c50056570a0b
Serial number:        Z1X04SVD0000S334R4W1
Device type:          disk
Transport protocol:   SAS
Local Time is:        Sun Oct 26 13:21:25 2014 CET
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     35 C
Drive Trip Temperature:        68 C

Manufactured in week @* of year 20#
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  75
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  75
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 301554535
  Blocks received from initiator = 1034646598
  Blocks read from cache and sent to initiator = 293162844
  Number of read and write commands whose size <= segment size = 3613097
  Number of read and write commands whose size > segment size = 5816

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 9220.05
  number of minutes until next internal SMART test = 31

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   2714067489        0         0  2714067489          0       2353.419           0
write:         0        0         0         0          0       2729.319           0

Non-medium error count:        0


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
No self-tests have been logged

I don't see errors logged there, or am I overlooking something? The problem above isn't exclusive to da4, btw. - it hits the other disks randomly. We've gotten a few of them replaced on warranty, before I noticed the error hit new and old disks randomly.

cyberjock · Oct 26, 2014

Well, if it is affecting other disks that would have been useful. You're going to need to post full system specs and the debug file. (The SMART info is fine but you should schedule SMART tests in the WebGUI.)

We may or may not be able to help you with this problem, but let's see. ;)

herr_tichy · Nov 2, 2014

Thank you. The problem has apparently decided that suddenly it doesn't want to be reproduced. I've been sending big snapshots to the machine all week long - no hangs, no controller resets, nothing. *sigh*

Important Announcement for the TrueNAS Community.

LSI 9201 - 16i resetting when receiving snapshots

herr_tichy

Dabbler

cyberjock

Inactive Account

herr_tichy

Dabbler

cyberjock

Inactive Account

herr_tichy

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

LSI 9201 - 16i resetting when receiving snapshots

herr_tichy

Dabbler

cyberjock

Inactive Account

herr_tichy

Dabbler

cyberjock

Inactive Account

herr_tichy

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "LSI 9201 - 16i resetting when receiving snapshots"

Similar threads