herr_tichy
Dabbler
- Joined
- Jul 14, 2014
- Messages
- 12
I have the following controller:
... which is an LSI 9201 - 16i with 14 Seagate Constellation ES.3 connected to it.
Now, when I receive a ZFS snapshot on the machine, the following happens after some time:
When I reset the machine, everything is fine again.
I've tried simulating the load during receiving a ZFS snapshot by sending /dev/zero via nc to /dev/null from another machine at linespeed while appending a small amount of random data (so it isn't compressable) from a tmpfs onto a file on the ZFS partition over and over which resulted in 112MB/s on the network and about 1GB/s sustained writes on the ZFS. I've done that twice now for the entire capacity of the pool and it worked fine, yet, when I start receive a snapshot, the above happens again after a few GB.
I'm all out of ideas how to even start to solve this problem. Halp?
Code:
mps0@pci0:3:0:0: class=0x010700 card=0x30c01000 chip=0x00641000 rev=0x02 hdr=0x00 vendor = 'LSI Logic / Symbios Logic' device = 'SAS2116 PCI-Express Fusion-MPT SAS-2 [Meteor]' class = mass storage subclass = SAS
... which is an LSI 9201 - 16i with 14 Seagate Constellation ES.3 connected to it.
Now, when I receive a ZFS snapshot on the machine, the following happens after some time:
Code:
mps0: IOC Fault 0x40000d04, Resetting mps0: Reinitializing controller, mps0: Firmware: 19.00.00.00, Driver: 16.00.00.00-fbsd mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc> mps0: mps_reinit finished sc 0xffffff8000c7a000 post 4 free 3 (probe6:mps0:0:6:0): TEST UNIT READY. CDB: 00 00 00 00 00 00 length 0 SMID 198 command timeout cm 0xffffff8000cc7db0 ccb 0xfffffe01650c6800 (noperiph:mps0:0:4294967295:0): SMID 1 Aborting command 0xffffff8000cc7db0 (probe6:mps0:0:6:0): TEST UNIT READY. CDB: 00 00 00 00 00 00 length 0 SMID 863 command timeout cm 0xffffff8000cfd1b8 ccb 0xfffffe01650c6800 (noperiph:mps0:0:4294967295:0): SMID 2 Aborting command 0xffffff8000cfd1b8 (probe6:mps0:0:6:0): TEST UNIT READY. CDB: 00 00 00 00 00 00 length 0 SMID 528 command timeout cm 0xffffff8000ce2480 ccb 0xfffffe01650c6800 (noperiph:mps0:0:4294967295:0): SMID 3 Aborting command 0xffffff8000ce2480 (probe6:mps0:0:6:0): TEST UNIT READY. CDB: 00 00 00 00 00 00 length 0 SMID 428 command timeout cm 0xffffff8000cda460 ccb 0xfffffe01650c6800 (noperiph:mps0:0:4294967295:0): SMID 4 Aborting command 0xffffff8000cda460 (probe6:mps0:0:6:0): TEST UNIT READY. CDB: 00 00 00 00 00 00 length 0 SMID 1013 command timeout cm 0xffffff8000d091e8 ccb 0xfffffe01650c6800 (noperiph:mps0:0:4294967295:0): SMID 5 Aborting command 0xffffff8000d091e8 (probe6:mps0:0:6:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 312 command timeout cm 0xffffff8000cd0fc0 ccb 0xfffffe01650c6800 (noperiph:mps0:0:4294967295:0): SMID 6 Aborting command 0xffffff8000cd0fc0 (probe6:mps0:0:6:0): INQUIRY. CDB: 12 00 00 00 24 00 (probe6:mps0:0:6:0): CAM status: Command timeout (probe6:mps0:0:6:0): Retrying command (probe6:mps0:0:6:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 213 command timeout cm 0xffffff8000cc90e8 ccb 0xfffffe01650c6800 (noperiph:mps0:0:4294967295:0): SMID 7 Aborting command 0xffffff8000cc90e8 (probe6:mps0:0:6:0): INQUIRY. CDB: 12 00 00 00 24 00 (probe6:mps0:0:6:0): CAM status: Command timeout (probe6:mps0:0:6:0): Retrying command (probe6:mps0:0:6:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 657 command timeout cm 0xffffff8000cec9c8 ccb 0xfffffe01650c6800 (noperiph:mps0:0:4294967295:0): SMID 8 Aborting command 0xffffff8000cec9c8 (probe6:mps0:0:6:0): INQUIRY. CDB: 12 00 00 00 24 00 (probe6:mps0:0:6:0): CAM status: Command timeout (probe6:mps0:0:6:0): Retrying command (probe6:mps0:0:6:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 532 command timeout cm 0xffffff8000ce29a0 ccb 0xfffffe01650c6800 (noperiph:mps0:0:4294967295:0): SMID 9 Aborting command 0xffffff8000ce29a0 (probe6:mps0:0:6:0): INQUIRY. CDB: 12 00 00 00 24 00 (probe6:mps0:0:6:0): CAM status: Command timeout (probe6:mps0:0:6:0): Retrying command (probe6:mps0:0:6:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 329 command timeout cm 0xffffff8000cd2588 ccb 0xfffffe01650c6800 (noperiph:mps0:0:4294967295:0): SMID 10 Aborting command 0xffffff8000cd2588 (probe6:mps0:0:6:0): INQUIRY. CDB: 12 00 00 00 24 00 (probe6:mps0:0:6:0): CAM status: Command timeout (probe6:mps0:0:6:0): Error 5, Retries exhausted da4 at mps0 bus 0 scbus0 target 6 lun 0 da4: <SEAGATE ST2000NM0023 A001> s/n Z1X04SVD0000S334R4W1 detached (da4:mps0:0:6:0): WRITE(10). CDB: 2a 00 ca dd 6b 40 00 01 00 00 length 131072 SMID 315 command timeout cm 0xffffff8000cd1398 ccb 0xfffffe001baf2800 (noperiph:mps0:0:4294967295:0): SMID 11 Aborting command 0xffffff8000cd1398 (da4:mps0:0:6:0): WRITE(10). CDB: 2a 00 ca dd 6b 40 00 01 00 00 (da4:mps0:0:6:0): CAM status: Command timeout (da4:mps0:0:6:0): Error 5, Periph was invalidated (da4:mps0:0:6:0): Periph destroyed [root@ltm-babylon] ~# zpool status pool: ltm_storage state: DEGRADED status: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: scrub repaired 0 in 9h13m with 0 errors on Mon Oct 20 20:57:32 2014 config: NAME STATE READ WRITE CKSUM ltm_storage DEGRADED 0 0 0 raidz3-0 DEGRADED 0 0 0 gptid/410bd563-05ca-11e4-9087-001e6788f6c7 ONLINE 0 0 0 gptid/a840a5a0-a78d-11e3-b33f-001e6788f6c7 ONLINE 0 0 0 gptid/a881c1e9-a78d-11e3-b33f-001e6788f6c7 ONLINE 0 0 0 gptid/a8c4ce09-a78d-11e3-b33f-001e6788f6c7 ONLINE 0 0 0 gptid/1c9f179a-0c15-11e4-9cab-001e6788f6c7 ONLINE 0 0 0 gptid/a94930d0-a78d-11e3-b33f-001e6788f6c7 ONLINE 0 0 0 706702018939685941 REMOVED 0 0 0 was /dev/gptid/a98c8616-a78d-11e3-b33f-001e6788f6c7 raidz3-1 ONLINE 0 0 0 gptid/a9d009d6-a78d-11e3-b33f-001e6788f6c7 ONLINE 0 0 0 gptid/aa0fed44-a78d-11e3-b33f-001e6788f6c7 ONLINE 0 0 0 gptid/aa52a552-a78d-11e3-b33f-001e6788f6c7 ONLINE 0 0 0 gptid/aa9661dd-a78d-11e3-b33f-001e6788f6c7 ONLINE 0 0 0 gptid/dd8716b4-06a1-11e4-9087-001e6788f6c7 ONLINE 0 0 0 gptid/ab1f5e9c-a78d-11e3-b33f-001e6788f6c7 ONLINE 0 0 0 gptid/ab667c1f-a78d-11e3-b33f-001e6788f6c7 ONLINE 0 0 0 errors: No known data errors [root@ltm-babylon] ~#
When I reset the machine, everything is fine again.
I've tried simulating the load during receiving a ZFS snapshot by sending /dev/zero via nc to /dev/null from another machine at linespeed while appending a small amount of random data (so it isn't compressable) from a tmpfs onto a file on the ZFS partition over and over which resulted in 112MB/s on the network and about 1GB/s sustained writes on the ZFS. I've done that twice now for the entire capacity of the pool and it worked fine, yet, when I start receive a snapshot, the above happens again after a few GB.
I'm all out of ideas how to even start to solve this problem. Halp?