I've got a home built machine running on an AMD 350 board with 6 drives attached directly and 8 attached to an IBM M1015 board which I bought pre-flashed from ebay. It's been running fairly well for the last 6 months, but a few days ago I happened to check my security run emails and noticed a fair amount of errors in the kernel logs like this:
+(da2:mps0:0:3:0): READ(10). CDB: 28 0 86 e9 4f 30 0 0 28 0 length 20480 SMID 651 terminated ioc 804b scsi 0 state c xfer 0
+(da2:mps0:0:3:0): READ(10). CDB: 28 0 86 e9 46 c0 0 0 28 0 length 20480 SMID 370 terminated ioc 804b scsi 0 state c xfer 0
+(da2:mps0:0:3:0): READ(10). CDB: 28 0 86 e9 4e a8 0 0 30 0 length 24576 SMID 433 terminated ioc 804b scsi 0 state c xfer 16388
+(da2:mps0:0:3:0): READ(10). CDB: 28 0 86 e9 4e d8 0 0 28 0 length 20480 SMID 985 terminated ioc 804b scsi 0 state c xfer 0
+(da2:mps0:0:3:0): SYNCHRONIZE CACHE(10). CDB: 35 0 0 0 0 0 0 0 0 0
+(da2:mps0:0:3:0): CAM status: SCSI Status Error
+(da2:mps0:0:3:0): SCSI status: Check Condition
+(da2:mps0:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
+(da5:mps0:0:6:0): READ(10). CDB: 28 0 86 ea 91 c0 0 0 b0 0 length 90112 SMID 139 terminated ioc 804b scsi 0 state c xfer 0
+(da5:mps0:0:6:0): READ(10). CDB: 28 0 86 ea 91 90 0 0 30 0 length 24576 SMID 927 terminated ioc 804b scsi 0 state c xfer 0
+(da5:mps0:0:6:0): READ(10). CDB: 28 0 86 ea 90 28 0 0 e0 0 length 114688 SMID 110 terminated ioc 804b scsi 0 state c xfer 32772
+(da5:mps0:0:6:0): READ(10). CDB: 28 0 86 ea 91 38 0 0 58 0 length 45056 SMID 438 terminated ioc 804b scsi 0 state c xfer 0
+(da5:mps0:0:6:0): WRITE(10). CDB: 2a 0 8f 15 8e f0 0 0 30 0
+(da5:mps0:0:6:0): CAM status: SCSI Status Error
+(da5:mps0:0:6:0): SCSI status: Check Condition
+(da5:mps0:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Timestamps showed that they were occurring every few hours, but no pattern that I could see. The errors only seem to occur on da1, da2, da3, and da5 (so if it's bad cabling, it must be both cables). Smart shows no issues (both long and short self-tests pass) and zpool status shows a few CKSUM errors, but they don't correspond in time with the errors in the logs (i.e. there will be errors in the logs with no corresponding CKSUM errors in zpool status).
This morning it seemed much worse - errors every few minutes, and accessing files over the network seemed to slow down or hitch a little whenever the errors were occurring. Zpool still shows no data loss and no increase in errors, smart continues to show no issues.
Do I have four drives failing silently on me, or could there be something else going on? I've got a spare drive and could try replacing the one that shows up most often in the logs (da5), but I'm kind of afraid that the stress of resilvering might cause others to fail if that's really the problem. Does anyone have any suggestions? I can provide more information if necessary.
+(da2:mps0:0:3:0): READ(10). CDB: 28 0 86 e9 4f 30 0 0 28 0 length 20480 SMID 651 terminated ioc 804b scsi 0 state c xfer 0
+(da2:mps0:0:3:0): READ(10). CDB: 28 0 86 e9 46 c0 0 0 28 0 length 20480 SMID 370 terminated ioc 804b scsi 0 state c xfer 0
+(da2:mps0:0:3:0): READ(10). CDB: 28 0 86 e9 4e a8 0 0 30 0 length 24576 SMID 433 terminated ioc 804b scsi 0 state c xfer 16388
+(da2:mps0:0:3:0): READ(10). CDB: 28 0 86 e9 4e d8 0 0 28 0 length 20480 SMID 985 terminated ioc 804b scsi 0 state c xfer 0
+(da2:mps0:0:3:0): SYNCHRONIZE CACHE(10). CDB: 35 0 0 0 0 0 0 0 0 0
+(da2:mps0:0:3:0): CAM status: SCSI Status Error
+(da2:mps0:0:3:0): SCSI status: Check Condition
+(da2:mps0:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
+(da5:mps0:0:6:0): READ(10). CDB: 28 0 86 ea 91 c0 0 0 b0 0 length 90112 SMID 139 terminated ioc 804b scsi 0 state c xfer 0
+(da5:mps0:0:6:0): READ(10). CDB: 28 0 86 ea 91 90 0 0 30 0 length 24576 SMID 927 terminated ioc 804b scsi 0 state c xfer 0
+(da5:mps0:0:6:0): READ(10). CDB: 28 0 86 ea 90 28 0 0 e0 0 length 114688 SMID 110 terminated ioc 804b scsi 0 state c xfer 32772
+(da5:mps0:0:6:0): READ(10). CDB: 28 0 86 ea 91 38 0 0 58 0 length 45056 SMID 438 terminated ioc 804b scsi 0 state c xfer 0
+(da5:mps0:0:6:0): WRITE(10). CDB: 2a 0 8f 15 8e f0 0 0 30 0
+(da5:mps0:0:6:0): CAM status: SCSI Status Error
+(da5:mps0:0:6:0): SCSI status: Check Condition
+(da5:mps0:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Timestamps showed that they were occurring every few hours, but no pattern that I could see. The errors only seem to occur on da1, da2, da3, and da5 (so if it's bad cabling, it must be both cables). Smart shows no issues (both long and short self-tests pass) and zpool status shows a few CKSUM errors, but they don't correspond in time with the errors in the logs (i.e. there will be errors in the logs with no corresponding CKSUM errors in zpool status).
This morning it seemed much worse - errors every few minutes, and accessing files over the network seemed to slow down or hitch a little whenever the errors were occurring. Zpool still shows no data loss and no increase in errors, smart continues to show no issues.
Do I have four drives failing silently on me, or could there be something else going on? I've got a spare drive and could try replacing the one that shows up most often in the logs (da5), but I'm kind of afraid that the stress of resilvering might cause others to fail if that's really the problem. Does anyone have any suggestions? I can provide more information if necessary.