Drives failing or SAS disk shelf failing?

JaimieV · Feb 12, 2020

I'm running 21 disks of pools currently, and occasionally a drive errors out and goes unavailable. I've been stacking them up in a pile as I replace them.

Couple of days ago I thought I'd check them. 18 failed disks over a year or so - quite a lot, but the disks weren't new when I got them, enterprise surplus.

Five fail their own firmware bootup self tests and spin themselves down within 20 seconds, toss those.
Three have SMART errors logged, bad blocks. Put those into a "bad" pile for tossing.
The other ten... zero SMART errors logged, pass the short and long tests, work fine on a Mac, and have just now survived a complete badblocks test run.

Why would they have been ejected in the first place? Here's a log snip from the most recent replacement, which is much like what I recall of all the ones I checked:

Code:

Jan 17 04:36:59 Sisyphus        (da6:mps0:0:36:0): READ(16). CDB: 88 00 00 00 00 01 62 3b 2c 10 00 00 00 80 00 00 length 65536 SMID 397 terminated ioc 804b loginfo 31120101 scsi 0 state c xfer 65536
Jan 17 04:36:59 Sisyphus        (da6:mps0:0:36:0): WRITE(10). CDB: 2a 00 12 4b a2 08 00 00 80 00 length 65536 SMID 1011 terminated ioc 804b loginfo 31120101 scsi 0 state c xfer 0
Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): READ(16). CDB: 88 00 00 00 00 01 62 3b 2c 10 00 00 00 80 00 00
Jan 17 04:36:59 Sisyphus        (da6:mps0:0:36:0): READ(16). CDB: 88 00 00 00 00 01 62 3b 2d 10 00 00 00 80 00 00 length 65536 SMID 1097 terminated ioc 804b (da6:mps0:0:36:0): CAM status: CCB request completed with an error
Jan 17 04:36:59 Sisyphus loginfo 31120101 scsi 0 state c xfer 0
Jan 17 04:36:59 Sisyphus        (da6:mps0:0:36:0): READ(16). CDB: 88 00 00 00 00 01 62 3b 2c 90 00 00 00 80 00 00 length 65536 SMID 452 terminated ioc 804b loginfo 31120101 scsi 0 state c xfer 0
Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): Retrying command
Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): WRITE(10). CDB: 2a 00 12 4b a2 08 00 00 80 00
Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): CAM status: CCB request completed with an error
Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): Retrying command
Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): READ(16). CDB: 88 00 00 00 00 01 62 3b 2d 10 00 00 00 80 00 00
Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): CAM status: CCB request completed with an error
Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): Retrying command
Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): READ(16). CDB: 88 00 00 00 00 01 62 3b 2c 90 00 00 00 80 00 00
Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): CAM status: CCB request completed with an error
Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): Retrying command
Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): READ(16). CDB: 88 00 00 00 00 01 62 3b 2c 10 00 00 00 80 00 00
Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): CAM status: SCSI Status Error
Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): SCSI status: Check Condition
Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): Retrying command (per sense data)
Jan 17 04:37:00 Sisyphus (da6:mps0:0:36:0): READ(16). CDB: 88 00 00 00 00 01 62 3b 45 10 00 00 01 00 00 00
Jan 17 04:37:00 Sisyphus (da6:mps0:0:36:0): CAM status: SCSI Status Error
Jan 17 04:37:00 Sisyphus (da6:mps0:0:36:0): SCSI status: Check Condition
Jan 17 04:37:00 Sisyphus (da6:mps0:0:36:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jan 17 04:37:00 Sisyphus (da6:mps0:0:36:0): Retrying command (per sense data)

That happens intermittently, then eventually a long burst of them and

Code:

Jan 24 15:10:51 Sisyphus (da6:mps0:0:36:0): Error 6, Retries exhausted
Jan 24 15:10:51 Sisyphus (da6:mps0:0:36:0): Invalidating pack

and my FreeNAS emails me to say that state is degraded as it's tossed da6 out.

Is it likely my disk shelf is the one with problems, not the disks? Dell Xyratex HB-1235, a 12 disk SAS box. My LSI 9207-4i4e is all set to default settings (fw 20.00.07.00). That single LSI card runs the internal 2xboot plus two HDDs on its internal socket, and the disk shelf on the external socket.

The 'failed' disks have all come from the shelf, but since there's only two HDDs in the server itself that's not statistically anomolous. The fails haven't all been from the same disk slot or anything so clear!

The shelf will show an error light on a sled when its HDD has terminally failed like those five that spindown, but not the SMART errors or the soft fails most of these disks have shown. This makes me think it's not the shelf that's generating the problem, up at the server end rather than the shelf, but I'm not much experienced in troubleshooting these systems - it's my first shelf.

How would I check the health of the shelf? I don't have another one. I'm using both PSUs.

sretalla · Feb 12, 2020

The first thing to do in answering your question would be to refer to the SMART data for the drives.

If you have an option to connect some of the "failed" drives to another system and run smartctl -a on them you can confirm if there are drive-level errors for reading or writing, etc.

If they turn up clean in SMART (with recent test dates), then you're looking at HBA or backplane/cabling as the next port of call.

OK, so I really need to start reading OPs in full before I answer... I see you looked at SMART already, just didn't share the output.

JaimieV · Feb 12, 2020

:) Yeah, no interesting output from all the 'clean' drives, just the usual FreeNAS scheduled daily short and weekly long test successes.

The only cable involved is the external SFF-8088 between the LSI and the shelf - given that all traffic goes through that, it can't be *too* dodgy... and more seriously would affect more than one drive at a time.

I'm hoping it's not backplane as it appears to be impossible to get into the Xyratex shelf - it's all riveted together. That it's sporadic, and follows one drive at a time until failure, doesn't concentrate on any one slot, then goes quiet for a while... everything points to HDDs dying. But I've never met a HDD get magically better before!

JaimieV · Feb 12, 2020

Just thought to check on my badblocks burnin machine - it's the SisyphusBackup hardware in sig, I pulled the backup pool out of it temporarily for this so all HDDs being tested are internal (although also hung off an LSI). That has logged a few CAM errors, just these from the whole process:

Code:

Feb 11 21:52:59 SisyphusBackup  (pass4:mps0:0:13:0): LOG SENSE. CDB: 4d 00 0d 00 00 00 00 00 40 00 length 64 SMID 475 Aborting command 0xfffffe0001017f70
Feb 11 21:52:59 SisyphusBackup mps0: Sending reset from mpssas_send_abort for target ID 13
Feb 11 21:53:00 SisyphusBackup  (da4:mps0:0:13:0): WRITE(16). CDB: 8a 00 00 00 00 01 ae 0c fe 00 00 00 01 00 00 00 length 131072 SMID 487 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0
Feb 11 21:53:00 SisyphusBackup mps0: Unfreezing devq for target ID 13
Feb 11 21:53:00 SisyphusBackup (da4:mps0:0:13:0): WRITE(16). CDB: 8a 00 00 00 00 01 ae 0c fe 00 00 00 01 00 00 00
Feb 11 21:53:00 SisyphusBackup (da4:mps0:0:13:0): CAM status: CCB request completed with an error
Feb 11 21:53:00 SisyphusBackup (da4:mps0:0:13:0): Retrying command
Feb 11 21:53:01 SisyphusBackup (da4:mps0:0:13:0): WRITE(16). CDB: 8a 00 00 00 00 01 ae 0c fe 00 00 00 01 00 00 00
Feb 11 21:53:01 SisyphusBackup (da4:mps0:0:13:0): CAM status: SCSI Status Error
Feb 11 21:53:01 SisyphusBackup (da4:mps0:0:13:0): SCSI status: Check Condition
Feb 11 21:53:01 SisyphusBackup (da4:mps0:0:13:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Feb 11 21:53:01 SisyphusBackup (da4:mps0:0:13:0): Retrying command (per sense data)
Feb 11 21:53:01 SisyphusBackup (da4:mps0:0:13:0): WRITE(16). CDB: 8a 00 00 00 00 01 ae 0d 5a 00 00 00 01 00 00 00
Feb 11 21:53:01 SisyphusBackup (da4:mps0:0:13:0): CAM status: SCSI Status Error
Feb 11 21:53:01 SisyphusBackup (da4:mps0:0:13:0): SCSI status: Check Condition
Feb 11 21:53:01 SisyphusBackup (da4:mps0:0:13:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Feb 11 21:53:01 SisyphusBackup (da4:mps0:0:13:0): Retrying command (per sense data)

so those messages aren't shelf specific - ain't using one here. da4 has not logged any issues against badblocks or SMART and a SMART short selftest is clean.

Are they general BSD kernel errors, or LSI driver specific, or other?

sretalla · Feb 12, 2020

CAM is a part of Freebsd.

Chapter 12. Common Access Method SCSI Controllers

Common Access Method SCSI Controllers

www.freebsd.org

JaimieV · Feb 12, 2020

Thanks. Not indicative either way then, I guess.

JaimieV · Feb 15, 2020

Welp, just ran them through another couple of badblocks patterns and SMART long tests, they're still showing as good. I'll mark them and put them back into the "replacements" pile.

Still interested if anyone knows why they were ejected in the first place. IIRC there was text for at least some of them noting that they had been "removed by administrator" and it wasn't me, so it might have been a FreeNAS script hitting a level of intolerance rather than ZFS itself that did it?

Important Announcement for the TrueNAS Community.

Drives failing or SAS disk shelf failing?

JaimieV

Guru

sretalla

Powered by Neutrality

JaimieV

Guru

JaimieV

Guru

sretalla

Powered by Neutrality

Chapter 12. Common Access Method SCSI Controllers

JaimieV

Guru

JaimieV

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Drives failing or SAS disk shelf failing?

Guru

Powered by Neutrality

Guru

Guru

Powered by Neutrality

Guru

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Drives failing or SAS disk shelf failing?"

Similar threads