I'm running 21 disks of pools currently, and occasionally a drive errors out and goes unavailable. I've been stacking them up in a pile as I replace them.
Couple of days ago I thought I'd check them. 18 failed disks over a year or so - quite a lot, but the disks weren't new when I got them, enterprise surplus.
Five fail their own firmware bootup self tests and spin themselves down within 20 seconds, toss those.
Three have SMART errors logged, bad blocks. Put those into a "bad" pile for tossing.
The other ten... zero SMART errors logged, pass the short and long tests, work fine on a Mac, and have just now survived a complete badblocks test run.
Why would they have been ejected in the first place? Here's a log snip from the most recent replacement, which is much like what I recall of all the ones I checked:
That happens intermittently, then eventually a long burst of them and
and my FreeNAS emails me to say that state is degraded as it's tossed da6 out.
Is it likely my disk shelf is the one with problems, not the disks? Dell Xyratex HB-1235, a 12 disk SAS box. My LSI 9207-4i4e is all set to default settings (fw 20.00.07.00). That single LSI card runs the internal 2xboot plus two HDDs on its internal socket, and the disk shelf on the external socket.
The 'failed' disks have all come from the shelf, but since there's only two HDDs in the server itself that's not statistically anomolous. The fails haven't all been from the same disk slot or anything so clear!
The shelf will show an error light on a sled when its HDD has terminally failed like those five that spindown, but not the SMART errors or the soft fails most of these disks have shown. This makes me think it's not the shelf that's generating the problem, up at the server end rather than the shelf, but I'm not much experienced in troubleshooting these systems - it's my first shelf.
How would I check the health of the shelf? I don't have another one. I'm using both PSUs.
Couple of days ago I thought I'd check them. 18 failed disks over a year or so - quite a lot, but the disks weren't new when I got them, enterprise surplus.
Five fail their own firmware bootup self tests and spin themselves down within 20 seconds, toss those.
Three have SMART errors logged, bad blocks. Put those into a "bad" pile for tossing.
The other ten... zero SMART errors logged, pass the short and long tests, work fine on a Mac, and have just now survived a complete badblocks test run.
Why would they have been ejected in the first place? Here's a log snip from the most recent replacement, which is much like what I recall of all the ones I checked:
Code:
Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): READ(16). CDB: 88 00 00 00 00 01 62 3b 2c 10 00 00 00 80 00 00 length 65536 SMID 397 terminated ioc 804b loginfo 31120101 scsi 0 state c xfer 65536 Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): WRITE(10). CDB: 2a 00 12 4b a2 08 00 00 80 00 length 65536 SMID 1011 terminated ioc 804b loginfo 31120101 scsi 0 state c xfer 0 Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): READ(16). CDB: 88 00 00 00 00 01 62 3b 2c 10 00 00 00 80 00 00 Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): READ(16). CDB: 88 00 00 00 00 01 62 3b 2d 10 00 00 00 80 00 00 length 65536 SMID 1097 terminated ioc 804b (da6:mps0:0:36:0): CAM status: CCB request completed with an error Jan 17 04:36:59 Sisyphus loginfo 31120101 scsi 0 state c xfer 0 Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): READ(16). CDB: 88 00 00 00 00 01 62 3b 2c 90 00 00 00 80 00 00 length 65536 SMID 452 terminated ioc 804b loginfo 31120101 scsi 0 state c xfer 0 Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): Retrying command Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): WRITE(10). CDB: 2a 00 12 4b a2 08 00 00 80 00 Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): CAM status: CCB request completed with an error Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): Retrying command Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): READ(16). CDB: 88 00 00 00 00 01 62 3b 2d 10 00 00 00 80 00 00 Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): CAM status: CCB request completed with an error Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): Retrying command Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): READ(16). CDB: 88 00 00 00 00 01 62 3b 2c 90 00 00 00 80 00 00 Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): CAM status: CCB request completed with an error Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): Retrying command Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): READ(16). CDB: 88 00 00 00 00 01 62 3b 2c 10 00 00 00 80 00 00 Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): CAM status: SCSI Status Error Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): SCSI status: Check Condition Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Jan 17 04:36:59 Sisyphus (da6:mps0:0:36:0): Retrying command (per sense data) Jan 17 04:37:00 Sisyphus (da6:mps0:0:36:0): READ(16). CDB: 88 00 00 00 00 01 62 3b 45 10 00 00 01 00 00 00 Jan 17 04:37:00 Sisyphus (da6:mps0:0:36:0): CAM status: SCSI Status Error Jan 17 04:37:00 Sisyphus (da6:mps0:0:36:0): SCSI status: Check Condition Jan 17 04:37:00 Sisyphus (da6:mps0:0:36:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Jan 17 04:37:00 Sisyphus (da6:mps0:0:36:0): Retrying command (per sense data)
That happens intermittently, then eventually a long burst of them and
Code:
Jan 24 15:10:51 Sisyphus (da6:mps0:0:36:0): Error 6, Retries exhausted Jan 24 15:10:51 Sisyphus (da6:mps0:0:36:0): Invalidating pack
and my FreeNAS emails me to say that state is degraded as it's tossed da6 out.
Is it likely my disk shelf is the one with problems, not the disks? Dell Xyratex HB-1235, a 12 disk SAS box. My LSI 9207-4i4e is all set to default settings (fw 20.00.07.00). That single LSI card runs the internal 2xboot plus two HDDs on its internal socket, and the disk shelf on the external socket.
The 'failed' disks have all come from the shelf, but since there's only two HDDs in the server itself that's not statistically anomolous. The fails haven't all been from the same disk slot or anything so clear!
The shelf will show an error light on a sled when its HDD has terminally failed like those five that spindown, but not the SMART errors or the soft fails most of these disks have shown. This makes me think it's not the shelf that's generating the problem, up at the server end rather than the shelf, but I'm not much experienced in troubleshooting these systems - it's my first shelf.
How would I check the health of the shelf? I don't have another one. I'm using both PSUs.
Last edited: