BlueMagician
Explorer
- Joined
- Apr 24, 2015
- Messages
- 56
Dear all,
I've posted a thread regarding this before, but that was towards the start of the year, so figured I'd create a new one.
75% of the time, during (or just after) a scrub, I am seeing the following errors:
And via email or in DMESG I see:
Over the months that this has been happening, the device number/partition ID described in the error changes occasionally - as does the amount of data repaired, ranging from 64k to 128k etc.
The error does not seem to be connected to a particular physical drive, OR particular HBA port, OR even a common PSU power branch.
At one point, I thought the error was being caused by a particular drive - it was the latest model revision and had seen another post with someone having very similar issues with that exact revision of drive. I replaced that drive on warranty. I got one good clean scrub, but on subsequent scrubs soon after, the error returned (on the new drive, on the same port).
So I replaced the SAS data cables, but the error continued.
So then I started to suspect the physical port was faulty, so changed my HBA port arrangement so that particular port was avoided. I got one clean scrub, thought the issue was resolved, but then it returned...
So I went hardcore -- replaced the HBA entirely with a new card (identical model) - this time flashed to the latest 20.x.07 IT-mode firmware (the previous was 20.x.04 IT mode).
Today, the first scrub to complete using this latest arrangement has thrown up the error again - but on a completely different port, relating to a completely different drive - and I'm starting to lose the plot.
With the lack of any supporting bad SMART data, am I just seeing random and infrequent drive-based URE's that are being dealt with by ZFS, that are 'nothing to worry about'? Or is there something strange going on here that I've completely overlooked..?
My system spec in in my signature.
Apologies for the wall of text - I hope it makes sense - this has been going on for so long (on and off) that I've only got random notes to work from to paint a picture...
Thanks for any thoughts and advice,
S.
I've posted a thread regarding this before, but that was towards the start of the year, so figured I'd create a new one.
75% of the time, during (or just after) a scrub, I am seeing the following errors:
Code:
[root@freenas] ~# zpool status pool: Chamber1 state: ONLINE scan: scrub in progress since Sat Dec 23 01:00:01 2017 15.6T scanned out of 21.9T at 504M/s, 3h37m to go 32K repaired, 71.29% done config: NAME STATE READ WRITE CKSUM Chamber1 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 gptid/7c55507f-006a-11e5-9af6-001e67aa46b9 ONLINE 0 0 0 gptid/01b99fdb-bf4c-11e7-8ea3-001e67aa46b9 ONLINE 0 0 0 gptid/7dd4abd6-006a-11e5-9af6-001e67aa46b9 ONLINE 0 0 0 gptid/7e95631f-006a-11e5-9af6-001e67aa46b9 ONLINE 0 0 0 gptid/7f54f268-006a-11e5-9af6-001e67aa46b9 ONLINE 0 0 0 gptid/80137822-006a-11e5-9af6-001e67aa46b9 ONLINE 0 0 0 (repairing) errors: No known data errors
And via email or in DMESG I see:
Code:
(da4:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 01 99 37 26 d0 00 00 00 40 00 00 length 32768 SMID 772 terminated ioc 804b scsi 0 state 0 xfer 0 (da4:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 01 99 37 26 d0 00 00 00 40 00 00 (da4:mps0:0:5:0): CAM status: CCB request completed with an error (da4:mps0:0:5:0): Retrying command (da4:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 01 99 37 27 50 00 00 00 40 00 00 (da4:mps0:0:5:0): CAM status: SCSI Status Error (da4:mps0:0:5:0): SCSI status: Check Condition (da4:mps0:0:5:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error) (da4:mps0:0:5:0): Info: 0x199372750 (da4:mps0:0:5:0): Error 5, Unretryable error
Over the months that this has been happening, the device number/partition ID described in the error changes occasionally - as does the amount of data repaired, ranging from 64k to 128k etc.
The error does not seem to be connected to a particular physical drive, OR particular HBA port, OR even a common PSU power branch.
At one point, I thought the error was being caused by a particular drive - it was the latest model revision and had seen another post with someone having very similar issues with that exact revision of drive. I replaced that drive on warranty. I got one good clean scrub, but on subsequent scrubs soon after, the error returned (on the new drive, on the same port).
So I replaced the SAS data cables, but the error continued.
So then I started to suspect the physical port was faulty, so changed my HBA port arrangement so that particular port was avoided. I got one clean scrub, thought the issue was resolved, but then it returned...
So I went hardcore -- replaced the HBA entirely with a new card (identical model) - this time flashed to the latest 20.x.07 IT-mode firmware (the previous was 20.x.04 IT mode).
Today, the first scrub to complete using this latest arrangement has thrown up the error again - but on a completely different port, relating to a completely different drive - and I'm starting to lose the plot.
With the lack of any supporting bad SMART data, am I just seeing random and infrequent drive-based URE's that are being dealt with by ZFS, that are 'nothing to worry about'? Or is there something strange going on here that I've completely overlooked..?
My system spec in in my signature.
Apologies for the wall of text - I hope it makes sense - this has been going on for so long (on and off) that I've only got random notes to work from to paint a picture...
Thanks for any thoughts and advice,
S.
Last edited: