Advice - ZFS Clear or Replace?

TooMuchData · May 27, 2016

I am currently running my pool (6 x 6TB WD Reds in raidz2) in an external box via an LSI 9200-8e (P20) in a Lenovo TS140 (latest 9.3.1).

Yesterday I received a Critical Alert email stating "The volume zVol (ZFS) state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state."

The system log showed:

(Sorry, one line is missing - similar to the first da1 line)
May 26 20:18:37 freenas (da0:mps0:0:8:0): CAM status: SCSI Status Error May 26 20:18:37 freenas (da0:mps0:0:8:0): SCSI status: Check Condition
May 26 20:18:37 freenas (da0:mps0:0:8:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
May 26 20:18:37 freenas (da0:mps0:0:8:0): Info: 0x28fbb3f58 May 26 20:18:37 freenas (da0:mps0:0:8:0): Error 22, Unretryable error
May 26 22:19:07 freenas (da1:mps0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 02 8b 6f e3 c8 00 00 00 08 00 00
May 26 22:19:07 freenas (da1:mps0:0:9:0): CAM status: SCSI Status Error May 26 22:19:07 freenas (da1:mps0:0:9:0): SCSI status: Check Condition
May 26 22:19:07 freenas (da1:mps0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
May 26 22:19:07 freenas (da1:mps0:0:9:0): Info: 0x28b6fe3c8 May 26 22:19:07 freenas (da1:mps0:0:9:0): Error 22, Unretryable error

a zPool report stated:

pool: zVol
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace’.
see: http://illumos.org/msg/ZFS-8000-9P
scan: resilvered 8.11M in 0h0m with 0 errors on Thu May 26 22:19:15 2016
config:

NAME STATE READ WRITE CKSUM
zVol ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/831ef1ef-a343-11e4-adaf-d050995044d1 ONLINE 0 0 0
gptid/83ec38c1-a343-11e4-adaf-d050995044d1 ONLINE 0 0 0
gptid/84b80ccd-a343-11e4-adaf-d050995044d1 ONLINE 0 0 0
gptid/85856a70-a343-11e4-adaf-d050995044d1 ONLINE 0 1 0
gptid/86542f6a-a343-11e4-adaf-d050995044d1 ONLINE 0 0 0
gptid/8726b5b1-a343-11e4-adaf-d050995044d1 ONLINE 0 1 0

errors: No known data errors

The six WD Reds get weekly short SMART tests and bi-weekly long tests. None of the drives has logged an error, and all tests have completed satisfactorily. The drives are about 20 months old and have been in continuous service. I have one spare drive available.

Given that two disks experienced the same problem moments apart, I'm inclined to suspect another, common component (the 9200-8e or the cabling). So, I'm inclined to do "zpool clear", but would appreciate any comments or suggestions before I do so.

Thanks for your help.

Sakuru · May 27, 2016

Please post the output of "smartctl -a /dev/XXX" for all of your drives in code tags.

TooMuchData · May 27, 2016

How about as text files?

Sakuru · May 27, 2016

Eh, that works, but I'd prefer code tags (hit the button to the left of the save button in the toolbar) or something like pastebin. That way I don't have to download all those files :)

Sakuru · May 27, 2016

Hmm, your SMART stats look great.
Are you using any of the Marvell ports on that board?

TooMuchData · May 27, 2016

Thank you, Sakuru.
Please see initial post. No Marvel ports are involved in this error.

Sakuru · May 27, 2016

Ah, so this isn't the one in your signature.
What is the "external box" that you refer to?

Mirfster · May 27, 2016

Check the LSI downloads, they did release another update (04/04/2016 if I recall) that is 20.00.07.00 which while not really required for all does have a fix mainly for those seeing a particular issue. I don't recall what it was off the top of my head but might be worth checking out.

TooMuchData · May 27, 2016

Thank you, Mirfster and Sakuru.
The LSI has 20.00.07.00.
The external box is another Node 304 with an SSR-450RM PSU for the disks, and an SFF-8088 external connector that fans out to eight internal SAS/SATA cables six of which are connected to the disks. A CSE-PTJBOD-CB1 ATX board does power on-off control.

Any reason why I should not do zpool clear?

My replacement C2750D4I arrived today, so I plan to put all the disks back into my other Node 304 in a few days.

Mirfster · May 28, 2016

TooMuchData said:
Any reason why I should not do zpool clear?

I personally would perform a system reboot first and see if things are cleared up. I am not one to just clear messages ;)

cyberjock · May 28, 2016

A reboot will clear the errors since the numbers are reset on an unmount. ;P

Mirfster · May 28, 2016

Ahh, it is good to be reassured by a more knowledgeable member. ;)

Important Announcement for the TrueNAS Community.

Advice - ZFS Clear or Replace?

TooMuchData

Contributor

Sakuru

Guru

TooMuchData

Contributor

Attachments

Sakuru

Guru

Sakuru

Guru

TooMuchData

Contributor

Sakuru

Guru

Mirfster

Doesn't know what he's talking about

TooMuchData

Contributor

Mirfster

Doesn't know what he's talking about

cyberjock

Inactive Account

Mirfster

Doesn't know what he's talking about

Similar threads

Important Announcement for the TrueNAS Community.

Advice - ZFS Clear or Replace?

Contributor

Guru

Contributor

Attachments

Guru

Guru

Contributor

Guru

Doesn't know what he's talking about

Contributor

Doesn't know what he's talking about

Inactive Account

Doesn't know what he's talking about

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Advice - ZFS Clear or Replace?"

Similar threads