Advice - ZFS Clear or Replace?

Status
Not open for further replies.

TooMuchData

Contributor
Joined
Jan 4, 2015
Messages
188
I am currently running my pool (6 x 6TB WD Reds in raidz2) in an external box via an LSI 9200-8e (P20) in a Lenovo TS140 (latest 9.3.1).

Yesterday I received a Critical Alert email stating "The volume zVol (ZFS) state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state."

The system log showed:
(Sorry, one line is missing - similar to the first da1 line)
May 26 20:18:37 freenas (da0:mps0:0:8:0): CAM status: SCSI Status Error May 26 20:18:37 freenas (da0:mps0:0:8:0): SCSI status: Check Condition
May 26 20:18:37 freenas (da0:mps0:0:8:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
May 26 20:18:37 freenas (da0:mps0:0:8:0): Info: 0x28fbb3f58 May 26 20:18:37 freenas (da0:mps0:0:8:0): Error 22, Unretryable error
May 26 22:19:07 freenas (da1:mps0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 02 8b 6f e3 c8 00 00 00 08 00 00
May 26 22:19:07 freenas (da1:mps0:0:9:0): CAM status: SCSI Status Error May 26 22:19:07 freenas (da1:mps0:0:9:0): SCSI status: Check Condition
May 26 22:19:07 freenas (da1:mps0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
May 26 22:19:07 freenas (da1:mps0:0:9:0): Info: 0x28b6fe3c8 May 26 22:19:07 freenas (da1:mps0:0:9:0): Error 22, Unretryable error​

a zPool report stated:
pool: zVol
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace’.
see: http://illumos.org/msg/ZFS-8000-9P
scan: resilvered 8.11M in 0h0m with 0 errors on Thu May 26 22:19:15 2016
config:

NAME STATE READ WRITE CKSUM
zVol ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/831ef1ef-a343-11e4-adaf-d050995044d1 ONLINE 0 0 0
gptid/83ec38c1-a343-11e4-adaf-d050995044d1 ONLINE 0 0 0
gptid/84b80ccd-a343-11e4-adaf-d050995044d1 ONLINE 0 0 0
gptid/85856a70-a343-11e4-adaf-d050995044d1 ONLINE 0 1 0
gptid/86542f6a-a343-11e4-adaf-d050995044d1 ONLINE 0 0 0
gptid/8726b5b1-a343-11e4-adaf-d050995044d1 ONLINE 0 1 0

errors: No known data errors​

The six WD Reds get weekly short SMART tests and bi-weekly long tests. None of the drives has logged an error, and all tests have completed satisfactorily. The drives are about 20 months old and have been in continuous service. I have one spare drive available.

Given that two disks experienced the same problem moments apart, I'm inclined to suspect another, common component (the 9200-8e or the cabling). So, I'm inclined to do "zpool clear", but would appreciate any comments or suggestions before I do so.

Thanks for your help.
 

Sakuru

Guru
Joined
Nov 20, 2015
Messages
527
Please post the output of "smartctl -a /dev/XXX" for all of your drives in code tags.
 

TooMuchData

Contributor
Joined
Jan 4, 2015
Messages
188
How about as text files?
 

Attachments

  • da0.txt
    6.1 KB · Views: 209
  • da1.txt
    6.1 KB · Views: 293
  • da2.txt
    6.1 KB · Views: 185
  • da3.txt
    6.1 KB · Views: 237
  • da4.txt
    6.1 KB · Views: 226
  • da5.txt
    6.1 KB · Views: 253

Sakuru

Guru
Joined
Nov 20, 2015
Messages
527
Eh, that works, but I'd prefer code tags (hit the button to the left of the save button in the toolbar) or something like pastebin. That way I don't have to download all those files :)
 

Sakuru

Guru
Joined
Nov 20, 2015
Messages
527
Hmm, your SMART stats look great.
Are you using any of the Marvell ports on that board?
 

TooMuchData

Contributor
Joined
Jan 4, 2015
Messages
188
Thank you, Sakuru.
Please see initial post. No Marvel ports are involved in this error.
 

Sakuru

Guru
Joined
Nov 20, 2015
Messages
527
Ah, so this isn't the one in your signature.
What is the "external box" that you refer to?
 

Mirfster

Doesn't know what he's talking about
Joined
Oct 2, 2015
Messages
3,215
Check the LSI downloads, they did release another update (04/04/2016 if I recall) that is 20.00.07.00 which while not really required for all does have a fix mainly for those seeing a particular issue. I don't recall what it was off the top of my head but might be worth checking out.
 

TooMuchData

Contributor
Joined
Jan 4, 2015
Messages
188
Thank you, Mirfster and Sakuru.
The LSI has 20.00.07.00.
The external box is another Node 304 with an SSR-450RM PSU for the disks, and an SFF-8088 external connector that fans out to eight internal SAS/SATA cables six of which are connected to the disks. A CSE-PTJBOD-CB1 ATX board does power on-off control.

Any reason why I should not do zpool clear?

My replacement C2750D4I arrived today, so I plan to put all the disks back into my other Node 304 in a few days.
 

Mirfster

Doesn't know what he's talking about
Joined
Oct 2, 2015
Messages
3,215

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
A reboot will clear the errors since the numbers are reset on an unmount. ;P
 
Status
Not open for further replies.
Top