One of my disks is resilvering without obvious reason?

Status
Not open for further replies.

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
Quick summary:

Hardware is good (supermicro/ECC/Xeon v3/enterprise drives) and server seems good. The pool - which I'm expanding - has been made up of 4 sets of mirrors (call them vdev1:{d1,d2}, vdev2:{d1,d2}, vdev3{d1,d2} and vdev4:{d1,d2}). I'm upgrading them to 3 way mirrors (I like mirrors, and 2 way isn't redundant enough any more).

I added 4 disks to the array (3 x mirrors to vdev1 and 1 x mirror to vdev2). When they finish resilvering, I'll detach 2 of the old drives from vdev1 and reuse it elsewhere.

What I'm noticing is odd behaviour with vdev4:d1.

While watching the resilver progress, I noticed that zpool status -v listed zero for all R/W/CHK for all disks, except vdev4:d1 which listed R=1, W=114, CHK=0. I wasn't too worried as this is often a symptom of a bad cable, so I powered the system down to swap the cable and change the port. But I also saw that the output showed vdev4:d1 as resilvering, which was odd - I certainly hadn't told it to do so, and there were no log reports I could find showing other faults (including SMART faults) that definitively showed a disk error.

I shut down, swapped the cable and port, and rebooted and all was then normal. This was about 4 hours ago. After rebooting, zpool status -v showed zero R/W/CHK errors for all drives (including that one), it showed just 3 new drives resilvering (correct), and vdev4:d1 had zero errors and was no longer listed as resilvering. After a while of no issues, I got on with other things and thought no more of it until now.

I probably kept an eye on zpool status for a while after reboot, but can't be sure how long. But just now, I rechecked. R/W/CHK errors are still all zero for all drives, but now vdev4:d1 shows once again as spontaneously "resilvering" without being told to and without obvious reason.

Now I'm distinctly disturbed.

smartctl -a shows healthy + quite a few fast ECC corrected errors + 4 "non medium error count". I'm not sure which SMART report to run, if it's helpful I will run it.

What should I make of this, and what if any action is appropriate?
 
Last edited:

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
What is the controller and disk model?
The device is on my system as da0, it's a SEAGATE ST6000NM0054 SAS, connected to an LSI 9211 OEM. The motherboard is a Supermicro X10SRi-F.

The HBA is a reflashed OEM 9211 (probably a Dell H200 originally but without shutting down I can't be sure if the original is a Dell or some other OEM version of the 9211). Either way it's running p20 firmware and all 8 ports on it are in use (2 x 4 fanouts, Adaptec originals); that's the only disk attached to it that's behaving oddly and I don't have any known issues with the card.

The disk itself has a bit of an odd history. I log disk purchases/RMAs/swapouts, so I remember this disk well. I sent a new 6TB SAS disk back as DOA early this year, and this was the replacement.

(I didn't know it at the time, the issue was the cable not the disk - the SAS cable worked fine with SATA but not with SAS. So predictably the replacement failed too, because I didn't realise the cable was at fault at the time, and Seagate agreed it was probably the disk, on the basis that every other [SATA] disk attached worked fine on that cable. When I swapped the cable from cheapo-SAS to Adaptec, I never had another problem with it.)

So when the replacement disk also didn't spin up, I called Seagate Tech, who emailed me back that they couldn't help on it, because the replacement disk they'd sent for my Seagate drive wasn't in fact a Seagate - it was something I've never heard of. This was their email:

This Drive is a Seagate drive sold to XYRATEX. It is an OEM, XYRATEX owned it and might have added firmware to the drive. We at Segate have no access to the XYRATEX information, therefore we are not able to exchange this drive, it needs to be handled with XYRATEX support. XYRATEX is a devision of Seagate but has a different support handling as the drives do contain XYRATEX specific firmware.

I have a large sticky note on the drive to the effect that if it ever needs an RMA, don't accept any BS to the effect this isn't a Seagate drive, because the original was a Seagate off Amazon that they did accept for RMA, and this was the RMA replacement Seagate sent me for it.

That said, I have always found Seagate support very reliable and helpful, so I don't expect an issue, and I'm pretty sure they would immediately swap it if needed. But that's off topic. Anyway, that's the odd background to this HDD.
 
Last edited:

rs225

Guru
Joined
Jun 28, 2014
Messages
878
Try posting the smartctl -x for the drive in a code block. Somebody might recognize something.

No other ideas at this time, and if nothing else turns up, I would swap the drive and see if the replacement holds. Resilver without any obvious cause is strange.
 

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
# smartctl -x /dev/da0
Code:
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:			   SEAGATE
Product:			  ST6000NM0054
Revision:			 ET05
Compliance:		   SPC-4
User Capacity:		6,001,175,126,016 bytes [6.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:		7200 rpm
Form Factor:		  3.5 inches
Logical Unit id:	  0x5000********
Serial number:		********
Device type:		  disk
Transport protocol:   SAS (SPL-3)
Local Time is:		Mon Nov 27 19:07:38 2017 GMT
SMART support is:	 Available - device has SMART capability.
SMART support is:	 Enabled
Temperature Warning:  Enabled
Read Cache is:		Enabled
Writeback Cache is:   Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:	 33 C
Drive Trip Temperature:		60 C

Manufactured in week 21 of year 2016
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  51
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  11289
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 415372232
  Blocks received from initiator = 101070608
  Blocks read from cache and sent to initiator = 5839710
  Number of read and write commands whose size <= segment size = 482849
  Number of read and write commands whose size > segment size = 235

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 5144.58
  number of minutes until next internal SMART test = 11

Error counter log:
		   Errors Corrected by		   Total   Correction	 Gigabytes	Total
			   ECC		  rereads/	errors   algorithm	  processed	uncorrected
		   fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:	1221418		0		 0   1221418		  0		212.671		   0
write:		 0		0		 0		 0		  0		 51.772		   0

Non-medium error count:		4


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num  Test			  Status				 segment  LifeTime  LBA_first_err [SK ASC ASQ]
	 Description							  number   (hours)
# 1  Background short  Completed				   -	5132				 - [-   -	-]
# 2  Background long   Completed				   -	 137				 - [-   -	-]

Long (extended) Self Test duration: 37912 seconds [631.9 minutes]

Background scan results log
  Status: no scans active
	Accumulated power on time, hours:minutes 5144:35 [308675 minutes]
	Number of background scans performed: 0,  scan progress: 0.00%
	Number of background medium scans performed: 0

Protocol Specific port log page for SAS SSP

relative target port id = 1
  generation code = 0
  number of phys = 1
  phy identifier = 0
	attached device type: SAS or SATA device
	attached reason: unknown
	reason: unknown
	negotiated logical link rate: phy enabled; 6 Gbps
	attached initiator port: ssp=1 stp=1 smp=1
	attached target port: ssp=0 stp=0 smp=0
	SAS address = 0x5000********
	attached SAS address = 0x5000********
	attached phy identifier = 6
	Invalid DWORD count = 0
	Running disparity error count = 0
	Loss of DWORD synchronization = 1
	Phy reset problem = 0
	Phy event descriptors:
	 Invalid word count: 0
	 Running disparity error count: 0
	 Loss of dword synchronization count: 1
	 Phy reset problem count: 0

relative target port id = 2
  generation code = 0
  number of phys = 1
  phy identifier = 1
	attached device type: no device attached
	attached reason: unknown
	reason: unknown
	negotiated logical link rate: phy enabled; unknown
	attached initiator port: ssp=0 stp=0 smp=0
	attached target port: ssp=0 stp=0 smp=0
	SAS address = 0x5000********
	attached SAS address = 0x0
	attached phy identifier = 0
	Invalid DWORD count = 0
	Running disparity error count = 0
	Loss of DWORD synchronization = 0
	Phy reset problem = 0
	Phy event descriptors:
	 Invalid word count: 0
	 Running disparity error count: 0
	 Loss of dword synchronization count: 0
	 Phy reset problem count: 0


I've got a spare 6TB if the consensus is to swap it out and RMA it. I don't want to have to swap it in (I'm using it for a test drive), and I'm not sure they would accept a return without any detected underlying issue that related to the drive itself, but it's possible to swap it at a pinch.

Even so, I would want to see if there's a reason why resilvering could be stated to be occurring without being told to do so and without a change to the vdev disks such as one being attached/replaced. That's plain odd.
 
Last edited:
Status
Not open for further replies.
Top