Second drive "resilvering" when other drive replaced

Rexbard · Nov 12, 2016

I had 9 of 2-3Gb drives from old systems and unused Drobos, and decided to build a large FreeNAS server with RAIDZ3 -- partly to experiment, but also for a large backup server for my family's computers, and storage for an EyeTV DVR and Plex. Since a few of these drives are a bit aged, I'm slowly replacing them with new or larger drives.

On the last two drive replacements, when the new drive is resilvering, FreeNAS' daily email reports a second drive also resilvering. The same one did it on both replacements. I expect this means the second drive is starting to fail and takes a hit during the intensive resilvering process. However, I get zero errors or warnings from that drive during daily use.

Is there another reason this could be resilvering? Perhaps part of the reorganization of data or checksums?

Thanks for your help.

P.S. The motherboard is aging, as well, so that may also be a project soon. However, we have relatively light usage. There are occasional backups. Two of the family occasionally use the FreeNAS for Plex, with little local encoding needed. The biggest user is one that stores a lot of their games here, and plays them from their LAN PC. He built a sweet gaming PC, but skimped on the HD size. He's inherited the previous 2GB I replaced, so that should change soon.

FreeNAS-9.10.1-U4 (ec9a7d3)

Sakuru · Nov 12, 2016

Please post those messages in [CODE][/CODE] tags if you can please. Also post the output of zpool status.

Rexbard · Nov 12, 2016

Here's the first status email I referenced:

Code:

Checking status of zfs pools:
NAME  SIZE  ALLOC  FREE  EXPANDSZ  FRAG  CAP  DEDUP  HEALTH  ALTROOT
freenas-boot  29.8G  2.42G  27.3G  -  -  8%  1.00x  ONLINE  -
wodeyun  16.2T  13.5T  2.72T  -  23%  83%  1.00x  ONLINE  /mnt

 pool: wodeyun
state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scan: resilver in progress since Mon Oct 10 17:07:27 2016
 10.3T scanned out of 13.5T at 303M/s, 3h4m to go
 1.10T resilvered, 76.30% done
config:

	NAME  STATE  READ WRITE CKSUM
	wodeyun  ONLINE  0  0  0
	 raidz3-0  ONLINE  0  0  0
	 gptid/b0683d7b-f1ce-11e5-9025-90e6baeef758  ONLINE  0  0  0
	 gptid/b12494e0-f1ce-11e5-9025-90e6baeef758  ONLINE  0  0  0  (resilvering)
	 gptid/b24e48aa-f1ce-11e5-9025-90e6baeef758  ONLINE  0  0  0
	 gptid/b310f427-f1ce-11e5-9025-90e6baeef758  ONLINE  0  0  0
	 gptid/b3db0a36-f1ce-11e5-9025-90e6baeef758  ONLINE  0  0  0
	 gptid/90746b63-8f10-11e6-bf1b-90e6baeef758  ONLINE  0  0  0  (resilvering)
	 gptid/b6325863-f1ce-11e5-9025-90e6baeef758  ONLINE  0  0  0
	 gptid/b7adbd82-f1ce-11e5-9025-90e6baeef758  ONLINE  0  0  0
	 gptid/b919a53f-f1ce-11e5-9025-90e6baeef758  ONLINE  0  0  0

errors: No known data errors

and here's the second:

Code:

Checking status of zfs pools:
NAME  SIZE  ALLOC  FREE  EXPANDSZ  FRAG  CAP  DEDUP  HEALTH  ALTROOT
freenas-boot  29.8G  3.04G  26.7G  -  -  10%  1.00x  ONLINE  -
wodeyun  16.2T  13.4T  2.85T  -  22%  82%  1.00x  ONLINE  /mnt

 pool: wodeyun
state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scan: resilver in progress since Fri Nov 11 11:22:21 2016
 12.9T scanned out of 13.4T at 241M/s, 0h33m to go
 1.38T resilvered, 96.59% done
config:

	NAME  STATE  READ WRITE CKSUM
	wodeyun  ONLINE  0  0  0
	 raidz3-0  ONLINE  0  0  0
	 gptid/b0683d7b-f1ce-11e5-9025-90e6baeef758  ONLINE  0  0  0
	 gptid/b12494e0-f1ce-11e5-9025-90e6baeef758  ONLINE  0  0  0  (resilvering)
	 gptid/b24e48aa-f1ce-11e5-9025-90e6baeef758  ONLINE  0  0  0
	 gptid/b310f427-f1ce-11e5-9025-90e6baeef758  ONLINE  0  0  0
	 gptid/b3db0a36-f1ce-11e5-9025-90e6baeef758  ONLINE  0  0  0
	 gptid/90746b63-8f10-11e6-bf1b-90e6baeef758  ONLINE  0  0  0
	 gptid/b6325863-f1ce-11e5-9025-90e6baeef758  ONLINE  0  0  0
	 gptid/b7adbd82-f1ce-11e5-9025-90e6baeef758  ONLINE  0  0  0
	 gptid/e82ecfe6-a832-11e6-9d5d-90e6baeef758  ONLINE  0  0  0  (resilvering)

errors: No known data errors

Rexbard · Nov 12, 2016

I did a "zpool status" earlier, although I can't guarantee where it was in the drive-replacement process.

Code:

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0h2m with 0 errors on Wed Oct 26 03:47:04 2016
config:

	NAME										  STATE	 READ WRITE CKSUM
	freenas-boot								  ONLINE	   0	 0	 0
	  gptid/c1712f73-cc3b-11e5-95e9-78e3b55a2d62  ONLINE	   0	 0	 0

errors: No known data errors

  pool: wodeyun
 state: ONLINE
  scan: resilvered 1.43T in 16h52m with 0 errors on Sat Nov 12 04:14:55 2016
config:

	NAME											STATE	 READ WRITE CKSUM
	wodeyun										 ONLINE	   0	 0	 0
	  raidz3-0									  ONLINE	   0	 0	 0
		gptid/b0683d7b-f1ce-11e5-9025-90e6baeef758  ONLINE	   0	 0	 0
		gptid/b12494e0-f1ce-11e5-9025-90e6baeef758  ONLINE	   0	 0	 0
		gptid/b24e48aa-f1ce-11e5-9025-90e6baeef758  ONLINE	   0	 0	 0
		gptid/b310f427-f1ce-11e5-9025-90e6baeef758  ONLINE	   0	 0	 0
		gptid/b3db0a36-f1ce-11e5-9025-90e6baeef758  ONLINE	   0	 0	 0
		gptid/90746b63-8f10-11e6-bf1b-90e6baeef758  ONLINE	   0	 0	 0
		gptid/b6325863-f1ce-11e5-9025-90e6baeef758  ONLINE	   0	 0	 0
		gptid/b7adbd82-f1ce-11e5-9025-90e6baeef758  ONLINE	   0	 0	 0
		gptid/e82ecfe6-a832-11e6-9d5d-90e6baeef758  ONLINE	   0	 0	 0

errors: No known data errors

and here's one now, after the drive-replacement resilvering appears to be done.

Code:

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0h2m with 0 errors on Wed Oct 26 03:47:04 2016
config:

	NAME										  STATE	 READ WRITE CKSUM
	freenas-boot								  ONLINE	   0	 0	 0
	  gptid/c1712f73-cc3b-11e5-95e9-78e3b55a2d62  ONLINE	   0	 0	 0

errors: No known data errors

  pool: wodeyun
 state: ONLINE
  scan: resilvered 1.43T in 16h52m with 0 errors on Sat Nov 12 04:14:55 2016
config:

	NAME											STATE	 READ WRITE CKSUM
	wodeyun										 ONLINE	   0	 0	 0
	  raidz3-0									  ONLINE	   0	 0	 0
		gptid/b0683d7b-f1ce-11e5-9025-90e6baeef758  ONLINE	   0	 0	 0
		gptid/b12494e0-f1ce-11e5-9025-90e6baeef758  ONLINE	   0	 0	 0
		gptid/b24e48aa-f1ce-11e5-9025-90e6baeef758  ONLINE	   0	 0	 0
		gptid/b310f427-f1ce-11e5-9025-90e6baeef758  ONLINE	   0	 0	 0
		gptid/b3db0a36-f1ce-11e5-9025-90e6baeef758  ONLINE	   0	 0	 0
		gptid/90746b63-8f10-11e6-bf1b-90e6baeef758  ONLINE	   0	 0	 0
		gptid/b6325863-f1ce-11e5-9025-90e6baeef758  ONLINE	   0	 0	 0
		gptid/b7adbd82-f1ce-11e5-9025-90e6baeef758  ONLINE	   0	 0	 0
		gptid/e82ecfe6-a832-11e6-9d5d-90e6baeef758  ONLINE	   0	 0	 0

errors: No known data errors

*edit: deleted stray empty lines from CODE.

Sakuru · Nov 12, 2016

Huh, that is strange.
Figure out which drive that is with this command:

 

glabel status | grep gptid/b12494e0-f1ce-11e5-9025-90e6baeef758

That should return something like:

 

gptid/b12494e0-f1ce-11e5-9025-90e6baeef758	 N/A  da0p2

That would mean the funky drive is da0.
Please post the SMART info for that drive with the following command, replacing da0 with your drive:

 

smartctl -a /dev/da0

Rexbard · Nov 12, 2016

Code:

# smartctl -a /dev/ada1 | tee 2016-11-12-ada1-smartctl.txt
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD30EFRX-68EUZN0
Serial Number:	WD-WCC4N4XR5JRF
LU WWN Device Id: 5 0014ee 2b7723284
Firmware Version: 82.00A82
User Capacity:	3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:	Sat Nov 12 17:45:34 2016 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:	  (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection:		 (39600) seconds.
Offline data collection
capabilities:			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:			(0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:		(0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time:	 (   2) minutes.
Extended self-test routine
recommended polling time:	 ( 398) minutes.
Conveyance self-test routine
recommended polling time:	 (   5) minutes.
SCT capabilities:			(0x703d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   182   182   021	Pre-fail  Always	   -	   5900
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   13
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   100   253   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   093   093   000	Old_age   Always	   -	   5378
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   13
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   3
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   135
194 Temperature_Celsius	 0x0022   121   119   000	Old_age   Always	   -	   29
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	  5105		 -
# 2  Extended offline	Completed without error	   00%	  4260		 -
# 3  Short offline	   Completed without error	   00%	  4091		 -
# 4  Short offline	   Completed without error	   00%	  3108		 -
# 5  Short offline	   Completed without error	   00%	  2604		 -
# 6  Short offline	   Completed without error	   00%	  2365		 -
# 7  Extended offline	Completed without error	   00%	  2157		 -
# 8  Short offline	   Completed without error	   00%	  1885		 -
# 9  Short offline	   Completed without error	   00%	  1646		 -
#10  Extended offline	Completed without error	   00%	  1624		 -
#11  Short offline	   Completed without error	   00%	  1615		 -
#12  Extended offline	Completed without error	   00%	  1438		 -
#13  Short offline	   Completed without error	   00%	  1142		 -
#14  Short offline	   Completed without error	   00%	   903		 -
#15  Extended offline	Interrupted (host reset)	  10%	   694		 -
#16  Short offline	   Completed without error	   00%	   443		 -
#17  Extended offline	Completed without error	   00%	   235		 -
#18  Extended offline	Completed without error	   00%		33		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Sakuru · Nov 12, 2016

Huh, that drive looks fine to me. It should not be resilvering. How have you been performing the drive replacements?

Stux · Nov 12, 2016

Are you hot swapping drives?

Rexbard · Nov 13, 2016

Not hot-swapping. I followed the guide.

- Storage->Volumes->View Volumes, noted Serial number
- Storage->Volumes->View Volumes, selected volume, clicked [Volume Status] icon, selected drive to replace, clicked [Offline]
- Powered down, swapped drive with matching Serial number (and device port), powered back up
- Storage->Volumes->View Volumes, selected volume, clicked [Volume Status] icon, selected new drive, clicked [Replace]

Resilvering started, and I received the noted emails at some point in the process.

Rexbard · Nov 24, 2016

I'll probably pick up another drive this weekend. Is there anything I can monitor or run before/during/after the swap to help figure out what's going on? Or is there any additional logging you'd like to turn on, if you think this may be a FreeNAS bug?

I'll also check the guide again to make sure I'm doing the swap right.

Rexbard · Dec 6, 2016

I replaced another drive. I setup a crontab to periodically log "zpool status" and "smartctl -x", which shows resilvering went for about 12 hours and when something happened (last log showed 73% complete). The resilvering appeared to begin again (multipass?). Then, about 2 hours after that, something else happened (not sure what, yet). A few hours later, FreeNAS restarted and the crontab was gone, as expected.

When I started logs again, *3* drives are now resilvering, including the new one. Currently at 89% done; hopefully it completes without another failure.

I'll post more as I figure things out, in case someone else every has a similar episode.

Robert Trevellyan · Dec 7, 2016

Rexbard said:
*3* drives are now resilvering, including the new one

This should not happen, and may indicate an underlying problem with the hardware or the pool.

Rexbard · Dec 10, 2016

Thanks, Robert. I have scrubs setup via the FreeNAS GUI and periodically run pool status. I'm finding the "zpool status" reports often have small, but troubling resilvering "scan" comments, such as todays "resilvered 8K in 0h0m with 0 errors on Sat Dec 10 18:17:22 2016". Is there anything else I can run or that I can look at to verify my pool health? I've yet to find a command or log to show which drive was resilvered, so suggestions on that would be well appreciated, too.

I found a couple of interesting things in the logs and in recent FreeNAS daily reports that might indicate hardware issues.

1) I occasionally receive a report stating "Device: /dev/ada1, Read SMART Error Log Failed". However, running short and long tests at the command line gave reports that I thought were okay (smartctl noob; learning about the reports as I go). I discovered today that when a drive is replaced, the short and long tests I have setup via the FreeNAS GUI have to be re-established for the drives. I've now done so, and will see if this "error log" comment continues.

1b) In a "smartctl -x" today, I noticed non-zero "Transition from drive PhyRdy to drive PhyNRdy" values for most of the drives. ada1 happens to have the most at 308. I'll search the forums for suggestions on what IBM M1015 breakout cables to use, as cable issues is a common suggestion. I'm currently using what came with my CoolerMaster case to attach to the motherboard, and a breakout cable from Monoprice (may be http://www.monoprice.com/product?p_id=8186).

2) Parsing the message and dmesg logs, I'm also getting an occasional "ahcich1: Timeout on slot N port 0" and "(aprobe0:ahcich1:0:0:0): CAM status: Command timeout" (error 5) messages. They don't seem to happen coincidental to the FreeNAS restarts (the software restarts, not the server), and the restarts often happen without these errors occurring in-between. Again, I'm reading up on this error, but it, also, seems to be indicative of a hardware issue.

Could my ASUS motherboard and AMD be contributing to this? Although some like @joeschmuck has used an ASUS motherboard and AMD cpu, its looks like its been a while and the new Hardware Recommendations don't mention AMD. Since I want to setup and play with an lightweight ESXi server, I've been thinking of moving my FreeNAS to SuperMicro/Intel and using the current HW for the ESXi server.

Thank you for your help (and @Stux and @Sakuru). The support and info here is stellar -- way richer than I've seen on other forums.

Stux · Dec 11, 2016

You have a hardware issue :(

CPU, ram, PSU, mobo, HBA, cables or HD ;)

Or a combination :(

Robert Trevellyan · Dec 11, 2016

Rexbard said:
Could my ASUS motherboard and AMD be contributing to this?

Yes, but it could be almost anything. The only way to figure it out is methodically, step by step, changing one thing at a time. You can start with simple things like cables and work your way from there.

With 9 drives, are you sure your PSU is up to the job?

Rexbard · Dec 11, 2016

Thanks, Stux:

Since Robert asked, I double-checked the PSU on a couple of calculators, and am above their suggested wattages of 356/379W (PSU is 650W). I'll add the PSU to my sig.

Robert Trevellyan · Dec 12, 2016

According to the specs for your mobo, the SATA ports are driven by its "SB750 Chipset". Have you tested without using the onboard ports? Your M1015 should be able to run all your drives if you have enough breakout cables.

Rexbard · Dec 12, 2016

Are you watching me? ;)

I was using my main computer when I received a zpool alert. "zpool status" showed ada1 had a 3 read errors. I'm not sure if I did something to cause it, but running the status a few minutes later to get the drive info, and the READ errors were zeroed again. FYI: I ran "geom status" and the incorrect "zpool -v status" between.

I was going to only move ada1 to the M1015, but then saw your message. So, all but ada0 is off the motherboard. As I understand it, I need a SAS Expander to go beyond 2x4port breakout cables, so it can only support 8 for me right now.

There was a momentary panic when it didn't boot, but the BIOS was detecting drives in a new order, bumping the USB out of the boot slot. That fixed, and all came back up fine.

One other tidbit for future reference, the suspect drive's smart report accumulated 50 "Transition from drive PhyRdy to drive PhyNRdy" in one day.

Rexbard · Dec 16, 2016

No tiny-resilvers, "aprobe0:ahcich1" messages, or FreeBSD restarts since I moved four of the 5 drives from the motherboard to the M1015. Only 4 days, but since it was happening multiple times a day beforehand, I consider this a resounding success.

Having something new to work on and learn was one of the reasons I built a FreeNAS system, and its definitely been "fun" and educational. Now to start planning for my AMD->Intel upgrade.

Thank you all for your help Sakuru, Stux, and Robert Trevellyan.

Important Announcement for the TrueNAS Community.

Second drive "resilvering" when other drive replaced

Dabbler

Guru

Dabbler

Dabbler

Guru

Dabbler

Guru

MVP

Dabbler

Dabbler

Dabbler

Pony Wrangler

Dabbler

MVP

Pony Wrangler

Dabbler

Pony Wrangler

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Second drive "resilvering" when other drive replaced"

Similar threads