zpool scrub issues

Status
Not open for further replies.

petr

Contributor
Joined
Jun 13, 2013
Messages
142
I cannot seem to be able to complete a scrub on one of my volumes. I have noticed it being stuck on 62.65% for a very long time (20+ days), so I've stopped the scrub (which went OK albeit took 30s) and rebooted the machine. Then, started the scrub again and it got to 62.65% pretty fast (as fast as expected) but then did not move any further.

Restarting the scrub again yielded a zero progress which does not move:

No errors are reported, all SMART info shows "PASSED" and the system is performing just fine (data access via NFS and AFP, jails).

Details of my config are in signature - the volume in question is the 7x8TB WD drives connected to a backplane that is shared between all my drives.

Please see below what the status says after the stopping and re-running. Also a bit confused by the "issued at"

Code:
root@freenas:~ # zpool status mainsafe
  pool: mainsafe
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
	still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
	the pool may no longer be accessible by software that does not support
	the features. See zpool-features(7) for details.
  scan: scrub in progress since Wed Jun 27 14:25:46 2018
	0 scanned at 0/s, 0 issued at 0/s, 25.2T total
	0 repaired, 0.00% done, no estimated completion time
config:

	NAME											STATE	 READ WRITE CKSUM
	mainsafe										ONLINE	   0	 0	 0
	  raidz3-0									  ONLINE	   0	 0	 0
		gptid/a4b8bb14-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
		gptid/a5859ce3-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
		gptid/a647bcbb-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
		gptid/a70459ab-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
		gptid/a7c75791-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
		gptid/a88e1cd5-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0
		gptid/a94e05d1-17d7-11e8-a2dd-000c293a32fe  ONLINE	   0	 0	 0



Running FreeNAS-11.1-U5 with all available updates applied.

EDIT: Further information:
- Hypervisor - VMWare ESXI
- 24GB RAM available
- all storage hardware (M1015) is pass-through directly to the VM

(The setup is now new, have been running FreeNAS like this since 2012)
 
Last edited:

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
Additional information from your sig that you should have added. This is going to make it a bit more difficult to diagnose.
Running inside VMWare ESXI 6.5 with VMXNET3 interface
Additional pass-through devices:
- IBM M1015 in IR mode hooked up to the chassis backplane
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
Additional information from your sig that you should have added. This is going to make it a bit more difficult to diagnose.

Thank you, added more information to my post above. Why do you think it's going to make things more difficult to diagnose? AFAIK anything to do with drive access is the same as native and networking is not an issue (the VMXNET3 is also used in conjunction with pass-through 10gb NIC that is used for bulk of data transfers).

Also, considering that scrubs for other volumes that are connected to exactly same HW are going through just fine, that leads me to believe that the issue may be somewhere on the config/software side.
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
What's the output of camcontrol devlist?
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
What's the output of camcontrol devlist?

Code:
root@freenas:~ # camcontrol devlist
<VMware Virtual disk 2.0>		  at scbus2 target 0 lun 0 (pass0,da0)
<NECVMWar VMware SATA CD00 1.00>   at scbus3 target 0 lun 0 (cd0,pass1)
<ATA WDC WD80EFZX-68U 0A83>		at scbus33 target 10 lun 0 (pass2,da1)
<LSI SAS3x36 0601>				 at scbus33 target 11 lun 0 (ses0,pass3)
<ATA WDC WD80EFZX-68U 0A83>		at scbus33 target 12 lun 0 (pass4,da2)
<ATA WDC WD80EFZX-68U 0A83>		at scbus33 target 13 lun 0 (pass5,da3)
<ATA WDC WD80EFZX-68U 0A83>		at scbus33 target 14 lun 0 (pass6,da4)
<ATA WDC WD40EFRX-68W 0A82>		at scbus33 target 15 lun 0 (pass7,da5)
<ATA WDC WD40EFRX-68N 0A82>		at scbus33 target 16 lun 0 (pass8,da6)
<ATA WDC WD80EFZX-68U 0A83>		at scbus33 target 17 lun 0 (pass9,da7)
<ATA WDC WD80EFZX-68U 0A83>		at scbus33 target 18 lun 0 (pass10,da8)
<ATA WDC WD80EFZX-68U 0A83>		at scbus33 target 19 lun 0 (pass11,da9)
<ATA Crucial_CT500MX2 MU02>		at scbus33 target 20 lun 0 (pass12,da10)
<ATA Samsung SSD 850 1B6Q>		 at scbus33 target 21 lun 0 (pass13,da11)
<ATA TOSHIBA THNSNJ12 0101>		at scbus33 target 22 lun 0 (pass14,da12)
<ATA Samsung SSD 850 1B6Q>		 at scbus33 target 23 lun 0 (pass15,da13)
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
Why do you think it's going to make things more difficult to diagnose?
Your SAS controller listed in IR mode is what I was referring to. IR mode tends to mask SMART data and make diagnosing drive problems difficult.
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
Your SAS controller listed in IR mode is what I was referring to. IR mode tends to mask SMART data and make diagnosing drive problems difficult.

I am not setting up the drives using the IR mode functions at all - my understanding was that for any drive that is not set up, it basically acts the same as IT mode firmware, I've never had any problems accessing the drive's SMART data if they were not set up into a volume on the controller (which they are not). In the camcontrol devlist, I can even see the backplane port numbers, which leads me to believe that there is fairly little, if any interference from the HBA
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
IR should be fine, but possibly slower than IT. It's megaRAID mode that's a problem.

root@freenas:~ # camcontrol devlist
<VMware Virtual disk 2.0> at scbus2 target 0 lun 0 (pass0,da0)
<NECVMWar VMware SATA CD00 1.00> at scbus3 target 0 lun 0 (cd0,pass1)
<ATA WDC WD80EFZX-68U 0A83> at scbus33 target 10 lun 0 (pass2,da1)
<LSI SAS3x36 0601> at scbus33 target 11 lun 0 (ses0,pass3)
<ATA WDC WD80EFZX-68U 0A83> at scbus33 target 12 lun 0 (pass4,da2)
<ATA WDC WD80EFZX-68U 0A83> at scbus33 target 13 lun 0 (pass5,da3)
<ATA WDC WD80EFZX-68U 0A83> at scbus33 target 14 lun 0 (pass6,da4)
<ATA WDC WD40EFRX-68W 0A82> at scbus33 target 15 lun 0 (pass7,da5)
<ATA WDC WD40EFRX-68N 0A82> at scbus33 target 16 lun 0 (pass8,da6)
<ATA WDC WD80EFZX-68U 0A83> at scbus33 target 17 lun 0 (pass9,da7)
<ATA WDC WD80EFZX-68U 0A83> at scbus33 target 18 lun 0 (pass10,da8)
<ATA WDC WD80EFZX-68U 0A83> at scbus33 target 19 lun 0 (pass11,da9)
<ATA Crucial_CT500MX2 MU02> at scbus33 target 20 lun 0 (pass12,da10)
<ATA Samsung SSD 850 1B6Q> at scbus33 target 21 lun 0 (pass13,da11)
<ATA TOSHIBA THNSNJ12 0101> at scbus33 target 22 lun 0 (pass14,da12)
<ATA Samsung SSD 850 1B6Q> at scbus33 target 23 lun 0 (pass15,da13)
Those look fine.

So, that leaves the question of what is going on. What's the output of smartctl -x /dev/dax for all disks?
all SMART info shows "PASSED"
That indication is basically meaningless. A disk could be failing disastrously and that indication would still show passed.
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
IR should be fine, but possibly slower than IT. It's megaRAID mode that's a problem.


Those look fine.

So, that leaves the question of what is going on. What's the output of smartctl -x /dev/dax for all disks?

That indication is basically meaningless. A disk could be failing disastrously and that indication would still show passed.

Here it is - concatenated all into a file:

Code:
smartctl -x /dev/da1 > disks.txt
smartctl -x /dev/da2 >> disks.txt
smartctl -x /dev/da3 >> disks.txt
smartctl -x /dev/da4 >> disks.txt
smartctl -x /dev/da7 >> disks.txt
smartctl -x /dev/da8 >> disks.txt
smartctl -x /dev/da9 >> disks.txt


The output is here: link
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
Those disks have all had just a single short test run on them since new. Run a long test on all of them and report back.
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
Those disks have all had just a single short test run on them since new. Run a long test on all of them and report back.

Would that be just smartctl -t long /dev/da1 on each disk?
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
Correct.
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142

Looks like there is something odd happening - ran the long self-test on two of the drives and both got to 90% and the last 10% does not seem to be moving in the last week:

Code:
== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED


General SMART Values:

Offline data collection status:  (0x84) Offline data collection activity

was suspended by an interrupting command from host.

Auto Offline Data Collection: Enabled.

Self-test execution status:	  ( 241) Self-test routine in progress...

10% of test remaining.

Total time to complete Offline 

data collection: (  101) seconds.

Offline data collection

capabilities:  (0x5b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

No Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:			(0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:		(0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine 

recommended polling time:  (   2) minutes.

Extended self-test routine

recommended polling time:  (1342) minutes.

SCT capabilities:  	   (0x003d) SCT Status supported.

SCT Error Recovery Control supported.

SCT Feature Control supported.

SCT Data Table supported.




 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
I would RMA that drive.
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
I would RMA that drive.

I would if the same thing was not happening on multiple drives.... seems suspect that they would develop exactly the same fault
 
Joined
Dec 29, 2014
Messages
1,135
I would if the same thing was not happening on multiple drives.... seems suspect that they would develop exactly the same fault

I would check cabling or the power supply.
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
I would check cabling or the power supply.

Thank you - will do! Though the odd thing is that the drives are on the same backplane like other arrays that perform absolutely fine.

Another odd thing - the whole array's activity lights are pretty much lit up all the time.
 
Joined
Dec 29, 2014
Messages
1,135
Another odd thing - the whole array's activity lights are pretty much lit up all the time.

If it impacts all the drives at the same time, it might be the controller as well. Just trying think of all the shared resources that are in play for the problem components. I would also try and get a good backup as soon as possible. If it is doing a scrub, that would make all the drives that are part of your pool be pretty busy.
 

wblock

Documentation Engineer
Joined
Nov 14, 2014
Messages
1,506
Both short and long mode tests run in the background at the drive. If there is real data transfer going on at the same time, it can become very slow due to head contention. The SMART test might not timeout, which would be consistent with what you are seeing. But it's also consistent with a drive that has a lot of errors and retries.

Here it is - concatenated all into a file:
The output is here: link

I tried to read that, and after allowing a lot of Javascript (which is normally forbidden as a security measure), I still couldn't read it. Posting it in code tags here should work.

Anyway, I was going to look to see whether the problem drives had already accumulated some bad blocks, which would be an indicator they were on the way out.
 

petr

Contributor
Joined
Jun 13, 2013
Messages
142
If it impacts all the drives at the same time, it might be the controller as well. Just trying think of all the shared resources that are in play for the problem components. I would also try and get a good backup as soon as possible. If it is doing a scrub, that would make all the drives that are part of your pool be pretty busy.

Re. controller - would other volumes on the same controller be also experiencing problems? Everything else seems to be working just fine.
 
Status
Not open for further replies.
Top