DrunkenPeleg
Cadet
- Joined
 - Apr 25, 2019
 
- Messages
 - 2
 
Persistent little issue I've got here. A storage pool bugs out and becomes unavailable randomly. This started happening about two weeks ago; pool would go out every 6-10 hours. A reboot brought the pool back up, but it had to be a power cycle, since the restart failed due to a hanging process. 
OK, so let's get started.
Here is an example of what happens with the pool:
At this point, I can't do anything with the pool since it just gives me the 'I/O is currently suspended' error
So, I initiate a reboot (have tried regular shut down as well). After some time the process hangs- here's the tail end of things:
Now, I partly guess that the jails are preventing shutdown as a result of mount points pointing to the affected pool.
When I was having this issue two weeks ago, it was "resolved" after checking all cable connections, reseating the controller card and running a scrub. I thought it was probably just a loose connection somewhere as I had recently changed out a drive and probably jiggled something loose somewhere.
As you can see in the logs above, my attempts to run a scrub now are being foiled by the pool going out before the scrub can finish.
I think the restart/shutdown issue is secondary to whatever is going on with the pool. I should note that prior to this occurring again, there were no read/write errors, and SMART tests did not return any issues.
Should I chalk this up to bad cables, or perhaps the controller card? What's my best option to narrow this down?
	
		
			
		
		
	
			
			OK, so let's get started.
Here is an example of what happens with the pool:
Code:
root@freenas[~]# zpool status -v POOL4
  pool: POOL4
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-JQ
  scan: scrub in progress since Mon Aug 10 12:11:15 2020
        5.24T scanned at 1.57G/s, 2.35T issued at 722M/s, 30.3T total
        0 repaired, 7.75% done, 0 days 11:17:04 to go
config:
        NAME                      STATE     READ WRITE CKSUM
        POOL4                     UNAVAIL      0     0     0
          raidz2-0                UNAVAIL     44     0     0
            12611917014333540652  REMOVED      0     0     0  was /dev/gptid/9feee305-50eb-11ea-ad9a-002590343322
            1567844855632180418   REMOVED      0     0     0  was /dev/gptid/adf3da24-50eb-11ea-ad9a-002590343322
            10092631138471912601  REMOVED      0     0     0  was /dev/gptid/af868d3e-af7c-11ea-800a-002590343322
            12748854986319037125  REMOVED      0     0     0  was /dev/gptid/c9b0f525-50eb-11ea-ad9a-002590343322
            6683381636331138817   REMOVED      0     0     0  was /dev/gptid/d7735d16-50eb-11ea-ad9a-002590343322
            7872409557343776049   REMOVED      0     0     0  was /dev/gptid/43c07d3d-caf8-11ea-969f-002590343322
            8424755186125633080   REMOVED      0     0     0  was /dev/gptid/e689752c-50eb-11ea-ad9a-002590343322
            10321425597778855069  REMOVED      0     0     0  was /dev/gptid/f46ad081-50eb-11ea-ad9a-002590343322
            11260363593274240624  REMOVED      0     0     0  was /dev/gptid/022be270-50ec-11ea-ad9a-002590343322
            6997537682382531171   REMOVED      0     0     0  was /dev/gptid/10ab5145-50ec-11ea-ad9a-002590343322
errors: Permanent errors have been detected in the following files:
        <metadata>:<0x0>
        <metadata>:<0x1>
        <metadata>:<0x48>
        POOL4:<0x13003>
        POOL4:<0x82d2>At this point, I can't do anything with the pool since it just gives me the 'I/O is currently suspended' error
So, I initiate a reboot (have tried regular shut down as well). After some time the process hangs- here's the tail end of things:
Code:
Stopping ntpd. Waiting for PIDS: 1672, 1672. Shutting down local daemons:. Stopping lockd. Waiting for PIDS: 1628. Stopping statd. Waiting for PIDS: 1625. Stopping nfsd. Waiting for PIDS: 1616 1617. Stopping mountd. Waiting for PIDS: 1610. Stopping watchdogd. Waiting for PIDS: 1550. Stopping rpcbind. Waiting for PIDS: 1395. Writing entropy file:. Writing early boot entropy file:. Aug 10 14:23:46 freenas init: timeout expired for /etc/rc.shutdown: Interrupted system call: going to single user mode Aug 10 14:23:46 freenas init: timeout expired for /etc/rc.shutdown: Interrupted system call: going to single user mode Aug 10 14:24:06 init: some processes would not die: ps axl advised
Now, I partly guess that the jails are preventing shutdown as a result of mount points pointing to the affected pool.
When I was having this issue two weeks ago, it was "resolved" after checking all cable connections, reseating the controller card and running a scrub. I thought it was probably just a loose connection somewhere as I had recently changed out a drive and probably jiggled something loose somewhere.
As you can see in the logs above, my attempts to run a scrub now are being foiled by the pool going out before the scrub can finish.
I think the restart/shutdown issue is secondary to whatever is going on with the pool. I should note that prior to this occurring again, there were no read/write errors, and SMART tests did not return any issues.
Should I chalk this up to bad cables, or perhaps the controller card? What's my best option to narrow this down?