SOLVED Stuck in resilvering loop, is my pool in a death spiral?

I3lackR0se

Dabbler
Joined
May 22, 2015
Messages
16
Hey all. I'm running into a difficult situation that my google-fu isn't helping me with, and I'm worried I'm about to lose my pool. I'm running Freenas 11.2, and have two 6-drive vdevs, with 2 drives dead on one vdev. My system is constantly trying and failing to replace the first of those two drives, leaving it stuck in an eternal resilver loop. With two drives dead and constant pressure being put on the drive with back-to-back resilvers... my fear is a 3rd drive will give out soon and I'll be in a tough spot. I can't seem to figure out what the issue is, or how to safely pause the automatic resilvering. I am getting a cycle of errors clearly related to the drives being disconnected in my main log. I've attach it and my dmesg output below (and my build is in my signature).

Thank you very much in advance if anyone is able to put me on the right path to saving this.
 

Attachments

  • dmesg.txt
    220.4 KB · Views: 121
  • console.txt
    219.6 KB · Views: 97
Joined
Jan 7, 2015
Messages
1,150
You should definitely be backing up anything mission critical. Stop all jails VMs and services but ssh during a resilver. New data be written or accessed will at very least slow things way down.

You will want to offline the bad discs before resilvering/replacing. And if you have the parity you can do multiple disks at once.

So the disks that you have that are bad are-
da8: <ATA WDC WD80EMAZ-00W 0A83> s/n 7SHAMBHC
da10: <ATA WDC WD80EMAZ-00W 0A83> s/n 7HJX56YF

So what seems to be happening is they are getting booted then reattaching triggering a resilver. Offlining then replacing with new discs should get it, if not there could be more issues.
 

I3lackR0se

Dabbler
Joined
May 22, 2015
Messages
16
Thank you so much! I was able to fix it with a slight adjustment to your insight. I'll document my fix below for anyone who runs into a similar issue.

In retrospect after fixing, potentially related hardware information with my situation:
I had two 6 drive vdevs, with 4 additional unused drives available to replace broken ones as needed. All of these drives were connected directly to a RES2SV240 SAS Expander with breakout cables, and individually powered with molex power cables from the power supply.

Why is that information potentially important? Well, to disconnect the drives John Digital suggested, I needed to verify that the labels on my enclosure matched the drive serials I needed. To pull those drives out, I needed to pull them all out, and in the process reseated most of the power and data cables in place. Low and behold, before I even changed anything the resilver process got much, much farther. It either got to the very end of fixing the first drive, or within a couple percent of it. By this point I thought I just had a loose cable issue, however ultimately it did fail as before, it just got much further.

As I was about to pull 7SHAMBHC and 7HJX56YF, I noticed that my drives that were a part of the vdev and needed to be replaced were da8 (which was now 7HJX56YF) and da11 (which was now 7HJY3BJF). This was a head scratcher, so I looked at my logs to see where John was coming from.

The first relevant looking error in the log on a given restart was a couple like this:
Code:
Aug 26 10:58:41 freenas syslog-ng[4036]: Error processing log message: <ATA WDC WD80EFAX-68L 0A83> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number 7SGGWWXC           
da2: 600.000MB/s transfers
da2: Command Queueing enabled
da2: 7630885MB (15628053168 512 byte sectors)
da8 at mps0 bus 0 scbus0 target 13 lun 0
da7 at mps0 bus 0 scbus0 target 12 lun 0
da8: <ATA WDC WD80EMAZ-00W 0A83> Fixed Direct Access SPC-4 SCSI device
da8: Serial Number 7SHAMBHC           
da8: 600.000MB/s transfers
da8: Command Queueing enabled
da8: 7630885MB (15628053168 512 byte sectors)
da7: <ATA WDC WD80EMAZ-00W 0A83> Fixed Direct Access SPC-4 SCSI device
da10 at mps0 bus 0 scbus0 target 15 lun 0
da10: <ATA WDC WD80EMAZ-00W 0A83> Fixed Direct Access SPC-4 SCSI device
da7: Serial Number 1EJ7VL2Z           
da7: 600.000MB/s transfers
da7: Command Queueing enabled
da7: 7630885MB (15628053168 512 byte sectors)
da9 at mps0 bus 0 scbus0 target 14 lun 0
da10: Serial Number 7HJX56YF           
da10: 600.000MB/s transfers
da10: Command Queueing enabled
da10: 7630885MB (15628053168 512 byte sectors)
da9: <ATA WDC WD80EMAZ-00W 0A83> Fixed Direct Access SPC-4 SCSI device
da1 at mps0 bus 0 scbus0 target 1 lun 0
da9: Serial Number 7HJY3BJF           
da9: 600.000MB/s transfers
da9: Command Queueing enabled
da9: 7630885MB (15628053168 512 byte sectors)
da5 at mps0 bus 0 scbus0 target 10 lun 0
da1: da5: <ATA WDC WD80EDAZ-11T 0A81> Fixed Direct Access SPC-4 SCSI device
da5: Serial Number VGH2YX1G           
da5: 600.000MB/s transfers
da5: Command Queueing enabled
da5: 7630885MB (15628053168 512 byte sectors)
<ATA WDC WD40EFRX-68W 0A82> Fixed Direct Access SPC-4 SCSI device
da1: Serial Number WD-WCC4E4AVL5Y1
da1: 600.000MB/s transfers
da1: Command Queueing enabled
da1: 3815447MB (7814037168 512 byte sectors)
da1: quirks=0x8<4K>
da3 at mps0 bus 0 scbus0 target 8 lun 0
da3: <ATA WDC WD40EFRX-68W 0A82> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number WD-WCC4E5UPZV6N
da3: 600.000MB/s transfers
da3: Command Queueing enabled
da4 at mps0 bus 0 scbus0 target 9 lun 0
da4: <ATA WDC WD


Then a semi-recurring mixed pattern of errors that looked like so:
Code:
Aug 26 11:18:35 freenas da10: <ATA WDC WD80EMAZ-00W 0A83> s/n 7HJX56YF             detached
Aug 26 11:18:35 freenas GEOM_MIRROR: Device swap0: provider da10p1 disconnected.
Aug 26 11:18:35 freenas (da10:mps0:0:15:0): Periph destroyed
Aug 26 11:18:35 freenas mps0: Unfreezing devq for target ID 15
Aug 26 11:18:44 freenas mps0: mpssas_prepare_remove: Sending reset for target ID 13
Aug 26 11:18:44 freenas da8 at mps0 bus 0 scbus0 target 13 lun 0
Aug 26 11:18:44 freenas da8: <ATA WDC WD80EMAZ-00W 0A83> s/n 7SHAMBHC             detached
Aug 26 11:18:44 freenas GEOM_MIRROR: Device swap0: provider da8p1 disconnected.
Aug 26 11:18:44 freenas (da8:mps0:0:13:0): Periph destroyed
Aug 26 11:18:44 freenas GEOM_MIRROR: Device swap0: provider destroyed.
Aug 26 11:18:44 freenas GEOM_MIRROR: Device swap0 destroyed.
Aug 26 11:18:44 freenas mps0: Unfreezing devq for target ID 13
Aug 26 11:18:44 freenas GEOM_ELI: Device mirror/swap0.eli destroyed.
Aug 26 11:18:44 freenas GEOM_ELI: Detached mirror/swap0.eli on last close.
Aug 26 11:18:51 freenas mps0: SAS Address for SATA device = d137314b9db98e6c
Aug 26 11:18:51 freenas mps0: SAS Address from SATA device = d137314b9db98e6c
Aug 26 11:18:53 freenas ses0: da8,pass8: Element descriptor: 'ArrayDevice09'
Aug 26 11:18:53 freenas ses0: da8,pass8: SAS Device Slot Element: 1 Phys at Slot 9
Aug 26 11:18:53 freenas ses0:  phy 0: SATA device
Aug 26 11:18:53 freenas ses0:  phy 0: parent 5001e677bba8ffff addr 5001e677bba8ffe9
Aug 26 11:18:53 freenas da8 at mps0 bus 0 scbus0 target 13 lun 0
Aug 26 11:18:53 freenas da8: <ATA WDC WD80EMAZ-00W 0A83> Fixed Direct Access SPC-4 SCSI device
Aug 26 11:18:53 freenas da8: Serial Number 7SHAMBHC           
Aug 26 11:18:53 freenas da8: 600.000MB/s transfers
Aug 26 11:18:53 freenas da8: Command Queueing enabled
Aug 26 11:18:53 freenas da8: 7630885MB (15628053168 512 byte sectors)
Aug 26 11:18:53 freenas ZFS: vdev state changed, pool_guid=5287701622901903645 vdev_guid=13252120772642032451
Aug 26 11:18:53 freenas ZFS: vdev state changed, pool_guid=5287701622901903645 vdev_guid=2852271544820749817
Aug 26 11:18:53 freenas ZFS: vdev state changed, pool_guid=5287701622901903645 vdev_guid=4682476836420478665
Aug 26 11:18:54 freenas ZFS: vdev state changed, pool_guid=5287701622901903645 vdev_guid=12030264751771965849
Aug 26 11:18:54 freenas ZFS: vdev state changed, pool_guid=5287701622901903645 vdev_guid=11816524228526246329
Aug 26 11:18:54 freenas ZFS: vdev state changed, pool_guid=5287701622901903645 vdev_guid=7767667901200424987
Aug 26 11:18:54 freenas ZFS: vdev state changed, pool_guid=5287701622901903645 vdev_guid=15456278399355232182
Aug 26 11:19:10 freenas GEOM_MIRROR: Device mirror/swap0 launched (2/2).
Aug 26 11:19:10 freenas GEOM_ELI: Device mirror/swap0.eli created.
Aug 26 11:19:10 freenas GEOM_ELI: Encryption: AES-XTS 128
Aug 26 11:19:10 freenas GEOM_ELI:     Crypto: hardware
Aug 26 11:19:34 freenas mpssas_get_sata_identify: error reading SATA PASSTHRU; iocstatus = 0x47
Aug 26 11:19:34 freenas mpssas_get_sata_identify: error reading SATA PASSTHRU; iocstatus = 0x804b
Aug 26 11:19:34 freenas mps0: Sleeping 3 seconds after SATA ID error to wait for spinup
Aug 26 11:19:37 freenas mps0: SAS Address for SATA device = 392b424e9dae9083
Aug 26 11:19:37 freenas mps0: SAS Address from SATA device = 392b424e9dae9083
Aug 26 11:19:37 freenas ses0: da10,pass10: Element descriptor: 'ArrayDevice0B'
Aug 26 11:19:37 freenas ses0: da10,pass10: SAS Device Slot Element: 1 Phys at Slot 11
Aug 26 11:19:37 freenas ses0:  phy 0: SATA device
Aug 26 11:19:37 freenas ses0:  phy 0: parent 5001e677bba8ffff addr 5001e677bba8ffeb
Aug 26 11:19:37 freenas da10 at mps0 bus 0 scbus0 target 15 lun 0
Aug 26 11:19:37 freenas da10: <ATA WDC WD80EMAZ-00W 0A83> Fixed Direct Access SPC-4 SCSI device
Aug 26 11:19:37 freenas da10: Serial Number 7HJX56YF           
Aug 26 11:19:37 freenas da10: 600.000MB/s transfers
Aug 26 11:19:37 freenas da10: Command Queueing enabled
Aug 26 11:19:37 freenas da10: 7630885MB (15628053168 512 byte sectors)
Aug 26 11:20:20 freenas mps0: mpssas_prepare_remove: Sending reset for target ID 15
Aug 26 11:20:20 freenas da10 at mps0 bus 0 scbus0 target 15 lun 0
Aug 26 11:20:20 freenas da10: <ATA WDC WD80EMAZ-00W 0A83> s/n 7HJX56YF             detached
Aug 26 11:20:20 freenas (da10:mps0:0:15:0): Periph destroyed
Aug 26 11:20:20 freenas mps0: Unfreezing devq for target ID 15
Aug 26 11:20:22 freenas     (da8:mps0:0:13:0): READ(10). CDB: 28 00 d8 32 e2 08 00 00 40 00 length 32768 SMID 880 terminated ioc 804b loginfo 31170000 scsi 0 state c xfer 0
Aug 26 11:20:22 freenas     (da8:mps0:0:13:0): READ(10). CDB: 28 00 d8 32 e1 c8 00 00 40 00 length 32768 SMID 756 terminated ioc 804b loginfo 31170000 scsi 0 state c xfer 0
Aug 26 11:20:22 freenas (da8:mps0:0:13:0): READ(10). CDB: 28 00 d8 32 e2 08 00 00 40 00
Aug 26 11:20:22 freenas (da8:mps0:0:13:0): CAM status: CCB request completed with an error
Aug 26 11:20:22 freenas (da8:mps0:0:13:0): Retrying command
Aug 26 11:20:22 freenas (da8:mps0:0:13:0): READ(10). CDB: 28 00 d8 32 e1 c8 00 00 40 00
Aug 26 11:20:22 freenas (da8:mps0:0:13:0): CAM status: CCB request completed with an error
Aug 26 11:20:22 freenas (da8:mps0:0:13:0): Retrying command
Aug 26 11:20:24 freenas     (da8:mps0:0:13:0): READ(10). CDB: 28 00 d8 32 e1 c8 00 00 40 00 length 32768 SMID 789 terminated ioc 804b loginfo 31170000 scsi 0 state 0 xfer 0
Aug 26 11:20:24 freenas     (da8:mps0:0:13:0): READ(10). CDB: 28 00 d8 32 e2 08 00 00 40 00 length 32768 SMID 329 terminated ioc 804b loginfo 31170000 sc(da8:mps0:0:13:0): READ(10). CDB: 28 00 d8 32 e1 c8 00 00 40 00
Aug 26 11:20:24 freenas si 0 state 0 xfer 0
Aug 26 11:20:24 freenas (da8:mps0:0:13:0): CAM status: CCB request completed with an error
Aug 26 11:20:24 freenas (da8:mps0:0:13:0): Retrying command
Aug 26 11:20:24 freenas (da8:mps0:0:13:0): READ(10). CDB: 28 00 d8 32 e2 08 00 00 40 00
Aug 26 11:20:24 freenas (da8:mps0:0:13:0): CAM status: CCB request completed with an error
Aug 26 11:20:24 freenas (da8:mps0:0:13:0): Retrying command
Aug 26 11:20:25 freenas mps0: mpssas_prepare_remove: Sending reset for target ID 13
Aug 26 11:20:25 freenas da8 at mps0 bus 0 scbus0 target 13 lun 0
Aug 26 11:20:25 freenas da8: <ATA WDC WD80EMAZ-00W 0A83> s/n 7SHAMBHC             detached
Aug 26 11:20:25 freenas (da8:mps0:0:13:0): READ(10). CDB: 28 00 d8 32 e2 08 00 00 40 00
Aug 26 11:20:25 freenas (da8:mps0:0:13:0): CAM status: CCB request aborted by the host
Aug 26 11:20:25 freenas (da8:mps0:0:13:0): Error 5, Periph was invalidated
Aug 26 11:20:25 freenas mps0: Unfreezing devq for target ID 13
Aug 26 11:20:25 freenas (da8:mps0:0:13:0): READ(10). CDB: 28 00 d8 32 e1 c8 00 00 40 00
Aug 26 11:20:25 freenas (da8:mps0:0:13:0): CAM status: CCB request aborted by the host
Aug 26 11:20:25 freenas (da8:mps0:0:13:0): Error 5, Periph was invalidated
Aug 26 11:20:25 freenas GEOM_MIRROR: Device swap0: provider da8p1 disconnected.
Aug 26 11:20:25 freenas ZFS: vdev state changed, pool_guid=5287701622901903645 vdev_guid=15456278399355232182
Aug 26 11:20:25 freenas (da8:mps0:0:13:0): Periph destroyed
Aug 26 11:20:25 freenas ZFS: vdev state changed, pool_guid=5287701622901903645 vdev_guid=15456278399355232182
Aug 26 11:20:33 freenas mps0: SAS Address for SATA device = d137314b9db98e6c
Aug 26 11:20:33 freenas mps0: SAS Address from SATA device = d137314b9db98e6c
Aug 26 11:20:34 freenas ses0: da8,pass8: Element descriptor: 'ArrayDevice09'
Aug 26 11:20:34 freenas ses0: da8,pass8: SAS Device Slot Element: 1 Phys at Slot 9
Aug 26 11:20:34 freenas ses0:  phy 0: SATA device
Aug 26 11:20:34 freenas ses0:  phy 0: parent 5001e677bba8ffff addr 5001e677bba8ffe9
Aug 26 11:20:34 freenas da8 at mps0 bus 0 scbus0 target 13 lun 0
Aug 26 11:20:34 freenas da8: <ATA WDC WD80EMAZ-00W 0A83> Fixed Direct Access SPC-4 SCSI device
Aug 26 11:20:34 freenas da8: Serial Number 7SHAMBHC           
Aug 26 11:20:34 freenas da8: 600.000MB/s transfers
Aug 26 11:20:34 freenas da8: Command Queueing enabled
Aug 26 11:20:34 freenas da8: 7630885MB (15628053168 512 byte sectors)
Aug 26 11:20:35 freenas ZFS: vdev state changed, pool_guid=5287701622901903645 vdev_guid=13252120772642032451
Aug 26 11:20:35 freenas ZFS: vdev state changed, pool_guid=5287701622901903645 vdev_guid=2852271544820749817
Aug 26 11:20:35 freenas ZFS: vdev state changed, pool_guid=5287701622901903645 vdev_guid=4682476836420478665
Aug 26 11:20:35 freenas ZFS: vdev state changed, pool_guid=5287701622901903645 vdev_guid=12030264751771965849
Aug 26 11:20:35 freenas ZFS: vdev state changed, pool_guid=5287701622901903645 vdev_guid=11816524228526246329
Aug 26 11:20:35 freenas ZFS: vdev state changed, pool_guid=5287701622901903645 vdev_guid=7767667901200424987
Aug 26 11:20:35 freenas ZFS: vdev state changed, pool_guid=5287701622901903645 vdev_guid=15456278399355232182
Aug 26 11:20:44 freenas mps0: mpssas_prepare_remove: Sending reset for target ID 13
Aug 26 11:20:44 freenas da8 at mps0 bus 0 scbus0 target 13 lun 0


Searching the logs for 7SHAMBHC and 7HJX56YF, I found them throughout these logs, though with many other drive serials in their place (some a part of the pool, others unused trying to replace it). I figured my whole batch of used and unused drives couldn't be bad, so I made a judgement call to only disconnect the bad drives that were a part of the pool (rather than letting the swap occur naturally). This ultimately solved the problem, and I was able to replace both missing drives and now have a completely healthy pool once again!

So in conclusion, if someone comes across this, I reseated all my power and data connections, and then physically removed the drives that the system was attempting to replace with automatic resilvers. Thanks again John Digital, I might not of figured this out in time if you hadn't pointed me down the right path!
 
Joined
Jan 7, 2015
Messages
1,150
Im always on the lookout for the disks that say "detached" and note the serial, these are generally your all the way dead ones. Once this and or CAM errors get going its replace time.

Also youll generally have several system alerts by the time you get to this point.
 
Top