6 of 11 drives resilvering and then crashing with reboot

DCswitch

Explorer
Joined
Dec 20, 2013
Messages
58
I posted this under hardware, but no one is responding to help me. Here is the link:
https://www.ixsystems.com/community...ilvering-and-then-crashing-with-reboot.76143/

I have purchased new hardware w/ 11 new drives and cloned them with CloneZilla. I have not turned on the FreeNAS with the new hardware because I'm waiting for some help before I do so.
My new (used) hardware:
Supermicro X9SRL-F Motherboard Socket Socket LGA2011 (updated firmware)
Intel Xeon E5-2603 v2 CPU
128 GB Samsung ECC DDR3 memory
11 WD 8TB NAS HDDs (shucked from external enclosures)
LSI SAS 9207-8i firmware 20 IT mode
IBM SAS Expander 46M0997 firmware 364a

Having the drives cloned will give me an unlimited amount of tries to get this right. I could really use your help on how to solve this.
 

Attachments

  • six drives resilvering.jpeg
    six drives resilvering.jpeg
    306.7 KB · Views: 235

adrianwi

Guru
Joined
Oct 15, 2013
Messages
1,231
It's not really clear what you are trying to do. What drives/data have you cloned with CloneZilla? What drives are you trying to use with FreeNAS Would you be better configuring FreeNAS with the drive structure you require, and then transferring over the data?
 

DCswitch

Explorer
Joined
Dec 20, 2013
Messages
58
@adrianwi Because the FreeNAS kept crashing (and because I realized I had no backup)- I stopped booting up to FreeNAS and cloned all 11 drives with 11 brand new drives. I did that so when I install them into the new hardware (new motherboard, ECC memory, etc) and IF it continues to keep crashing- then it won't be crashing the original drives. It gives me a much greater chance of solving this problem and not losing all my data. Ultimately what I'm trying to do is figure out why all these drives are resilvering. I used this FreeNAS for years with no problems. It wasn't until I added the LSI controller and IBM Expander that it started doing all this resilvering. Now I've purchased all new (used) hardware including cables. I hope it's as simple as just being a bad cable or bad motherboard, etc, but I'm asking for help so I have a plan in place when I boot it up. That includes gathering info and diagnostics, etc. It ran about 5 to 10 minutes before crashing, so that doesn't give me a lot of time to gather info. 5 of the 11 drives were not resilvering, so at minimal I'd like to figure out how to match the GPTID to the HDD's serial number, so I can determine which drives are resilvering and which ones aren't.
 

toadman

Guru
Joined
Jun 4, 2013
Messages
619
I've had issues with multiple drives acting flaky and resilvering caused by a power issue (usually power supply dropping voltage, in my case a bad sata power extender). In your case it would seem to be HBA and/or expander related given that was the change in the system.

Interesting that you have the pool with the ashift configured to 512 but the drives are 4096. Did you grow this pool up from older drives that were originally 512?
 

DCswitch

Explorer
Joined
Dec 20, 2013
Messages
58
@toadman Yes, I grew it. The original drives were 4 TB. These are all 8 TB now. Since this is for minor home use- I didn't think it was a big deal. I actually appreciated the extra space. Is it a big deal? The one thing that bothers me is that before I added the LSI HBA/IBM Expander cards- I remember that this would happen when I moved drives to a different SATA port on the Asus motherboard. I have read over and over that the order shouldn't matter, but ages ago I misplaced two of the drives putting them into two different SATA ports and both drives ended up resilvering. I wonder since this was grown from an older setup- is it possible that the order of drives used to matter and somehow it still does for my case? Had I remembered that detail before I put in the HBA/Expander cards- I would have carefully noted which drive should go with what SATA port so I could revert back. It's too late now unfortunately.
 

toadman

Guru
Joined
Jun 4, 2013
Messages
619
re: 512, depends what you are doing as a typical workload, but you can see some throughput issues if the pool is not aligned to 4k. Not clear if it would be noticeable to you. If the sectors are misaligned, yea, you'll see a substantial perf delta. Do you know if you are misaligned to a 4k boundary?

The sata port ordering should not matter (to my knowledge) as zfs uses GUIDs on each disk to identify the disk, vdev, pool, etc by that tag.
 

DCswitch

Explorer
Joined
Dec 20, 2013
Messages
58
@toadman I don't know if I'm misaligned to a 4k boundary- is there a way to tell?
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Are the drives that are in the system now the cloned drives or the original drives? the block size error is strange. Also when lots of drives like this seem to mess up at the same time it's usually a hba, cable, backplane or power supply problem. One thing you can do is get the smart data for each drive and see if they are actually going bad. If they are not going bad look at replacing each part in the system.
 

DCswitch

Explorer
Joined
Dec 20, 2013
Messages
58
@SweetAndLow It took three days, but I just finished cloning all the drives. I'm going to install the cloned drives first and see how it goes. I never received a SMART error/alert on the original drives, so I doubt the drives are going bad. I also bought a SSD drive to use as the boot drive (I previously used a USB thumb drive). I have already bought and installed all new hardware and within the next hour or two I'll be turning it on and seeing what happens. I have a question- since I'm going to do a fresh OS install, what version OS would you recommend (what do you think is the most stable one)? Do you think I should install 11.2 U4.1 or 11.1 U7, etc?
 

toadman

Guru
Joined
Jun 4, 2013
Messages
619
re: 4k alignment, do gpart show and you will see the start sector of your freebsd-zfs partitions. They should be divisible by 8. I think freenas puts it's first partition on a disk at 128 (normally swap) and the 2nd one at 2048 (data, but could be a different number due to required size of swap. Mine are all 128 as I don't have a swap partition on my disks. But I wouldn't worry too much about that right now. You have a potentially bigger issue to deal with on the resilvering.

SweetAndLow is correct on the usual reasons for multiple disks resilvering at once.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
what version did you run before? Run that version if you can, just to eliminate any zfs changes. Otherwise you can use 11.2-U4.1 since it's the most recent stable version.
 

DCswitch

Explorer
Joined
Dec 20, 2013
Messages
58
I had upgraded to 11.4 U2 or U3 and it was running stable until I started getting crashes when doing transfers from a macOS computer. I reached out to the forum and someone said they were aware of the issue and that it would be addressed. So then I went back to 11.1 U5 and had no issues with macOS anymore. 11.2 U4.1 may have fixed that macOS transfer issue by now, so that's why I was asking. I was thinking I should potentially just go to the newest, most stable version. I don't need to use macOS to save my data. I can back up with a Windows PC (assuming I make it past this resilvering problem). I'm getting close to firing this up with all the new hardware. Once it's up and running- I'm going to use these commands for more info:
glabel status
to identify drives and use

smartctl -a /dev/ada5 | grep "Serial Number"
to figure which drives are resilvering and which are not

gpart show
4k alignment info

Is there any other commands or suggestions to execute when it's up and running? I remember someone mentioning to create a debug. Is that under Systems/Advanced? Thank you for all the help and suggestions.
 

DCswitch

Explorer
Joined
Dec 20, 2013
Messages
58
@Chris Moore @toadman @SweetAndLow @adrianwi I installed the new hardware with the cloned drives and for the first time ever it didn't automatically import the pool. I was a bit spooked- but when I did the manual import, it worked. It automatically started the resilver and so far so good. I can finally breathe. This time it's only resilvering one drive (not six) and so far- far less cksum errors. Before it was crashing before 2% resilver- it's currently around 7.52%. It's also good to know that cloning with Clonezilla works- I wasn't even sure if it would.

Code:
root@freenas[~]# zpool status -v
  pool: Jan2017
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu May  2 18:36:50 2019
        10.9T scanned at 2.97G/s, 3.09T issued at 859M/s, 41.0T total
        831M resilvered, 7.52% done, 0 days 12:51:50 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        Jan2017                                         DEGRADED     0     0 0
          raidz2-0                                      DEGRADED     0     0 0
            gptid/8b2ce50f-0b2c-11e9-9686-002655d6e06d  ONLINE       0     0 1  block size: 512B configured, 4096B native
            da4p2                                       ONLINE       0     0 0  block size: 512B configured, 4096B native
            gptid/e9e4e0b2-0b87-11e9-93e2-002655d6e06d  ONLINE       0     0 0  block size: 512B configured, 4096B native
            14960535744802862746                        UNAVAIL      0     0 0  was /dev/da4p2
            da3p2                                       ONLINE       0     0 6  block size: 512B configured, 4096B native
            gptid/90a06eba-099d-11e9-ad19-002655d6e06d  ONLINE       0     0 0  block size: 512B configured, 4096B native
            gptid/080e58bc-09eb-11e9-9fb4-002655d6e06d  ONLINE       0     0 0  block size: 512B configured, 4096B native
            gptid/17138ffd-0a29-11e9-bb96-002655d6e06d  ONLINE       0     0 0  block size: 512B configured, 4096B native
            gptid/2615b617-0a58-11e9-845a-002655d6e06d  ONLINE       0     0 0  block size: 512B configured, 4096B native
            gptid/9f3ce1f9-0a8c-11e9-a370-002655d6e06d  ONLINE       0     0 0  block size: 512B configured, 4096B native
            gptid/610c26f9-0ac3-11e9-b671-002655d6e06d  ONLINE       0     0 0  block size: 512B configured, 4096B native

errors: No known data errors

  pool: freenas-boot
state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          da0p2     ONLINE       0     0     0

errors: No known data errors


This is the gpart show output- please let me know if my ashift 512 (non 4k) is okay:

Code:
root@freenas[~]# gpart show
=>      40  78165280  da0  GPT  (37G)
        40    532480    1  efi  (260M)
    532520  77627392    2  freebsd-zfs  (37G)
  78159912      5408       - free -  (2.6M)

=>         40  15628053088  da1  GPT  (7.3T)
           40           88       - free -  (44K)
          128      4194304    1  freebsd-swap  (2.0G)
      4194432  15623858688    2  freebsd-zfs  (7.3T)
  15628053120            8       - free -  (4.0K)

=>         40  15628053088  da2  GPT  (7.3T)
           40           88       - free -  (44K)
          128      4194304    1  freebsd-swap  (2.0G)
      4194432  15623858688    2  freebsd-zfs  (7.3T)
  15628053120            8       - free -  (4.0K)

=>         40  15628053088  da3  GPT  (7.3T)
           40           88       - free -  (44K)
          128      4194304    1  freebsd-swap  (2.0G)
      4194432  15623858696    2  freebsd-zfs  (7.3T)

=>         40  15628053088  da4  GPT  (7.3T)
           40           88       - free -  (44K)
          128      4194304    1  freebsd-swap  (2.0G)
      4194432  15623858688    2  freebsd-zfs  (7.3T)
  15628053120            8       - free -  (4.0K)

=>         40  15628053088  da5  GPT  (7.3T)
           40           88       - free -  (44K)
          128      4194304    1  freebsd-swap  (2.0G)
      4194432  15623858688    2  freebsd-zfs  (7.3T)
  15628053120            8       - free -  (4.0K)

=>         40  15628053088  da6  GPT  (7.3T)
           40           88       - free -  (44K)
          128      4194304    1  freebsd-swap  (2.0G)
      4194432  15623858688    2  freebsd-zfs  (7.3T)
  15628053120            8       - free -  (4.0K)

=>         40  15628053088  da7  GPT  (7.3T)
           40           88       - free -  (44K)
          128      4194304    1  freebsd-swap  (2.0G)
      4194432  15623858688    2  freebsd-zfs  (7.3T)
  15628053120            8       - free -  (4.0K)

=>         40  15628053088  da8  GPT  (7.3T)
           40           88       - free -  (44K)
          128      4194304    1  freebsd-swap  (2.0G)
      4194432  15623858688    2  freebsd-zfs  (7.3T)
  15628053120            8       - free -  (4.0K)

=>         40  15628053088  da9  GPT  (7.3T)
           40           88       - free -  (44K)
          128      4194304    1  freebsd-swap  (2.0G)
      4194432  15623858688    2  freebsd-zfs  (7.3T)
  15628053120            8       - free -  (4.0K)

=>         40  15628053088  da10  GPT  (7.3T)
           40           88        - free -  (44K)
          128      4194304     1  freebsd-swap  (2.0G)
      4194432  15623858688     2  freebsd-zfs  (7.3T)
  15628053120            8        - free -  (4.0K)

=>         40  15628053088  da11  GPT  (7.3T)
           40           88        - free -  (44K)
          128      4194304     1  freebsd-swap  (2.0G)
      4194432  15623858688     2  freebsd-zfs  (7.3T)
  15628053120            8        - free -  (4.0K)

root@freenas[~]#


After this scare I decided to make two FreeNAS's - keeping the 2nd one at a different location and offline (no ransomware, fire, or theft). I already have it built in my living room. I'm also making a third backup of the really important things.
 
Last edited:

toadman

Guru
Joined
Jun 4, 2013
Messages
619
very odd that da3 and da4 are not listed with a gptid.

and disk 14960535744802862746 was da4p2, but now da4p2 is in the pool as another disk? That seems very weird to me.
 

DCswitch

Explorer
Joined
Dec 20, 2013
Messages
58
@toadman Not sure why either. After this is backed up- I think it would be best to blow out the whole thing and start from scratch.
 

DCswitch

Explorer
Joined
Dec 20, 2013
Messages
58
That resilver went through without error. I've replaced da4, so that's resilvering now. After this is finished/backed up- I'm thinking about how to do the next setup. I know that when I go to 4k instead of 512b I'm going to lose disk space, but will compression work more efficiently negating some of data size loss? This vdev is 11 drives with ZFS-2 and I'm thinking about going to a vdev of 13 drives with ZFS-3. Is this okay (I know 13 is pushing it)? Keep in mind I'll be creating a 2nd FreeNAS to back this one up to and this is for minor home use.
 
Top