Help to understand pool state after disk replacement

gbernardes · Jan 5, 2022

Hi,

I have a pool using RaidZ with 15 disks + 1 spare, on a FreeNas 11.1-U6.

I replaced a disk that had CRC errors, and after the replacement, it found errors on another disk, which was replaced as well. However, after the resilvering, it continued with errors and now I have no idea what the problem with the pool is. Below is 'zpool status':

Code:

    NAME                                                STATE     READ WRITE CKSUM
    Storage_BKP1                                        DEGRADED     0     0     2
      raidz1-0                                          DEGRADED     0     0     4
        gptid/34d1a834-b3f2-11e8-b7b0-1418774a5e7c      ONLINE       0     0     0
        gptid/354a3b61-b3f2-11e8-b7b0-1418774a5e7c      ONLINE       0     0     0
        gptid/35c5b4a9-b3f2-11e8-b7b0-1418774a5e7c      ONLINE       0     0     0
        gptid/ddd1184e-5e65-11ec-aa6a-1418774a5e7c      DEGRADED     0     0 5.66K  too many errors
        spare-4                                         DEGRADED     0     0     0
          replacing-0                                   UNAVAIL      0     0     0
            15713876987547349613                        UNAVAIL      0     0     0  was /dev/gptid/36fe33b8-b3f2-11e8-b7b0-1418774a5e7c
            gptid/5e864d34-68a3-11ec-885f-1418774a5e7c  ONLINE       0     0     0
          gptid/3e56a4f3-b3f2-11e8-b7b0-1418774a5e7c    ONLINE       0     0     0
        gptid/37be42bc-b3f2-11e8-b7b0-1418774a5e7c      ONLINE       0     0     0
        gptid/9ea27df4-a140-11eb-9b79-1418774a5e7c      ONLINE       0     0     0
        gptid/3927cffa-b3f2-11e8-b7b0-1418774a5e7c      ONLINE       0     0     0
        gptid/78b6ad6c-e14e-11e9-9615-1418774a5e7c      ONLINE       0     0     0
        gptid/3a720849-b3f2-11e8-b7b0-1418774a5e7c      ONLINE       0     0     0
        gptid/3b0e4da8-b3f2-11e8-b7b0-1418774a5e7c      ONLINE       0     0     0
        gptid/3bae0862-b3f2-11e8-b7b0-1418774a5e7c      ONLINE       0     0     0
        gptid/3c35db28-b3f2-11e8-b7b0-1418774a5e7c      ONLINE       0     0     0
        gptid/3cbb7063-b3f2-11e8-b7b0-1418774a5e7c      ONLINE       0     0     0
        gptid/3d49b011-b3f2-11e8-b7b0-1418774a5e7c      ONLINE       0     0     0
    logs
      gptid/3eab0ef3-b3f2-11e8-b7b0-1418774a5e7c        ONLINE       0     0     0
    spares
      6428965284576591425                               INUSE     was /dev/gptid/3e56a4f3-b3f2-11e8-b7b0-1418774a5e7c

gbernardes · Jan 5, 2022

Can anyone help me understand how I do pool health to OK again?

Arwen · Jan 5, 2022

If I am reading that output right, you replaced the same disk slot twice, only 1 was successful.

Next, your disk with 5.66K checksum errors on a RAID-Z1 pool means that 1 more error elsewhere and you might loose data. So you need to replace this disk too.

One of the things you should have setup is regular ZFS pool scrubs. Some people do them weekly, twice a month or monthly. What this covers, is finding previously unknown bad blocks. In a RAID-Z1 pool re-silvers, ZFS needs to read all the used data in the other disks to recover the failed disk. Thus, if you did not have scrubs and had unknown bad blocks, that's what you would see, a second disk failing.

But, it does appear that the bad blocks on the disk with 5.66K checksum errors are different blocks from what was failing on the other bad disk, (except 2 reads...). With the 2 checksum errors at the pool level, you may have lost data. The full output of zpool status -v Storage_BKP1 will give better information.

gbernardes · Jan 5, 2022

Hi @Arwen,

I removed the lost file indicated in 'zpool status -v ' (it was not essential). Which disk should I change so that the pool is healthy again?

gbernardes · Jan 6, 2022

Hi,

If I understand, I need to replace the same disk, in this case gptid/ddd1184e-5e65-11ec-aa6a-1418774a5e7c, that is da10 according with command:

Code:

# glabel status | grep ddd1184e-5e65-11ec-aa6a-1418774a5e7c
gptid/ddd1184e-5e65-11ec-aa6a-1418774a5e7c     N/A  da10p2
 # smartctl -a /dev/da10 | grep ^Serial
Serial Number:    WSE0KRMC

Is correct?

Arwen · Jan 6, 2022

Sorry, I can't answer that. Not used to seeing hot spare in use, with that output.

One unrelated commented, in general, a single vDev of RAID-Zx, (1/2/3), should not be 15 disks wide. This can cause certain things to be slow, including re-silvers. The general rule of thumb, is somewhat between 8 to 12, depending on how many disk slots are in the chassis.

Meaning if you have exactly 12, 3.5" disk slots in a rack computer, it may make sense to use all 12 in a single vDev. Then accept that somethings are slower.

Next, using 15 disks in a RAID-Z1 is likely more prone to having a second error during a re-silver, as you have found. Depending on the size of the disks, (basically greater than 1TB or 2TB), wide RAID-Z1 is really not recommended.

gbernardes · Jan 6, 2022

Thank you @Arwen for your observations.

Can anyone help me with this?

gbernardes · Jan 7, 2022

Can anyone please help me with this?

Important Announcement for the TrueNAS Community.

Help to understand pool state after disk replacement

gbernardes

Cadet

gbernardes

Cadet

Arwen

MVP

gbernardes

Cadet

gbernardes

Cadet

Arwen

MVP

gbernardes

Cadet

gbernardes

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Help to understand pool state after disk replacement

Cadet

Cadet

MVP

Cadet

Cadet

MVP

Cadet

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Help to understand pool state after disk replacement"

Similar threads