Slow resilver (1,000+ days)

squirecounty · Aug 26, 2021

We have a system with the following hardware:

Motherboard	SuperMicro X10DRL (DDR 4)
Processor	2x Intel E52620 v4
Network	Intel 10GBe X540T2BLK
Memory	128GB
HBA	LSI 9305-16i & LSI 9305-24i
OS Version	FreeNAS-11.3-U5
Boot drives	Mirror SSD (120GB)
Data drives	SEAGATE ST14000NM0048 (14TB @ 7,200rpm)

The system has been in place for a few years, we have had drives fail in the past, however now a drive has failed, it has been replaced with a new drive and the resilver time was originally showing at 29 days which was already huge, a day later this is over 1,400 days.

The disk is good, no SMART errors are detected. The system has been rebooted.

I assume it is possible for this new drive to be faulty, but I'd have expected to seen SMART errors or alerts for it.

Any thoughts?

Code:

zpool status VOL1
  pool: VOL1
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Aug 25 16:05:31 2021
        2.44T scanned at 39.2M/s, 34.4G issued at 554K/s, 66.6T total
        995M resilvered, 0.05% done, no estimated completion time
config:

        NAME                                              STATE     READ WRITE CKSUM
        VOL1                                              ONLINE       0     0     0
          raidz2-0                                        ONLINE       0     0     0
            gptid/6f9cb65a-9efd-11e9-943c-ac1f6b7b5820    ONLINE       0     0     0
            gptid/70c7e9c9-9efd-11e9-943c-ac1f6b7b5820    ONLINE       0     0     0
            gptid/721da7f8-9efd-11e9-943c-ac1f6b7b5820    ONLINE       0     0     0
            spare-3                                       ONLINE       0     0     0
              gptid/a016eabf-00f5-11ec-90e3-ac1f6b7b5820  ONLINE       0     0     0
              gptid/7bebff29-9efd-11e9-943c-ac1f6b7b5820  ONLINE       0     0     0
            gptid/749fb3e0-9efd-11e9-943c-ac1f6b7b5820    ONLINE       0     0     0
            gptid/7602f1a8-9efd-11e9-943c-ac1f6b7b5820    ONLINE       0     0     0
            gptid/776c5869-9efd-11e9-943c-ac1f6b7b5820    ONLINE       0     0     0
            gptid/78b34585-9efd-11e9-943c-ac1f6b7b5820    ONLINE       0     0     0
            gptid/7a27c511-9efd-11e9-943c-ac1f6b7b5820    ONLINE       0     0     0
        spares
          5247034995803090637                             INUSE     was /dev/gptid/7bebff29-9efd-11e9-943c-ac1f6b7b5820

errors: No known data errors

Arwen · Aug 26, 2021

Sometimes the slow down is another disk. Check all the non-failed disks for excessive retries using "smartclt". If the disk is retrying alot, it can cause a ZFS slow down.

One thing ZFS likes to handle, it it's own error detection and correction. So, when using disks with consumer settings for TLER, (Time Limited Error Recovery), that can be too long. In general, ZFS would like 7 seconds TLER. Consumer settings can be much higher, (I don't recall, but could be 1 minute or more). Note that some vendors call TLER something else, but it performs a similar function.

With a low TLER and a RAID-Z2 pool, during a failed disk recovery, any other failed blocks can be re-created from RAID-Z2's additional redundancy. So, if ZFS get's a 7 second error recovery time out, it will simply re-create the block and re-write to the pool. Even during a re-silver.

Normally you would see problems during a pool scrub with a pool & disk issue like I have described above. The scrubs would take excessive time. And you would see read errors that were recoverable. Don't know why you did not see them.

NugentS · Aug 27, 2021

You could try set_hdd_erc.sh from https://github.com/Spearfoot/FreeNAS-scripts - it might help

blanchet · Aug 27, 2021

The only time I have encountered such a very excessive long resilvering time, I had a faulty drive (in my case SMART reported a low helium level).
So I wonder why you do not have at least a SMART error.

If SMART report nothing, you can check the health of your new disk with Seagate Seatools.
I prefer using the DOS version that comes with Ultimate Boot CD, but there is also a version for Windows.

If SeaTools reports no error, I may worth checking your SATA/SAS cables.

NugentS · Aug 27, 2021

It could be one of the older disks on a borderline fail

jgreco · Aug 27, 2021

While I agree that another disk being marginal is a likely cause, no one has asked about the other major controlling factor here -- what's being stored on this RAIDZ2?

Millions of small files or usage as block storage can result in excessive fragmentation which is also a classic cause of outsized resilver estimates.

squirecounty · Aug 31, 2021

jgreco said:
While I agree that another disk being marginal is a likely cause, no one has asked about the other major controlling factor here -- what's being stored on this RAIDZ2?

Millions of small files or usage as block storage can result in excessive fragmentation which is also a classic cause of outsized resilver estimates.

They are flat files, very large in size - Veeam backup target.

squirecounty · Aug 31, 2021

I've run smartctl this morning and that drive looks OK, unless I've completely missed something....which at the moment for me wouldn't be alien:

Code:

smartctl -a /dev/da44
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p14 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST14000NM0048
Revision:             E002
Compliance:           SPC-5
User Capacity:        14,000,519,643,136 bytes [14.0 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500ad90b9bf
Serial number:        ZHZ2E7LG0000C9247RS2
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Tue Aug 31 09:16:35 2021 BST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     29 C
Drive Trip Temperature:        60 C

Manufactured in week 10 of year 2019
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  106
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  101
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 38187016
  Blocks received from initiator = 3797425008
  Blocks read from cache and sent to initiator = 1086643
  Number of read and write commands whose size <= segment size = 1642198
  Number of read and write commands whose size > segment size = 347834

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 282.70
  number of minutes until next internal SMART test = 9

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0       10         0        10         10         17.396           0
write:         0        0         0         0          0       2078.800           0
verify:        0        0         0         0          0          2.156           0

Non-medium error count:        0


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
No Self-tests have been logged

squirecounty · Aug 31, 2021

So I've now started seeing this alert over the weekend:

Device /dev/gptid/a016eabf-00f5-11ec-90e3-ac1f6b7b5820 is causing slow I/O on pool VOL1

However according to this output that disk is a spare:

Code:

  pool: VOL1
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Aug 26 16:42:32 2021
        2.41T scanned at 11.7M/s, 176G issued at 860K/s, 66.7T total
        34.3G resilvered, 0.26% done, no estimated completion time
config:

        NAME                                              STATE     READ WRITE CKSU                                                                                     M
        VOL1                                              ONLINE       0     0                                                                                          0
          raidz2-0                                        ONLINE       0     0                                                                                          0
            gptid/6f9cb65a-9efd-11e9-943c-ac1f6b7b5820    ONLINE       0     0                                                                                          0
            gptid/70c7e9c9-9efd-11e9-943c-ac1f6b7b5820    ONLINE       0     0                                                                                          0
            gptid/721da7f8-9efd-11e9-943c-ac1f6b7b5820    ONLINE       0     0                                                                                          0
            spare-3                                       ONLINE       0     0                                                                                          0
              gptid/a016eabf-00f5-11ec-90e3-ac1f6b7b5820  ONLINE       0     0                                                                                          0
              gptid/7bebff29-9efd-11e9-943c-ac1f6b7b5820  ONLINE       0     0                                                                                          0
            gptid/749fb3e0-9efd-11e9-943c-ac1f6b7b5820    ONLINE       0     0                                                                                          0
            gptid/7602f1a8-9efd-11e9-943c-ac1f6b7b5820    ONLINE       0     0                                                                                          0
            gptid/776c5869-9efd-11e9-943c-ac1f6b7b5820    ONLINE       0     0                                                                                          0
            gptid/78b34585-9efd-11e9-943c-ac1f6b7b5820    ONLINE       0     0                                                                                          0
            gptid/7a27c511-9efd-11e9-943c-ac1f6b7b5820    ONLINE       0     0                                                                                          0
        spares
          5247034995803090637                             INUSE     was /dev/gptid/                                                                                     7bebff29-9efd-11e9-943c-ac1f6b7b5820

errors: No known data errors

So should a poor performing spare even be causing this issue?

Do I just detach (if possible?) the slow performing disk?

Any thoughts?

Thanks,

squirecounty · Sep 3, 2021

Consider this resolved, I've swapped the disk and resilver completed within 2 days.

Thanks for any input anyone gave to this.

Important Announcement for the TrueNAS Community.

Slow resilver (1,000+ days)

squirecounty

Cadet

Arwen

MVP

NugentS

MVP

blanchet

Guru

NugentS

MVP

jgreco

Resident Grinch

squirecounty

Cadet

squirecounty

Cadet

squirecounty

Cadet

squirecounty

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Slow resilver (1,000+ days)

Cadet

MVP

MVP

Guru

MVP

Resident Grinch

Cadet

Cadet

Cadet

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Slow resilver (1,000+ days)"

Similar threads