Resilver very slow - no apparent progress

rivimey · Dec 18, 2023

I had an existing group-of-mirrors array with 2 vdevs each of 2x8T drives, which I have recently moved into my new Truenas box. While I did have some issues the array did move ok and I can see the filesystem. I then added a new pair of new 8T drives as a third vdev, but during the process one of the old drives got inadvertently disconnected. The resilver of that mirror is what I'm asking about.

Running watch zpool status shows the pool alternating between:

scan: resilvered 1.93M in 00:00:01 with 0 errors on Mon Dec 18 20:55:23 2023

and:

status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Dec 18 20:55:28 2023
        337M / 11.3T scanned, 1.88M / 11.3T issued at 1.88M/s
        1.93M resilvered, 0.00% done, no estimated completion time
    scan warning: skipping blocks that are only referenced by the checkpoint.

and the date (here Mon Dec 18 20:55...) is constantly updated to almost-now.

I am currently interpreting this as the resilver starts and then more-or-less immediately stops for unexplained reasons. Does that make sense? Following the process with zpool iostat I can see a burst of activity (e.g. a few kilobytes or a megabyte) than nothing, repeated every 5 seconds or so.

I have run smartctl long selftest on all the disks in the system and all are reporting themselves healthy with no errors. I also ran a scrub on the drive before this happened and again everything was reported fine.

I can see no messages (at all) in the kernel log or elsewhere, and am a bit stumped. Any ideas?

Would it be a good idea

rivimey · Dec 18, 2023

I thought a complete 'status' would be a good idea:

zpool status part1
  pool: part1
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Dec 18 21:03:31 2023
        337M / 11.3T scanned, 0B / 11.3T issued
        0B resilvered, 0.00% done, no estimated completion time
    scan warning: skipping blocks that are only referenced by the checkpoint.
checkpoint: created Sat Dec  9 14:27:25 2023, consumes 1.84G
config:

        NAME                                      STATE     READ WRITE CKSUM
        part1                                     ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            ff8a6aae-6254-7549-8264-128e0997e9ad  ONLINE       0     0     0
            751ead49-43d8-0d43-94e7-c21a985cc1fd  ONLINE       0     0     0
          mirror-1                                ONLINE       0     0     0
            e1d14753-a637-1e49-abe4-4031f184067b  ONLINE       0     0     0
            42de08dd-55d1-624d-907c-6d563a36f633  ONLINE       0     0     0
          mirror-2                                ONLINE       0     0     0
            ea541032-c99f-49be-8331-bc3ac0ae4061  ONLINE       0     0     0
            530538a0-ffb8-40ff-ad20-298b50b8f1c9  ONLINE       0     0     0
        cache
          nvme0n1p1                               ONLINE       0     0     0
        spares
          sdj1                                    FAULTED   corrupted data

errors: No known data errors

... I don't know why my spare (a new 8T drive) is faulted or how to find out.

The iostat behaviour:

$ zpool iostat part1 1
              capacity     operations     bandwidth 
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
part1       11.3T  10.5T     39    171   396K  3.63M
part1       11.3T  10.5T    233    584  2.09M  9.97M
part1       11.3T  10.5T      0    185      0  3.87M
part1       11.3T  10.5T      0      0      0      0
part1       11.3T  10.5T      0      0      0      0
part1       11.3T  10.5T      0      0      0      0
part1       11.3T  10.5T      0      0      0      0
part1       11.3T  10.5T    228    743  2.09M  17.7M
part1       11.3T  10.5T      0    247      0  5.34M
part1       11.3T  10.5T      0     23      0  95.3K
part1       11.3T  10.5T      0      0      0      0
part1       11.3T  10.5T      0      0      0      0
part1       11.3T  10.5T      0      0      0      0
part1       11.3T  10.5T      0    399      0  12.6M
part1       11.3T  10.5T    225    649  2.09M  10.9M
part1       11.3T  10.5T      0      0      0      0
part1       11.3T  10.5T      0      0      0      0
part1       11.3T  10.5T      0      0      0      0
part1       11.3T  10.5T      0      0      0      0
part1       11.3T  10.5T      0     77      0  2.67M
part1       11.3T  10.5T      0    319      0  8.31M
part1       11.3T  10.5T      0      0      0      0
part1       11.3T  10.5T    161    160  1.69M  3.92M
part1       11.3T  10.5T     64    520   407K  8.51M
part1       11.3T  10.5T      0      0      0      0
part1       11.3T  10.5T      0      0      0      0
part1       11.3T  10.5T      0      0      0      0
part1       11.3T  10.5T      0      0      0      0
part1       11.3T  10.5T      0    317      0  10.8M
part1       11.3T  10.5T    233    610  2.03M  11.3M
part1       11.3T  10.5T      0     98      0  1.71M
part1       11.3T  10.5T      0      0      0      0
part1       11.3T  10.5T      0      0      0      0
part1       11.3T  10.5T      0      0      0      0
part1       11.3T  10.5T      0      0      0      0
part1       11.3T  10.5T    228    802  2.05M  18.4M

rivimey · Dec 18, 2023

Found more evidence: 'zed' does seem to be persistently restarting the silver:


Dec 18 22:00:40 erin kernel: e1000e 0000:00:19.0 eno1: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Dec 18 22:00:43 erin zed[396357]: eid=97252 class=resilver_start pool='part1'
Dec 18 22:00:49 erin zed[396369]: eid=97256 class=resilver_start pool='part1'
Dec 18 22:00:55 erin zed[396388]: eid=97260 class=resilver_start pool='part1'
Dec 18 22:01:02 erin zed[396402]: eid=97264 class=resilver_start pool='part1'
Dec 18 22:01:08 erin zed[396415]: eid=97268 class=resilver_start pool='part1'
Dec 18 22:01:15 erin zed[396427]: eid=97272 class=resilver_start pool='part1'
Dec 18 22:01:21 erin zed[396439]: eid=97276 class=resilver_start pool='part1'
Dec 18 22:01:28 erin zed[396451]: eid=97280 class=resilver_start pool='part1'
Dec 18 22:01:35 erin zed[396463]: eid=97284 class=resilver_start pool='part1'
Dec 18 22:01:41 erin zed[396475]: eid=97288 class=resilver_start pool='part1'
Dec 18 22:01:48 erin zed[396487]: eid=97292 class=resilver_start pool='part1'

Pitfrr · Dec 18, 2023

Hello,

You don't list the HW you're using so it is also difficult to guess but maybe this could be due to SMR drives?
I don't have first hand experience in that matter but unending (or very slow) resilver is one of the symptoms.
Well, that is where I would start before checking anything else.

rivimey · Dec 18, 2023

The drives are Toshiba NAS 8TB N300 NAS drives; the older ones are HDWN180 and the new HDWG480. I have a couple of other (WD,Seagate) NAS drives connected but they're not in pool part1.

I can't find a detailed datasheet but similar drives are CMR now SMR. I did find this:

https://smarthdd.com/database/TOSHIBA-HDWG480/0601/

I found one retailer asserting it's a CMR drive and one reviewer asserting it's part of the MG08-D series, which are enterprise drives.

NugentS · Dec 18, 2023

I suggest removing the spare, testing it and then maybe readding it. Assuming you can with a resilver in progress. The thing is that the zpool status doesn't seem to indicate that the resilver is doing anything - the only fault is with the spare. So remove the spare, test it off line and on principle reboot / the NAS

rivimey · Dec 18, 2023

Using the UI, I tried replacing the spare, but it said:

Error: [EZFS_ISSPARE] cannot replace 12426937301463792562 with /dev/disk/by-partuuid/82b125dc-5417-483d-9656-f38ba5177767: device is reserved as a hot spare

If I try to remove it in the UI I get:

Error: 2077 is not a valid Error

NugentS · Dec 18, 2023

Try removing in the command line?
Of course you could perhaps physically remove it and then logically remove it - maybe, possibly after a reboot. I do however suggest a backup of the data first - cos I am paranoid

Not entirely convinced by the PCIe cards you have installed. I know that most are utter crap - no idea about yours - but it don't look good to me. You also do not have enough memory to even begin to use a 1TB Cache. Its not helping, and probably hindering

rivimey · Dec 18, 2023

Not entirely convinced by the PCIe cards you have installed. I know that most are utter crap - no idea about yours - but it don't look good to me. You also do not have enough memory to even begin to use a 1TB Cache. Its not helping, and probably hindering

No argument there, hence getting the 9400 card..!

I had the SSD available so added it... where would I RTFM on the proper setup of a cache drive? Could I have set up 1/2 of the SSD as cache and 1/2 as something else?

NugentS · Dec 18, 2023

Err nope. A vdev uses the entire disk without hacking things around - which is highly not reccomended or supported. Its the sort of thing that if you need to ask how - you don't need to know. Its something I have done before, but have now stopped doing.

L2ARC = cache. This doesn't come into play until you have probably 64GB of RAM. Its also very likley to have any effect for a single (or few) user system. Also 1TB is too high as it should be max 5X ARC (ARC=RAM. If Scale ARC=50% RAM)
I would suggest you remove it (its safe to remove) - you can then use the NVMe as an App Pool (preferably with a mirror, if you can)

I would make more detailed suggestions, but you haven't posted your full hardware - so I can't. Actually I apologise you have, but it would help if you expanded your sig to include the rest of the hardware as its easier to find then.

""As a general rule of thumb, an L2ARC should not be added to a system with less than 64 GB of RAM and the size of an L2ARC should not exceed 5x the amount of RAM. In some cases, it may be more efficient to have two separate pools: one on SSDs for active data and another on hard drives for rarely used content."

Important Announcement for the TrueNAS Community.

Resilver very slow - no apparent progress

rivimey

Dabbler

rivimey

Dabbler

rivimey

Dabbler

Pitfrr

Wizard

rivimey

Dabbler

NugentS

MVP

rivimey

Dabbler

Attachments

NugentS

MVP

rivimey

Dabbler

NugentS

MVP

Similar threads

Important Announcement for the TrueNAS Community.

Resilver very slow - no apparent progress

Dabbler

Dabbler

Dabbler

Wizard

Dabbler

MVP

Dabbler

Attachments

MVP

Dabbler

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Resilver very slow - no apparent progress"

Similar threads