Resilver very slow - no apparent progress

rivimey

Dabbler
Joined
Dec 12, 2023
Messages
20
I had an existing group-of-mirrors array with 2 vdevs each of 2x8T drives, which I have recently moved into my new Truenas box. While I did have some issues the array did move ok and I can see the filesystem. I then added a new pair of new 8T drives as a third vdev, but during the process one of the old drives got inadvertently disconnected. The resilver of that mirror is what I'm asking about.

Running watch zpool status shows the pool alternating between:

scan: resilvered 1.93M in 00:00:01 with 0 errors on Mon Dec 18 20:55:23 2023

and:

status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon Dec 18 20:55:28 2023 337M / 11.3T scanned, 1.88M / 11.3T issued at 1.88M/s 1.93M resilvered, 0.00% done, no estimated completion time scan warning: skipping blocks that are only referenced by the checkpoint.

and the date (here Mon Dec 18 20:55...) is constantly updated to almost-now.

I am currently interpreting this as the resilver starts and then more-or-less immediately stops for unexplained reasons. Does that make sense? Following the process with zpool iostat I can see a burst of activity (e.g. a few kilobytes or a megabyte) than nothing, repeated every 5 seconds or so.

I have run smartctl long selftest on all the disks in the system and all are reporting themselves healthy with no errors. I also ran a scrub on the drive before this happened and again everything was reported fine.

I can see no messages (at all) in the kernel log or elsewhere, and am a bit stumped. Any ideas?

Would it be a good idea
 

rivimey

Dabbler
Joined
Dec 12, 2023
Messages
20
I thought a complete 'status' would be a good idea:

zpool status part1 pool: part1 state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon Dec 18 21:03:31 2023 337M / 11.3T scanned, 0B / 11.3T issued 0B resilvered, 0.00% done, no estimated completion time scan warning: skipping blocks that are only referenced by the checkpoint. checkpoint: created Sat Dec 9 14:27:25 2023, consumes 1.84G config: NAME STATE READ WRITE CKSUM part1 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ff8a6aae-6254-7549-8264-128e0997e9ad ONLINE 0 0 0 751ead49-43d8-0d43-94e7-c21a985cc1fd ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 e1d14753-a637-1e49-abe4-4031f184067b ONLINE 0 0 0 42de08dd-55d1-624d-907c-6d563a36f633 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 ea541032-c99f-49be-8331-bc3ac0ae4061 ONLINE 0 0 0 530538a0-ffb8-40ff-ad20-298b50b8f1c9 ONLINE 0 0 0 cache nvme0n1p1 ONLINE 0 0 0 spares sdj1 FAULTED corrupted data errors: No known data errors


... I don't know why my spare (a new 8T drive) is faulted or how to find out.

The iostat behaviour:

$ zpool iostat part1 1 capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- part1 11.3T 10.5T 39 171 396K 3.63M part1 11.3T 10.5T 233 584 2.09M 9.97M part1 11.3T 10.5T 0 185 0 3.87M part1 11.3T 10.5T 0 0 0 0 part1 11.3T 10.5T 0 0 0 0 part1 11.3T 10.5T 0 0 0 0 part1 11.3T 10.5T 0 0 0 0 part1 11.3T 10.5T 228 743 2.09M 17.7M part1 11.3T 10.5T 0 247 0 5.34M part1 11.3T 10.5T 0 23 0 95.3K part1 11.3T 10.5T 0 0 0 0 part1 11.3T 10.5T 0 0 0 0 part1 11.3T 10.5T 0 0 0 0 part1 11.3T 10.5T 0 399 0 12.6M part1 11.3T 10.5T 225 649 2.09M 10.9M part1 11.3T 10.5T 0 0 0 0 part1 11.3T 10.5T 0 0 0 0 part1 11.3T 10.5T 0 0 0 0 part1 11.3T 10.5T 0 0 0 0 part1 11.3T 10.5T 0 77 0 2.67M part1 11.3T 10.5T 0 319 0 8.31M part1 11.3T 10.5T 0 0 0 0 part1 11.3T 10.5T 161 160 1.69M 3.92M part1 11.3T 10.5T 64 520 407K 8.51M part1 11.3T 10.5T 0 0 0 0 part1 11.3T 10.5T 0 0 0 0 part1 11.3T 10.5T 0 0 0 0 part1 11.3T 10.5T 0 0 0 0 part1 11.3T 10.5T 0 317 0 10.8M part1 11.3T 10.5T 233 610 2.03M 11.3M part1 11.3T 10.5T 0 98 0 1.71M part1 11.3T 10.5T 0 0 0 0 part1 11.3T 10.5T 0 0 0 0 part1 11.3T 10.5T 0 0 0 0 part1 11.3T 10.5T 0 0 0 0 part1 11.3T 10.5T 228 802 2.05M 18.4M
 

rivimey

Dabbler
Joined
Dec 12, 2023
Messages
20
Found more evidence: 'zed' does seem to be persistently restarting the silver:

Dec 18 22:00:40 erin kernel: e1000e 0000:00:19.0 eno1: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Dec 18 22:00:43 erin zed[396357]: eid=97252 class=resilver_start pool='part1' Dec 18 22:00:49 erin zed[396369]: eid=97256 class=resilver_start pool='part1' Dec 18 22:00:55 erin zed[396388]: eid=97260 class=resilver_start pool='part1' Dec 18 22:01:02 erin zed[396402]: eid=97264 class=resilver_start pool='part1' Dec 18 22:01:08 erin zed[396415]: eid=97268 class=resilver_start pool='part1' Dec 18 22:01:15 erin zed[396427]: eid=97272 class=resilver_start pool='part1' Dec 18 22:01:21 erin zed[396439]: eid=97276 class=resilver_start pool='part1' Dec 18 22:01:28 erin zed[396451]: eid=97280 class=resilver_start pool='part1' Dec 18 22:01:35 erin zed[396463]: eid=97284 class=resilver_start pool='part1' Dec 18 22:01:41 erin zed[396475]: eid=97288 class=resilver_start pool='part1' Dec 18 22:01:48 erin zed[396487]: eid=97292 class=resilver_start pool='part1'
 

Pitfrr

Wizard
Joined
Feb 10, 2014
Messages
1,531
Hello,

You don't list the HW you're using so it is also difficult to guess but maybe this could be due to SMR drives?
I don't have first hand experience in that matter but unending (or very slow) resilver is one of the symptoms.
Well, that is where I would start before checking anything else.
 

rivimey

Dabbler
Joined
Dec 12, 2023
Messages
20
The drives are Toshiba NAS 8TB N300 NAS drives; the older ones are HDWN180 and the new HDWG480. I have a couple of other (WD,Seagate) NAS drives connected but they're not in pool part1.

I can't find a detailed datasheet but similar drives are CMR now SMR. I did find this:

https://smarthdd.com/database/TOSHIBA-HDWG480/0601/

I found one retailer asserting it's a CMR drive and one reviewer asserting it's part of the MG08-D series, which are enterprise drives.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I suggest removing the spare, testing it and then maybe readding it. Assuming you can with a resilver in progress. The thing is that the zpool status doesn't seem to indicate that the resilver is doing anything - the only fault is with the spare. So remove the spare, test it off line and on principle reboot / the NAS
 

rivimey

Dabbler
Joined
Dec 12, 2023
Messages
20
Using the UI, I tried replacing the spare, but it said:

Error: [EZFS_ISSPARE] cannot replace 12426937301463792562 with /dev/disk/by-partuuid/82b125dc-5417-483d-9656-f38ba5177767: device is reserved as a hot spare

If I try to remove it in the UI I get:

Error: 2077 is not a valid Error
 

Attachments

  • zfs.jpg
    zfs.jpg
    141.5 KB · Views: 133

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Try removing in the command line?
Of course you could perhaps physically remove it and then logically remove it - maybe, possibly after a reboot. I do however suggest a backup of the data first - cos I am paranoid

Not entirely convinced by the PCIe cards you have installed. I know that most are utter crap - no idea about yours - but it don't look good to me. You also do not have enough memory to even begin to use a 1TB Cache. Its not helping, and probably hindering
 

rivimey

Dabbler
Joined
Dec 12, 2023
Messages
20
Not entirely convinced by the PCIe cards you have installed. I know that most are utter crap - no idea about yours - but it don't look good to me. You also do not have enough memory to even begin to use a 1TB Cache. Its not helping, and probably hindering

No argument there, hence getting the 9400 card..! :smile:

I had the SSD available so added it... where would I RTFM on the proper setup of a cache drive? Could I have set up 1/2 of the SSD as cache and 1/2 as something else?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Err nope. A vdev uses the entire disk without hacking things around - which is highly not reccomended or supported. Its the sort of thing that if you need to ask how - you don't need to know. Its something I have done before, but have now stopped doing.

L2ARC = cache. This doesn't come into play until you have probably 64GB of RAM. Its also very likley to have any effect for a single (or few) user system. Also 1TB is too high as it should be max 5X ARC (ARC=RAM. If Scale ARC=50% RAM)
I would suggest you remove it (its safe to remove) - you can then use the NVMe as an App Pool (preferably with a mirror, if you can)

I would make more detailed suggestions, but you haven't posted your full hardware - so I can't. Actually I apologise you have, but it would help if you expanded your sig to include the rest of the hardware as its easier to find then.

""As a general rule of thumb, an L2ARC should not be added to a system with less than 64 GB of RAM and the size of an L2ARC should not exceed 5x the amount of RAM. In some cases, it may be more efficient to have two separate pools: one on SSDs for active data and another on hard drives for rarely used content."
 
Top