z2pool gone crazy after stupid tinkering-no data related issues but: inaccessible GUI options, eternal resilver loop and wrong disk identifiers

ElGusto

Explorer
Joined
Mar 19, 2015
Messages
72
Hello! i have an z2pool running though an IBM 105 flashed to LSI 9211.
The hard disks are 4 pieces of 8TB SMR Seagate archives, which have run flawlessly until recently.

This seems to be a pure confiruation problem, because the disks are now running fine in a new case with better cooling.
The preceding issue obviously was the heat wave we had here, causing overheating in the hard drives and timeouts.
I had to offline/online the disks and use the replace feature for making it work again for multiple times.
In between I have moved all the hardware to a better ventilated case, and the overheating issues are gone. There seem to be no permanent errors or hardware issues left.

But now there are some strange issues (probably related to my actions when trying to reintegrate the disks that had stalled during the heatwave):
-When I boot the system it automatically starts resilvering everytime:
root@truenas[~]# zpool status z2pool
pool: z2pool
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Fri Aug 12 19:11:15 2022
917G scanned at 1.04G/s, 131G issued at 152M/s, 23.3T total
13.0G resilvered, 0.55% done, 1 days 20:24:32 to go
config:

NAME STATE READ WRITE CKSUM
z2pool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
da1p2 ONLINE 0 0 0 (resilvering)
gptid/10046937-6b06-11eb-9b87-000c29077bb3 ONLINE 0 0 0
gptid/aa7c274d-0bc3-11ed-bd90-000c29077bb3 ONLINE 0 0 0 (resilvering)
gptid/9ce5247f-331b-3148-9b38-c958d2bd057a ONLINE 0 0 0
As you can see for some reason the first hard drive is shown by it's device name instead of it's gptid and there are no errors reported. When I click on "Edit" for the disk da1p2, nothing happens - the Edit windows does not open, while it's working fine for the other drives.

Bildschirmfoto_2022-08-12_21-58-39.jpg


(da0 is not part of the pool, but the boot-drive)
When the resilver has run through, with no errors, it just starts over instantly.
I tried some commands which I found in similar topics, like:
glabel status
Name Status Components
gptid/4457da9a-9c9a-11eb-b2ad-000c29077bb3 N/A da0p1
gptid/aa7c274d-0bc3-11ed-bd90-000c29077bb3 N/A da2p2
gptid/10046937-6b06-11eb-9b87-000c29077bb3 N/A da3p2
gptid/9ce5247f-331b-3148-9b38-c958d2bd057a N/A da4p1
gptid/e632cfa5-0957-6344-b0ae-c7e0ecefabbc N/A da4p9

Strange enough there's no da1 there, but da4 is twice . Also the partition numbers seem to be all over the place ..
Can you please help me get this mess sorted?
I think it's crucial for getting rid of the eternal resilver loop and to make the WebUI work for the first disk's "Edit" options again. Also I can't scruib because of the running resilver. And having this issue cleared probably is much better for future handling in case of date integrity errors.
Thank you!
 
Last edited:

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
Maybe your HBA took some damage from the overheat? Could a cable be half-loose? I would first re-seat everything (including RAM sticks) and then try with another HBA.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
I think it's crucial for getting rid of the eternal resilver loop
Isn't this likely due to the SMR drives, irrespective of any other aspect? Do you have a fan on the heatsink of the IBM 1015? They run hot normally.
 

ElGusto

Explorer
Joined
Mar 19, 2015
Messages
72
Hello! I already reseated anything when I moved the whole hardware to another case. And there's no Checksum error or anything reported. So I don't think the issue is related to transfer or hard disk errors, but I am rather sure it stems from me fiddling around wildly removing and adding disks to make it run again, when they had gone to failure mode due to overheating.
 

ElGusto

Explorer
Joined
Mar 19, 2015
Messages
72
Isn't this likely due to the SMR drives, irrespective of any other aspect? Do you have a fan on the heatsink of the IBM 1015? They run hot normally.
Improbable, because it has been working perfectly fine for years before, and this chaos first started after I fiddled around with the drives to reintegrate them after they stalled from overheating - and there are no data integrity erros at all, all the drives seem to run well. So I think it's my failure.
I assume the solution would roughly consist of removing all the drives from the pool, maybe modifying/fixing something on the drives and re-adding them .. but I got no clue how to do this the right way.
And, I mean: This strange issue with da4 having two GPTIDs and d1 having none, really looks like some adminstrative/configuration problem.
 

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
Just a hunch, but some times those things help. Can you (maybe) try to take the array offline and retake it up after a reboot with no drives and another still with all of them? Maybe someone more experienced than me in arrays in TrueNAS can say if it is a good idea?
 

ElGusto

Explorer
Joined
Mar 19, 2015
Messages
72
Isn't this likely due to the SMR drives, irrespective of any other aspect? Do you have a fan on the heatsink of the IBM 1015? They run hot normally.
My HBA doesn't have additional cooling - but there's no data or data transfer problem at all. So it doesn't seem like that's caused by the HBA. And now it's in a new, big case with a lot of cooling. Everythings points to this being a logical/configuration problem resulting from my stupid tinkering :( .
 

ElGusto

Explorer
Joined
Mar 19, 2015
Messages
72
Just a hunch, but some times those things help. Can you (maybe) try to take the array offline and retake it up after a reboot with no drives and another still with all of them? Maybe someone more experienced than me in arrays in TrueNAS can say if it is a good idea?
Taking the whole array offline - do you mean exporting? I already did this before, and it didn't change the chaos, sadly.
 

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
Yes, but with following with the reboots. Might let it re-insert the array without any residual old config.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
And what about the SMR drives? These are typically considered a no-no for TrueNAS because of the issues associated with long resilver times - this from https://www.truenas.com/community/resources/list-of-known-smr-drives.141/ :

SMR has worse sustained write performance than CMR, which can cause severe issues during resilver or other write-intensive operations, up to and including failure of that resilver. It is often desirable to choose a CMR drive instead. This thread attempts to pull together known SMR drives, and the sources for that information."
 

ElGusto

Explorer
Joined
Mar 19, 2015
Messages
72
Thanks! It didn't help, though :( After reimporting it's same as before:
1. Naḿing is weird as shown in the screenshot
2. Cannot edit disk options for da1
3. pool is shown as unhealthy in UI, but it's perfectly fine from the output of zpool status with no errors
4. unneeded resilver started as soon as the pool was reimported
 

ElGusto

Explorer
Joined
Mar 19, 2015
Messages
72
I have some more info, there seem to be weird things going on with da4, too:
gpart show /dev/da1
=> 40 15628053088 da1 GPT (7.3T)
40 88 - free - (44K)
128 4194304 1 freebsd-swap (2.0G)
4194432 15623858696 2 freebsd-zfs (7.3T)

root@truenas[~]# gpart show /dev/da2
=> 40 15628053088 da2 GPT (7.3T)
40 88 - free - (44K)
128 4194304 1 freebsd-swap (2.0G)
4194432 15623858696 2 freebsd-zfs (7.3T)

root@truenas[~]# gpart show /dev/da3
=> 40 15628053088 da3 GPT (7.3T)
40 88 - free - (44K)
128 4194304 1 freebsd-swap (2.0G)
4194432 15623858696 2 freebsd-zfs (7.3T)

root@truenas[~]# gpart show /dev/da4
=> 34 15628053101 da4 GPT (7.3T)
34 2014 - free - (1.0M)
2048 15628034048 1 !6a898cc3-1dd2-11b2-99a6-080020736631 (7.3T)
15628036096 16384 9 !6a945a3b-1dd2-11b2-99a6-080020736631 (8.0M)
15628052480 655 - free - (328K)
Or is this normal?
 

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
Hmmm...
Thanks! It didn't help, though :( After reimporting it's same as before:
1. Naḿing is weird as shown in the screenshot
2. Cannot edit disk options for da1
3. pool is shown as unhealthy in UI, but it's perfectly fine from the output of zpool status with no errors
4. unneeded resilver started as soon as the pool was reimported
Hmmm... Do you have any other drives lying around? If only just to rule out your controller?
 

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
I have some more info, there seem to be weird things going on with da4, too:

Or is this normal?
W8. Everything has a block size of 128 except the last entry that has 2K? Can someone knowledgeable shed some light on that? Also, what is the function of each of the drives?
 

ElGusto

Explorer
Joined
Mar 19, 2015
Messages
72
They are all equally part of my z2pool. No special function. I once had jails installed, but removed it again, and then there's the forced 2GByte swap on each, which FreeNAS created on each drive automatically. Or waht do you mean by 'what's the function'?
 

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
They are all equally part of my z2pool. No special function. I once had jails installed, but removed it again, and then there's the forced 2GByte swap on each, which FreeNAS created on each drive automatically. Or waht do you mean by 'what's the function'?
No, that is exactly what I meant. Could this be the first time the drives had to resilver?
 
Top