Resilver caused a reboot loop and pool fail

AlanC

Cadet
Joined
Apr 2, 2021
Messages
3
Hi
I am a first time user of a forum, so i apologize in advance for any faults i may make.

I upgraded my old FreeNas to TrueNas in January, and it has been running all my media files, film and pictures with only minor hiccups. However last week, toward the end of watching a film late one night it started to what i can only call buffering and freezing. I thought nothing of it and went to bed. in the morning i thought i would check to see if all was well with TrueNas and found it had gone into a constant reboot loop. I shutdown the server and disconnected all the drives and TrueNas came backup. I continued to connect drives 1 at a time, but could only connect 2 at any one time. I loaded a clean install of the TrueNas software to a spare PC i had, and connected the 3x 4TB hard drives that were causing the boot loop. The spare pc with the clean install booted ok. I did a manual S.M.A.R.T. test on all 3 drives which came back as being OK with no faults. I ran zpool import with the -f -F, this just caused the system to reboot and failed to import. I then tried zpool import, without the -f -F and this server went into a reboot loop as well. i have searched and read dozens of potential fixes but with no success. I took a picture of the final screen before it reboots (see below)
2021-04-01 09.09.17.jpg
2021-04-02 15.00.29.jpg
The pool seems to be in perfect working order apart from showing a re-silver error. (see below)

zpool-capture.PNG


I downloaded and installed UFS Explorer thinking that i would need to do data recovery. However UFS can see the drives and the data, but reports that there are no problems with the drives or the pool.

Can anyone suggest a way to get the pool and data back without having to use the UFS recovery tool
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
In general you should follow "Forum rules": https://www.truenas.com/community/threads/forum-rules.45124/

Many of these types of problems are caused by some type of hardware failure.. so the type/status of the hardware is important.

My initial questions:

Which version of TrueNAS were you running initially on your NAS system?
Which version did you use on the spare pc?

After removing the old drives on the NAS... was that NAS OK. Or were there hardware issues?
Did the previous NAS have ECC memory?
 

AlanC

Cadet
Joined
Apr 2, 2021
Messages
3
Sorry

Original System
P8H61-MX-USB (ASUS)
Intel i5-2300
16Gb ram non ECC
3x 4TB Seagate Baracuda HDD Media pool (now won't import)
3x2TB Seagate Baracuda HDD Data pool (will load if media pool drives not connected)
100/1000 onboard network (no cards)

Second (test) System
H81M-Plus (ASUS)
Intel i3-4160
16Gb ram non ECC
10/1000 onboard network (no cards)

I have since moved the Data pool to the second test machine without any problem. However the Media pool will not transfer without causing a reboot loop on all other hardware.

TrueNAS-12.0-Release (on original)
TrueNAS 12.0-Release (on spare pc)

When the drives were removed from the original NAS it booted up ok, no errors.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
You you could try 12.0-U2.1 on the new PC machine....

Unfortunately, I suspect you might have have had corrupt RAM (non-ECC) and so it wrote corrupted data to ZFS. That makes the drives carry the corruption with them to the new machine. Its very hard to detect or fix that problem. We'll see if anyone else has better ideas.
 

oumpa31

Patron
Joined
Apr 7, 2015
Messages
253
I just updated to 12.0-U3.1 and mine just did the same thing to about 18 tb of movies tv shows and family pictures. I was just about to get externals to back up everything to to.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
I just updated to 12.0-U3.1 and mine just did the same thing to about 18 tb of movies tv shows and family pictures. I was just about to get externals to back up everything to to.
Can you document your system hardware and the errors. Are they identical?
Was there any resilvering going on?
If its family photos, its worth doing what you can to see if there is a solution.
 

AlanC

Cadet
Joined
Apr 2, 2021
Messages
3
Just to update, on what i ended up doing.

Having checked all my hardware for faults/failures and finding none i ended up downloading UFS Raid Recovery (free) to test if the data was still on the HDD and was recoverable. This will not recover your data, but will tell you if it is recoverable.

After completing a scan of the drives in my raid (i had 3x 4TB drives) it reported that all was OK. I tried again to import them into TrueNAS-12.0-Release (on original) many times, but was unable each time. I ended up biting the bullet and buying the software to run the data recovery to a brand new set of 4TB HDD. This took quite a while but was successful and the drives are running without problem (touch wood). I should say that the recovery software found that 2 files had been corrupted but still managed to recovery them. I am assuming that this corruption was caused by the re-silver process. Since i found no problems with my old drives, other than the re-silver corruption, i am using them as backup drives (just in case)
After recovering my files to my original version of-TrueNAS-12.0-Release (on original) I upgraded 1 time to-TrueNAS-12.0-U2.1. I have not upgraded since due to all seeming well so far.

This was not my proffered solution, but it was the only one that I felt I had left to me, having tried all other possible solution's that I could find on the forums.

I would suggest that you try testing all your hardware, drives and memory, as well as trying the zpool command options, as was suggested to me before going down the route that i took.
 

oumpa31

Patron
Joined
Apr 7, 2015
Messages
253
Can you document your system hardware and the errors. Are they identical?
Was there any resilvering going on?
If its family photos, its worth doing what you can to see if there is a solution.
The code looks identical that I get before the system restarts. I've been running this machine for years and never one issue.

i'll have to see what happened. My only problem is that all my disks are in disk shelfs.
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
I am assuming that this corruption was caused by the re-silver process.
That would be an incorrect assumption. A resilver (scrub) finds corruption it does not cause it. If this data is important to you, you should purchase appropriate hardware to run your server on.
 

oumpa31

Patron
Joined
Apr 7, 2015
Messages
253
Gigabyte x399 aorus pro
AMD Ryzen Threadripper 2920X 12-Core
128 gig non ecc "ya ya I know I should have ecc just hard to figure out which type"
12 14tb shucked wd drives
12 4tb drives the good ones
12 2 tb ssd's
lsi-9201-16e
24 bay 2.5" shelf
24 bay 3.5" shelf
2 port 10gig nic

with zpool import i get this
1620478616575.png

so it doesn't look like it was resilvering or anything.

I used zpool import -R /mnt -f -o readonly=on 2948359633811826928 to see if I could get it to mount and it does. I can SSH into the box and see all my data the only thing that doesn't show up is the zvol on that pool for my cameras. I guess I figured out that caused the problem.

Is there a way to delete that ZVOL so it can be imported. otherwise i'll be going to get drives
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
I don't know why the camera zvol would cause problems. I would be interested to see if anyone else experiences this problem with ECC. That would tend to indicate there is a bug rather than an unfortunate sequence of events.

In the meantime, glad to hear the data is salvageable....
 

oumpa31

Patron
Joined
Apr 7, 2015
Messages
253
I don't know why the camera zvol would cause problems. I would be interested to see if anyone else experiences this problem with ECC. That would tend to indicate there is a bug rather than an unfortunate sequence of events.

In the meantime, glad to hear the data is salvageable....

The only thing I can think is a bit flipped when it was recording footage from my security cameras because that zvol was connected to a windows VM. I'm looking into ECC ram for this machine thank god for AMD having the capability across the platform. Also working on a custom front for my server case "Rosewill RSV-L4412" so I can water cool this CPU a bit better.
 

oumpa31

Patron
Joined
Apr 7, 2015
Messages
253
After more investigation I think I found the root of my problem I had 3 2tb ssd's as Log drives attached to my massive main pool 140TB and one of them had unrecoverable errors. I didn't notice this until I destroyed the whole pool, and was putting the pool back together. I think this is the drive that wrote the incorrect information to the main part of the pool. I guess the only good part of this happening is that it forced me to get drives that I can can backup my data to.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Possible. Generally, the SLOG drives are only read on power failure or reboot. I would not expect it to corrupt the pool.. unless you were very unlucky.

Did you have them arranged as a mirror or a stripe? If its a mirror, then a checksum would protect you. If its a stripe, then the ZIL can be corrupted and nonrecoverable.
 

oumpa31

Patron
Joined
Apr 7, 2015
Messages
253
My main pool is a 12 4tb raidz3 vdev and a 12 14tb raidz3 vdev combined
 
Top