Panic + Bootloop

Decess

Dabbler
Joined
Feb 14, 2023
Messages
16
Hey, everyone. Need some help. I'm a layman in this matter, so please "treat me like a nine years old".

General Info:

I'm running

M5A78L-M Motherboard
AMD FX-8320E CPU
8GB of ram, not sure what model and specs
5x WD Red NAS 6TB WD60EFAX
1x SSD PNY cs900 120GB SSD7CS900-120-RB

SSD and HDDs are all "new" (bought last november).

Running TrueNAS 13.0

My pool is configured to have one drive as a redundancy feature. I mean instead of roughly 5x6=~30gbs, I had 24gbs of storage, because one drive is supposed to be used for redundancy. I believe this is called RAID5.

I'm not sure about this, but I believe my pool is encrypted. If it is, I have the password for it. I used the same password for everything.


The Problem:

A month and a half ago I had a power outtage. When I got my power back and turned everything on, I had a problem with "Pool Offline." I troubleshot it in the STORAGE subforum here and managed to get everything working once again by running "zpool import -F tank". It said I was losing 20 seconds of data, but that wasnt relevant to me. Problem was fixed.

I ended up not powering my NAS on anymore until I got a nobreak for it. So a couple weeks pass, I get a newbreak and turn everything on. But it couldn't handle my PC+NAS at the same time and it shutdown on me.

This happened a 3 weeks ago. So I unplugged everything from the nobreak, put it back on the wall and just kept using my PC normally without even touching my NAS.

I decided to turn it on to check on some files a week ago and I was having the exact problem I had before, with the difference that this time it said I was going to lose 80 something seconds of data, which wasn't relevant to me. My pool "tank" was basically offline again. I did all checks with my previous problem, matching all tests done with the ones I posted before on the other thread I mentioned. Everything seemed to be the same, so I tried the same solution. "zpool import -F tank" it was.

But tthis time it didn't work. It started doing whatever it does but then restarted the system, landing me back to where I started. Same 84 seconds being lost were reported to me, so I tried the -F again. It again tried doing something but just restarted one more time.

So I decided to record my NAS screen to be able to pause the video and read it to try to figure out what was happening before it restarted itself. I set up a phone recording it and hit the import -F again. This time something different happened. I got stuck in a bootloop.

And I don't know what to do.

Here's a video of the bootloop:



Additional Information and Questions:

I tried troubleshooting this in the same thread I had previously open in the "STORAGE" subforum. I got reply from a couple people, so here are some things that I think are relevant:

1) I was told to turn any over-clocking off. I didn't have any on in the first place, but anyhow I did a reset cmos/bios. Didn't help.

2) I was told to get a new network card. I didn't do this, but I'm not sure how this could be a problem when I can't even boot my system. Anyway I tried turning off the onboard network card in my bios. Didn't help.

3) I was told to get an LSI HBA card. These are very expensive in the third world country where I live. I have lots of files on my NAS, but most are completely redownloadable. Will take some time, but it's manageable. There are very few more important files, but it wouldn't be the end of the world losing them. I'd like to try and salvage them if possible and then find another storage solution. So I'm not sure spending that money is adequate. Maybe as a last resort, IDK. If I were sure that this would fix my problem, I'd buy the card. But it's hard to be sure of anything.

4) I tried booting with all HDDs unplugged, with only the SSD with the OS in it running. It worked. I managed to boot and could connect to truenas through the web interface. Is this behavior expected? Does this indicate what could be going wrong? I tried unplugging HDDs one by one (supposedly one could break and my system would still run because of raid redundancy), but with only 4 out of 5 HDDs plugged in, no matter which 4, the bootloop happened again.

5) Could I reinstall TrueNAS to my SSD and see if it fixes things? I don't think it would, since it boots with HDDs off, but it's worth asking.

6) What if I remove my boot SSD and all HDDs and plug them in my main computer? Could it be a harwdare problem with NAS machine that would happen in my main PC? Can the TrueNAS os manage the change in hardware without it needing to be reinstalled? Could this have any chance of working?

7) After searching here on forums, I found reference to this: https://www.klennet.com/zfs-recovery/remote-assistance.aspx Can I plug my 5 HDDs in my main Windows computer and try running this to recover my data?

Any tips?

I thank you ver much for your time and attention, and I'm sorry for bothering you with these problems. I tried doing something that exceeded my tech knowledge and ended up screwing myself and am now wasting your time, so I'm sorry.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Hello @Decess

I did see that in your original thread it was mentioned that you have "Shingled Magnetic Recording" or "SMR" drives - this might be contributing to the problem, as the WD SMR drives are specifically known to have issues with heavy load (they return IDNF errors trying to read a sector) but in this case you may also have issues with your general system stability as you're getting a kernel panic:

decess.jpg


Are you able to run a copy of MemTest86 or a similar program to ensure that your CPU/motherboard/RAM are stable?

Once overall system stability is confirmed, we can try to boot without importing the pool (with the drives disconnected) and then attempt to import it read-only through the command-line (similar to how you used the -F flag previously) and determine if it is able to come online in this manner.
 

Decess

Dabbler
Joined
Feb 14, 2023
Messages
16
Hello @Decess

I did see that in your original thread it was mentioned that you have "Shingled Magnetic Recording" or "SMR" drives - this might be contributing to the problem, as the WD SMR drives are specifically known to have issues with heavy load (they return IDNF errors trying to read a sector) but in this case you may also have issues with your general system stability as you're getting a kernel panic:

View attachment 65171

Are you able to run a copy of MemTest86 or a similar program to ensure that your CPU/motherboard/RAM are stable?

Once overall system stability is confirmed, we can try to boot without importing the pool (with the drives disconnected) and then attempt to import it read-only through the command-line (similar to how you used the -F flag previously) and determine if it is able to come online in this manner.
Thank you ver much for your reply!

Am running MemTest86 right now. Will leave it over night. ATM it shows 2 passes with 0 errors.
 

Decess

Dabbler
Joined
Feb 14, 2023
Messages
16
Hello @Decess

I did see that in your original thread it was mentioned that you have "Shingled Magnetic Recording" or "SMR" drives - this might be contributing to the problem, as the WD SMR drives are specifically known to have issues with heavy load (they return IDNF errors trying to read a sector) but in this case you may also have issues with your general system stability as you're getting a kernel panic:

View attachment 65171

Are you able to run a copy of MemTest86 or a similar program to ensure that your CPU/motherboard/RAM are stable?

Once overall system stability is confirmed, we can try to boot without importing the pool (with the drives disconnected) and then attempt to import it read-only through the command-line (similar to how you used the -F flag previously) and determine if it is able to come online in this manner.

Hey

Did a little over 19 hours of testing. The first 2 passes went pretty quick, about an hour each. Then they slowed down once errors started showing up.

1.jpg


2.jpg


3.jpg


These are pictures at different times. 8/13/19 hours of running time.

So clearly there is something wrong with my system. In my ignorance I ask:

1) is this definitely RAM? Couldn't it be the motherboard?

2) Is for sure my panic/bootloop issues due to this hardware problems? Cause it's not like I'm getting all errors. I sometimes got passes. Wouldn't my system sometimes boot too? Especially considering the system ran fine for a couple months with 0 issues and only exploded after power outtages.

3) Should my first atempt at fixing this be finding new RAM?

4) I loop back to a previous question I did in my original post here: what if I remove from my NAS machine my boot SSD and all HDDs and plug them in my main computer? Could it be a harwdare problem with NAS machine that wouldn't happen in my main PC? Can the TrueNAS OS manage the change in hardware without it needing to be reinstalled? Could this have any chance of working?


I thank you again for your time and attention.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
1) is this definitely RAM? Couldn't it be the motherboard?

2) Is for sure my panic/bootloop issues due to this hardware problems? Cause it's not like I'm getting all errors. I sometimes got passes. Wouldn't my system sometimes boot too? Especially considering the system ran fine for a couple months with 0 issues and only exploded after power outtages.

3) Should my first atempt at fixing this be finding new RAM?

4) I loop back to a previous question I did in my original post here: what if I remove from my NAS machine my boot SSD and all HDDs and plug them in my main computer? Could it be a harwdare problem with NAS machine that wouldn't happen in my main PC? Can the TrueNAS OS manage the change in hardware without it needing to be reinstalled? Could this have any chance of working?

1) It could be the RAM, the board, or the chip - given the delayed onset of problems (two successful passes, then errors begin) it could also be a thermal issue. As you suggest in your point #3, the RAM is the easiest to check - removing a stick or two of RAM and retesting could shed some light on whether a particular stick is bad.

2) It's not a 100% thing. Certainly, unreliable hardware will have trouble importing a pool - but if the same hardware managed to corrupt metadata on your pool, whether because of failing RAM or the data-integrity issues around the WD SMR drives, it might have permanently written some bad data there, which is causing the panic on import. The failure after power outages makes me think it is related to the SMR drives though, as if they were in the midst of shuffling data from their "cache" area to the permanent home on the disk when the power outages occurred, this is an easy way to cause an inconsistent data state.

3) See #1 - if you can narrow down if you have one bad stick or another first, that is probably better than buying new hardware immediately.

4) This is also a good way to identify if the problem is hardware-based or pool-based. You would likely have to reconfigure your network settings, but if you're able to get your system to boot to the TrueNAS console, you can then use that to go to the shell/prompt, and perform a simple zpool status -v check - if your pool was successfully imported, then it's a good sign that your pool is healthy and the hardware is the problem. If your main/desktop computer fails to boot, and bootloops in the same manner - that's unfortunately a sign that the problem is the pool itself, and we might need to find a way to get it to import read-only, and/or go back a few transaction groups similar to what you've done in the past.

Ultimately, those SMR drives do need to go away though. As mentioned in point #2, they're capable of doing things to the data on disk without ZFS's knowledge - and that's a bad thing for the safety of your storage.
 
Top