Pool degrades then shuts down

Stickeris

Dabbler
Joined
Sep 3, 2022
Messages
18
Hey yall, I am having a weird issue. I woke up to my server services having shut down. I checked the box itself and it was hung on a reboot. I gave it a hand and it came back all fine and dandy, within 10 mins of checking apps to make sure they were running I started to see a pool error, and then read and write errors across the board.

My shell displays:
md/raid1:md127: Disk failure on sdf1, disabling device.
md/raid1:md127: Operation continuing on 1 devices.

and the pool status looks like this:
Screen Shot 2022-09-04 at 10.15.59 AM.png


And a few minutes later the entire server crashes and gets stuck in a reboot hang with this output:
process 1661 has rlimit-core set to 1
aborting core

I dont know what to do, as I can't get it to stay up long enough to resilver.

This is an almost brand new hard drives, only 3 months old, IornWolf 8TB, I only have 4 in raid z1, and one NVMe cache.

My hardware is a bit old, but core temps and the BiOS are showing no issues
  • Intel E52650I-v2
  • Asus P9x79 WS
  • 64 GB Ram
Any advice, from what I'm reading, which is in a panic, I may have to start with fresh hardware as well. Any help would be appreciated.
 

indivision

Guru
Joined
Jan 4, 2013
Messages
806
Troubleshooting hardware can be a time-consuming pain. But, you want to remain calm, methodical and patient.

I think the first thing to test would be the individual drives. Power down the server. One at a time move each drive over to another PC temporarily to run diagnostics on them. See if the flaws are reproduced in a different environment. If not, then move on to looking at other parts in the server. PSU, ram, etc.
 

Stickeris

Dabbler
Joined
Sep 3, 2022
Messages
18
Troubleshooting hardware can be a time-consuming pain. But, you want to remain calm, methodical and patient.

I think the first thing to test would be the individual drives. Power down the server. One at a time move each drive over to another PC temporarily to run diagnostics on them. See if the flaws are reproduced in a different environment. If not, then move on to looking at other parts in the server. PSU, ram, etc.
Does what I listed sound hardware related?

either way I will move forward with individual testing in case I need to rma anything. Any advice on good testing software?
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Does what I listed sound hardware related?

either way I will move forward with individual testing in case I need to rma anything. Any advice on good testing software?

Suggest you specify the software version... and provide the system history. Was it running well before?
 

indivision

Guru
Joined
Jan 4, 2013
Messages
806
Does what I listed sound hardware related?

That's my impression. But, I don't really have much info about your system or things you might have tried/changed, etc. I recommend following up with the info morganL requested.

Any advice on good testing software?

I haven't had to do this directly myself. But, someone else recommended this for a similar situation the other day:

 

Stickeris

Dabbler
Joined
Sep 3, 2022
Messages
18
Suggest you specify the software version... and provide the system history. Was it running well before?
Sorry, I was running TrueNAS-SCALE-22.02.3. It has been running without error for the past 3 weeks, yesterday was the first time I had to power down (a few times) to install a USB PCIe card. Other than that no issues.

I was running one virtualization of an Unbutu server, traefik, nextcloud, mosquitto, and pihole all from the true charts catalog and all up to date.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Sorry, I was running TrueNAS-SCALE-22.02.3. It has been running without error for the past 3 weeks, yesterday was the first time I had to power down (a few times) to install a USB PCIe card. Other than that no issues.

I was running one virtualization of an Unbutu server, traefik, nextcloud, mosquitto, and pihole all from the true charts catalog and all up to date.

So probably either the USB card or a hardware failure... latter is more likely would be my guess.
 

Stickeris

Dabbler
Joined
Sep 3, 2022
Messages
18
Hey yall, I took my pool drives out and ran several long tests with Seagates native diagnostic tool. I had one interrupted and had to run a "long fix", the whole process took a week. After that I replaced the sata cables with some silly threaded ones I found on amazon (just trying to get something that said "this isnt poorly made"), and then booted up. Everything works fine, after 48 hours of uptime, I am showing 0 issues.

That said I am more or less playing a waiting game to see if my mobo starts to kick out, if so it looks like ill have to upgrade. Thank you everyone for the help and advice
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Hey yall, I took my pool drives out and ran several long tests with Seagates native diagnostic tool. I had one interrupted and had to run a "long fix", the whole process took a week. After that I replaced the sata cables with some silly threaded ones I found on amazon (just trying to get something that said "this isnt poorly made"), and then booted up. Everything works fine, after 48 hours of uptime, I am showing 0 issues.

That said I am more or less playing a waiting game to see if my mobo starts to kick out, if so it looks like ill have to upgrade. Thank you everyone for the help and advice


Sounds like the cables?? or was it just the one drive?
 

Stickeris

Dabbler
Joined
Sep 3, 2022
Messages
18
Sounds like the cables?? or was it just the one drive?
Sorry for the belated response. I will be soon posting again since im, kinda?, of having the same issue. I think its the sata controller on this old ass motherboard because the cables failing seems honestly ridiculous.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
If it failed again after replacing cables, I would tend to agree.

Its often not cables failing, but just being out of spec for the system needs.
 

Stickeris

Dabbler
Joined
Sep 3, 2022
Messages
18
Waiting on the response from a smart test on one faulted drive, but my friend is gonna lend me 9207-8i raid card to see if that helps. I will update on this or a new thread. THANK YOU for the advice and help.
 
Top