Multiple disk failures confusion

Etorix · Sep 12, 2022

superfour said:
So... let me say that I still cannot comprehend how a 2-year-old system used by one user only, never stressed, will fail with such a manner. What factor can *actually break* a disk, or break 8 disks in 60 minutes for that matter? Few RAM!? Three controllers simultaneously having strokes?

One of the two add-in, and low end, SATA controllers failing to cope with the load put on it by ZFS would be enough.
It's good that you could put the disks in a system with some more RAM, but you need to get rid of these SATA cards (SATA ports from the Intel chipset are fine).

FreeNAS/TrueNAS CORE are not Linux (I probably wouldn't be here if it were!). And they are not so "friendly to old hardware". There's some sort of qualified hardware list which has to be followed. ECC RAM is nice to have, but not essential. Having proper drive controllers IS essential.

joeschmuck · Sep 12, 2022

@superfour
First I'd like to say that I appreciate you remaining polite through all the interactions with the other forum members. it's good to be polite for everyone, even when we do not understand the rationale behind someone's thinking.

I could tell several of the forum users trying to help you were really looking for complete details about your problem so they could provide you good sound advice. You did not provide good details about your problem, your hardware, I'm still not sure what version of TrueNAS you are running but I suspect 12.x. Just some friendly advice for any future problem you are asking for help with... Post your entire system specs, tell it all and do not assume we know what you have or assume it doesn't matter. Let us decide what matters. All too often it's a little detail that is the root of all evil.

Something you should have also posted, the complete output of

Code:

smartctl -a /dev/ada1

and do this for all drives that are having problems. This can provide a wealth of information.

I am glad that you have another system to place the drives into to let them resilver.

My advice:
1) Let the drives resilver, if they resilver. While this is happening, if there is data you need off the pool, grab it now.
2) When the pool is healthy again, run a SMART long/extended test of every drive and report the findings. (unless they finish from what you are doing now)
3) If you place your drives back into the original system, I'd recommend you do full burn-in test for the CPU and RAM. Do a good one and make sure your system is stable. Do it with all the drives connected, you want the full power load.
4) If all that passes and when you get TrueNAS running, periodically check the SWAP Space, make sure zero (0) has been used. If any SWAP Space is being used then you have run out of RAM and you should increase your RAM.

As for the SATA drive controller card, well there are some good points made above. My first thought is to check the SATA cables, or maybe even replace them. Often a poorly connected data cable will product UDMA_CRC errors for the hard drives. These errors once recorded do not go away. But if yo find out that all the drives attached to the same controller are failing, then you have a possible smoking gun. Something about SATA controllers, they may stop being supported by the drivers in the TrueNAS application someday. I personally take that risk everytime I update my software because I use a standard 4 port SATA controller card, but I researched the card for driver compatibility, well that was many years ago.

As for when or how a system fails... It just happens due to component failure. Age does not matter. You could be having a power supply issue that affects the drives, and that is actually a reasonable explanation.

Most of the folks I know responding to you are very experienced and are trying to give you good advice. They do not have an agenda to tell you how bad your system is and leave you hanging. your system is substandard by today's standards for TrueNAS 12, even if it does work. If you do not want to change out your system, that is fine, but be diligent in providing details in order to isolate the issue, or if they give up on you, then you will need to do it on your own.

I wish you luck and hope that you have not lost any data.

superfour · Sep 13, 2022

@joeschmuck thank you for your time!

My news today after several hours of SMART (I didn't put them all at once) is that 9 out of 10 drives pass the long SMART test. I guess that's somewhat good news. Regardless of attitudes, I'm trying to follow advice whenever I can. Now for example, I have no options left. Need to ask around for decent SATA controllers to replace mine; but what is decent?! :D The array (config'd in one pool) is still unusable. Pool status write "resilver, status finished, errors 0". Kinda funny. Let me remind your that in both systems I need 3 controllers (including the mobo one). From those 2 extra controllers, one I moved from the old system to the new. So new system has the old 4-port Marvell, a second identical Marvell, and the onboard (Asus mobo). Again, disks reported degraded across all controllers.

This system's swap is at 4GB. After this discussion I think that's there's not much point in trying it again with the old system. I've already ripped it apart anyway (quite old, power hungry, etc). Is systemctl output for all drives useful here? Shall I try a zpool clear or will it make things worse?

ChrisRJ · Sep 13, 2022

superfour said:
Need to ask around for decent SATA controllers to replace mine; but what is decent?! :D

The recommended approach is to get a used plain LSI HBA and flash it to IT mode. There are (or at least used to be, when I last looked) many of those available on eBay for very little money. I got IBM branded ones (ServeRAID M1015, also see e.g. https://www.servethehome.com/ibm-m1015-part-1-started-lsi-92208i/) for less than 30 Euros.

superfour said:
Let me remind your that in both systems I need 3 controllers (including the mono one).

Can you (possibly again) tell us why you need so many per system?

superfour · Sep 13, 2022

ChrisRJ said:
Can you (possibly again) tell us why you need so many per system?

It's 10 disks, never got my hands on a big controller... Put my faith in FreeNAS as I was explaining above, simple & hardware-friendly.
(Ebay / US sellers is not a option unfortunately around here - PCIx8 also not an option for simple mobos like mine)

Davvo · Sep 13, 2022

You could try to use 2TB ones instead, and that would half your required sata ports.

superfour said:
Is systemctl output for all drives useful here?

Could be.

ChrisRJ · Sep 13, 2022

superfour said:
It's 10 disks, never got my hands on a big controller... Put my faith in FreeNAS as I was explaining above, simple & hardware-friendly.
(Ebay / US sellers is not a option unfortunately around here - PCIx8 also not an option for simple mobos like mine)

Well, a typical HBA has 8 ports and I would assume that your motherboard has at least 2. That means one HBA would suffice.

I don't understand why you mention US sellers. My assumption had been that you look for someone from your country.

Redcoat · Sep 13, 2022

superfour said:
Ebay / US sellers is not a option unfortunately around here

Why not? Art of Server on eBay https://www.ebay.com/str/theartofserver?&_stpos=44105&_fcid=1 has a good reputation here - 2 port (for 8 HDD directly) IT mode LSI HBA's today are <$75.

joeschmuck · Sep 13, 2022

superfour said:
Is systemctl output for all drives useful here?

That is 'smartctl', not 'systemctl'. Close.
Yes, you could post it or at least examine it yourself if you know what you are looking for. Take a gander at the hard Drive Troubleshooting Guide in my links below if you need to understand the basics but if you want a fool proof way to know for certain there are no issues with your hard drives, post the output for each drive, in code brackets and preferably separate (one drive at a time). We will give it a look and several of us will tell you if all is good or if something is wrong.

And I agree with the other advice you are getting, a single HBA along with two MB connectors should be fine. Add one more motherboard connector for your SSD boot device.

joeschmuck · Sep 13, 2022

superfour said:
This system's swap is at 4GB.

Yikes, that is very bad actually. I was thinking maybe 8KB, but not 4GB. Yes, if you cannot add more RAM, a new motherboard is in your future.

superfour said:
Shall I try a zpool clear or will it make things worse?

This will not damage the pool. If you have rebooted the system and the pool is still showing degraded, while I'm not sure it will fix the errors you could try the command and see. The worst case is you have to destroy your pool and recreate it. So if you have any important data, grab a copy of it if you can.

Also, what is the current output for zpool status -v

Etorix · Sep 13, 2022

Redcoat said:
Why not?

OP is in Greece, so anything coming from outside the EU is probably not attractive due to costs (shipping and customs).
China may, or may not, be the happy exception, depending how customs work on the receiving end.

superfour said:
PCIx8 also not an option for simple mobos like mine)

These will work in any PCIe slot which can fit the card. But SAS HBAs are pretty much the only option for additional controllers with TrueNAS.
If getting a LSI 9200-8i (with its breakout cables) is not an option, then you'd better forget about TrueNAS. "Hardware-friendly" was a wrong assumption.

superfour · Sep 14, 2022

Thank you all yet again!

@Etorix @Redcoat indeed, and I don't want to digress but variety in hardware and prices is not an option here. We even lost UK; Brexit. Only source now is Germany, but again P&P and impossible delivery times (a greek problem) make things worse.

So, after rebooting and having again 8 disks degraded, after a zpool clear ...... all looks fine(!!

) I don't know where to begin with new questions here. What on earth does that command do actually?! Array is fully usable, all disks show ONLINE.

(SMART long test fails in only one disk, and I'll post smartctl results later.)

Davvo · Sep 14, 2022

superfour said:
What on earth does that command do actually?!

Basically reset all errors.

joeschmuck · Sep 14, 2022

superfour said:
So, after rebooting and having again 8 disks degraded, after a zpool clear ...... all looks fine(!! ) I don't know where to begin with new questions here. What on earth does that command do actually?! Array is fully usable, all disks show ONLINE.

Great news. I suspect your issues were from low RAM but I couldn't prove it. The disk with failures might have contributed some but I'm leaning more that it probably didn't, not considering all the failures you had. It's all speculation on my part, nothing factual to go on other than you had a 4GB Swap file, that is far from good.

superfour · Sep 14, 2022

Great news indeed! It's a pity that I didn't try a zpool clear on the first system. By bet is that it would have straighten things out, as @joeschmuck says nothing can be proven. It was a "great" experience anyway. I am examining the first failed drive which I removed and started this mess, and looks fine (apart from expected old_age SMART results). Find it quite strange: why TrueNAS does not perform a zpool clear each time it restarts? Why put the user in such agony beleiving that the whole system is failing?

PS. Sorry about that systemctl typo; I'm typing several systemctl's daily at work - force of habit (: All disks have mostly old age results. Expected of course.

joeschmuck · Sep 14, 2022

superfour said:
Find it quite strange: why TrueNAS does not perform a zpool clear each time it restarts? Why put the user in such agony beleiving that the whole system is failing?

Probably so it's not hiding a potential problem. You should keep in mind that ZFS is a file system built to ensure file integrity. If anything goes wrong, I'm sure the company using it wants to know about it. If we automatically cleared all the errors on each reboot, that could hide a lot of problems and what starts as an intermittent problem often results in a catastrophic problem.

One other thing... When FreeNAS 8.0 started up, it was not meant for any novice users to use without some adequate knowledge level of FreeBSD, and ZFS file systems. It was not a pretty sight. Over time the software has matured and having adequate knowledge level has darn near gone away. I don't personally like that because then people get into a pickle and don't have a clue how to get out of it. It's not their fault either but they scramble to fix a system they do not understand and hope data isn't lost forever. So reading the user manual two times minimum use to be my answer to many people before using FreeNAS. We are now at TrueNAS, and I still think people should be reading that user guide several times before getting into ZFS. And they should contemplate how they will fix a drive failure well in advance of the failure occurring. When I first got into this, I created a drive replacement procedure for myself and taped it to the side of the server. it have every single step that I needed to accomplish in order to recover from a failed hard drive. I used it once to test and once for a real drive failure. While it didn't save my butt (only because I was very familiar by then) it did make things go smoother.

And I'm not calling anyone here stupid or an idiot so please do not take my comments that way. I just fell people need to be educated on the systems they run.

Etorix · Sep 14, 2022

superfour said:
Great news indeed! It's a pity that I didn't try a zpool clear on the first system.

zpool clear does nothing more than clearing error reports; it does not solve any actual issue, just record that the administrator acknowledge having dealt with it. (And so, ZFS will never run this command on its own.)
So enjoy the "clean pool" but do check and double-check that the data is there, uncorrupted. A scrub would be on order…
And prepare for future issues: Backup, server upgrade.

If it was "just" the low RAM, the new server may have solved it. If it was a misbehaving SATA card, the issue will pop up again.

Davvo · Sep 14, 2022

As a side note, you might find the multi_report script quite useful in this situation. I suggest you to check it.

joeschmuck · Sep 14, 2022

Davvo said:
As a side note, you might find the multi_report script quite useful in this situation. I suggest you to check it.

I like the plug! @ChrisRJ is helping me understand more about BASH scripts to make the script more efficient and help make a better overall product. I'm going to tweak the script to recognize more SAS hard drives and hopefully have that done in the next few days, well that is the goal without breaking something else.

superfour · Oct 1, 2022

For the record & all those interested, after extensive testing, it was after all a faulty controller. Which was a good first diagnosis by @ChrisRJ .

Important Announcement for the TrueNAS Community.

Multiple disk failures confusion

Wizard

Old Man

Dabbler

Wizard

Dabbler

MVP

Wizard

MVP

Old Man

Old Man

Wizard

Dabbler

MVP

Old Man

Dabbler

Old Man

Wizard

MVP

Old Man

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Multiple disk failures confusion"

Similar threads