Hard Drive Troubleshooting - Massive Failures - Need Help Isolating the Problem(s)

Andy C · Oct 12, 2017

Inxsible said:
That is very difficult given the umpteen number of options that are available in every server. For eg. Some server boards have IPMI, some don't. Some have multi-processors, some don't. Some have multiple FreeNAS systems using replication, some don't. So on and so forth...

But the basic guideline (once the server is built and is up and running) is pretty clear.

You need to have SMART running in regular intervals (I run short every day, long every 3 days)

You need to have regular scrubs of your pool. (My boot pool is every 5 days, tank is every 12 days)

You need to have regular snapshots set up in case crap hits the fan.

You ABSOLUTELY must set up the email in FreeNAS so you can be informed of the various things that FreeNAS is doing.

Set up ssh, so you have access to the box in case the GUI doesn't work for some reason

Once all that is done, you probably won't even need to login to the GUI or into the freenas box.

This is essentially my basic setup (ZFS, RAID-Z2 for my media and backups).

Before I invested in a FreeNAS, I researched it, read the FAQ'as and helpful information on this forum. Once I setup my system, I have subsequently tweaked it and added features like periodic snapshots but essentially my setup is unchanged from day 1. My main cause of instability is myself (I am a serial experimenter so I ran Corral beta and now FN11 nightlies) and a pesky powerline adaptor that occasionally drops the connection (wife doesn't want the box in the bedroom).

I knew U*ix pretty well before FreeNAS but I can't believe people would invest time and money without doing some preparation by reading around the subject.

Not meaning to be patronising here and I hope the OP resolves his problems and rest assured, I'd be first to post a plaintive, desperate plea for help on these forums if I ever, God forbid, get a hardware problem or an obtuse SMART error but sometimes you reap what you sow.

arameen · Oct 12, 2017

rs225 said:
Since da2 can't complete a SMART test, definitely plan to replace it next, as I outlined in #61.

Check your power lines, and if you can balance the load any better, do so. I do think the power supply is undersized for the system plus two pools. Although the wattage may seem fine mathematically, each individual rail coming out of the PSU can only provide a certain portion of that power, not the total. So there could be overload on the rails. The 8TB which complains about spin-up could be an indicator, as the larger drives usually have high spin-up power requirements. But if the drive were to start working in a different power situation, it might actually be okay.

Long tests will probably not complete tomorrow, you've got resilvers going that take two days. :)

That drive will get replaced, I already have a new drive waiting. just don't want to replace it now, wanna see if we can figure out something from the pool status while troubleshooting what hardware is not feeling good.
Once the resilvering is done, i will turn off the server and check the powercables. Try to balance drives avoiding too many drives on one line. Actually this is something I have been thinking about before. Even the facet, that I didnt mentione yet, that allmost all 8TB NAS drives I recently tried to add were causing different issues while trying to replace 4TB drives temporary. This somehow is pointing towards your theory that it could be related to the PSU.
Same time, larger drives, and drives in general, dont require much. not even during spinup, but again, I am not the expert here :)
I can not find the calculation done before on required Watt for my server, but It remember it was covering those disc with no problem. But again, as I mentioned earlier, the pool was complaining about at least 4 new 8TB drives and my thought were if it was powerrelated. When i checked the powerusage was not so much higher for this disk compared to the 4TB disks why I was not suspecting totalt systempower usage as the problem.
What effect do you recommend for my system if you think my could be undersized for this ?

arameen · Oct 12, 2017

joeschmuck said:
Whoa, this is one long thread for a very basic problem. I'm glad to see a lot of people offering assistance.

@arameen Please take the advice of the group who have posted that you should be doing some more reading up on how FreeNAS works and how to maintain it. Unfortunately there is not a nice single guide on each and every step you must do to configure your FreeNAS system properly to make it a virtually hands-free system. FreeNAS was never originally intended to be used buy the typical novice and required the user to do a lot of work investigating how to make the system work for you. Thankfully our forum members are here to help us all out when we are in need. Also I know that some of the postings here read as if the poster is geting a little frustrated, both them and you, I can relate to be honest, I think we have all been on both sides of a problem. I am glad to see it looks like you are getting your problematic pool recovered.

If you do not know how to setup routine SMART tests then read the User Guide, it has a section on it. Once you have your pool issues solved then I'd establish those settings as follows: Smart Short Test = Once a Day on Sunday through Friday, Smart Long Test = Once on Saturday. Do not run a long test on Sunday because the automatic scrub runs on a sunday and then the hard drive is fighting for time between running both tests. And as previously stated, ensure your email notification is setup properly becasue you will get an email when a drive experiences a problem that you should take care of.

Other things to look at while you are troubleshooting, ensure all the fans in your computer are spinning, including your power supply.

You have several drives which have UDMA errors (ID 199). These are caused by communications errors, typically a bad cable or controller. Unfortunately these values never reset back to zero so my advice is to just print out the SMART data and keep it on hand. If you notice the UDMA CRC errors increase then you have a problem with a cable or controller. While the problem "could" be in the hard drive itself, it's very unlikely.

So lets see if I understand your present hardware configuration...
1) The "Main" pool has been disconnected from the motherboard SATA connectors.
Q1) Are the drives still plugged in to power or are they disconnected?
2) The pool having issues is not disconnected from the HBA card and plugged directly into the motherboard SATA connectors.
Q2) Is the HBA still plugged into the motherboard?

If you are not using the HBA and your pool is fine once you have completed all the testing, then you can validate if the HBA is causing it by powering down your system and using the HBA SATA ports and powering back up, test out your system, run a scrub, after that check the SMART data for any changes. Obvioulsy you can't use just a single data point, you need to give the system some time to fail or pass but this is a good quick check. If you have new failures then it's highly suspect that the HBA is your problem. You can rule out the power supply too by connecting the other main pool hard drives up to the power, assuming they were disconnected. If they have been connected this entire time then I doubt it's the power supply at fault.

My last piece of advice is to take your steps slow, do not rush into anything. While you can check SMART data at anytime, you shouldn't be resilvering one drive while running a smart long test on another drive at the same time, it will slow down the resilvering and smart test and the drive under the smart test will hate you. I'm not telling you to stop any test if you have it running, but it will take a lot longer during a resilver or scrub operation.

Stick with it and best of luck to you,
-Joe

Well I hope its a basic and simple problem :) but it feels a bit more complicated than basic.
I really appreciate all the help am getting here, and ofcourse am taking all advices with great respect. Once I found out what is faulty and replaced it, I will go through all guides after I set up regular SMART tests, long as short, besides the regular scrubs. This has been pointed at me so many times in this thread that there is no way to miss it :)
I think that those UDMA errors are older ones, but now that I learned more about those numbers I will save the numbers and keep track of them to see if they change and when they change.

Well as of now, the data is backed up. The pool is not in good shape, but that doesnt matter much. What matters is to find the cause of this. What needs replacment, get it fast and get my server up. I need my main pool online. And it is not now as am troubleshooting this secondary pool.

Present hardware configuration is (since last week):
1) The "Main" pool has been disconnected from the motherboard SATA connectors.
Q1) The drives of the main pool are not plugged in to power
2) The pool having issues, called secondary, is disconnected from the HBA card and plugged directly into the motherboard SATA connectors.
Q2) The HBA is still plugged into the motherboard but not beeing used. (Not sure if pulling it out would benefit while troubleshooting why I didnt think about it at all)

Well I am not sure what "the pool is fine" means here. Sure the pool is there and I can access the data, but the statues of the pool looks no good. Still some drives getting into faulty status and new data corruption (this datacorruption is new and I just observed it). And this happend with the new configuration mentioned under present hardware configuration. This is how the pool looks like as of now

Code:

  pool: Secondary_Raidz3																											
state: DEGRADED																													
status: One or more devices is currently being resilvered.  The pool will														   
		continue to function, possibly in a degraded state.																		 
action: Wait for the resilver to complete.																						 
  scan: resilver in progress since Wed Oct 11 23:32:57 2017																		 
		20.5T scanned out of 34.8T at 243M/s, 17h10m to go																		 
		71.5G resilvered, 58.80% done																							   
config:																															 
																																	
		NAME											STATE	 READ WRITE CKSUM												 
		Secondary_Raidz3								DEGRADED	 0	 0 13.7K												 
		  raidz3-0									  DEGRADED	 0	 0 54.9K												 
			gptid/8275e396-a83c-11e7-9cee-002590f5b804  ONLINE	   0	 0	 0  (resilvering)								   
			gptid/3a44142c-931c-11e7-b895-002590f5b804  ONLINE	   0	 0	 0												 
			gptid/33c047e7-2292-11e7-9626-002590f5b804  ONLINE	   0	 0	 0												 
			gptid/34749735-2292-11e7-9626-002590f5b804  ONLINE	   0	 0	 0												 
			6370505857967419013						 UNAVAIL	  0	 0	 0  was /dev/gptid/3536bf51-2292-11e7-9626-002590f5b
804																																 
			gptid/35e2d6ec-2292-11e7-9626-002590f5b804  ONLINE	   0	 0	 0												 
			gptid/368b679d-2292-11e7-9626-002590f5b804  ONLINE	   0	 0	 0												 
			gptid/3730ee56-2292-11e7-9626-002590f5b804  ONLINE	   0	 0	 0												 
			gptid/37de7e53-2292-11e7-9626-002590f5b804  ONLINE	   0	 0	 0												 
			replacing-9								 UNAVAIL	  0   113	 0												 
			  5660221525628801207					   UNAVAIL	  0	 0	 0  was /dev/da8p2								 
			  da0p2									 FAULTED	  0 7.71K	 0  too many errors  (resilvering)				 
			gptid/39778368-2292-11e7-9626-002590f5b804  FAULTED	 36   115	 0  too many errors								 
																																	
errors: Permanent errors have been detected in the following files:																 
																																	
		Secondary_Raidz3:<0x0>																									 
		/mnt/Secondary_Raidz3/Action/Thumbs.db

by the ways, thanks :)
fun to see even moderators getting involved in this

rs225 · Oct 12, 2017

The zpool status you posted yesterday showed a resilver had recently finished, and since the drives showed 0 errors, I thought the pool would be able to recover. However, your status today shows top level checksum errors again, and failures of another drive. If you can shut down your system and re-balance the power load, maybe the pool can be restored. Removing the HBA may held your situation with the drives. If not, and if you have a good backup, I would say it looks time to give it up. There is no reason to let the resilver finish with those errors.

I think you need a larger wattage PSU, designed toward a large storage server. The important thing to look for is lots of individual drive power cables built-in. Each is designed to carry a certain power load without sagging. After all, it is a safety feature that too much power can't go through those small wires.

arameen · Oct 12, 2017

rs225 said:
The zpool status you posted yesterday showed a resilver had recently finished, and since the drives showed 0 errors, I thought the pool would be able to recover. However, your status today shows top level checksum errors again, and failures of another drive. If you can shut down your system and re-balance the power load, maybe the pool can be restored. Removing the HBA may held your situation with the drives. If not, and if you have a good backup, I would say it looks time to give it up. There is no reason to let the resilver finish with those errors.

I think you need a larger wattage PSU, designed toward a large storage server. The important thing to look for is lots of individual drive power cables built-in. Each is designed to carry a certain power load without sagging. After all, it is a safety feature that too much power can't go through those small wires.

Sure the new checksum errors on top level show something is still wrong despite bypassing the HBA.
As soon as the SMART long tests are done on the previously interrupted drives, 4 of them. I will turn off the machine, change the powercables trying to distribute as much and even as possible and start up. Then maybe do long tests again on all drives and see if any get interrupted? or is there a better test to do to see that the powercablerearranging did help ?

giving up the pool itself is ok, as I have a fresh backup of all contents there. But I dont want to create a new pool without knowing what the problem is. I want it solved so the new start will be a good start :)

as of bigger wattage PSU. what do you recommend of Wattage ? considering this system in normal operation has one pool of 5 x 8TB seagate ironwolf 7200 RPM drives and another pool of 11 x 4TB Seagate Ironwolf 5400 RPM (those may in future get upgraded to the 8TB drives, meaning more power and RPM - seems hard to find bigger disks with lower RPM).
Anyway, with that disk setup and my hardware in the signature, what PSU wattage do you suggest ? keeping in mind that it would be able to handle even bigger drives in future.

ethereal · Oct 12, 2017

arameen said:
as of bigger wattage PSU. what do you recommend of Wattage ?

pinned here for 2 YEARS - https://forums.freenas.org/index.php?threads/proper-power-supply-sizing-guidance.38811/

rs225 · Oct 13, 2017

Right now, you just want to see if balancing the power and removing the HBA will get rid of the checksum errors. If they don't go away, abandon the pool. If you have hotswap bays, pulling the 8TB disks, and then plugging them in a couple seconds apart after turning on the system power might also help. (This means forget/cancel the SMART tests until the pool is abandoned, or healthy)

As for power supply, you will need to buy whatever power supply has enough rails, each rail rated for peak startup amperage of whatever number of drives you plan to attach to each rail. Your drives usually have the 5volt and 12volt amperages on the label. If not, 0.7 on 5volt and 0.5 on 12volt is reasonable approximation per disk, but I would specifically verify those 8tb disks.

As for your problem, I would assume it is power. Until power is eliminated as a cause, you will be chasing your tail on everything else. Nobody generally has that many independent problems, but power can cause all kinds of problems.

arameen · Oct 13, 2017

I have been waiting for the resilvering to finish, so the SMART Tests finish too. So I can post them, turn off the server and try redistribute the drives as even and fair as possible. Compared to building a PC with a powerfull videocard, that needs lots of powers, i never thought this could be an issue here. I mean, am not using a cheap or bad PSU, but I counted on 750W beeing enough.
What I didnt think of is how may drives are connected to one trail only (there is 5 trails and all are plugged in, but can not see clearly how distributed they are), add to that I have several fans connected to the same trail. I will look into that carefully once the machine is off, maybe I see too many drives and fans on one trail.

When I did the calculation once I built the sever, I came to conclusion that total Wattage was more than enough.
I don't know if it is not ?
Or if this is about some trails beeing overloaded (as I never before thought calculating on each trail

). searching in the PSU manual to see if I find information about this

Once I did the cablerearrangement, will it be possible to know that this was related to the PSU without recreating the pool, only by looking into this current ones status ?

Bidule0hm · Oct 13, 2017

What's the brand and model of your PSU?

ethereal · Oct 13, 2017

if the psu is faulty then it may not matter which wattage you have - you'll need a new one.

i had a psu which was really overkill for my server - but when if failed it was subtle everything seemed normal. but i'm sure the irregular voltages damaged my m/b because the ipmi does not work now. i had a cheap psu tester which said everything was okay but i tried a different psu and my problems went away (apart from my blown ipmi)

i'd get a different psu and try that

arameen · Oct 13, 2017

Bidule0hm said:
What's the brand and model of your PSU?

BE Quiet! Dark Power PRO BQT P10-750W

arameen · Oct 13, 2017

ethereal said:
if the psu is faulty then it may not matter which wattage you have - you'll need a new one.

i had a psu which was really overkill for my server - but when if failed it was subtle everything seemed normal. but i'm sure the irregular voltages damaged my m/b because the ipmi does not work now. i had a cheap psu tester which said everything was okay but i tried a different psu and my problems went away (apart from my blown ipmi)

i'd get a different psu and try that

You are totally right about that.
But getting a new PSU this time means propably a Seasonic. And those are not cheap, already looked a little into that.
It would be good to have some kind of proof or at least more pointing towards the PSU as the problem here. Remember that main pool had no issue despite it replaced alll drives months ago with faster and bigger 7200RPM 8TB drives, all those drives have been connected to the same PSU. I want to avoid ending with buying new PSU, HBA and worst case motherboard too, except lots of new drives ofcourse.

rs225 · Oct 13, 2017

If there are any fans you don't need temporarily, you can disconnect those too. The power supply seems pretty decent. You could be pushing the limit or have a problem on the 5V though, considering there is only one 5V for the entire system. The 12V looks very good, as long as the load is balanced.

If the pool doesn't recover, you can create a new one, and continue to do testing on the hardware. Switch everything over to the HBA, complete all the SMART tests, and see what happens with some test data on it.

Johnnie Black · Oct 13, 2017

You shroud really avoid a multi-rail power supply for storage servers, depending on how the rails are distributed you may not have enough 12V amps, e.g., this is a typical 4 rail power supply:

12v1 is CPU
12v2 peripherals
12v3 and 12v4 GPUs

Leaving you with 25 Amps for all the disks.

danb35 · Oct 13, 2017

rs225 said:
If not, 0.7 on 5volt and 0.5 on 12volt is reasonable approximation per disk

Two amps per disk on 12v for startup would be a better planning number.

joeschmuck · Oct 13, 2017

arameen said:
fun to see even moderators getting involved in this

The only reason we are moderators is because we have been involved and do good things for people, and we have a level head, so you will see the moderators being active in the FreeNAS forum. We also are watching over the forums but that is not my primary job here, my job is to help people out and police the forums second. I've been on forums in the past where moderators seemed to police the forums all the time, never having much time to do much of anything else. We are not like that here. Policing the forums happens when we see an infraction, we don't go surfing the forums looking specifically for infractions.

Bidule0hm · Oct 13, 2017

Ok, I saw some reviews and it's a pretty good PSU. Unless it's old or there is a defect of some sort it should not be the PSU. The thing is, unless you can test with another PSU and/or you can do some measurements with an oscilloscope then you can not be sure it's not the PSU.

I agree with @danb35 I've made the measurements myself, 3 A total for both 5 and 12 V for a few seconds is common for the drives used in our NAS, the details are in this thread if you want more :)

joeschmuck · Oct 13, 2017

And of course I agree too, in order to troubleshoot a power supply problem you really need another power supply or some test equipment. Trust me, a new power supply is cheaper. And we here like Seasonic. You don't need a Gold or Platnum power supply, a Bronze rated one is good too. Look for a sale and buy one. Right now I'm not thinking it's your power supply, likely your HBA, but one never knows until you prove it. So if your hard drives are doing what is expected and the pool recovers and operates fine and looks proper, then I'd move that pool to the HBA connectors and use it for a while, testing it out to ensure it works fine. I suspect it will start throwing errors again.

If you need access to your "Main" pool, you can connect it to the motherboard SATA connectors again but it doesn't help the troubleshooting efforts at all right now. If you have another computer that has 8GB RAM, you could use that to host your main pool, but I don't know if you have one.

arameen · Oct 13, 2017

Johnnie Black said:
You shroud really avoid a multi-rail power supply for storage servers, depending on how the rails are distributed you may not have enough 12V amps, e.g., this is a typical 4 rail power supply:

12v1 is CPU
12v2 peripherals
12v3 and 12v4 GPUs

Leaving you with 25 Amps for all the disks.

I read a bit about that, it seems some people prefer one-rail PSU while others prefer multirail PSU.
If you would recommend a one-rail PSU, what would that be ? :)

arameen · Oct 13, 2017

Bidule0hm said:
Ok, I saw some reviews and it's a pretty good PSU. Unless it's old or there is a defect of some sort it should not be the PSU. The thing is, unless you can test with another PSU and/or you can do some measurements with an oscilloscope then you can not be sure it's not the PSU.

I agree with @danb35 I've made the measurements myself, 3 A total for both 5 and 12 V for a few seconds is common for the drives used in our NAS, the details are in this thread if you want more :)

That thread is way to advanced for me, but I can see the you love the topic :)
Sure the PSU is one of the best out there, at least when I bought it, read many reviews before getting it.

Not sure if I followed you with "3 A total for both 5 and 12 V for a few seconds is common for the drives used in our NAS". But do you think too many harddrives and some fans on one PSU trail could be a problem and causing this ?

Important Announcement for the TrueNAS Community.

Hard Drive Troubleshooting - Massive Failures - Need Help Isolating the Problem(s)

Explorer

Contributor

Contributor

Guru

Contributor

Guru

Guru

Contributor

Server Electronics Sorcerer

Guru

Contributor

Contributor

Guru

Guru

Hall of Famer

Old Man

Server Electronics Sorcerer

Old Man

Contributor

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Hard Drive Troubleshooting - Massive Failures - Need Help Isolating the Problem(s)"

Similar threads