Hard Drive Troubleshooting - Massive Failures - Need Help Isolating the Problem(s)

arameen · Oct 14, 2017

thanks for the input guys so far. Lots of good input. I am using every advice and not omitting any suggestion :D

I will use your plan joeschmuck with some some additional own steps to make sure the HBA is not the issue and try to figure out why the SMART long tests sometimes are interrupted by the host.
I already disconnected the troubled pool and reconnected the main and healthy pool. Its up and running.
I also inserted the HBA back in the case and connected one drive from the healthy pool into it. So 4 drives are connected to the motherboard while 1 drives is connected to the HBA. This drive is beeing power by another trail and SATA connected with the HBA.
MEMtest has been run before for more than 6 passes on the ECC memory and there was no issue.

Next i will do the CPU test, just to make sure.
After the CPU test I will do long tests on the healthy main pool. I want to see if those tests will be interrupted at all ? or interrupted for one drive that is connected to the HBA. We discussed earlier that some long tests are interrupted by the system without finding out why.
By this i will be doing 2 tests same time, HBA and interrupted long tests. The reason I do 2 tests (HBA connected and used) is I want this troubleshooting not to take months to figure out the problem.
If however tests are interrupted, i will remove the HBA again and do tests with the motherboard only.
Next I will move to connecting another drive to the HBA making it, 2 HBA connected drives and 3 motherboard connected drives. I will even make sure to power up those 2 HBA connected drives with same power used for the disks on the troubled pool before (to make sure this is not related to the power extenders somehow). Or i may connect 4 drives to the HBA making sure every channel and power is working on the HBA. Hopefully that is not risky and may not corrupt the main pool if anything is wrong with the HBA ?
Then I will do new long tests to see if any tests are interrupted. Hopefully noone are. Depending on how long time this takes I may do a scrub as final test.
Then I can assume that the HBA is not the problem, neither is the motherboard or CPU. The healthy pool is working with the HBA and no long tests are interrupted. However if anything is interrupted then the tests above have to be broken into smaller test and will take much longer time, I hope I can avoid that situation and much more downtime

Something more i will be testing is moving the USB sticks from USB ports on the front whenever there is an issue. I will then be using the USB ports on the motherboard instead of the ones on the case. Now I want to ask if there is a way to know what USB is faulty ? when you have 2 sticks same brand and same size ? I searched many times for an answer but found nothing, now doing a workaround. But maybe there is a way to know after all ? and not risk removing the healthy one or need to use workarounds for this.
I will be doing a second and final scrub of the main pool before assuming the the USB is ok, the HBA is ok and the PSU is ok with 5 drives. Same time i will be trying to give the server additional load in the form of copying files to make sure I can proceed with next steps.

If all tests above pass. I will jump over the step of installing FreeNAS on a SSD disk and continue with step 8 instead in joeschmucks plan after dropping the troubled pool and creating a new one :)

joeschmuck · Oct 14, 2017

arameen said:
and try to figure out why the SMART long tests sometimes are interrupted by the host.

This occurs only in two situations that I can think of right now. You run smartctl -X or use the force switch to abort it, or what is most likely for you is the power is cycled to the hard drives (powered down even for a split second). The way the smart tests work is you send a command to run the test to the hard drive electronics and then the hard drive runs the tests internally, there is no outside communications required after that. You could send the command to start the test and then unplug the SATA data cable and the test will continue, the drive only needs power. What you can do is before you start the long test look at the smart data ID 12 Power_cycle_count and record it (well I'd do a screen shot to retain all the data or dump it to a file, Putty does this well if you set it up to) and then after the test is finished, or fails to finish is more likely, check this value again. If if incremented then you have another data point. And while I'm thinking of it, you don't have the drives setup to sleep, do you?

arameen said:
MEMtest has been run before for more than 6 passes on the ECC memory and there was no issue.

How recently? Since the problems started? These tests not only test the RAM (which is not suspect) but the entire motherboard and CPU, and if you have flaky power, it tesnds to show up on these tests.

arameen said:
Next i will do the CPU test, just to make sure.

Good because this will put a strain on your power supply, and again, you should have all your hard drives connected to power so you are under the same conditions in which the problem originated.

arameen said:
Hopefully that is not risky and may not corrupt the main pool if anything is wrong with the HBA ?

Yes, this is very risky. Make sure you have a backup before proceding.

arameen said:
Now I want to ask if there is a way to know what USB is faulty ? when you have 2 sticks same brand and same size ? I searched many times for an answer but found nothing

Sure, do a Google search for "freenas identify usb stick", should be the first hit.

So I've gone through your drives SMART data that you posted, I think I got all of them, you tell me. Here are my conclusion, listed by serial number:

ZA161YG1 = ID 10 Spinup Retry always seems to indicate that it is FAILING. I would replace that drive and then test it seperately on a different compter system to see if this changes, if not, I'd RMA the drive. My first thought was you don't have enough power to make this thing spin up, but it was the only drive with this error condition so it is likely going bad.

Z301NHXV = Looks good, Needs Extended Test

S300VSWF = Looks good, Needs Extended Test

S3019679 = Looks good, Needs Extended Test

S3019PGW = Okay

S3019PAC = Okay

WDH304RQ = New drive, needs both Short and Extended SMART Tests

ZDHZ217CP = Okay

WCC4E0020880 = Okay

W30124AF = Okay

ZA158SD6 = Looks good, Needs Extended Test

You have a lot of errors due to communications issues, these will never clear (become zero again) and are likely due to a SATA cable or HBA issue. The HBA could be the add-on card or your motherboard but I think our preliminary results are the add-on card. Pay attention to your SMART data and note the differences. My guide tells you that there are values to ignore such as raw read error rates and such, so don't worry about that data. But where a value should be zero and it's in the single or double digits, that may be of concern. Feel free to ask about it but please post the entire output of the smart data. And FYI, with the exception of ZA161YG1, I don't think you have any actual hard drives failing. Due to this analysis I would focus less on running the SMART Extended/Long tests and more on Scrubs. Don't get me wrong, you need to get at least one good Extended Test for each drive but after that, I'd forget about it and concentrate on the data failures.

I hope you don't spend all of your Saturday troubleshooting this problem, go enjoy the weekend.

rogerh · Oct 14, 2017

Can I just support the idea of changing all your SATA cables if you haven't done so? It's just possible you got a bunch of counterfeit or defective cables, and this is one thing that could cause most of your symptoms. A bit expensive, but you only have to get some name brand, not especially expensive ones.

farmerpling2 · Oct 14, 2017

Bidule0hm said:
@farmerpling2 Are those numbers from the web or are they real measurements?

Those are from company literature. As I mentioned Seagate actually shows a strip chart for their drives on startup power usage for 5V/12V. I know the other company's do it also, but getting your hands on it is much harder.

wblock · Oct 15, 2017

chipped said:
Download Parted Magic and boot from that, use the Disk Health app, double click each drive individually and check if any of the drives have attributes that are Pre Failure or near end of life.

Also check the Error Log for each drive and see if there any.

There is no need to download anything. FreeNAS already handles this SMART data reporting, both within the GUI and from the command line.

arameen · Oct 15, 2017

wblock said:
There is no need to download anything. FreeNAS already handles this SMART data reporting, both within the GUI and from the command line.

well that program is not even for free :) unlike FreeNAS

arameen · Oct 15, 2017

I did not do the tests I suggested in my previous post using my main pool and the HBA. as joeschmuck mentioned that it could hurt my healty pool, there was no need risk that and complicate this more :) I thought as long as the pool is not written to it wouldn't be hurt by a faulty HBA, if now the HBA is the issue.

I have now completed all long tests on the main pools drives, with the HBA connected but not the drives of the second pool. No long test were interrupted and all results still ok.
I did write down all testsresults, now on I will start saving those tests to keep an eye on raising faults for all drives. I seen som posts about scripts to save the results, but so far havn't found a one that emails the complete output of a SMART results everytime a test is done :) ?

Then I did connect all drives and fans. And booted Ultimate Boot CD on USB. it took some time because there was some stupid bios setting in the supermicro motherboard that was hard to find about booting from the USB
I did CPU stress test for 1,5 hour and temperature did not raise more than 66 degress Celcius as most, and I use the cooler shiped with the CPU. I consider temperatur ok.
The system did not crash or restart during this period.
And as mentioned, this was done with everything connected as suggested. I assume 1,5 hour is ok for the CPU stress test ?!

What I even did was that I moved the drives in the second pool internally. Withing same pool, I switched the 4 drives previously connected to the HBA with 4 connected to the motherboard. That way maybe see if any issues follow any drive or stick with the HBA on the new positioned drives. ZA161YG1 is one of the drives I moved away from HBA to motherboard connection.
FreeNAS is up and running now with both pools. The second one is as usually resilvering.
I will try to run long tests on all those drives again, considering lasttime some tests did interrupt and we didnt know why.

What is new now and suprising is:
Under the GUI tab volume status tab for the troubled pool it shows this:
Resilver
Status: Completed
Errors: 1236812 Date: Sun Oct 15 23:38:36 2017

And if check zpool status i will get a almost never ending scrolling output of corrupted files.

But what I dont get now is, how this changed so fast, from having few errors to having so many errors as of now.

Sure its time to destroy this troubled never healing pool and create a new one with same specifications, fill it with some data ?
and do some new tests ? trying to figure out what hardware is malfunctioning

By the way.
MemTest was done around a week ago, when this issues started. That was one of the first things I did :)
And it was many passes, more than 6 passes. Don't remember exactly how many I did. But it passed all without errors.

joeschmuck · Oct 16, 2017

arameen said:
I seen som posts about scripts to save the results, but so far havn't found a one that emails the complete output of a SMART results everytime a test is done :) ?

Use this link to get started.

arameen said:
assume 1,5 hour is ok for the CPU stress test ?!

Yes, great results.

I still say that you need to replace the ZA161YG1 drive. Didn't you previous replace a few drives that you thought were failing? Did those really fail?

arameen · Oct 16, 2017

New update

It started with FreeNAS rebooting by itself twice this evening, not sure why.
My first guess was because of the failing drive ZA161YG1.
I have noticed since before that once there is a serious issue with a drive, FreeNAS never finishes booting giving those messages about Cam Status and retrying command . ..
This happend now and FreeNAS never finished booting.
FreeNAS does boot fast and its hard to see the serial of the drive it was complaining about, I had to record my screen IPMI :) (I dont have a list of dev names and corresponding harddrives serials when FreeNAS is not up and running)
Apperently it wasnt ZA161YG1 it was complaining about. it was S3019679
I disconnected S3019679 to see if FreeNAS will boot, and it did.
I can add neither of these mentioned harddrives is connected to the HBA. They are connected directly to the motherboard.
One of the messages I gout about S3019679 was:

Code:

The primary GPT table is corrupt or invalid.
using the secondary instead -- recovery strong advised

Not sure how this fits into the bigger picture.

Regardless. I turned off the machine.
- Replaced ZA161YG1 with a new drive (gonna test on ZA161YG1 in my client so I can RMA it)
- Destroyed the troubled pool
- Was going to create a new pool with same specifications and drives (except for ZA161YG1 that is replaced now) BUT was missing one drive conneced to the HBA.
I was supposed to have 11 drives, 4 of these are connected to the HBA.
But what I got was 10 drives, 3 of 4 HBA connected drives were detected/visible in the GUI :S

So seconds ago, while the troubled pool was there, I could see all 11 drives. but the pool was failing.
Once I destroyed the pool to create a new one same specification, there was only 10 drives. 1 of 4 drives connected to the HBA was not there, neither in GUI drive list or under volume manager.
Either it tells something about the HBA.
Or it tells something about the drive not showing up, Ok smart results are no warranty the drive is not failing.

And moments ago FreeNAS emailed me warning about another drive, Z301NHXV, Failed SMART usage Attribute: 10 Spin_Retry_Count

So for now i will just do some long tests on those remaining 10 drives. And see if any of those will be interrupted like before without knowing why.
Any opinions so far ?

((By the way, I was so near to find out the the big problem yesterday night.
I was looking at my HBA and guess what I saw?
it was broken, physically broken. I could see it cleary. I thought, at last I figured this out.
Suddenly I wake up to realize, I was just dreaming))

I guess I am thinking hard of this, and its beeing a pain in the ass hunting me into my dreams :)
Or my dreams were telling me, luke it is the HBA ! use your the force if you are not sure
Anyway, am happy getting a lot of help on the forum and confident to find out the problem IRL with your help guys.

dirkme · Oct 16, 2017

I had bad cabling and had the same issue, If you download your encryption key and your config, you can start a USB stick installation from scratch, it recognizes your pool, upload the encryption key and you are up and running again.

To me, FreeNAS is absolute rock solid and a great NAS system, the best actually. Even if ypur PC breaks down, USB sticks with FreeNAS on it all in a new PC and you are up and running again.

joeschmuck · Oct 17, 2017

The issue with 10 of the 11 drives being seen, I would locate the drive which is not seen and swap the SATA connector with a different drive. When you start back up again do you still have a drive issue and if yes, is it the same drive (by serial number) or different drive? Does it follow the SATA cable?

If all of your drives have had the SMART Extended test run on them this week and there were no issues with it, I'd stop running the long test, it's wasting your time at this point. it was good noticing that a drive was not being recognized.

arameen · Oct 17, 2017

dirkme said:
I had bad cabling and had the same issue, If you download your encryption key and your config, you can start a USB stick installation from scratch, it recognizes your pool, upload the encryption key and you are up and running again.

To me, FreeNAS is absolute rock solid and a great NAS system, the best actually. Even if ypur PC breaks down, USB sticks with FreeNAS on it all in a new PC and you are up and running again.

Neither pool is encrypted, so not sure what kind of encryption key you talking about here.
Well the cable was not so long ago replaced.

arameen · Oct 17, 2017

joeschmuck said:
Use this link to get started.

Yes, great results.

I still say that you need to replace the ZA161YG1 drive. Didn't you previous replace a few drives that you thought were failing? Did those really fail?

Yes I did, but I guess I never will find out if those drives had any issues. Can not put those back

arameen · Oct 17, 2017

joeschmuck said:
The issue with 10 of the 11 drives being seen, I would locate the drive which is not seen and swap the SATA connector with a different drive. When you start back up again do you still have a drive issue and if yes, is it the same drive (by serial number) or different drive? Does it follow the SATA cable?

If all of your drives have had the SMART Extended test run on them this week and there were no issues with it, I'd stop running the long test, it's wasting your time at this point. it was good noticing that a drive was not being recognized.

I did create a new test pool using all 11 drives.

There are 2 drives currently having some kind of issue. I have them noted those drives by serials ofcourse.
Neither of them finishes it long SMART test, "Interrupted (host reset)".
One of the 2 drives was beeing reported as admin removed without me touching anything, while sometimes beeing reported as zero in size and not beeing able to create a pool with it.
And sometimes its just in normal state
The other drive is showing up sometimes then dissapearing. As of now that drive is not visible at all.
However I am glad I got a window with all drives online and managed to create the test pool with all 11 drives.
So all 11 drives are beeing used now in a test pool, and the pool is now degraded due to one of the troubled drives not visible.
I also connected one of those 2 drives to the motherboard, the issues did follow the drive and seem not related to the HBA
I will connect the other one to the motherboard also later.
But as it looks now, the problem points at 2 drives. Even if I yet don't want to see the HBA as working not malfunctioning. As I wrote, it could be more than one issue.
The HBA has 2 other drives connected to it, those seem to work at least with the test pool created.
Just to point it out, I did SATA switch both drives already without any good result.

But with the pool empty and no real activity, it is hard to say the remaining 2 HBA connected harddrives work.
So now its time to do some tests, not sure what kind of tests to do.
Should I fill the pool with lots of data and do scrubs ?
The size of the testpool is 23TB. So filling it even halfway will take days, but maybe no need to fill it so much to be able to do some more tests and see how the drives react ?

And as of now. I am out of replacment drives, so I can not replaced those 2 drives to make sure they are the problem.
I got several new drives recently. Waiting for 2 more drives but those will not be arriving until earliest 10 day from now.
But as the testpool is there, sure I can do lots of test and dont have to wait for the replacementdrives :) ?

joeschmuck · Oct 17, 2017

I'd rebuild the test pool using only the 9 good drives and place most of those drives on the HBA, then start testing. Do not test using an 11 drive pool with two drives removed, while it may yield some result, it likely wouldn't be definitive.

arameen · Oct 17, 2017

joeschmuck said:
I'd rebuild the test pool using only the 9 good drives and place most of those drives on the HBA, then start testing. Do not test using an 11 drive pool with two drives removed, while it may yield some result, it likely wouldn't be definitive.

Well considering the test drives is a raidz3 I assumed loosing those 2 drives wouldnt effect the tests ?

Why would you use the 9 drives that are ok isntead ?
Is your strategy to first make sure that the HBA is working ?
and later move to test the other 2 drives ?
And when we say test, is it filling the pool with data, do scrubs and some more read and write ?

wblock · Oct 17, 2017

This thread has seven pages, most of which have nothing to do with the title. Is there a reason to keep it open, knowing it will not be read by most people who could benefit from it? Can we create a new thread with a title that has something about drive testing, say?

arameen · Oct 17, 2017

wblock said:
This thread has seven pages, most of which have nothing to do with the title. Is there a reason to keep it open, knowing it will not be read by most people who could benefit from it? Can we create a new thread with a title that has something about drive testing, say?

Well we have been investigating different parts of the system through the thread, we just got back to checking the drives and HBA.
As long as I get help or advice, I dont care where the thread is located or what name it has.

wblock · Oct 17, 2017

arameen said:
As long as I get help or advice, I dont care where the thread is located or what name it has.

Well, sure. But the point of these forums is to help lots of people. And if useful information is hidden in threads with misleading subject names, that only helps you. Please start a new thread with a title that is about what it contains. That will also get more and better answers, as people interested in that topic will look at it rather than skipping over it because the title is not about anything that interests them.

arameen · Oct 17, 2017

wblock said:
Well, sure. But the point of these forums is to help lots of people. And if useful information is hidden in threads with misleading subject names, that only helps you. Please start a new thread with a title that is about what it contains. That will also get more and better answers, as people interested in that topic will look at it rather than skipping over it because the title is not about anything that interests them.

With "I dont care" I meant, its ok if you want to rename the thread :)
By no means I want to be the only one getting help.
Am just so suprised and happy I got lots of help in this thread.
I will create a new thread as you suggest with a name related to the harddrive testing and link to it from this thread.
And even point back to this one from there as a kind of background and detailed story, I guess that is ok ?

Important Announcement for the TrueNAS Community.

Hard Drive Troubleshooting - Massive Failures - Need Help Isolating the Problem(s)

Contributor

Old Man

Guru

Patron

Documentation Engineer

Contributor

Contributor

Old Man

Contributor

Contributor

Old Man

Contributor

Contributor

Contributor

Old Man

Contributor

Documentation Engineer

Contributor

Documentation Engineer

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Hard Drive Troubleshooting - Massive Failures - Need Help Isolating the Problem(s)"

Similar threads