arameen
Contributor
- Joined
- Sep 4, 2014
- Messages
- 145
thanks for the input guys so far. Lots of good input. I am using every advice and not omitting any suggestion :D
I will use your plan joeschmuck with some some additional own steps to make sure the HBA is not the issue and try to figure out why the SMART long tests sometimes are interrupted by the host.
I already disconnected the troubled pool and reconnected the main and healthy pool. Its up and running.
I also inserted the HBA back in the case and connected one drive from the healthy pool into it. So 4 drives are connected to the motherboard while 1 drives is connected to the HBA. This drive is beeing power by another trail and SATA connected with the HBA.
MEMtest has been run before for more than 6 passes on the ECC memory and there was no issue.
Next i will do the CPU test, just to make sure.
After the CPU test I will do long tests on the healthy main pool. I want to see if those tests will be interrupted at all ? or interrupted for one drive that is connected to the HBA. We discussed earlier that some long tests are interrupted by the system without finding out why.
By this i will be doing 2 tests same time, HBA and interrupted long tests. The reason I do 2 tests (HBA connected and used) is I want this troubleshooting not to take months to figure out the problem.
If however tests are interrupted, i will remove the HBA again and do tests with the motherboard only.
Next I will move to connecting another drive to the HBA making it, 2 HBA connected drives and 3 motherboard connected drives. I will even make sure to power up those 2 HBA connected drives with same power used for the disks on the troubled pool before (to make sure this is not related to the power extenders somehow). Or i may connect 4 drives to the HBA making sure every channel and power is working on the HBA. Hopefully that is not risky and may not corrupt the main pool if anything is wrong with the HBA ?
Then I will do new long tests to see if any tests are interrupted. Hopefully noone are. Depending on how long time this takes I may do a scrub as final test.
Then I can assume that the HBA is not the problem, neither is the motherboard or CPU. The healthy pool is working with the HBA and no long tests are interrupted. However if anything is interrupted then the tests above have to be broken into smaller test and will take much longer time, I hope I can avoid that situation and much more downtime
Something more i will be testing is moving the USB sticks from USB ports on the front whenever there is an issue. I will then be using the USB ports on the motherboard instead of the ones on the case. Now I want to ask if there is a way to know what USB is faulty ? when you have 2 sticks same brand and same size ? I searched many times for an answer but found nothing, now doing a workaround. But maybe there is a way to know after all ? and not risk removing the healthy one or need to use workarounds for this.
I will be doing a second and final scrub of the main pool before assuming the the USB is ok, the HBA is ok and the PSU is ok with 5 drives. Same time i will be trying to give the server additional load in the form of copying files to make sure I can proceed with next steps.
If all tests above pass. I will jump over the step of installing FreeNAS on a SSD disk and continue with step 8 instead in joeschmucks plan after dropping the troubled pool and creating a new one :)
I will use your plan joeschmuck with some some additional own steps to make sure the HBA is not the issue and try to figure out why the SMART long tests sometimes are interrupted by the host.
I already disconnected the troubled pool and reconnected the main and healthy pool. Its up and running.
I also inserted the HBA back in the case and connected one drive from the healthy pool into it. So 4 drives are connected to the motherboard while 1 drives is connected to the HBA. This drive is beeing power by another trail and SATA connected with the HBA.
MEMtest has been run before for more than 6 passes on the ECC memory and there was no issue.
Next i will do the CPU test, just to make sure.
After the CPU test I will do long tests on the healthy main pool. I want to see if those tests will be interrupted at all ? or interrupted for one drive that is connected to the HBA. We discussed earlier that some long tests are interrupted by the system without finding out why.
By this i will be doing 2 tests same time, HBA and interrupted long tests. The reason I do 2 tests (HBA connected and used) is I want this troubleshooting not to take months to figure out the problem.
If however tests are interrupted, i will remove the HBA again and do tests with the motherboard only.
Next I will move to connecting another drive to the HBA making it, 2 HBA connected drives and 3 motherboard connected drives. I will even make sure to power up those 2 HBA connected drives with same power used for the disks on the troubled pool before (to make sure this is not related to the power extenders somehow). Or i may connect 4 drives to the HBA making sure every channel and power is working on the HBA. Hopefully that is not risky and may not corrupt the main pool if anything is wrong with the HBA ?
Then I will do new long tests to see if any tests are interrupted. Hopefully noone are. Depending on how long time this takes I may do a scrub as final test.
Then I can assume that the HBA is not the problem, neither is the motherboard or CPU. The healthy pool is working with the HBA and no long tests are interrupted. However if anything is interrupted then the tests above have to be broken into smaller test and will take much longer time, I hope I can avoid that situation and much more downtime
Something more i will be testing is moving the USB sticks from USB ports on the front whenever there is an issue. I will then be using the USB ports on the motherboard instead of the ones on the case. Now I want to ask if there is a way to know what USB is faulty ? when you have 2 sticks same brand and same size ? I searched many times for an answer but found nothing, now doing a workaround. But maybe there is a way to know after all ? and not risk removing the healthy one or need to use workarounds for this.
I will be doing a second and final scrub of the main pool before assuming the the USB is ok, the HBA is ok and the PSU is ok with 5 drives. Same time i will be trying to give the server additional load in the form of copying files to make sure I can proceed with next steps.
If all tests above pass. I will jump over the step of installing FreeNAS on a SSD disk and continue with step 8 instead in joeschmucks plan after dropping the troubled pool and creating a new one :)