SOLVED Multiple drive failures

Status
Not open for further replies.

wokka

Dabbler
Joined
Aug 13, 2013
Messages
16
I had a freenas 9.3 system running with 8 drives for a couple of years with no problems. A few months ago, needed more space and the supermicro chassis I had was maxed out at 8 drives, so bought a supermicro 24 slot setup. I backed up my freenas install, installed a fresh 11.1 and put into the new system, restored the config, and moved the drives over. Everything came up nicely. I added two more drives to the array and all was good. Total of 10 drives, 4tb each running RAID 10. I'm booting from a USB thumb drive.

A day or two later, started having these errors pop up and drive failures, especially under heavy load (copying lots of files). My first instinct was heat issues on the new chassis, or hardware issues (new controller, new backplane, new mobo). Heat was easiest, so pointed a high velocity fan at the drives, but that hasn't changed it. The failed drive is always different and taking it offline and back online will resolve (or a reboot). This seems very similar to https://forums.freenas.org/index.php?posts/476695/

Here is an excerpt of the dmesg : https://gist.github.com/wokka1/2a49d7093613115e8cd79d535de71ba7

I came here to ask for help on troubleshooting the hardware, but after seeing the same problem from someone else, could it be something else?

I've had drives fail, we all have, but this isn't symptomatic of a drive failure, I'd expect it to always be the same drive failing, not 5 or 6 out of the 10 over the course of a week.

Also, I was gone for 10 days on vacation, had no drives fail, so idle they are fine, it just seems to be under higher load.

The server has dual 1100w psu, and I only have 11 drives in it (11th is for a timemachine setup). Nothing in the errors point to a PSU and the IPMI isn't reporting any power problems. I could understand a single PSU failure, but not two at the same time.

Thanks for your help.

EDIT
TLDR;
Bad controller causing the issues, replaced it, no more errors.
 
Last edited:

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
Please provide your full system details. Including the supermicro backplane model.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Total of 10 drives, 4tb each running RAID 10.
Can we get some model number information, not just on the drives but on the rest of the hardware in the chain between the system board and the drives.
I have seen several other users reporting similar issues and I am trying to find some link.
 

wokka

Dabbler
Joined
Aug 13, 2013
Messages
16
Supermicro 6047R-e1r24n with dual e5-2620, 192gb of ram and LSI/Avago 9240-8i controller. The chassis has the BPN-SAS2-846EL backplane and dual 920W power supplies.

Device Model: WDC WD4002FYYZ-01B7CB0
Device Model: WDC WD4002FYYZ-01B7CB0
Device Model: WDC WD4000F9YZ-09N20L1
Device Model: WDC WD4000F9YZ-09N20L1
Device Model: WDC WD4000F9YZ-09N20L1
Device Model: WDC WD4000F9YZ-09N20L1
Device Model: ST4000NM0033-9ZM170
Device Model: WDC WD4002FYYZ-01B7CB1
Device Model: ST4000NM0033-9ZM170
Device Model: ST4000NM0033-9ZM170
Device Model: ST8000NM0055-1RM112
 
Last edited:

wokka

Dabbler
Joined
Aug 13, 2013
Messages
16
Can we get some model number information, not just on the drives but on the rest of the hardware in the chain between the system board and the drives.
I have seen several other users reporting similar issues and I am trying to find some link.

Added in the requested info. Thanks!
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
LSI/Avago 9240-8i controller.
Did you check to see what firmware that is running? Also, that is a PCIe 2.0 card, and the system board is able to support a PCIe 3.0 card, so you might want to upgrade to a more modern controller. I am kind of suspecting a controller issue, so it would also be a good troubleshooting step. This is the kind I use and I run two 24 slot backplanes from it with no trouble:
https://www.ebay.com/itm/HP-H220-6G...0-IT-Mode-for-ZFS-FreeNAS-unRAID/162862201664

A few years ago, I had a SAS controller that started giving me trouble after it overheated. It never worked properly again, even with plenty of airflow, so if this one got too hot, adding air after isn't going to bring it back. It could have cooked internally. Did you slow the fans on the 24 bay chassis to make it quiet? That is what I did, after I replaced the SAS controller, I added a fan that blows directly on the SAS card and all has been good since.
 

wokka

Dabbler
Joined
Aug 13, 2013
Messages
16
Interesting, my controller has the latest firmware (don't have that handy without rebooting), I put the latest IT firmware on it when I built the system. I had the fans in standard speedmode (not quiet) and just today, put them into Heavy I/O mode, but can also do Full Speed mode if needed (doesn't matter about the noise).

I'll order another controller to see if that helps any.
 

wokka

Dabbler
Joined
Aug 13, 2013
Messages
16
Update: I put my new controller in a couple of hours ago and have been trying to hammer my NAS to see if an error will appear. Knock on wood, nothing so far. Will update in a couple of days if this is resolved.
 

wokka

Dabbler
Joined
Aug 13, 2013
Messages
16
I think this is resolved. Funny enough, now that the bad controller isn't spamming my logs, I do have a drive with an unrecoverable sector showing up, so RMA'ing that.
 
Status
Not open for further replies.
Top