Weird disk IO performance issue

Status
Not open for further replies.

zfrogz

Dabbler
Joined
Jun 29, 2012
Messages
43
Setup:
FreeNAS 9.2.1.8
ASRock C2550DI Motherboard
32GB ECC RAM
6 x WD60EFRX 6TB Drive w/ 68MYMN1 Firmware -> RAIDZ2

I'm having a strange problem that I'd love some help solving. I recently picked up a bunch of the new Western Digital 6TB Red drives and been upgrading each drive in an existing pool. Everything went great for the first 5 drives. The 6th one started off great and was resilvering faster than any of the previous drives now that all six faster drives are in place. The issue is that after the first 20 minutes or so, the performance drops to about 10MB/s from about 100MB/s. If I reboot, the performance starts off good but then drops again after a few min.
5e248f4f8b6d5704445457a81f34e2195dc67770.jpg


No obvious errors in FreeNAS.

Ran short and long SMART tests on the drives. All are clean.

I did a full format on the last drive to see if it would slow down like it does with FreeNAS after a few min. No problems there.

I've tried swapping out all of the SATA cables but no change in behavior.

I ended swapping that last drive back to one of the old 3TB drives, then pulled a different WD60EFRX out of the pool and replaced it with the one that was giving me trouble. No issue. Then took the known good WD60EFRX that I had pulled and replaced it with the 3TB drive and the issue returned.

Everything seems to be in perfect health. I'm not sure if there is an issue with this drive/firmware and FreeNAS or something else but I know there are other users with these (like Cyberjock - although older firmware).

Are there any logs I can dig into that might explain what is happening when the disk IO takes a nose dive? It feels like FreeNAS is deliberately choking the drives for some unknown reason but I'm not sure where to look to figure out why. Any help would be greatly appreciated.
 
Last edited:

zambanini

Patron
Joined
Sep 11, 2013
Messages
479
Without your hw specs and knowing how you test your setup.....we can only guess..dislike. Maybe it is just your arc which is filled and the system needs less data from the disk.
 

JackShine

Dabbler
Joined
Nov 13, 2014
Messages
27
Its probably a MB issue, very few can handle 6TB X6 .

BTW, WHAT THE HELL ARE YOU STORING?
 

zfrogz

Dabbler
Joined
Jun 29, 2012
Messages
43
I posted the relevant hardware up above but let me know if more detail is needed on any of those parts. Not doing anything fancy with it at the moment. I've tried enabling/disabling autotune - no change in disk IO behavior. Only services I have turned on are CIFS, SMART, SSH, and UPS. The CPU isn't the issue but does show the same drop when the disk IO plummets.
9e245323fa13d83843583f97a2540d22ad0f102c.jpg

I can't imagine it's a MB issue. Shouldn't be hitting any limits there. Here are the settings enabled by autotune. I've left them enabled for now but disabling them makes no difference.
e694fd9cd74a58391727917b839387b73b43a716.jpg

a2e7ac574031f73d40f4c17d10ad3228c85a2b07.jpg

I'm surprised that there is no log hiding somewhere that can explain what is happening.. Anybody got any other ideas?
 

zfrogz

Dabbler
Joined
Jun 29, 2012
Messages
43
Also, I've tried moving some of the drives off the Intel SATA controller to the Marvell controllers. This board has 3 controllers with 12 SATA ports. I originally had the Intel controller handling all 6 drives but have since moved 2 drives to each controller in case I was hitting some sort of limit within the controller. Behavior did not change.
 

bestboy

Contributor
Joined
Jun 8, 2014
Messages
198
I have seen similar resilvering behavior growing a mirror: Reads and writes all over the place. I think it's just normal. I just let it do its thing and it turned out fine.

btw: how long does it take to resilver a 6TB drive with raidz2? 2 days?
 

zfrogz

Dabbler
Joined
Jun 29, 2012
Messages
43
Well, the first 5 drives were about 12 hours (reading from slower 3TB drives). The last drive which caused me to start this thread would have taken about two days if I just let it go at 10MB/s but I rebooted it every time I saw it's performance drop which made it take less time. It's not that it's all over the place. The issue is that it starts off great and then flat lines and never improves.

Now that all drives are in, running a scrub (read only) is going at about 600MB/s with completion time of about 6-7 hours but that's with everything idle and not accessing any shares.

Even though the scrub is running fast and everything seems great, I can't help but think that there is still some sort of issue (probably in the drives firmware) that could potentially affect performance down the road. I would guess that it has something to do with the other 5 drives throwing data at the 6th faster than it can write it so it forces the other 5 to slow down (way down). Just wish I could prove it somehow so I can call Western Digital support and open a support case to have them fix their firmware. I'm assuming it's not on the FreeNAS side but haven't found a way to verify that. Any help proving or disproving my theory would be greatly appreciated.
 
Last edited:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
Did you ever fix this?

I read a few things which concerned me...

You installed 6TB drives in your entire pool and then shortly after you started putting back 3TB drives. You cannot do that safely and if your pool is not corrupt, you're one lucky SOB. Once you have added the last drive to the pool the system will start to expand your total capacity, in your case it will double it. You already saw that resilvering take considerable time, you need to let the system do it's thing. Let it run and stop rebooting it.
 

zfrogz

Dabbler
Joined
Jun 29, 2012
Messages
43
I keep good backups so I wasn't too worried if something went wrong. I offlined and pulled the last drive before the resilvering finished so it never expanded the volume and allowed me to put a 3TB back so I could continue testing. I haven't diagnosed the performance issue yet. It appears to be working normally now but I suspect that if I pull any one of the six drives, wipe it, and resilver it, I would run into this again.

I was really hoping there was some way to see what was causing the resilvering to flat line but it doesn't sound like there are any logs or ways to see what is triggering it or someone would have spoken up by now. Just hoping it doesn't bite me down the road because eventually a drive is going to fail and the last thing you want is to have the resilvering drag on forever (outside of controlled testing).
 
Last edited:

zfrogz

Dabbler
Joined
Jun 29, 2012
Messages
43
Autotune being enabled was a legacy setting from older hardware but has been disabled including the config changes it made. No noticeable difference in behavior or performance with it on or off while I was running these tests. I do read the forums and am well aware that it has been depreciated and can cause lower performance on systems with plenty of RAM.
 
Last edited:
Status
Not open for further replies.
Top