Pool problem, 100% disk usage

Status
Not open for further replies.

skmattias

Cadet
Joined
Dec 29, 2017
Messages
5
Hello!

I've got a problem with my FreeNAS storage. I'm running FreeNAS virtualized on an ESXi box. I've given it 30gb of ram and 2 cores from my Xeon e5 2620 v2. It's got access to one of the the motherbaord's (Supermicro X9SRE-3F) storage units (Intel Patsburg Dual 4-port SATA/SAS Storage Control Unit) via harware passthrough. I've got one raidz1 pool on 3 WD RED 1TB hard drives.

Regularly, maybe once every few days, my storage will get painfully slow. When i check the reporting section in the web UI, I'll see that all disks in the pool has 100% disk usage. Screenshots below.

busy.png
io.png

latency.png
operations.png

100% disk usage, 1 pending IO request, about a second of latency and 1-2 operations/second. Disk I/O shows about 15 kb/s for each disk. All other graphs are unaffected. Only the drives in this pool are affected.

It seems to go away if I disconnect all clients from the share. But as soon as someone connects, it comes back. The only way to fix it seems to be to pull the plug to the host and then start it up again.

I've had the problem for a few months. Last week, I decided to reinstall FreeNAS and start over from scratch. When i imported the volume, however, the problem was still there. I then erased the drives and created a brand new pool altogether. I thought that I had it solved, but this morning it happened again.

Does anyone have any idea about what it could be?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
What version of FreeNAS are you running?
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
Looks like it is writes, so the problem would have to be sync writes.
 

Xelas

Explorer
Joined
Sep 10, 2013
Messages
97
Did you run a long SMART test to confirm that it isn't one of the drives dying?
 

skmattias

Cadet
Joined
Dec 29, 2017
Messages
5
Looks like it is writes, so the problem would have to be sync writes.
It's the same thing with writes. Seconds of disk latency and 100% disk usage for any small thing. The load just happened to be writes when i took the screenshots.
Did you run a long SMART test to confirm that it isn't one of the drives dying?
Just ran short smart tests on all drives with no problems. However, when I started the long smart tests, the problem reemerged about halfway through and the test couldn't even complete. Would the problem disappear when restarting the host if it was a drive dying, though?

Since the problem disappears for a while when I restart the host, I'm beginning to think that it's got something to do with the hba and the hardware passthrough.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Do these drives have important data or can we do destructive testing?

I would test the drives, full / long format to write every sector, followed by a long SMART test and do it from bare metal to ensure that the hardware pass through is not part of the puzzle.

That's going to tell us if both the drives and the controller are working properly. If this works, it should be repeatable from inside the VM. If you can't do it inside the VM, it might be that the controller is not being passed through property.

Sent from my SAMSUNG-SGH-I537 using Tapatalk
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
You'll want to ditch the hypervisor and see if you can duplicate the problem on the bare metal. You don't have an HBA if you are using the Pats SCU controller, and the SCU's have been a bit problematic in the past for this sort of abuse. I'd wager that you'll either need to ditch the hypervisor or add a known-to-work-with-virtualization 9211-8i HBA to resolve, but what do I know. :smile:
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
You'll want to ditch the hypervisor and see if you can duplicate the problem on the bare metal. You don't have an HBA if you are using the Pats SCU controller, and the SCU's have been a bit problematic in the past for this sort of abuse. I'd wager that you'll either need to ditch the hypervisor or add a known-to-work-with-virtualization 9211-8i HBA to resolve, but what do I know. :smile:
That's what I thought.

Sent from my SAMSUNG-SGH-I537 using Tapatalk
 

skmattias

Cadet
Joined
Dec 29, 2017
Messages
5
You'll want to ditch the hypervisor and see if you can duplicate the problem on the bare metal. You don't have an HBA if you are using the Pats SCU controller, and the SCU's have been a bit problematic in the past for this sort of abuse. I'd wager that you'll either need to ditch the hypervisor or add a known-to-work-with-virtualization 9211-8i HBA to resolve, but what do I know. :)

Yes, you're right, I'm using the built in Patsburg SCU!

Do these drives have important data or can we do destructive testing?

I would test the drives, full / long format to write every sector, followed by a long SMART test and do it from bare metal to ensure that the hardware pass through is not part of the puzzle.

That's going to tell us if both the drives and the controller are working properly. If this works, it should be repeatable from inside the VM. If you can't do it inside the VM, it might be that the controller is not being passed through property.

Sent from my SAMSUNG-SGH-I537 using Tapatalk

The drives do have important data, but not a lot. So I could just get a new drive to move the data to while testing. I'll do this some time in the upcoming days!
 

skmattias

Cadet
Joined
Dec 29, 2017
Messages
5
It turned out that the hardware passthrough of the built in SCU was the problem. I went to ebay and got myself an LSI 9211-8i, and now everything seems to be working flawlessly!
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
It turned out that the hardware passthrough of the built in SCU was the problem. I went to ebay and got myself an LSI 9211-8i, and now everything seems to be working flawlessly!

I'd wager that you'll either need to ditch the hypervisor or add a known-to-work-with-virtualization 9211-8i HBA to resolve, but what do I know. :)

"Do I win a prize?" ;-)
 
Status
Not open for further replies.
Top