SOLVED One of the 30 disks in my Z2 array is behaving unlike any other.

VictorR · Mar 21, 2016

I'm hoping some of you could tell me if this is normal, or give me some pointers on how to figure out what is going on.

FreeNAS-9.3-STABLE-201602031011
45 Drives Q30:
SuperMicro X10DRL
2x Intel Xeon E52620 v3 2.4GHz CPU
2x 120GB SSD Boot Drive
256GB RAM
2x LSI 9201 HBA
30x WD Re 4TB drives
3x X540T2BLK Intel X540 DA2

3 x 10 drive Z2 array

Everything has been great since finally finding and replacing a bad HD, which turned out to actually be a bad HBA card cable in January. At least, I received no error report emails and no alerts in the Freenas GUI.

Anyway, a casual check of the array tonight showed every disk online

But, when I checked Reporting>Disk tonight, it 29 of the disks show with the exact same I/O report. But, the 30th report is completely different.

Looking a little closer, it shows this behavior

A 1 week view looks like this

4 week view

Preceding 4 weeks

And back to when the array was brought online

palm101 · Mar 21, 2016

D30 is not in any vdev?

Sent from my iPhone using Tapatalk

MrToddsFriends · Mar 21, 2016

Are you sure that da30 is not a boot device (eventually holding the system data set, too)? How about your second boot device? "2x 120GB SSD Boot Drive" suggests you are using two mirrored boot devices.

INCSlayer · Mar 21, 2016

a cursory look at your vdevs i should point out it starts a 0 so your 30 disks are 0-29 disk 30 is not in the pool

VictorR · Mar 22, 2016

INCSlayer said:
a cursory look at your vdevs i should point out it starts a 0 so your 30 disks are 0-29 disk 30 is not in the pool

Thank you, what a foolish tired newbie mistake!

MrToddsFriends said:
Are you sure that da30 is not a boot device (eventually holding the system data set, too)? How about your second boot device? "2x 120GB SSD Boot Drive" suggests you are using two mirrored boot devices.

Yes, that' likely what is going on. I need to get home and go to bed

jgreco · Mar 22, 2016

VictorR said:
Thank you, what a foolish tired newbie mistake!

We've all made 'em. Decades doing this professionally and I did something not too terribly far off this morning myself.

wtfR6a · Mar 22, 2016

I chuckled half way through reading that because I made that mistake myself recently but caught it before I posted. Thanks for making me feel "normal" again ! :)

VictorR · Mar 22, 2016

wtfR6a said:
I chuckled half way through reading that because I made that mistake myself recently but caught it before I posted. Thanks for making me feel "normal" again ! :)

I am so glad to hear someone else has made that mistake! I was pretty embarrassed after having my error pointed out

jgreco said:
We've all made 'em. Decades doing this professionally and I did something not too terribly far off this morning myself.

jgreco, I think you also warned me months ago about not using LAGG in a small 10GbE LAN. Well, we're regularly getting wide swings in bandwidth to several of our 10 clients. Normal speeds are ~720MB/sec write and ~600MB/sec reads. But occasionally, differing clients will drop to 100MB/sec or even lower. Sometimes zero. Restarting the client sometimes solves the problem. Other times, restarting networking on the LAGG works better, if no one else is in the middle of editing

It's more of an annoyance right now. But, when the studio goes to full production with 10-15 editors puling 2K/4K/6K streams simultaneously, it will be intolerable. So, I'm going to give up on LAGG and just assign each of the six 10GbE ports it's own sub-network address (10.0.1.2, 10.0.2.2. 10.0.3.2, etc) and create several sub-network VLAN gateways(10.0.1.1, 10.0.2.1. 10.0.3.1, etc) on the 24-port 10GbE Netgear XS728T switch. Each port/VLAN will get 2-3 clients.

Does this sound reasonable? Or, am I over-thinking this?

The reason is that, depending on resolution/frame rate/compression rate raw 2K/4K/6K video streams can be up to 150MB/sec. And there can be several camera shots/streams per scene. It's not a constant bombardment of the NAS, just as editing software caches previews of each stream on the client.

jgreco · Mar 22, 2016

wtfR6a said:
I chuckled half way through reading that because I made that mistake myself recently but caught it before I posted. Thanks for making me feel "normal" again ! :)

Actually my mistake was more along the lines of, I went to test some performance numbers to show someone the difference between cache and no cache on a RAID controller. So I go to one of our old ones and get 10MB/sec for HDD and 50MB/sec for SSD, as expected. Then I go to one of our new gen ones, with the nice LSI 3108 and supercap, and ... HDD at 10MB/sec?!?!?

Being insufficiently caffeinated, the obvious answer that we had pulled the supercap on that unit a while back because of a crisis didn't occur to me. So I actually get up and go look at the hypervisor to figure out what's wrong, at which point I see the replacement cable and supercap sitting there with a note for the next downtime.

Go ahead and laugh.

jgreco · Mar 22, 2016

VictorR said:
I think you also warned me months ago about not using LAGG in a small 10GbE LAN. Well, we're regularly getting wide swings in bandwidth to several of our 10 clients. Normal speeds are ~720MB/sec write and ~600MB/sec reads. But occasionally, differing clients will drop to 100MB/sec or even lower. Sometimes zero. Restarting the client sometimes solves the problem. Other times, restarting networking on the LAGG works better, if no one else is in the middle of editing

It's more of an annoyance right now. But, when the studio goes to full production with 10-15 editors puling 2K/4K/6K streams simultaneously, it will be intolerable. So, I'm going to give up on LAGG and just assign each of the six 10GbE ports it's own sub-network address (10.0.1.2, 10.0.2.2. 10.0.3.2, etc) and create several sub-network VLAN gateways(10.0.1.1, 10.0.2.1. 10.0.3.1, etc) on the 24-port 10GbE Netgear XS728T switch. Each port/VLAN will get 2-3 clients.

Does this sound reasonable? Or, am I over-thinking this?

I think you may be overthinking it, but it is what it is. Sometimes things get ugly in network operations.

The good part is that you're actually able to leverage the layer 3 capabilities of that switch in a useful manner... no need to experimentally buy other hardware only to find out the problem is somewhere else.

The first bad part is that it's really difficult to know whether your idea will fix things. It could be FreeNAS itself suffering under load, for example.

The second bad part is that you've merely compartmentalized the problem, but that may be okay. If that actually works out for you, great. If it merely leads to similar problems down the road, not-so-great.

Looking forward, you might want to contemplate 40GbE. The Chelsio 10G stuff that's swimmingly well supported under FreeNAS (T420, T520, etc) also supports the T580 with the same driver. I have a small number of 40G ports available here, but they're all infrastructure. I haven't tried a Chelsio 40G card (yet). But getting yourself something like a Chelsio T580 card and 10/40 switch like the Dell N4032 or N4064 seems like it might be somewhat more workable for your use model.

VictorR · Apr 5, 2016

jgreco said:
The first bad part is that it's really difficult to know whether your idea will fix things. It could be FreeNAS itself suffering under load, for example.

Yeah, I came to realization a few weeks back that this could be a FreeNAS thing, a Netgear thing(XS728T is their foray into affordable 10GbE), or a Sonnet Thunderbolt-to-10GbE Ethernet converter thing. Even worse, it could be a problem between any two or three of those.

Although, another post of yours on LAGG/LACP made me thing that maybe it's a problem of each company having a slightly different implementation of Jumbo frames. I think my first experiment should be doing away with that and see what happens.

The second bad part is that you've merely compartmentalized the problem, but that may be okay. If that actually works out for you, great. If it merely leads to similar problems down the road, not-so-great

That was my goal. Hopefully, I could isolate which stations the problem is originating from.

BUT, here is the interesting thing. I've had, for several weeks, a suspicion that the instability might be caused by editors leaving their clients mounted to the shares overnight(or days at a time) with project files/video open in Adobe Premiere. I don't know why I thought that, but it had always been in the back of my mind. So, I started staying late and closing Premiere, un-mounting all shares. even shutting down the Mac Pro/iMac clients. Lo and behold, no network instability the next couple of days! The one day we did have problems, I quickly discovered someone had left Premiere open with a share mounted. That night I unmounted all clients and no problems since then. And we have had a multiple days of all clients hitting the NAS pretty hard all day long.

Another issue is that scrubs are taking 9.5 hours on the 40TB of data we have. That extends into business hours of the next day. Luckily, this only happens on the 1st & 16th of each month

Looking forward, you might want to contemplate 40GbE. The Chelsio 10G stuff that's swimmingly well supported under FreeNAS (T420, T520, etc) also supports the T580 with the same driver. I have a small number of 40G ports available here, but they're all infrastructure. I haven't tried a Chelsio 40G card (yet). But getting yourself something like a Chelsio T580 card and 10/40 switch like the Dell N4032 or N4064 seems like it might be somewhat more workable for your use model.

That is a definite possibility - in the future - but, I've got to go with the hardware I have, for now.

Important Announcement for the TrueNAS Community.

SOLVED One of the 30 disks in my Z2 array is behaving unlike any other.

VictorR

Contributor

palm101

Cadet

MrToddsFriends

Documentation Browser

INCSlayer

Contributor

VictorR

Contributor

jgreco

Resident Grinch

wtfR6a

Explorer

VictorR

Contributor

jgreco

Resident Grinch

jgreco

Resident Grinch

VictorR

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

SOLVED One of the 30 disks in my Z2 array is behaving unlike any other.

Contributor

Cadet

Documentation Browser

Contributor

Contributor

Resident Grinch

Explorer

Contributor

Resident Grinch

Resident Grinch

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "One of the 30 disks in my Z2 array is behaving unlike any other."

Similar threads