SOLVED One of the 30 disks in my Z2 array is behaving unlike any other.

Status
Not open for further replies.

VictorR

Contributor
Joined
Dec 9, 2015
Messages
143
I'm hoping some of you could tell me if this is normal, or give me some pointers on how to figure out what is going on.

FreeNAS-9.3-STABLE-201602031011
45 Drives Q30:
SuperMicro X10DRL
2x Intel Xeon E52620 v3 2.4GHz CPU
2x 120GB SSD Boot Drive
256GB RAM
2x LSI 9201 HBA
30x WD Re 4TB drives
3x X540T2BLK Intel X540 DA2

3 x 10 drive Z2 array

Everything has been great since finally finding and replacing a bad HD, which turned out to actually be a bad HBA card cable in January. At least, I received no error report emails and no alerts in the Freenas GUI.

Anyway, a casual check of the array tonight showed every disk online

844q6Vq.png


But, when I checked Reporting>Disk tonight, it 29 of the disks show with the exact same I/O report. But, the 30th report is completely different.

Bo8iu6w.png


Looking a little closer, it shows this behavior

R2GRwyo.png


A 1 week view looks like this

sWLd9Hj.png


4 week view

6U9YlBX.png


Preceding 4 weeks

7mHChmt.png


And back to when the array was brought online

6Ewfpyt.png
 

MrToddsFriends

Documentation Browser
Joined
Jan 12, 2015
Messages
1,338
Are you sure that da30 is not a boot device (eventually holding the system data set, too)? How about your second boot device? "2x 120GB SSD Boot Drive" suggests you are using two mirrored boot devices.
 
Last edited:

INCSlayer

Contributor
Joined
Apr 4, 2015
Messages
197
a cursory look at your vdevs i should point out it starts a 0 so your 30 disks are 0-29 disk 30 is not in the pool
 

VictorR

Contributor
Joined
Dec 9, 2015
Messages
143
a cursory look at your vdevs i should point out it starts a 0 so your 30 disks are 0-29 disk 30 is not in the pool

Thank you, what a foolish tired newbie mistake!

4d3LmGn.png


Are you sure that da30 is not a boot device (eventually holding the system data set, too)? How about your second boot device? "2x 120GB SSD Boot Drive" suggests you are using two mirrored boot devices.

Yes, that' likely what is going on. I need to get home and go to bed
 

wtfR6a

Explorer
Joined
Jan 9, 2016
Messages
88
I chuckled half way through reading that because I made that mistake myself recently but caught it before I posted. Thanks for making me feel "normal" again ! :)
 

VictorR

Contributor
Joined
Dec 9, 2015
Messages
143
I chuckled half way through reading that because I made that mistake myself recently but caught it before I posted. Thanks for making me feel "normal" again ! :)

I am so glad to hear someone else has made that mistake! I was pretty embarrassed after having my error pointed out

We've all made 'em. Decades doing this professionally and I did something not too terribly far off this morning myself.

jgreco, I think you also warned me months ago about not using LAGG in a small 10GbE LAN. Well, we're regularly getting wide swings in bandwidth to several of our 10 clients. Normal speeds are ~720MB/sec write and ~600MB/sec reads. But occasionally, differing clients will drop to 100MB/sec or even lower. Sometimes zero. Restarting the client sometimes solves the problem. Other times, restarting networking on the LAGG works better, if no one else is in the middle of editing

It's more of an annoyance right now. But, when the studio goes to full production with 10-15 editors puling 2K/4K/6K streams simultaneously, it will be intolerable. So, I'm going to give up on LAGG and just assign each of the six 10GbE ports it's own sub-network address (10.0.1.2, 10.0.2.2. 10.0.3.2, etc) and create several sub-network VLAN gateways(10.0.1.1, 10.0.2.1. 10.0.3.1, etc) on the 24-port 10GbE Netgear XS728T switch. Each port/VLAN will get 2-3 clients.

Does this sound reasonable? Or, am I over-thinking this?

The reason is that, depending on resolution/frame rate/compression rate raw 2K/4K/6K video streams can be up to 150MB/sec. And there can be several camera shots/streams per scene. It's not a constant bombardment of the NAS, just as editing software caches previews of each stream on the client.
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I chuckled half way through reading that because I made that mistake myself recently but caught it before I posted. Thanks for making me feel "normal" again ! :)

Actually my mistake was more along the lines of, I went to test some performance numbers to show someone the difference between cache and no cache on a RAID controller. So I go to one of our old ones and get 10MB/sec for HDD and 50MB/sec for SSD, as expected. Then I go to one of our new gen ones, with the nice LSI 3108 and supercap, and ... HDD at 10MB/sec?!?!?

Being insufficiently caffeinated, the obvious answer that we had pulled the supercap on that unit a while back because of a crisis didn't occur to me. So I actually get up and go look at the hypervisor to figure out what's wrong, at which point I see the replacement cable and supercap sitting there with a note for the next downtime.

Go ahead and laugh.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I think you also warned me months ago about not using LAGG in a small 10GbE LAN. Well, we're regularly getting wide swings in bandwidth to several of our 10 clients. Normal speeds are ~720MB/sec write and ~600MB/sec reads. But occasionally, differing clients will drop to 100MB/sec or even lower. Sometimes zero. Restarting the client sometimes solves the problem. Other times, restarting networking on the LAGG works better, if no one else is in the middle of editing

It's more of an annoyance right now. But, when the studio goes to full production with 10-15 editors puling 2K/4K/6K streams simultaneously, it will be intolerable. So, I'm going to give up on LAGG and just assign each of the six 10GbE ports it's own sub-network address (10.0.1.2, 10.0.2.2. 10.0.3.2, etc) and create several sub-network VLAN gateways(10.0.1.1, 10.0.2.1. 10.0.3.1, etc) on the 24-port 10GbE Netgear XS728T switch. Each port/VLAN will get 2-3 clients.

Does this sound reasonable? Or, am I over-thinking this?

I think you may be overthinking it, but it is what it is. Sometimes things get ugly in network operations.

The good part is that you're actually able to leverage the layer 3 capabilities of that switch in a useful manner... no need to experimentally buy other hardware only to find out the problem is somewhere else.

The first bad part is that it's really difficult to know whether your idea will fix things. It could be FreeNAS itself suffering under load, for example.

The second bad part is that you've merely compartmentalized the problem, but that may be okay. If that actually works out for you, great. If it merely leads to similar problems down the road, not-so-great.

Looking forward, you might want to contemplate 40GbE. The Chelsio 10G stuff that's swimmingly well supported under FreeNAS (T420, T520, etc) also supports the T580 with the same driver. I have a small number of 40G ports available here, but they're all infrastructure. I haven't tried a Chelsio 40G card (yet). But getting yourself something like a Chelsio T580 card and 10/40 switch like the Dell N4032 or N4064 seems like it might be somewhat more workable for your use model.
 

VictorR

Contributor
Joined
Dec 9, 2015
Messages
143
The first bad part is that it's really difficult to know whether your idea will fix things. It could be FreeNAS itself suffering under load, for example.

Yeah, I came to realization a few weeks back that this could be a FreeNAS thing, a Netgear thing(XS728T is their foray into affordable 10GbE), or a Sonnet Thunderbolt-to-10GbE Ethernet converter thing. Even worse, it could be a problem between any two or three of those.

Although, another post of yours on LAGG/LACP made me thing that maybe it's a problem of each company having a slightly different implementation of Jumbo frames. I think my first experiment should be doing away with that and see what happens.

The second bad part is that you've merely compartmentalized the problem, but that may be okay. If that actually works out for you, great. If it merely leads to similar problems down the road, not-so-great

That was my goal. Hopefully, I could isolate which stations the problem is originating from.

BUT, here is the interesting thing. I've had, for several weeks, a suspicion that the instability might be caused by editors leaving their clients mounted to the shares overnight(or days at a time) with project files/video open in Adobe Premiere. I don't know why I thought that, but it had always been in the back of my mind. So, I started staying late and closing Premiere, un-mounting all shares. even shutting down the Mac Pro/iMac clients. Lo and behold, no network instability the next couple of days! The one day we did have problems, I quickly discovered someone had left Premiere open with a share mounted. That night I unmounted all clients and no problems since then. And we have had a multiple days of all clients hitting the NAS pretty hard all day long.

Another issue is that scrubs are taking 9.5 hours on the 40TB of data we have. That extends into business hours of the next day. Luckily, this only happens on the 1st & 16th of each month

Looking forward, you might want to contemplate 40GbE. The Chelsio 10G stuff that's swimmingly well supported under FreeNAS (T420, T520, etc) also supports the T580 with the same driver. I have a small number of 40G ports available here, but they're all infrastructure. I haven't tried a Chelsio 40G card (yet). But getting yourself something like a Chelsio T580 card and 10/40 switch like the Dell N4032 or N4064 seems like it might be somewhat more workable for your use model.

That is a definite possibility - in the future - but, I've got to go with the hardware I have, for now.
 
Status
Not open for further replies.
Top