Freezes, HDs and headaches Oh My!

Status
Not open for further replies.

RichR

Explorer
Joined
Oct 20, 2011
Messages
77
Sorry for the title - and this post might be all over the place....

First - the details -
2 of what we call storage pods - with only a couple of hardware differences (listed underneath) based on http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v2-0revealing-more-secrets/

8.0.2 and 64 (8288)
i3 540 @ 3.07 GHz
16GB
45 x 3TB drives (Pod1 Hitachi Desktar)(Pod 2 WD Green)
Pod 1 has 2 PSUs, Pod 2 has 1 1200W PSU
(A note about power - used watt meter several times during startup of Pod2 - max was 760W and then it settled down to 250-300ish, but the math pretty much shows the issues are not related to lack of power after startup)
Both configured with 14 drives + 1 spare per Volume (totaling 45 drives) RAIDZ2
The volumes are set up this way (the GUI "Active Volumes" says size is 30.7TiB, but the GUI "Reporting" total is 35TB, but whatever...):
On Pod1 we use the terms for Volumes: Store1, Store2, Store3
On Pod2 we use the terms: Store4, Store5, Store6
From time to time I do get siis timeout errors
Backblaze is using Linux -- we're not!
The pods are in a data center

So, interestingly enough, I rsync'd about 17TB of total data (in many many smaller chunks) from a Linux box to Pod 1, divided amongst the 3 volumes @ about 9.0TB, 5.1TB, 3.7TB. No issues, no hang ups, and it seems to be performing fine, although the siis warnings in the "pod1.local security run output" from time to time bother me.

many, many of these......
+siisch2: Timeout on slot 26
+siisch2: siis_timeout is 00040000 ss 7fffffff rs 7fffffff es 00000000 sts 800a2000 serr 00000000
+siisch2: ... waiting for slots 7bffffff
+siisch2: Timeout on slot 6
+siisch2: siis_timeout is 00040000 ss 7fffffff rs 7fffffff es 00000000 sts 800a2000 serr 00000000
+siisch2: ... waiting for slots 7bffffbf

I think Pod 2 is where my immediate trouble lies.
In trying to sync Pod 1 to Pod 2, Pod2 freezes, Zpool problems, possible disk problems (all new disks on both pods btw) etc. In syncing individual parts from Store1 to Store4, no problems until syncing a rather large directory. I should state that none of the files are large (largest is 5MB). Rsyncing crapped out, the machine (Pod2) basically froze, I was unable to ssh back in. Lucky we have IPMI - If you don't you should!!! Anyway, rebooted, and all is fine. No zpool errors, all is dandy. Rsync again, it goes fine.

Then I start syncing Store2 to Store5. This is 5TB's of total data, but I was doing the directory that has about 3TB. Hours go by, reportedly getting around 250-300M average transfer speed (from Interface traffic in GUI), then it freezes again, but this time I'm already in. Zpool status -x shows problems on 3 drives (separated by 2 drives, so I know it's not a single backplane issue by the configuration). Reboot, spool status -x show resilvering a single drive, then couple of minutes later it starts resilvering another drive (of course the time to completion is not going up), then resilvering the third.....

Needless to say, this morning, it's all peaches and cream, and no one is complaining. However, I have not attempted to transfer any data!!!

The only major differences in the way I transferred data between the linux box and Pod1 (totally successful), and Pod1 to Pod2 (not going well), is the network. On all of the data between linux and FreeNAS I used em0 over a gig switch through the network. From Pod1 to Pod2 I used both em0 sometimes (which uses the switch), and em1 (on both FreeNAS boxes), which is a direct connection between the two boxes using the second (not IPMI) Gigeth port.

The drives are different. I think this "MIGHT" be the issue. I've seen both the good and bad about green drives. I don't think sleep would be the issue because it was in the middle of a transfer. In connecting the power issue with the green drive issue, no initial box startups went more then 750W as the green drives do not do an immediate spin up with an inrush of current, so I don't think anything in the system is "fried."

After this trouble, I read about ZFS tuning, and although I'm certainly not a ZFS guru, I have no problem with the command line. I added to loader.conf
per protsd @ http://forums.freenas.org/archive/index.php/t-511.html The reboot was fine.

vm.kmem_size="1536M"
vm.kmem_size_max="2048M"
vfs.zfs.arc_min="256M"
vfs.zfs.arc_max="1024M"
#vfs.zfs.prefetch_disable="1"
#vfs.zfs.zil_disable="0"
vfs.zfs.txg.timeout="30"
vfs.zfs.vdev.max_pending="35"
vfs.zfs.vdev.min_pending="4"
vfs.zfs.txg.write_limit_override=1073741824


So, putting a good load on both systems has not fared well for Pod2

The power is different.

Everything else is the same (that I can think of)

So...... where do I start? Needless to say, I'm nervous about Pod2, and at this point can't trust it.

Any suggestions are appreciated,

Rich
 

labtopia

Dabbler
Joined
May 31, 2011
Messages
47
hey i've done the same thing, i love my storage pods! i havent seen issues like this with any of our pods, i was stable at 8.0.2 64, i did see memory panics when i went up to the beta 8.0.4b2.

what controllers did you end up with? also i have read that there are big problems with mixing the onboard disk controller with pci cards in the same box. it evidently a incompatibility with the intel onboards and the sil cards. also my config uses 2tb seagate greenies mixed in with both 3 and 6g versions, i have heard issues with hitachi's.

also i have read that memory tuning is unnecessary on 64bit install because it's supposed to manage the memory automatically

not sure if it helps but if you sort it out i'd like to hear how. btw i've been running ours for almost a year with little troubles. i think i'm going to stick though to release builds as this distabilized the one i upgrade to the current beta.

dave
 
Status
Not open for further replies.
Top