Large transfers over CIF always fail.

Kayman · Aug 3, 2014

Hello.

Hate for my first post to be about a problem but I really need help here. First of all let me say I'm on my second freenas build. The first one I built was a sort of proof of concept. Specs were i7 920 Cpu, basic motherboard with 10 onboard sata ports. 24Gb ram (6 x 4). It had 2 separate pools (6 x 1.5Tb) z2 and (3 x 1Tb) z1. It ran build 9.1.0 I think. Long story short it ran like a dream, you can read about it or skip to the next paragraph.

I got it running set the static ip, set up the pools, set up the shares and that was it. It just worked beautifully from day 1. All transfers were consistently in the 75-90mb/s range peaking at 115-120mb/s. My torrent computers download directory was set to one of the data set folders. I regularly had 20+ torrents going all downloading/seeding. Plus me and others streaming high bit rate bluray movies from it. It never missed a beat. I even randomly un plugged drives just to see what would happen. Even the z1 pool successfully re-silvered multiple times. I never lost the pools or any data. Huge file transfers always finished eventually without pauses or drop outs. It was just bulletproof, the 2-3 months I had it running, it never once crashed or became unresponsive. I didn't fiddle with any settings or anything all I did was just scrub the pools manually every 1-2 weeks and it continued on working flawlessly.

So I was sold. Freenas is great. Time to go out and make a larger scale nas and put all my drives in it. I believe in do it once do it right, so I decided to go down the route of getting proper server gear. Here's what I got. All 2nd hand except the motherboard witch I got brand new.

Xeon e3 1220v2
Supermicro X9SCM-F
Kingston KVR16E11K4/32 (32Gb 1600mhz ECC Unbuffered)
Lsi 9211-8i (ver 16 IT mode)
Drives. Currently have 14 1.5Tb Wd greens.

Here's my problem. I've set up a brand new empty pool and I can't even fill it up. I try to copy something to a (CIF) share and it will copy anywhere from a few hundred Meg to 10-20 Gig (seems completely random) then it will simply stop. Network traffic will drop to zero and the share will become unresponsive and the transfer will fail. Meanwhile the web GUI will continue to work as normal. Still responsive even the shell works smoothly, no errors in the console nothing. Teracopy just sits there doing nothing and eventually comes up with the error "the specified network name is no longer available". About 10-20 minutes pass (again totally random) then out of nowhere it starts to copy again another few hundred Meg to 10-20 Gig and so it goes on. Whenever it actually is transferring the share becomes accessible when the transfer stops the shares also drop out.

Going through all the menus there doesn't really seem to be anything you can change or adjust to try and fix this. I've tried running different versions of freenas even the old 8.x.x. I've tried different pool configurations. Always the same problem.

Now I know the hardware is 2nd hand and I have no idea how old it is or what its been through but I did check it. Installed windows and ran tests. Prime95 and memtest86. Ran crystaldiskmark on each channel of the HBA. Nothing was out of the ordinary. The disks, yes the disk are ancient (in computer years) and yes some of them have crazy high values for (C1)Load cycle count. But all drives have Zero (Reallocated sector count), Zero (Reallocated event count), Zero (Current pending sector), Zero (Offline Uncorrectable). I'm pretty sure everything hardware wise is 100% working so the problem isn't the hardware.

When ever it actually is transferring the speeds are very good around 100 mb/s. Scrub speed is also very good at around 740mb/s. Currently running the latest version 9.2.1.6. The current pool configuration is 6+6 z2 striped vdevs with 2 drives disconnected as spares. Ideally I would like to keep this setup. Could go 7+7 z3 striped or single 13 z3 vdev but decided to keep the option of expandability open. Can easily get 4 more drives and add another 6 disk vdev.

So please what do I do to make this work. I just want the basic CIF share to work properly. On the old nas I would get a 2Tb back up disk and just dump it into a share and in about 9-10 hours it was done. Now I spent the entire weekend and can barely even get 250gig transferred. Any info you guys need I'll gladly supply just tell me what commands to type where to get it.

Thanks in advance.

indy · Aug 3, 2014

Does the client use a Realtek chipset?
I had the same symptoms with my old motherboard (rtl) and have not been able to replicate the problem with the new one (i217).

edit:
However the motherboard change coincided with some other changes so consider it a guess.

c32767a · Aug 3, 2014

Kayman said:
Hello.

So please what do I do to make this work. I just want the basic CIF share to work properly. On the old nas I would get a 2Tb back up disk and just dump it into a share and in about 9-10 hours it was done. Now I spent the entire weekend and can barely even get 250gig transferred. Any info you guys need I'll gladly supply just tell me what commands to type where to get it.

Thanks in advance.

For testing purposes, can you try transferring files using some other method? NFS? AFP? FTP?
Do you have another client you can try from as well?

Do you know how to use iperf? It would also be good to test the network performance between your client and server with iperf, just to validate that it's working.

joeschmuck · Aug 3, 2014

I didn't see this in your posting and if I overlooked it, well....

Did you try a new/different network ethernet cable and a direct connection to your computer? What is your computer/OS, sometimes the failure is the computer or cable(s) or switch, router, etc... I've seen it several times where someone blames FreeNAS and it was actually a piece of hardware. I hope it's a cable since those are easy fixes.

You said this was new MB, did you check to ensure the BIOS firmware was updated?

This does act like a RealTek NIC issue but I didn't think those are installed on your MB, but I could be wrong. If you are running a RealTek NIC, drop in an Intel NIC and your problem is likely to be fixed. If you have a Broadcom or Intel NIC, I think you should be fine short of the NIC being damaged.

Speaking of the BIOS, did you make any changes in there other than possibly RAM settings?

Now lets talk about the used parts... If you ran Memtest86 for 2 days solid and you had no issues then it sounds like the MB, CPU, RAM, and PS are okay. I like that you ran Prime95, I like that too, just don't run it over 2 hours or you could lead your CPU to premature death. But this also validated your CPU, CPU heatsink, and PS. The LSI card... I think you may need to do a little experiment which I don't know if you can support it or not. Remove the LSI card, connect one or more hard drives to the MB SATA connectors and create a new pool. Try to repeat the failure. If it doesn't fail then you have isolated the failure to the LSI card. I'm not saying the LSI card is faulty, I have no experience with them but maybe there is some setting, IRQ conflict, hell I don't know but it could either proved the LSI card is no issue or is the issue.

I'm out of ideas at this point but good luck.

DrKK · Aug 3, 2014

I feel like this is some kind of over-heating chipset. Possibly the NIC, possibly the LSI card. Everything he's describing---semi-random, but not so random times to lockup, responsiveness returning after a cool-down period, etc. I feel like when he's pounding the I/O, maybe something is getting too hot, or something had been previously overheated by the previous owner, and is now overly sensitive (which may be why it was sold in the first place), etc.

iSCSIinitiator · Aug 3, 2014

Just a quick thought: I'd definitely check to see if it's a CIFS issue. There are known issues in CIFS that were fixed in 9.2.1.6 by having CIFS default to SMB2 for the maximum protocol. However, if you do a clean install and create a CIFS share, 9.2.1.6 will set SMB2 as the default, but if you upgrade from an earlier 9.2.1.X, I'm pretty sure I've seen FreeNAS keep systems on SMB3, which is buggy for some users.

The simple fix is to check that SMB2 is set as the max protocol for CIFS in the CIFS config GUI. Just a quick guess; could be a number of things, but this is easy to test.

I re-read and see that you've used earlier FreeNAS prior to SAMBA 4, so likely another issue.

Good luck.

Kayman · Aug 4, 2014

Thanks for the replys guys.

First of all the cables/connections. The new Nas sits in exactly the same spot as the old one did. Connected with the same cable to the same port of the same router. All cables are high quality cat6. I'm using the onboard nic of the supermicro board. (Intel® 82579LM and 82574L,
2x Gigabit Ethernet LAN ports). The old nas used relteks initially and yes they were crap I switched to a Intel PRO/1000 PT dual port and speeds improved. Two client computers use Intel® 82574L Nics. One other computer uses onboard nic but it is Intel based. Rest of the computers are wireless and I couldn't care less about them because I don't use the wireless computers.

The motherboard. Only settings I changed is forced boot device to USB. And set the Sata controller to AHCI mode. Memory speed is auto and it runs at 1600MHz with Auto timmings. The Bios I haven't flashed so I will look into that, maybe there is a newer version.

Overheating. I have a small fan cable tied to the LSI card. So I have made sure it runs cool. The cpu fan is set to automatically increase rpm when temps go up. When I ran prime95 in windows I did confirm this feature was working because the cpu did warm up and the fan span faster. While freenas is running fan always stays at min rpm which I assume means the cpu stays cool.

Iperf not familiar with it but I'll try and look into it. Also all my computers are Windows based so what would be an easy way for me to try other share types NFS? AFP? FTP?

The Lsi 9211-8i card was bought of ebay from china donno maybe it was even new. It looked legit in a new box. I couldn't find a cheap m1015 so I just got it maybe it is dodgy. I will make a pool using just the motherboard sata ports. Scrubs seem to complete fine with out any pauses and I though they are even more stressing on the I/o then just straight writes.

I also still have the Intel PRO/1000 PT nic from the old nas that I could use in the new one.

Ok once again thanks for the replys I will try and get back to you guys as soon as i can.

joeschmuck · Aug 4, 2014

The Intel NIC is the easiest and quickest test but I'm sure you know that. Also, and I'm not saying this is our problem, even though you are using the same cables, router, switches, etc... it doesn't mean something has gone wrong. Call it bad luck but we have seen it before. This is why it's a great test to connect it directly to the computer you are transferring files to/from.

Kayman · Aug 5, 2014

Alright guys I've spent 2 solid days fighting with this but I got to the bottom of it.

Here's the story.

Pulled out the Lsi 9211-8 card. Made a pool of 4 drives z2 with just the motherboard sata ports and low and behold it works. Got 50gig of test data (SSD source) good mix of file sizes. Transferred without a hitch both ways. All crc checks in Teracopy came back good. So looks like everything is fine with the CPU motherboard and network.

Now I plug the Lsi 9211-8 card back in and make 8 pools of 1 drive each. Using the da drives. (Here's the drive list so it's clear.)

[root@NAS ~]# camcontrol dev list
<ATA WDC WD15EADS-00P 0A01> at scbus0 target 1 lun 0 (da0,pass0)
<ATA WDC WD15EADS-00S 0A01> at scbus0 target 2 lun 0 (da1,pass1)
<ATA WDC WD15EADS-00S 0A01> at scbus0 target 5 lun 0 (da2,pass2)
<ATA WDC WD15EADS-00S 0A01> at scbus0 target 6 lun 0 (da3,pass3)
<ATA WDC WD15EADS-00S 0A01> at scbus0 target 7 lun 0 (da4,pass4)
<ATA WDC WD15EADS-00R 0A01> at scbus0 target 8 lun 0 (da5,pass5)
<ATA WDC WD15EADS-00S 5G04> at scbus0 target 9 lun 0 (da6,pass6)
<ATA WDC WD15EADS-00S 5G04> at scbus0 target 10 lun 0 (da7,pass7)
<WDC WD15EADS-00P8B0 01.00A01> at scbus3 target 0 lun 0 (ada0,pass8)
<WDC WD15EARS-00Z5B1 80.00A80> at scbus4 target 0 lun 0 (ada1,pass9)
<WDC WD15EARS-00Z5B1 80.00A80> at scbus5 target 0 lun 0 (ada2,pass10)
<WDC WD15EARS-00J2GB0 80.00A80> at scbus6 target 0 lun 0 (ada3,pass11)
<SanDisk Cruzer Blade 1.26> at scbus8 target 0 lun 0 (pass12,da8)
[root@NAS ~]#

Using the same 50gig of test data I tried to fill the pools up one at a time. Transfers using CIF were all successful (crc verifyed in Teracopy) on da1-da7. da0 got to about 10-15gig and then it failed. The more times I tried the less it got. Eventually it wouldn't even transfer a few hundred meg. So I've left the da0 pool and deleted it's cif share. Next I deleted all the other pools and used the remaining 11 drives to make a z3 pool. Now this pool worked, all cif transfers worked big or small.

Now I noticed this. Under reporting Freenas is showing constant writes to drive da0 (the one with no share). Not just a few spikes here or there the graph is solid green with 80k on the y axis. Even after a reboot. I'm thinking wtf is going on.

At this point I thought it was pretty clear that channel 1 on the Lsi 9211-8 card is rooted. But to be sure I unplugged the drive (da0) and connected it to a spare motherboard sata port. Reboot, made a pool with this 1 drive got my 50gig test data and tried a cif transfer. 500 meg and it failed tried a few more times and it wouldn't even start, the share just went dead.

So I remove the drive and put one of the spares in its place. Make a single drive pool again and try the 50gig test data and it works. Now I take this drive unplug it from the motherboard and plug it back into channel 1 of the Lsi 9211-8 card. Again it works the 50gig transfer is successful.

I've since gone back to my original 6+6 z2 pool configuration and so far everything seems to be working. But my confidence in freenas has been greatly shaken. How can this happen???? My old nas I pulled the power plug on one of the drives in the pool while it was in use and it kept on working it didn't even seem to notice. Only after a reboot did the pool show up as degraded. Here 1 dud drive has basically stopped the entire pool from functioning without a single hint of anything being reported as wrong. Here is the Smart data from the drive in question. Here it's connected to the motherboard so it's ada0.

[root@NAS ~]# smartctl -a /dev/ada0 | more

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 187 182 021 Pre-fail Always - 5616
4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1298
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 085 085 000 Old_age Always - 11534
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1289
192 Power-Off_Retract_Count 0x0032 199 199 000 Old_age Always - 1246
193 Load_Cycle_Count 0x0032 163 163 000 Old_age Always - 112736
194 Temperature_Celsius 0x0022 124 101 000 Old_age Always - 26
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

Anyone care to share some insight. What has gone wrong here? I thought that the fundamental purpose of RAID was to protect against drive failures. It clearly hasn't done that here. Yes if the pool was full all the data would have still been there but I could barely write to and I'm assuming read from the pool what could I have done? Have I been to total noob and missed something? Whats to stop this from happening again when another drive goes bad in the pool?

cyberjock · Aug 5, 2014

So it was probably stuck at 80k because when the disk detached from the system the reporting system didn't know how to handle it, so the same value kept getting carried forward because that's the number it has previously.

As for your comments about this:

I've since gone back to my original 6+6 z2 pool configuration and so far everything seems to be working. But my confidence in freenas has been greatly shaken. How can this happen???? My old nas I pulled the power plug on one of the drives in the pool while it was in use and it kept on working it didn't even seem to notice. Only after a reboot did the pool show up as degraded. Here 1 dud drive has basically stopped the entire pool from functioning without a single hint of anything being reported as wrong. Here is the Smart data from the drive in question. Here it's connected to the motherboard so it's ada0.

This is the achilles' heal with ZFS. So when a disk starts failing a number of different things can cause failures, and each of those can cause different results to the HBA. It can range from a disconnect and the disk is never reconnected to garbage across the SATA cable to the HBA to disconnects and reconnects every few seconds. The key is to use an HBA that can handle these problems properly so that in most (and hopefully all) cases a failed disk doesn't take the entire pool down with it. Disconnecting a drive with the drive in use (do not try this at home with real pools.. it can be bad.. only for testing with test pools and test data) is the "best case" scenario. The HBA sees a disconnect and that's it. The crappy scenarios are when a disk is intermittently failing. The HBA has to deal with that... somehow.

I don't know about the LSI 9211, but I know the M1015 lets you control when it drops disks from the controller. There's a menu of options (I can't remember if its a hidden menu or once you can find easily) but you can change some of the settings to make them more aggressive. This may fail disks out that aren't bad, so you must be smart about it. I'd assume it exists on the 9211 since it's basically the same controller.

So your hardware is pretty good. No complaints or improvements I'd make. I deal with this very easily. When a disk starts failing, once the disk is disconnected once it will be removed from the pool until you either go to the CLI and online it manually or reboot the server (the act of onlining the pool will put the "dropped" disk back in the pool if it is found). Not all disks will disconnect though. So I have my SMART script that emails me nightly and SMART monitoring. If a disk starts failing it typically triggers SMART monitoring first, so I know the disk that is the problem. If it doesn't trip SMART there's other indicators. But if you notice your pool suddenly sucking horribly in performance where performance existed just hours or minutes before, the first thing I look for is a bad disk.

The "best" way to deal with this is to buy those ultra-expensive disks that have TLER and use those. But if you are only a mere mortal like me you'll have to settle for Greens or Reds and if/when this happens know that you need to take action. This really isn't "bad" for the pool as it hasn't put anyone's data at risk as far as I know. But it does kill pool performance until you deal with the issue.

So not the end of the world that you might think it is, but it is something that you learn to deal with and accept the fact that you built a server with less than enterprise-grade (and enterprise-priced) hardware and you have a quirk to deal with. ;)

Now to change the subject a little.. that SMART info you provided looks fine. It has 1200 boots in 11000 hours (every 10 hours.. ouch!) and you need to do use the wdidle tool on your clearly obvious Green drives to fix your load cycle count so it doesn't go off the scale. (There's a how-to in the forum guides section so go there and read about that.)

So the disk "looks" fine. I'd put it in another machine and do a badblocks test on it followed by a SMART short, conveyance and long test. If all of those pass then I'd argue it's not the disk despite the overwhelming evidence it is. It could just be an old disk that is doing crappy things in its firmware and somehow not triggering any of the monitored parameters in SMART. Other than that you can just discard the disk and be done with it (or try to do an RMA).

Kayman · Aug 6, 2014

Thanks for the reply cyberjock. Feel honored that you of all people have chimed in.

First of all.

cyberjock said:
I don't know about the LSI 9211, but I know the M1015 lets you control when it drops disks from the controller. There's a menu of options (I can't remember if its a hidden menu or once you can find easily) but you can change some of the settings to make them more aggressive. This may fail disks out that aren't bad, so you must be smart about it. I'd assume it exists on the 9211 since it's basically the same controller.

Where do I look to find this menu. Is it somewhere within the freenas GUI or do i have to get the spare monitor out and hook it up and look for it during boot. What exactly am I looking for to adjust in this menu.

cyberjock said:
Now to change the subject a little.. that SMART info you provided looks fine. It has 1200 boots in 11000 hours (every 10 hours.. ouch!) and you need to do use the wdidle tool on your clearly obvious Green drives to fix your load cycle count so it doesn't go off the scale. (There's a how-to in the forum guides section so go there and read about that.)

I am aware of this and I've set the timer to 300 on all drives in the pool. I've got them all second hand and they've come like this. I can't undo what previous owners have done. It really does help you get a low price though. You point it out to people and they assume the drive is worth nothing so they sell it to you for stupid cheap. I have one drive that's actually part of the pool that's on 1.6million load cycles and ironically it's not the one that's failed.

cyberjock said:
So not the end of the world that you might think it is, but it is something that you learn to deal with and accept the fact that you built a server with less than enterprise-grade (and enterprise-priced) hardware and you have a quirk to deal with. ;)

Definitely not the end of the world. I'm just glad I learned this now when I just put the pool together and it was pretty much empty and I haven't lost anything. I just wrongly assumed that your 100% safe against drives failing as long as too many don't fail at once. Looks like I have to take some extra precautions with your suggestions.

Based on this assumption I chose to go down the route of investing a lot in proper hardware and using cheap disks, most people I know with limited budget go the other way. Get the big expensive disks and use a crappy old computer to run it. Maybe that was a mistake but I'm stuck with what I've got now. I've stuck by the old 1.5TB drives because due to it being a odd obsolete size that as far as I'm aware isn't even made anymore they are stupid cheap. I've just bought 6 more for 150 that's $25 a drive. At that price they pretty much are expendable. Also there are a lot of them out there because for a long time when they were new the 1.5TB disks enjoyed the position 3TB disks do today. If you divide the price by the capacity you get the lowest value making them the drives to get.

Once again thank you to everyone that has chimed in to help me. Let the lessons I learnt be a lesson to you all. I didn't even consider that a drive that's "going" bad as apposed to gone completely bad could compromise the entire pool like it did for me.

Kayman · Aug 6, 2014

One more thing. I've got 6 more drives on the way and to use them I'm going to need another HBA card. Is the Lsi 9211-8i still recommended as an alternative to a m1015? ZFS file servers have really caught on in popularity lately and with the m1015 being so heavily recommended for ZFS, it's become really hard to get or the price is stupidly inflated due to the demand.

joeschmuck · Aug 6, 2014

Kayman said:
Thanks for the reply cyberjock. Feel honored that you of all people have chimed in.

Where's the love man, where is the love;)

cyberjock · Aug 6, 2014

joeschmuck said:
Where's the love man, where is the love;)

People named Joe Schmuck are the butt of jokes (seriously!) around here. When shit goes wrong we always say "it's joe schmuck's fault". I'm not even joking either.

@Kayman

The menu is in the controller itself if you didn't reflash it without the menu.

Kayman · Aug 7, 2014

cyberjock said:
The menu is in the controller itself if you didn't reflash it without the menu.

Ok looks like I had a blank bios and I had to re flash the card. For some stupid reason this is recommended by most of the flash guides and I must have followed them. Anyway I think I've gotten to the menu you were referring to.

It looks like this.

MAX INT 13 devices for this adapter. 24
IO Timeout for block devices 10
IO Timeout for block devices removable 10
IO Timeout for sequential devices 10
IO Timeout for other devices 10
LUNs to scan for block devices ALL
LUNs to scan for block devices removable ALL
LUNs to scan for sequential devices ALL
LUNs to scan for other devices ALL
Removable media support NONE

Now you said to be smart about this but I really have no idea about all these settings. Can you give me some starting values to try and a rough guide as to what settings are very conservative, conservative, middle ground, aggressive, very aggressive and too aggressive.

Since I've still got an empty pool, I would like to remake it with the bad drive and recreate the problem scenario. Hopefully through some trail and error I can make the HBA successfully deal with the bad drive. Just for peace of minds sake, before I really fill the pool up and leave it long term.

joeschmuck · Aug 7, 2014

cyberjock said:
People named Joe Schmuck are the butt of jokes (seriously!) around here. When shit goes wrong we always say "it's joe schmuck's fault". I'm not even joking either.

Someone needs to take the fall.

Kayman · Aug 7, 2014

Any update with regards to the LSi 9211-8i settings. I really don't want to have to guess here. I want to be sure It's not gonna come back and bite me later on.

cyberjock · Aug 7, 2014

Honestly, cookbooking someone else's settings aren't a good idea. Those settings depend on your threshold for pain, your choice of hardware and how your hardware performs. It's as personal as underwear and I'm pretty sure you wouldn't want to wear the same underwear as someone else because "it works for them".

To be frank, if you don't know what all that stuff means you are only 4-6 months from understanding the disk subsystem enough to actually have an informed opinion on how all that stuff works together. Sorry, a weekend of Googling is NOT going to give you the education you seek.

Kayman · Aug 7, 2014

cyberjock said:
Honestly, cookbooking someone else's settings aren't a good idea. Those settings depend on your threshold for pain, your choice of hardware and how your hardware performs. It's as personal as underwear and I'm pretty sure you wouldn't want to wear the same underwear as someone else because "it works for them".

To be frank, if you don't know what all that stuff means you are only 4-6 months from understanding the disk subsystem enough to actually have an informed opinion on how all that stuff works together. Sorry, a weekend of Googling is NOT going to give you the education you seek.

Why would you point all this out to me if you were gonna come back with a reply like that. I didn't ask for you to give me settings that I expect to work then come back and complain when things go wrong again. You told me to try more aggressive setting. I asked for a rough guide of what settings are more aggressive and how to incrementally make them more aggressive. Then I said I intend to recreate the problem scenario and adjust the settings and see what happens then make up my own mind of what to do.

You know the whole story of how this transpired. All my hardware and how it performers when its working properly. I can understand slow performance but I had no performance, the shares were completely dead. I can live with a slow pool but I expect consistency. The share associated with the pool should always be available and not completely drop out for 10-20 minutes at a time.

I'm not going to spend 4-6 months (potentially with a freenas system that doesn't work) educating myself just so that I can understand everything and then attempt to troubleshot a problem that shouldn't exist in the first place. I've already spent months researching Raid arrays, vdevs zpools, ZFS ect. I went out got proper server gear. I thought I did everything right and I was covered. Nowhere EVER in any explanation of a RAID5/z1 or RAID6/z2 did it say "oh hey look this is only gonna work in a best case scenario where your drive has to make it absolutely clear that it has failed". Otherwise your whole array will be unusable. I would think it's safe to assume that in nearly all situations a drive will start to intermittently fail it wont go out like a light bulb.

Just to be clear the drive that has caused this is pretty much confirmed as the cause. It has passed a long smart test but I put it in a windows machine and ran the basic benchmark in HDtune. In places this drive is barely even returning 1m/s read/write and the results are no way consistent from run to run. Meanwhile the NAS is running now with a few TB of data in the pool, being frequently accessed and all is still well.

cyberjock said:
I deal with this very easily. When a disk starts failing, once the disk is disconnected once it will be removed from the pool until you either go to the CLI and online it manually or reboot the server (the act of onlining the pool will put the "dropped" disk back in the pool if it is found). Not all disks will disconnect though.

Based on this it seems you've seen this problem before and clearly know that it's easy to deal with it. So please either tell me how you apply this very easy solution or tell me how I find a dead drive that isn't dead. Because at this point there is nothing to stop this from happening to me again and next time the pool will be full.

cyberjock · Aug 8, 2014

First, I was telling you about that menu. I wasn't telling you what to set or anything else. Just leaving the door open. If you want to spend a few months learning how all this stuff works (or not) that's totally your business. I didn't change my settings, and even if I did I wouldn't share them because they almost certainly wouldn't apply unless you chose to match my harder perfectly, down to the hard drive firmware version. Some of us will gladly spend months doing research as we find file servers to be a very fulfilling hobby. Others don't care. For those that don't care they either accept the limitations of whatever they have or they go with something like a Synology which has even more limitations. Pick one. ;)

Second, I already told you how I deal with it. Find the failing disk and simply replace it. That's how you deal with it. Now your disk is misbehaving for reasons that don't appear on SMART. So either something is wrong with the disk itself, or something else is indirectly affecting how the disk operates. I can't tell you which because I don't know which. I don't have your hardware so I can't troubleshoot the problem to that level.

Important Announcement for the TrueNAS Community.

Large transfers over CIF always fail.

Dabbler

Patron

Patron

Old Man

FreeNAS Generalissimo

Dabbler

Dabbler

Old Man

Dabbler

Inactive Account

Dabbler

Dabbler

Old Man

Inactive Account

Dabbler

Old Man

Dabbler

Inactive Account

Dabbler

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Large transfers over CIF always fail."

Similar threads