Borked Installation or Maybe a Bad Array

MeCJay12 · Aug 28, 2018

Hello, I have a FreeNAS machine I installed about a year ago. The install is a VM with 2 LSI 9211s passthrough, 32 GB of DDR3 ECC RAM, and 4 vCores from dual x5660s. I have two arrays: ESXi and Storage. ESXi is working fine; It's a RaidZ of 4 256GB SSDs. Storage is the , now, problem child. It was originally 8 x 1TB drives in a RaidZ2. I added another 4 x 2TB drives later for space. About two months ago, things started to go sideways without me realizing. One drive failed. I got the email from FreeNAS, shut the system down, swapped the drive (a 1TB for a 2TB), started the system back up and clicked replace. It seemed to go ok and I continued with my life. Over the next month, I started getting a lot of weird errors around storage but no two seemed to be the same so it never clicked that the problem was FreeNAS and not the individual machines and the Read/Write speeds never declined. The errors were thing like my NVR complaining about disconnecting from the SMB share often (I though network issues since other VMs weren't complaining), my gaming VM downloading slowly (always a troubled VM), a sync machine disconnecting a bit (OwnCloud Sync tool has always been finicky), etc. About a month ago, I got an email from FreeNAS that another drive had failed; It was the one I had replaced a month prior. It didn't fail any way I had seen before; it failed from checksum errors. I though the drive was bad so I swapped another 2TB in and send the drive for RMA. The second drive had checksum errors immediately but FreeNAS didn't say the array was degraded so I left it. Another week later FreeNAS changed it's mind and marked the drive bad. I started thinking cables so I swapped the issue 2TB drive into 4 different bays. In each new bay, the issue drive got it's checksum errors and the other drive would be ok. I moved it to bays on different cables, SAS ports, and controllers. I started to think that it was because I swapped a 1TB drive for a 2TB so I put all the drives back in their original bays and swapped the issue drive for a new 1TB. The new 1TB got the same errors. I shut the host down and booted into Ubuntu to run extended SMART tests on all the drives. They all passed. When I booted back into FreeNAS, all the check sum errors were gone, array is healthy again, but every other machine is still having the weird issues. It's to a point now where I'm having trouble copying files off FreeNAS (my off site backup failed during this problem), Steam downloads are limited by the speed of FreeNAS not my 80Mbps Internet, etc. What do I do?

kdragon75 · Aug 28, 2018

MeCJay12 said:
What do I do?

How is the network setup? Include details about your vSwitches, uplinks, network card models, physical switch, router, modem, etc..
Also on the FreeNAS side, What version of FreeNAS are you running? Whats the guest OS selection on the VM, what type of vNIC is configured, how full is the pool, how did you add the 4x2TB drives? Are you LSI controllers on the same firmware/what version are they on?
Can you DETAIL the model of EVERY component in the system including the power supply?

MeCJay12 · Aug 28, 2018

The network is plain. The vSwitch is standard. 1500 MTU 5 NICs, 4x1Gb, 1x10Gb, Security is all on reject, NIC Teaming is Route based load balancing, Link Status failover, Yes for notify switches and failback, the failover order is the 10 Gb NIC as active and everything else standby, traffic shaping disabled. The 10 Gb NIC and one of the 1 Gb NICs go to the same switch, a Ubnt US-48-750W. The other ESXi host is connected exactly the same way to the switch. The gaming VM is on the second ESXi host everything else is on the first with FreeNAS. The Switch has an uplink to PFSense which then connections directly to my modem.

I am running FreeNAS through this whole issue. The guest OS selection is Other -> FreebSD (64Bit). It has one e1000 NIC. The pool is 58% used according to the storage tab but I know that is is actually less than that due t how it accounts for iSCSI devices as 100%. To be honest I don't remember how I added the second part of the array anymore beyond that it is also a RaidZ2. The controllers are on the same firmware; 20.00.07.00.

Dual x5660 @ 2.80GHz, Supermicro X8DTE reporting as X8DT6 in ESXi, 96 GB (12x8GB) of DDR3 ECC RAM, 2 x LSI 9211-8i reporting as LSI2008, Single Port Mellanox Connectx-2, Quad Port Intel 82571EB Gb NIC, Dual PWS-801-1R 800W PSUs, the chassis is a Coraid Etherdrive SRX2800-S2 which is a rebranded Supermicro chassis.

kdragon75 · Aug 28, 2018

MeCJay12 said:
NIC Teaming is Route based load balancing

That's not an option on a standard vSwitch. Is it routed based on virtual port ID, Source MAC hash, or SRC/DST IP hash?

MeCJay12 said:
It has one e1000 NIC.

You should configurer that with a vmxnet3 NIC.
As for other speed issues, take a look at some of the benchmarking topics in the resource section of the form. you need a baseline of your FreeNAS system. Also how does the storage latency look from ESXi when things are running slow? Generally I would not expect much from 8 x 1TB drives in a RaidZ2. Not in a virtual environment anyway.

Elliot Dierksen · Aug 28, 2018

kdragon75 said:
That's not an option on a standard vSwitch. Is it routed based on virtual port ID, Source MAC hash, or SRC/DST IP hash?

Not only that, your switch has to be configured to match (meaning it must be a managed switch) if you use IP Hash, and probably if you use source MAC hash. You can use 'route based on originating virtual port' or failover without matching configuration in the switch.

MeCJay12 · Aug 28, 2018

Yes, "Route Based on Originating Port ID" I was abbreviating. I'd like to point out the networking hasn't changed since these issues started.

I don't have any baselines and I any tools I previously used used to benchmark won't work in my install's current state. If there is something I run now please be more specific about what you are looking for. The main indicators I have are constant failures to copy data to and from the SMB share and Steam downloading games to the iSCSI volume maxes out the disk around 4MB/s. I haven't been able to run a Veeam backup in a while because of these drops.

Elliot Dierksen · Aug 28, 2018

Please post the output of zpool status -v, ideally inside of CODE tags.

kdragon75 · Aug 28, 2018

Also under the VM monitoring, what is the maximum latency seen during a slow transfer? Preferably one that takes at least 40 seconds.

kdragon75 · Aug 28, 2018

You might also try something to the effect of dd if=/dev/zvol/ZVOL-NAME of=/dev/null bs=1M count=1000 on FreeNAS to assess raw sequential read speed and in a VM you can run crystal disk mark to tell us what the VM sees. Please run disk mark from each host.

MeCJay12 · Aug 28, 2018

All the VMs use the ESXi array which is doing fine. The Storage array is complete offline since I tried deleting some large files this morning (didn't realize is offline when I originally posted). I did notice a scrub started so that maybe related. Eitherway I can't do CDI on the SMB share now. Here are some CDIs and other related docs I found from 2 weeks ago when I asked on reddit.

Code:

root@freenas:~ # zpool status -v
  pool: ESXi
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:07:06 with 0 errors on Sun Aug 19 00:07:06 2018
config:

		NAME											STATE	 READ WRITE CKSUM
		ESXi											ONLINE	   0	 0	 0
		  raidz1-0									  ONLINE	   0	 0	 0
			gptid/114a0892-23f4-11e8-97f9-000c29412593  ONLINE	   0	 0	 0
			gptid/11ad119a-23f4-11e8-97f9-000c29412593  ONLINE	   0	 0	 0
			gptid/11fe48a6-23f4-11e8-97f9-000c29412593  ONLINE	   0	 0	 0
			gptid/125e8169-23f4-11e8-97f9-000c29412593  ONLINE	   0	 0	 0

errors: No known data errors

  pool: Storage
 state: ONLINE
  scan: scrub in progress since Tue Aug 28 09:39:44 2018
		180K scanned at 18/s, 0 issued at 0/s, 8.49T total
		0 repaired, 0.00% done, no estimated completion time
config:

		NAME											STATE	 READ WRITE CKSUM
		Storage										 ONLINE	   0	 0	 0
		  raidz2-0									  ONLINE	   0	 0	 0
			gptid/10423df5-3838-11e8-b57e-000c29412593  ONLINE	   0	 0	 0
			gptid/e1e21a68-24c6-11e8-8808-000c29412593  ONLINE	   0	 0	 0
			gptid/e2b72606-24c6-11e8-8808-000c29412593  ONLINE	   0	 0	 0
			gptid/e3d793b6-24c6-11e8-8808-000c29412593  ONLINE	   0	 0	 0
		  raidz2-1									  ONLINE	   0	 0	 0
			gptid/e6bd3ba9-24c6-11e8-8808-000c29412593  ONLINE	   0	 0	 0
			gptid/e7ff184b-24c6-11e8-8808-000c29412593  ONLINE	   0	 0	 0
			gptid/e98315ee-24c6-11e8-8808-000c29412593  ONLINE	   0	 0	 0
			gptid/ea812566-24c6-11e8-8808-000c29412593  ONLINE	   0	 0	 0
			gptid/ec0113f5-24c6-11e8-8808-000c29412593  ONLINE	   0	 0	 0
			gptid/edef2bfb-24c6-11e8-8808-000c29412593  ONLINE	   0	 0	 0
			gptid/f0453447-24c6-11e8-8808-000c29412593  ONLINE	   0	 0	 0
			gptid/f182e5f2-a5b0-11e8-a8ef-000c29412593  ONLINE	   0	 0	 0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:26 with 0 errors on Mon Aug 27 03:45:26 2018
config:

		NAME		STATE	 READ WRITE CKSUM
		freenas-boot  ONLINE	   0	 0	 0
		  da0p2	 ONLINE	   0	 0	 0

errors: No known data errors

MeCJay12 · Aug 28, 2018

Also

Code:

root@freenas:~ # dd if=/dev/zvol/Storage of=/dev/null bs=1M count=1000
0+1 records in
0+1 records out
44 bytes transferred in 0.000037 secs (1179846 bytes/sec)

Elliot Dierksen · Aug 28, 2018

It sure sounds like something is wrong with that pool if it only read 44 bytes out of the device for the zvol.

MeCJay12 · Aug 28, 2018

That makes sense. Is there a fix? Is there a way to get my data off?

kdragon75 · Aug 28, 2018

Well I was thinking about you iSCSI share. I don't know that you have your SMB share on a zvol, that wouldn't make any sense. for you SMB I'm guessing you're just sharing a folder or dataset on the pool. If that's the case, you can still do a simple dd benchmark but you would need to disable lz4 compression on the dataset. in that case the command would be dd if=/dev/zero of=/mnt/POOL/DATASET/test.dd that will give us the write speed. dd if=/mnt/POOL/DATASET/test.dd of=/dev/null will test the read speed.

MeCJay12 · Aug 28, 2018

I'm running the first command. How long should this take?

MeCJay12 · Aug 28, 2018

Actually, if you look back at that zpool status you can see the scrub hasn't completed anything in, now, 7 hours.

kdragon75 · Aug 28, 2018

Ha yeah I rushed that last one. You might want to cancel it. I didn't specify a count...

the first one should dd if=/dev/zero of=/mnt/POOL/DATASET/test.dd bs=1M count=1000 this will write 1000 1M "blocks"

kdragon75 · Aug 28, 2018

The second command if fine. Don't forget to delete the test.dd file when done.

MeCJay12 · Aug 28, 2018

Good news bad new. System crashed trying to escape the command without a counter. I did get the tests run after the crash and the SMB share came back (not good but I can access it again). You can see what I was referring to above, the read write speeds seem fine it just keeps dropping connections so I can't really use it. If we think its a problem with the array I have an external with enough space to hold everything, I just need to know how to copy stuff to it. There are some files that I really really don't want to lose, some I'd rather not, and some I couldn't care less about. If I get those critical files I don't mind blowing the rest away and starting over.

Code:

root@freenas:~ # dd if=/dev/zero of=/mnt/Storage/Data/test.dd bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 0.454919 secs (2304970688 bytes/sec)
root@freenas:~ # dd if=/mnt/Storage/Data/test.dd of=/dev/null
2048000+0 records in
2048000+0 records out
1048576000 bytes transferred in 68.825854 secs (15235205 bytes/sec)

kdragon75 · Aug 28, 2018

MeCJay12 said:
root@freenas:~ # dd if=/mnt/Storage/Data/test.dd of=/dev/null 2048000+0 records in 2048000+0 records out 1048576000 bytes transferred in 68.825854 secs (15235205 bytes/sec)

Yeah that's abysmal. Something's not right. Try bg dd if=/mnt/Storage/Data/test.dd of=/dev/null immediately followed by zpool iostat -v 5 this will show us what the disks are doing during the read.

Important Announcement for the TrueNAS Community.

Borked Installation or Maybe a Bad Array

Dabbler

Wizard

Dabbler

Wizard

Guru

Dabbler

Guru

Wizard

Wizard

Dabbler

Dabbler

Guru

Dabbler

Wizard

Dabbler

Dabbler

Wizard

Wizard

Dabbler

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Borked Installation or Maybe a Bad Array"

Similar threads