FreeNAS 9.10 - Slow performance that ended up being a faulty disk even though no errors reported

Morbo · Mar 30, 2017

Edit: I should really have titled this thread ZFS slow performance... not FreeNAS 9.10 slow performance as its more of a ZFS thing. Oh well :)

This is not a question, just a log of a recent event I had with my server that I thought I would share for 2 reasons:

As a log/reference for others if you are ever in the same boat (myself included).
I learned some lessons that you might want to proactively employ before you are on the boat. In hindsight they are obvious, but we all know how about hindsight.

TL;DR:

Be sure your have a replica (or even better, a replica AND a backup) somewhere before you have a disk issue or degraded pool. I do and thankfully I didn't need to use it, however even with the level of assurance that provided me, this experience was still a bit disconcerting.
Record the mapping of FreeBSD disk ID's to physical serial numbers.

Yes these seem obvious but I bet you haven't done number 2 ;-)

---
My FreeNAS 9.10 U2 server (HP N40L) server started performing uncharacteristically slow. This wasn’t a slow degradation, it just happened in the middle of the day. SMB shares took forever to mount and access in Windows and streaming movies buffered horrendously in both Plex and over SMB into VLC. Copies were sub 10MB/sec (normally 90-110MB/sec). This behavior seemed to be isolated to a particular pool.

No FreeNAS performance graphs indicated any issue whatsoever, Load Average, CPU and Disk IO graphs all effectively idle. Importantly “zpool status” showed absolutely no errors with any disk and the pool was fine. Top showed exciting either.

Zpool output:

 

[root@bender] ~# zpool status

  pool: datastore

state: ONLINE

  scan: scrub repaired 828K in 24h27m with 0 errors on Mon Feb 20 00:29:01 2017

config:



		NAME											STATE	 READ WRITE CKSUM

		datastore									   ONLINE	   0	 0	 0

		  raidz1-0									  ONLINE	   0	 0	 0

			gptid/50b11673-fe09-11e3-bd33-a0b3cceb0c67  ONLINE	   0	 0	 0

			gptid/51964e48-fe09-11e3-bd33-a0b3cceb0c67  ONLINE	   0	 0	 0

			gptid/5224aada-fe09-11e3-bd33-a0b3cceb0c67  ONLINE	   0	 0	 0

			gptid/52be3684-fe09-11e3-bd33-a0b3cceb0c67  ONLINE	   0	 0	 0

I had recently patched so I thought it could be that (although I would have expected the behaviour to be immediate). Rebooted onto the old version, which seemed ok but performance degraded again within minutes. Tried a cold power off to ensure the disk and SATA bus buffers and queues etc. were totally reset as a reboot may not do that (very unlikely but an easy check). Again, it seemed ok initially but then performance degraded to the same levels within minutes.

Watching gstat though, I could see what I felt was strange behaviour. Usually when streaming a significant data file, you would see all disks in the RAIZ1 being equally utilized (which makes sense). On this system that should be ada0, ada1, ada2 and ada3. However, in this instance I could see that when trying to pull some decent I/O (playing a large mkv file) I would see all disks in the pool fire briefly but then ada2 would show stats like below in bursts (not consistently busy). No other disks were being touched.

 

dT: 1.003s  w: 1.000s

L(q)  ops/s	r/s   kBps   ms/r	w/s   kBps   ms/w   %busy Name

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da0

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da1

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada0

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada1

	4	  8	  0	  0	0.0	  8	211   1617  175.0| ada2

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada3

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada4

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada5

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da0p1

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da0p2

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| gptid/d43c3d98-8d7e-11e4-a3a8-a0b3cceb0c67

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| gptid/d4454cb9-8d7e-11e4-a3a8-a0b3cceb0c67

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da1p1

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da1p2

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada0p1

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada0p2

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada1p1

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada1p2

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada2p1

	4	  8	  0	  0	0.0	  8	211   1617  175.0| ada2p2

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada3p1

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada3p2

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada4p1

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada4p2

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada5p1

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada5p2

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| gptid/9a1159a4-b6b2-11e5-82a6-a0b3cceb0c67

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| gptid/9a2ce9cf-b6b2-11e5-82a6-a0b3cceb0c67

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| gptid/50b11673-fe09-11e3-bd33-a0b3cceb0c67

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| gptid/51964e48-fe09-11e3-bd33-a0b3cceb0c67

	4	  8	  0	  0	0.0	  8	211   1617  175.0| gptid/5224aada-fe09-11e3-bd33-a0b3cceb0c67

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| gptid/52be3684-fe09-11e3-bd33-a0b3cceb0c67

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| gptid/89c20a0a-471c-11e6-aee6-a0b3cceb0c67

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| gptid/89ce24b5-471c-11e6-aee6-a0b3cceb0c67

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| gptid/32a70474-e545-11e4-965c-a0b3cceb0c67

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada1p1.eli

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada2p1.eli

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada3p1.eli

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada5p1.eli

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada0p1.eli

My first thought that this was the only disk that was working properly but the more I observed it, it became obvious (in my mind at least) that there must be something wrong with this disk. The likely hood of 3 disks having issues vs 1 didn’t add up but I was thinking about a SATA controller issue (since no disks were reporting errors). Plus if there was significant issues with the majority of disks surely my pool with not be healthy at all.

LESSON 1 : It occurred to me at this point that I was very grateful I had a replicated copy of the important datasets from this pool. Sure the pool was available, but trying to copy large volumes off at this stage would be extremely time consuming. Additionally it would be likely that many tools would timeout or error due to the erratic performance which could make trying to make a last minute copy/replication very time consuming.

I still wasn't convinced about ada2 (not a lot of hard evidence) but at that point the system stopped responding on the network so I rebooted it. As it came up I saw a lot of disk related messages on the console however I couldn’t catch them fast enough. FreeNAS then paused on importing pool for a LONG time….. which is expected given the performance characteristics I’d observed but still a little disconcerting.

LESSON 2: At this point I really wished I knew the mapping of device numbers to disks, specifically FreeBSD id’s to serial numbers. If the pool import is going to take days (12TB pool) or not complete at all then I’m likely going to be replacing a disk based on the information in the boot log which is based on bus ID’s etc. which is prone to human error. (I wish I had the output of all the commands that I ran from here!!)

I let the system boot overnight and thankfully it came up. Still no errors being reported anywhere, but still the same behaviour. Gstat still showed the same behaviour of that ada2 disk so I figured it was time to investigate that disk.

First I listed the disks to get their device names. The ones in bold are the ones in this pool.

 

[root@bender] ~# glabel status

									  Name  Status  Components

gptid/d43c3d98-8d7e-11e4-a3a8-a0b3cceb0c67	 N/A  da0p1

gptid/d4454cb9-8d7e-11e4-a3a8-a0b3cceb0c67	 N/A  da0p2

gptid/9a1159a4-b6b2-11e5-82a6-a0b3cceb0c67	 N/A  da1p1

gptid/9a2ce9cf-b6b2-11e5-82a6-a0b3cceb0c67	 N/A  da1p2

gptid/50b11673-fe09-11e3-bd33-a0b3cceb0c67	 N/A  ada0p2

gptid/51964e48-fe09-11e3-bd33-a0b3cceb0c67	 N/A  ada1p2

gptid/5224aada-fe09-11e3-bd33-a0b3cceb0c67	 N/A  ada2p2

gptid/52be3684-fe09-11e3-bd33-a0b3cceb0c67	 N/A  ada3p2

gptid/89c20a0a-471c-11e6-aee6-a0b3cceb0c67	 N/A  ada4p1

gptid/89ce24b5-471c-11e6-aee6-a0b3cceb0c67	 N/A  ada4p2

gptid/32a70474-e545-11e4-965c-a0b3cceb0c67	 N/A  ada5p2

Then I ran smartctl to get more info on each one. Considering my previous learning I did it on all of them and recorded it. Note that you can also use “camcontrol identify <device>”, but I like this output better.

 

[root@bender] ~# smartctl -i /dev/ada0

smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)

Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org



=== START OF INFORMATION SECTION ===

Model Family:	 Western Digital Green

Device Model:	 WDC WD30EZRX-00D8PB0

Serial Number:	WD-WMC4N1733188

LU WWN Device Id: 5 0014ee 003caa2b7

Firmware Version: 80.00A80

User Capacity:	3,000,592,982,016 bytes [3.00 TB]

Sector Sizes:	 512 bytes logical, 4096 bytes physical

Rotation Rate:	5400 rpm

Device is:		In smartctl database [for details use: -P show]

ATA Version is:   ACS-2 (minor revision not indicated)

SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)

Local Time is:	Fri Mar 31 06:46:08 2017 AEST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled



[root@bender] ~# smartctl -i /dev/ada1

smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)

Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org



=== START OF INFORMATION SECTION ===

Model Family:	 Western Digital Green

Device Model:	 WDC WD30EZRX-00D8PB0

Serial Number:	WD-WMC4N2408094

LU WWN Device Id: 5 0014ee 0591fb324

Firmware Version: 80.00A80

User Capacity:	3,000,592,982,016 bytes [3.00 TB]

Sector Sizes:	 512 bytes logical, 4096 bytes physical

Rotation Rate:	5400 rpm

Device is:		In smartctl database [for details use: -P show]

ATA Version is:   ACS-2 (minor revision not indicated)

SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)

Local Time is:	Fri Mar 31 06:46:38 2017 AEST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled



[root@bender] ~# smartctl -i /dev/ada2

smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)

Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org



=== START OF INFORMATION SECTION ===

Model Family:	 Western Digital Green

Device Model:	 WDC WD30EZRX-00DC0B0

Serial Number:	WD-WMC1T0706688

LU WWN Device Id: 5 0014ee 003704421

Firmware Version: 80.00A80

User Capacity:	3,000,592,982,016 bytes [3.00 TB]

Sector Sizes:	 512 bytes logical, 4096 bytes physical

Device is:		In smartctl database [for details use: -P show]

ATA Version is:   ACS-2 (minor revision not indicated)

SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)

Local Time is:	Fri Mar 31 06:46:57 2017 AEST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled



[root@bender] ~# smartctl -i /dev/ada3

smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)

Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org



=== START OF INFORMATION SECTION ===

Model Family:	 Western Digital Green

Device Model:	 WDC WD30EZRX-00DC0B0

Serial Number:	WD-WMC1T1854862

LU WWN Device Id: 5 0014ee 0ae2d7a01

Firmware Version: 80.00A80

User Capacity:	3,000,592,982,016 bytes [3.00 TB]

Sector Sizes:	 512 bytes logical, 4096 bytes physical

Device is:		In smartctl database [for details use: -P show]

ATA Version is:   ACS-2 (minor revision not indicated)

SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)

Local Time is:	Fri Mar 31 06:47:01 2017 AEST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

The big revelation here was that commands against ada2 took significantly longer than the other disks. Using the “time” command I could quantify it. Output for smarctl:

 



[root@bender] ~# time smartctl -i /dev/ada1

smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)

Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org



=== START OF INFORMATION SECTION ===

Model Family:	 Western Digital Green

Device Model:	 WDC WD30EZRX-00D8PB0

Serial Number:	WD-WMC4N2408094

LU WWN Device Id: 5 0014ee 0591fb324

Firmware Version: 80.00A80

User Capacity:	3,000,592,982,016 bytes [3.00 TB]

Sector Sizes:	 512 bytes logical, 4096 bytes physical

Rotation Rate:	5400 rpm

Device is:		In smartctl database [for details use: -P show]

ATA Version is:   ACS-2 (minor revision not indicated)

SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)

Local Time is:	Fri Mar 31 07:34:21 2017 AEST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled



0.028u 0.000s 0:00.02 100.0%	976+416k 0+0io 0pf+0w



[root@bender] ~# time smartctl -i /dev/ada2

smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)

Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org



=== START OF INFORMATION SECTION ===

Model Family:	 Western Digital Green

Device Model:	 WDC WD30EZRX-00DC0B0

Serial Number:	WD-WMC1T0706688

LU WWN Device Id: 5 0014ee 003704421

Firmware Version: 80.00A80

User Capacity:	3,000,592,982,016 bytes [3.00 TB]

Sector Sizes:	 512 bytes logical, 4096 bytes physical

Device is:		In smartctl database [for details use: -P show]

ATA Version is:   ACS-2 (minor revision not indicated)

SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)

Local Time is:	Fri Mar 31 07:34:25 2017 AEST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled



0.028u 0.000s 0:01.95 1.0%	  732+312k 0+0io 0pf+0w

Output for camcontrol:

 



[root@bender] ~# time camcontrol identify ada1

pass1: <WDC WD30EZRX-00D8PB0 80.00A80> ACS-2 ATA SATA 3.x device

pass1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)



protocol			  ATA/ATAPI-9 SATA 3.x

device model		  WDC WD30EZRX-00D8PB0

firmware revision	 80.00A80

serial number		 WD-WMC4N2408094

WWN				   50014ee0591fb324

cylinders			 16383

heads				 16

sectors/track		 63

sector size		   logical 512, physical 4096, offset 0

LBA supported		 268435455 sectors

LBA48 supported	   5860533168 sectors

PIO supported		 PIO4

DMA supported		 WDMA2 UDMA6

media RPM			 5400



Feature					  Support  Enabled   Value		   Vendor

read ahead					 yes	  yes

write cache					yes	  yes

flush cache					yes	  yes

overlap						no

Tagged Command Queuing (TCQ)   no	   no

Native Command Queuing (NCQ)   yes			  32 tags

NCQ Queue Management		   no

NCQ Streaming				  no

Receive & Send FPDMA Queued	no

SMART						  yes	  yes

microcode download			 yes	  yes

security					   yes	  no

power management			   yes	  yes

advanced power management	  no	   no

automatic acoustic management  no	   no

media status notification	  no	   no

power-up in Standby			yes	  no

write-read-verify			  no	   no

unload						 no	   no

general purpose logging		yes	  yes

free-fall					  no	   no

Data Set Management (DSM/TRIM) no

Host Protected Area (HPA)	  yes	  no	  5860533168/5860533168

HPA - Security				 no

0.000u 0.002s 0:00.00 0.0%	  0+0k 0+0io 0pf+0w





[root@bender] ~# time camcontrol identify ada2

pass2: <WDC WD30EZRX-00DC0B0 80.00A80> ACS-2 ATA SATA 3.x device

pass2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)



protocol			  ATA/ATAPI-9 SATA 3.x

device model		  WDC WD30EZRX-00DC0B0

firmware revision	 80.00A80

serial number		 WD-WMC1T0706688

WWN				   50014ee003704421

cylinders			 16383

heads				 16

sectors/track		 63

sector size		   logical 512, physical 4096, offset 0

LBA supported		 268435455 sectors

LBA48 supported	   5860533168 sectors

PIO supported		 PIO4

DMA supported		 WDMA2 UDMA6



Feature					  Support  Enabled   Value		   Vendor

read ahead					 yes	  yes

write cache					yes	  yes

flush cache					yes	  yes

overlap						no

Tagged Command Queuing (TCQ)   no	   no

Native Command Queuing (NCQ)   yes			  32 tags

NCQ Queue Management		   no

NCQ Streaming				  no

Receive & Send FPDMA Queued	no

SMART						  yes	  yes

microcode download			 yes	  yes

security					   yes	  no

power management			   yes	  yes

advanced power management	  no	   no

automatic acoustic management  no	   no

media status notification	  no	   no

power-up in Standby			yes	  no

write-read-verify			  no	   no

unload						 no	   no

general purpose logging		yes	  yes

free-fall					  no	   no

Data Set Management (DSM/TRIM) no

Host Protected Area (HPA)	  yes	  no	  5860533168/5860533168

HPA - Security				 no

0.000u 0.002s 0:03.98 0.0%	  0+0k 0+0io 0pf+0w

Commands against every disk except ada2 were almost instant, ada2 took a few seconds to come back. Now I felt a lot more confident that this is the root cause (or at least a contributor) so I decided to offline the disk which would deliberately degrade the pool.

IMPORTANT: You need to be sure about this and committed to the consequences if you get it wrong. The behaviour I’m expecting is that the pool will become degraded but it will perform normally because ada2 is not present. However, if that doesn’t turn out to be the case the result will be a degraded pool that doesn’t perform well and will likely take a LONG time to recover when I online or replace ada2.

I offlined the disk using its label (this took a few minutes):

 

[root@bender] ~# zpool offline datastore gptid/5224aada-fe09-11e3-bd33-a0b3cceb0c67

Confirmed the status of the pool:

 



[root@bender] ~# zpool status

  pool: datastore

state: DEGRADED

status: One or more devices has been taken offline by the administrator.

		Sufficient replicas exist for the pool to continue functioning in a

		degraded state.

action: Online the device using 'zpool online' or replace the device with

		'zpool replace'.

  scan: scrub repaired 828K in 24h27m with 0 errors on Mon Feb 20 00:29:01 2017

config:



		NAME											STATE	 READ WRITE CKSUM

		datastore									   DEGRADED	 0	 0	 0

		  raidz1-0									  DEGRADED	 0	 0	 0

			gptid/50b11673-fe09-11e3-bd33-a0b3cceb0c67  ONLINE	   0	 0	 0

			gptid/51964e48-fe09-11e3-bd33-a0b3cceb0c67  ONLINE	   0	 0	 0

			12312091975756753222						OFFLINE	  0	 0	 0  was /dev/gptid/5224aada-fe09-11e3-bd33-a0b3cceb0c67

			gptid/52be3684-fe09-11e3-bd33-a0b3cceb0c67  ONLINE	   0	 0	 0



errors: No known data errors

All was as expected and thankfully I got the expected result and the system immediately started performing correctly. Gstat reflected that I/O was being distributed across the 3 remaining disks (ada0, ada1, ada3). WINNING!

 

dT: 1.004s  w: 1.000s

L(q)  ops/s	r/s   kBps   ms/r	w/s   kBps   ms/w   %busy Name

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da0

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da1

	5	458	 96   5199   14.3	363  18956	2.1   89.1| ada0

	0	455	102   5538   11.8	354  18617	2.0   83.2| ada1

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada2

   0	455	105   5207	4.0	351  18529	1.3   58.0| ada3

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada4

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada5

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da0p1

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da0p2

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| gptid/d43c3d98-8d7e-11e4-a3a8-a0b3cceb0c67

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| gptid/d4454cb9-8d7e-11e4-a3a8-a0b3cceb0c67

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da1p1

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| da1p2

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada0p1

	5	458	 96   5199   14.3	363  18956	2.1   89.3| ada0p2

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada1p1

	0	455	102   5538   11.8	354  18617	2.0   83.4| ada1p2

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada2p1

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada2p2

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada3p1

	0	455	105   5207	4.0	351  18529	1.3   58.3| ada3p2

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada4p1

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada4p2

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada5p1

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada5p2

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| gptid/9a1159a4-b6b2-11e5-82a6-a0b3cceb0c67

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| gptid/9a2ce9cf-b6b2-11e5-82a6-a0b3cceb0c67

	5	458	 96   5199   14.3	363  18956	2.1   89.4| gptid/50b11673-fe09-11e3-bd33-a0b3cceb0c67

	0	455	102   5538   11.8	354  18617	2.0   83.6| gptid/51964e48-fe09-11e3-bd33-a0b3cceb0c67

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| gptid/5224aada-fe09-11e3-bd33-a0b3cceb0c67

	0	455	105   5207	4.0	351  18529	1.3   58.4| gptid/52be3684-fe09-11e3-bd33-a0b3cceb0c67

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| gptid/89c20a0a-471c-11e6-aee6-a0b3cceb0c67

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| gptid/89ce24b5-471c-11e6-aee6-a0b3cceb0c67

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| gptid/32a70474-e545-11e4-965c-a0b3cceb0c67

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada1p1.eli

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada2p1.eli

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada3p1.eli

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada5p1.eli

	0	  0	  0	  0	0.0	  0	  0	0.0	0.0| ada0p1.eli

The last action is to replace the disk

Robert Trevellyan · Mar 31, 2017

LESSON 3: there's a reason NAS drives are recommended for any kind of software or hardware RAID. With TLER or the equivalent, a flaky drive won't make the whole system unresponsive.

Morbo · Mar 31, 2017

A fair point indeed. For my situation I elected to save the money and go with the cheapest drives, however if you are running a system were performance degradation is not tolerable then better quality drives are a must. I got a good ~2-3 years out of those drives which was unexpected but I wouldn't be surprised if I start to see a few more fall away soon.

darkwarrior · Apr 1, 2017

Hi there,

And additionally don't forget, this kind of thing would have happened on any Operating System ( Windows/Linux/etc ...) ZFS is not at fault here ... ;)
I had a flaky SSD and another HDD too once and the system would simply lock itself up while waiting for File I/Os from the said drives. Just waiting until timeout was reached ...

Kevin Horton · Apr 3, 2017

Robert Trevellyan said:
LESSON 3: there's a reason NAS drives are recommended for any kind of software or hardware RAID. With TLER or the equivalent, a flaky drive won't make the whole system unresponsive.

If the system had drives with TLER, e.g. WD Red, what would the likely symptoms have been if a drive was throwing errors when reading? I assume the user would get some sort of warning that a drive was sick, but wonder what that error would be.

Robert Trevellyan · Apr 3, 2017

One possibility is that the read error count would have incremented, which would result in an email from FreeNAS and an entry in the drive's SMART error log.

Important Announcement for the TrueNAS Community.

FreeNAS 9.10 - Slow performance that ended up being a faulty disk even though no errors reported

Morbo

Cadet

Robert Trevellyan

Pony Wrangler

Morbo

Cadet

darkwarrior

Patron

Kevin Horton

Guru

Robert Trevellyan

Pony Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

FreeNAS 9.10 - Slow performance that ended up being a faulty disk even though no errors reported

Morbo

Cadet

Robert Trevellyan

Pony Wrangler

Morbo

Cadet

darkwarrior

Patron

Kevin Horton

Guru

Robert Trevellyan

Pony Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "FreeNAS 9.10 - Slow performance that ended up being a faulty disk even though no errors reported"

Similar threads