FreeNAS 9.10 - Slow performance that ended up being a faulty disk even though no errors reported

Status
Not open for further replies.

Morbo

Cadet
Joined
Jun 1, 2014
Messages
8
Edit: I should really have titled this thread ZFS slow performance... not FreeNAS 9.10 slow performance as its more of a ZFS thing. Oh well :)

This is not a question, just a log of a recent event I had with my server that I thought I would share for 2 reasons:

  1. As a log/reference for others if you are ever in the same boat (myself included).
  2. I learned some lessons that you might want to proactively employ before you are on the boat. In hindsight they are obvious, but we all know how about hindsight.
TL;DR:

  1. Be sure your have a replica (or even better, a replica AND a backup) somewhere before you have a disk issue or degraded pool. I do and thankfully I didn't need to use it, however even with the level of assurance that provided me, this experience was still a bit disconcerting.
  2. Record the mapping of FreeBSD disk ID's to physical serial numbers.
Yes these seem obvious but I bet you haven't done number 2 ;-)

---
My FreeNAS 9.10 U2 server (HP N40L) server started performing uncharacteristically slow. This wasn’t a slow degradation, it just happened in the middle of the day. SMB shares took forever to mount and access in Windows and streaming movies buffered horrendously in both Plex and over SMB into VLC. Copies were sub 10MB/sec (normally 90-110MB/sec). This behavior seemed to be isolated to a particular pool.

No FreeNAS performance graphs indicated any issue whatsoever, Load Average, CPU and Disk IO graphs all effectively idle. Importantly “zpool status” showed absolutely no errors with any disk and the pool was fine. Top showed exciting either.

Zpool output:


[root@bender] ~# zpool status
pool: datastore
state: ONLINE
scan: scrub repaired 828K in 24h27m with 0 errors on Mon Feb 20 00:29:01 2017
config:

NAME STATE READ WRITE CKSUM
datastore ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/50b11673-fe09-11e3-bd33-a0b3cceb0c67 ONLINE 0 0 0
gptid/51964e48-fe09-11e3-bd33-a0b3cceb0c67 ONLINE 0 0 0
gptid/5224aada-fe09-11e3-bd33-a0b3cceb0c67 ONLINE 0 0 0
gptid/52be3684-fe09-11e3-bd33-a0b3cceb0c67 ONLINE 0 0 0



I had recently patched so I thought it could be that (although I would have expected the behaviour to be immediate). Rebooted onto the old version, which seemed ok but performance degraded again within minutes. Tried a cold power off to ensure the disk and SATA bus buffers and queues etc. were totally reset as a reboot may not do that (very unlikely but an easy check). Again, it seemed ok initially but then performance degraded to the same levels within minutes.

Watching gstat though, I could see what I felt was strange behaviour. Usually when streaming a significant data file, you would see all disks in the RAIZ1 being equally utilized (which makes sense). On this system that should be ada0, ada1, ada2 and ada3. However, in this instance I could see that when trying to pull some decent I/O (playing a large mkv file) I would see all disks in the pool fire briefly but then ada2 would show stats like below in bursts (not consistently busy). No other disks were being touched.


dT: 1.003s w: 1.000s
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 0 0 0 0.0 0 0 0.0 0.0| da0
0 0 0 0 0.0 0 0 0.0 0.0| da1
0 0 0 0 0.0 0 0 0.0 0.0| ada0
0 0 0 0 0.0 0 0 0.0 0.0| ada1
4 8 0 0 0.0 8 211 1617 175.0| ada2
0 0 0 0 0.0 0 0 0.0 0.0| ada3
0 0 0 0 0.0 0 0 0.0 0.0| ada4
0 0 0 0 0.0 0 0 0.0 0.0| ada5
0 0 0 0 0.0 0 0 0.0 0.0| da0p1
0 0 0 0 0.0 0 0 0.0 0.0| da0p2
0 0 0 0 0.0 0 0 0.0 0.0| gptid/d43c3d98-8d7e-11e4-a3a8-a0b3cceb0c67
0 0 0 0 0.0 0 0 0.0 0.0| gptid/d4454cb9-8d7e-11e4-a3a8-a0b3cceb0c67
0 0 0 0 0.0 0 0 0.0 0.0| da1p1
0 0 0 0 0.0 0 0 0.0 0.0| da1p2
0 0 0 0 0.0 0 0 0.0 0.0| ada0p1
0 0 0 0 0.0 0 0 0.0 0.0| ada0p2
0 0 0 0 0.0 0 0 0.0 0.0| ada1p1
0 0 0 0 0.0 0 0 0.0 0.0| ada1p2
0 0 0 0 0.0 0 0 0.0 0.0| ada2p1
4 8 0 0 0.0 8 211 1617 175.0| ada2p2
0 0 0 0 0.0 0 0 0.0 0.0| ada3p1
0 0 0 0 0.0 0 0 0.0 0.0| ada3p2
0 0 0 0 0.0 0 0 0.0 0.0| ada4p1
0 0 0 0 0.0 0 0 0.0 0.0| ada4p2
0 0 0 0 0.0 0 0 0.0 0.0| ada5p1
0 0 0 0 0.0 0 0 0.0 0.0| ada5p2
0 0 0 0 0.0 0 0 0.0 0.0| gptid/9a1159a4-b6b2-11e5-82a6-a0b3cceb0c67
0 0 0 0 0.0 0 0 0.0 0.0| gptid/9a2ce9cf-b6b2-11e5-82a6-a0b3cceb0c67
0 0 0 0 0.0 0 0 0.0 0.0| gptid/50b11673-fe09-11e3-bd33-a0b3cceb0c67
0 0 0 0 0.0 0 0 0.0 0.0| gptid/51964e48-fe09-11e3-bd33-a0b3cceb0c67
4 8 0 0 0.0 8 211 1617 175.0| gptid/5224aada-fe09-11e3-bd33-a0b3cceb0c67
0 0 0 0 0.0 0 0 0.0 0.0| gptid/52be3684-fe09-11e3-bd33-a0b3cceb0c67
0 0 0 0 0.0 0 0 0.0 0.0| gptid/89c20a0a-471c-11e6-aee6-a0b3cceb0c67
0 0 0 0 0.0 0 0 0.0 0.0| gptid/89ce24b5-471c-11e6-aee6-a0b3cceb0c67
0 0 0 0 0.0 0 0 0.0 0.0| gptid/32a70474-e545-11e4-965c-a0b3cceb0c67
0 0 0 0 0.0 0 0 0.0 0.0| ada1p1.eli
0 0 0 0 0.0 0 0 0.0 0.0| ada2p1.eli
0 0 0 0 0.0 0 0 0.0 0.0| ada3p1.eli
0 0 0 0 0.0 0 0 0.0 0.0| ada5p1.eli
0 0 0 0 0.0 0 0 0.0 0.0| ada0p1.eli



My first thought that this was the only disk that was working properly but the more I observed it, it became obvious (in my mind at least) that there must be something wrong with this disk. The likely hood of 3 disks having issues vs 1 didn’t add up but I was thinking about a SATA controller issue (since no disks were reporting errors). Plus if there was significant issues with the majority of disks surely my pool with not be healthy at all.

LESSON 1 : It occurred to me at this point that I was very grateful I had a replicated copy of the important datasets from this pool. Sure the pool was available, but trying to copy large volumes off at this stage would be extremely time consuming. Additionally it would be likely that many tools would timeout or error due to the erratic performance which could make trying to make a last minute copy/replication very time consuming.

I still wasn't convinced about ada2 (not a lot of hard evidence) but at that point the system stopped responding on the network so I rebooted it. As it came up I saw a lot of disk related messages on the console however I couldn’t catch them fast enough. FreeNAS then paused on importing pool for a LONG time….. which is expected given the performance characteristics I’d observed but still a little disconcerting.

LESSON 2: At this point I really wished I knew the mapping of device numbers to disks, specifically FreeBSD id’s to serial numbers. If the pool import is going to take days (12TB pool) or not complete at all then I’m likely going to be replacing a disk based on the information in the boot log which is based on bus ID’s etc. which is prone to human error. (I wish I had the output of all the commands that I ran from here!!)

I let the system boot overnight and thankfully it came up. Still no errors being reported anywhere, but still the same behaviour. Gstat still showed the same behaviour of that ada2 disk so I figured it was time to investigate that disk.

First I listed the disks to get their device names. The ones in bold are the ones in this pool.


[root@bender] ~# glabel status
Name Status Components
gptid/d43c3d98-8d7e-11e4-a3a8-a0b3cceb0c67 N/A da0p1
gptid/d4454cb9-8d7e-11e4-a3a8-a0b3cceb0c67 N/A da0p2
gptid/9a1159a4-b6b2-11e5-82a6-a0b3cceb0c67 N/A da1p1
gptid/9a2ce9cf-b6b2-11e5-82a6-a0b3cceb0c67 N/A da1p2
gptid/50b11673-fe09-11e3-bd33-a0b3cceb0c67 N/A ada0p2
gptid/51964e48-fe09-11e3-bd33-a0b3cceb0c67 N/A ada1p2
gptid/5224aada-fe09-11e3-bd33-a0b3cceb0c67 N/A ada2p2
gptid/52be3684-fe09-11e3-bd33-a0b3cceb0c67 N/A ada3p2

gptid/89c20a0a-471c-11e6-aee6-a0b3cceb0c67 N/A ada4p1
gptid/89ce24b5-471c-11e6-aee6-a0b3cceb0c67 N/A ada4p2
gptid/32a70474-e545-11e4-965c-a0b3cceb0c67 N/A ada5p2




Then I ran smartctl to get more info on each one. Considering my previous learning I did it on all of them and recorded it. Note that you can also use “camcontrol identify <device>”, but I like this output better.


[root@bender] ~# smartctl -i /dev/ada0
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Green
Device Model: WDC WD30EZRX-00D8PB0
Serial Number: WD-WMC4N1733188
LU WWN Device Id: 5 0014ee 003caa2b7
Firmware Version: 80.00A80
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Fri Mar 31 06:46:08 2017 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

[root@bender] ~# smartctl -i /dev/ada1
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Green
Device Model: WDC WD30EZRX-00D8PB0
Serial Number: WD-WMC4N2408094
LU WWN Device Id: 5 0014ee 0591fb324
Firmware Version: 80.00A80
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Fri Mar 31 06:46:38 2017 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

[root@bender] ~# smartctl -i /dev/ada2
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Green
Device Model: WDC WD30EZRX-00DC0B0
Serial Number: WD-WMC1T0706688
LU WWN Device Id: 5 0014ee 003704421
Firmware Version: 80.00A80
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Fri Mar 31 06:46:57 2017 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

[root@bender] ~# smartctl -i /dev/ada3
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Green
Device Model: WDC WD30EZRX-00DC0B0
Serial Number: WD-WMC1T1854862
LU WWN Device Id: 5 0014ee 0ae2d7a01
Firmware Version: 80.00A80
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Fri Mar 31 06:47:01 2017 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled




The big revelation here was that commands against ada2 took significantly longer than the other disks. Using the “time” command I could quantify it. Output for smarctl:



[root@bender] ~# time smartctl -i /dev/ada1
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Green
Device Model: WDC WD30EZRX-00D8PB0
Serial Number: WD-WMC4N2408094
LU WWN Device Id: 5 0014ee 0591fb324
Firmware Version: 80.00A80
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Fri Mar 31 07:34:21 2017 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

0.028u 0.000s 0:00.02 100.0% 976+416k 0+0io 0pf+0w

[root@bender] ~# time smartctl -i /dev/ada2
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Green
Device Model: WDC WD30EZRX-00DC0B0
Serial Number: WD-WMC1T0706688
LU WWN Device Id: 5 0014ee 003704421
Firmware Version: 80.00A80
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Fri Mar 31 07:34:25 2017 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

0.028u 0.000s 0:01.95 1.0% 732+312k 0+0io 0pf+0w





Output for camcontrol:



[root@bender] ~# time camcontrol identify ada1
pass1: <WDC WD30EZRX-00D8PB0 80.00A80> ACS-2 ATA SATA 3.x device
pass1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)

protocol ATA/ATAPI-9 SATA 3.x
device model WDC WD30EZRX-00D8PB0
firmware revision 80.00A80
serial number WD-WMC4N2408094
WWN 50014ee0591fb324
cylinders 16383
heads 16
sectors/track 63
sector size logical 512, physical 4096, offset 0
LBA supported 268435455 sectors
LBA48 supported 5860533168 sectors
PIO supported PIO4
DMA supported WDMA2 UDMA6
media RPM 5400

Feature Support Enabled Value Vendor
read ahead yes yes
write cache yes yes
flush cache yes yes
overlap no
Tagged Command Queuing (TCQ) no no
Native Command Queuing (NCQ) yes 32 tags
NCQ Queue Management no
NCQ Streaming no
Receive & Send FPDMA Queued no
SMART yes yes
microcode download yes yes
security yes no
power management yes yes
advanced power management no no
automatic acoustic management no no
media status notification no no
power-up in Standby yes no
write-read-verify no no
unload no no
general purpose logging yes yes
free-fall no no
Data Set Management (DSM/TRIM) no
Host Protected Area (HPA) yes no 5860533168/5860533168
HPA - Security no
0.000u 0.002s 0:00.00 0.0% 0+0k 0+0io 0pf+0w


[root@bender] ~# time camcontrol identify ada2
pass2: <WDC WD30EZRX-00DC0B0 80.00A80> ACS-2 ATA SATA 3.x device
pass2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)

protocol ATA/ATAPI-9 SATA 3.x
device model WDC WD30EZRX-00DC0B0
firmware revision 80.00A80
serial number WD-WMC1T0706688
WWN 50014ee003704421
cylinders 16383
heads 16
sectors/track 63
sector size logical 512, physical 4096, offset 0
LBA supported 268435455 sectors
LBA48 supported 5860533168 sectors
PIO supported PIO4
DMA supported WDMA2 UDMA6

Feature Support Enabled Value Vendor
read ahead yes yes
write cache yes yes
flush cache yes yes
overlap no
Tagged Command Queuing (TCQ) no no
Native Command Queuing (NCQ) yes 32 tags
NCQ Queue Management no
NCQ Streaming no
Receive & Send FPDMA Queued no
SMART yes yes
microcode download yes yes
security yes no
power management yes yes
advanced power management no no
automatic acoustic management no no
media status notification no no
power-up in Standby yes no
write-read-verify no no
unload no no
general purpose logging yes yes
free-fall no no
Data Set Management (DSM/TRIM) no
Host Protected Area (HPA) yes no 5860533168/5860533168
HPA - Security no
0.000u 0.002s 0:03.98 0.0% 0+0k 0+0io 0pf+0w



Commands against every disk except ada2 were almost instant, ada2 took a few seconds to come back. Now I felt a lot more confident that this is the root cause (or at least a contributor) so I decided to offline the disk which would deliberately degrade the pool.

IMPORTANT: You need to be sure about this and committed to the consequences if you get it wrong. The behaviour I’m expecting is that the pool will become degraded but it will perform normally because ada2 is not present. However, if that doesn’t turn out to be the case the result will be a degraded pool that doesn’t perform well and will likely take a LONG time to recover when I online or replace ada2.

I offlined the disk using its label (this took a few minutes):


[root@bender] ~# zpool offline datastore gptid/5224aada-fe09-11e3-bd33-a0b3cceb0c67


Confirmed the status of the pool:



[root@bender] ~# zpool status
pool: datastore
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: scrub repaired 828K in 24h27m with 0 errors on Mon Feb 20 00:29:01 2017
config:

NAME STATE READ WRITE CKSUM
datastore DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
gptid/50b11673-fe09-11e3-bd33-a0b3cceb0c67 ONLINE 0 0 0
gptid/51964e48-fe09-11e3-bd33-a0b3cceb0c67 ONLINE 0 0 0
12312091975756753222 OFFLINE 0 0 0 was /dev/gptid/5224aada-fe09-11e3-bd33-a0b3cceb0c67
gptid/52be3684-fe09-11e3-bd33-a0b3cceb0c67 ONLINE 0 0 0

errors: No known data errors



All was as expected and thankfully I got the expected result and the system immediately started performing correctly. Gstat reflected that I/O was being distributed across the 3 remaining disks (ada0, ada1, ada3). WINNING!


dT: 1.004s w: 1.000s
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 0 0 0 0.0 0 0 0.0 0.0| da0
0 0 0 0 0.0 0 0 0.0 0.0| da1
5 458 96 5199 14.3 363 18956 2.1 89.1| ada0
0 455 102 5538 11.8 354 18617 2.0 83.2| ada1

0 0 0 0 0.0 0 0 0.0 0.0| ada2
0 455 105 5207 4.0 351 18529 1.3 58.0| ada3
0 0 0 0 0.0 0 0 0.0 0.0| ada4
0 0 0 0 0.0 0 0 0.0 0.0| ada5
0 0 0 0 0.0 0 0 0.0 0.0| da0p1
0 0 0 0 0.0 0 0 0.0 0.0| da0p2
0 0 0 0 0.0 0 0 0.0 0.0| gptid/d43c3d98-8d7e-11e4-a3a8-a0b3cceb0c67
0 0 0 0 0.0 0 0 0.0 0.0| gptid/d4454cb9-8d7e-11e4-a3a8-a0b3cceb0c67
0 0 0 0 0.0 0 0 0.0 0.0| da1p1
0 0 0 0 0.0 0 0 0.0 0.0| da1p2
0 0 0 0 0.0 0 0 0.0 0.0| ada0p1
5 458 96 5199 14.3 363 18956 2.1 89.3| ada0p2
0 0 0 0 0.0 0 0 0.0 0.0| ada1p1
0 455 102 5538 11.8 354 18617 2.0 83.4| ada1p2
0 0 0 0 0.0 0 0 0.0 0.0| ada2p1
0 0 0 0 0.0 0 0 0.0 0.0| ada2p2
0 0 0 0 0.0 0 0 0.0 0.0| ada3p1
0 455 105 5207 4.0 351 18529 1.3 58.3| ada3p2
0 0 0 0 0.0 0 0 0.0 0.0| ada4p1
0 0 0 0 0.0 0 0 0.0 0.0| ada4p2
0 0 0 0 0.0 0 0 0.0 0.0| ada5p1
0 0 0 0 0.0 0 0 0.0 0.0| ada5p2
0 0 0 0 0.0 0 0 0.0 0.0| gptid/9a1159a4-b6b2-11e5-82a6-a0b3cceb0c67
0 0 0 0 0.0 0 0 0.0 0.0| gptid/9a2ce9cf-b6b2-11e5-82a6-a0b3cceb0c67
5 458 96 5199 14.3 363 18956 2.1 89.4| gptid/50b11673-fe09-11e3-bd33-a0b3cceb0c67
0 455 102 5538 11.8 354 18617 2.0 83.6| gptid/51964e48-fe09-11e3-bd33-a0b3cceb0c67
0 0 0 0 0.0 0 0 0.0 0.0| gptid/5224aada-fe09-11e3-bd33-a0b3cceb0c67
0 455 105 5207 4.0 351 18529 1.3 58.4| gptid/52be3684-fe09-11e3-bd33-a0b3cceb0c67
0 0 0 0 0.0 0 0 0.0 0.0| gptid/89c20a0a-471c-11e6-aee6-a0b3cceb0c67
0 0 0 0 0.0 0 0 0.0 0.0| gptid/89ce24b5-471c-11e6-aee6-a0b3cceb0c67
0 0 0 0 0.0 0 0 0.0 0.0| gptid/32a70474-e545-11e4-965c-a0b3cceb0c67
0 0 0 0 0.0 0 0 0.0 0.0| ada1p1.eli
0 0 0 0 0.0 0 0 0.0 0.0| ada2p1.eli
0 0 0 0 0.0 0 0 0.0 0.0| ada3p1.eli
0 0 0 0 0.0 0 0 0.0 0.0| ada5p1.eli
0 0 0 0 0.0 0 0 0.0 0.0| ada0p1.eli



The last action is to replace the disk
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
LESSON 3: there's a reason NAS drives are recommended for any kind of software or hardware RAID. With TLER or the equivalent, a flaky drive won't make the whole system unresponsive.
 

Morbo

Cadet
Joined
Jun 1, 2014
Messages
8
A fair point indeed. For my situation I elected to save the money and go with the cheapest drives, however if you are running a system were performance degradation is not tolerable then better quality drives are a must. I got a good ~2-3 years out of those drives which was unexpected but I wouldn't be surprised if I start to see a few more fall away soon.
 

darkwarrior

Patron
Joined
Mar 29, 2015
Messages
336
Hi there,

And additionally don't forget, this kind of thing would have happened on any Operating System ( Windows/Linux/etc ...) ZFS is not at fault here ... ;)
I had a flaky SSD and another HDD too once and the system would simply lock itself up while waiting for File I/Os from the said drives. Just waiting until timeout was reached ... :rolleyes:
 
Joined
Dec 2, 2015
Messages
730
LESSON 3: there's a reason NAS drives are recommended for any kind of software or hardware RAID. With TLER or the equivalent, a flaky drive won't make the whole system unresponsive.
If the system had drives with TLER, e.g. WD Red, what would the likely symptoms have been if a drive was throwing errors when reading? I assume the user would get some sort of warning that a drive was sick, but wonder what that error would be.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
One possibility is that the read error count would have incremented, which would result in an email from FreeNAS and an entry in the drive's SMART error log.
 
Status
Not open for further replies.
Top