Errors on sectors of a drive, now unreadable and S.M.A.R.T. seems to say the drive is OK-ish.

Vrakfall · Jun 9, 2017

Hello !

So, a few days ago (let's not talk about the numerous months I was shamefully ignoring the errors) I saw that one of my 3 main drives stopped working completely. By that I mean there's not a single write/read on it for a long time and the volume says it's `unavailable`. Hopefully, it's part of a ZFS Raid5 (RAIDz1) so its lost isn't a huge problem yet. By the way, my server is on `FreeNAS-9.10.2-U4 (27ae72978)`.

The fisrt messages I encountered were these 2:

Code:

Device: /dev/ada1, 3 Currently unreadable (pending) sectors
The volume mainVolume (ZFS) state is DEGRADED: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state.

Indeed, when I check the pool status, that specific drive is unaivalable.

I then read a few things and tried to follow the instructions contained in this first link and also helped myself with what is contained in this second link.

I checked the quick smart results but didn't see anything interesting to my eyes so I then run a long smart test which gave me the following output after a few hours:

Code:

~# smartctl -a /dev/ada1
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Green
Device Model:	 WDC WD30EZRX-00D8PB0
Serial Number:	WD-WMC4N0694041
LU WWN Device Id: 5 0014ee 6ae8c1170
Firmware Version: 80.00A80
User Capacity:	3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:	Fri Jun  9 16:54:07 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0)	The previous self-test routine completed
					without error or no self-test has ever
					been run.
Total time to complete Offline
data collection:		 (40020) seconds.
Offline data collection
capabilities:			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:			(0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:		(0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time:	 (   2) minutes.
Extended self-test routine
recommended polling time:	 ( 401) minutes.
Conveyance self-test routine
recommended polling time:	 (   5) minutes.
SCT capabilities:			(0x7035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   18
  3 Spin_Up_Time			0x0027   212   174   021	Pre-fail  Always	   -	   4366
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   250
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   064   064   000	Old_age   Always	   -	   26864
10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0
11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   249
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   69
193 Load_Cycle_Count		0x0032   146   146   000	Old_age   Always	   -	   163776
194 Temperature_Celsius	 0x0022   117   099   000	Old_age   Always	   -	   33
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   3
198 Offline_Uncorrectable   0x0030   200   200   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	 26825		 -
# 2  Extended offline	Completed without error	   00%	 25466		 -
# 3  Extended offline	Completed without error	   00%	 25458		 -
# 4  Extended offline	Completed without error	   00%	 25450		 -
# 5  Extended offline	Completed without error	   00%	 25442		 -
# 6  Extended offline	Completed without error	   00%	 25434		 -
# 7  Extended offline	Completed without error	   00%	 25426		 -
# 8  Extended offline	Completed without error	   00%	 25418		 -
# 9  Extended offline	Completed without error	   00%	 25410		 -
#10  Extended offline	Completed without error	   00%	 25402		 -
#11  Extended offline	Completed without error	   00%	 25395		 -
#12  Extended offline	Completed without error	   00%	 25378		 -
#13  Extended offline	Completed without error	   00%	 25370		 -
#14  Extended offline	Interrupted (host reset)	  70%	 25360		 -
#15  Extended offline	Completed without error	   00%	 25358		 -
#16  Extended offline	Completed without error	   00%	 25350		 -
#17  Extended offline	Completed without error	   00%	 25340		 -
#18  Extended offline	Completed without error	   00%	 25314		 -
#19  Extended offline	Completed without error	   00%	 25306		 -
#20  Extended offline	Completed without error	   00%	 25299		 -
#21  Extended offline	Completed without error	   00%	 25291		 -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Don't believe S.M.A.R.T. when it says it's on FreeNAS 10.3. I guess this build was aimed for v10 or it's maybe because I tested it at some point.

I might be misreading the output but from what I understand, there's no error in there. Then, no sector number available to force-write. I don't know what this means. I know my drives are old (one is a bit newer because it showed sector errors a few months after I bought it so I used the warranty to claim a new one) and they should fail soon enough, especially this one if it's fixable. I guess I'll buy a new one soon enough. But, from these outputs, do you think my drive is "full kaput" and I should replace it asap or can I do something condemn the faulty sectors and keep using it? Or else, is it just zfs error that could have happened with some power failure at my house and I just need to fully wipe it and replace it by itself in the pool (and let it all recreate)? I didn't try that last solution yet as I wanted some advice first.

Also, I ran a zpool scrub just to try but nothing really new came out of it. Here's the output:

Code:

~# zpool status -v mainVolume
  pool: mainVolume
state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
	the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 23h32m with 0 errors on Fri Jun  9 14:05:02 2017
config:

	NAME												STATE	 READ WRITE CKSUM
	mainVolume										  DEGRADED	 0	 0	 0
	  raidz1-0										  DEGRADED	 0	 0	 0
		4384224381238940287							 UNAVAIL	  0	 0	 0  was /dev/gptid/########.eli
		gptid/########.eli  ONLINE	   0	 0	 0
		gptid/########.eli  ONLINE	   0	 0	 0

errors: No known data errors

(I hid the gpt Ids over there.)

So, what do you think about this?
Thank you in advance!

P.S.: Don't hesitate to be technical with me.
P.S.2: I hid the gpt Ids because I wasn't sure if it's safe to let them leak publicly. Is it? And is it needed for debugging? I don't think so but I can always be wrong!

Edit:
Here's the full spec of that machine as asked. It's an old workstation I recycled by only putting new drives. It's a very cheap setup as I only had a small budget (which didn't change over time...) but I'm very happy with the results so far as it only cost me 3 disks and a bit of RAM.

- FreeNAS 9.10.2-U4
- Re-purposed Dell Precision 380 Workstation (PWS380):
- Motherboard: CN-0CJ774-70821-65SI0XE
- Socket: LGA775
- Chipset: Intel 955X
- CPU: Intel Pentium 4 HT 650 @3.4GHz (1 core) (Found by semi-guess, see the following cpuid)

Code:

~# cpuid
eax in	eax	  ebx	  ecx	  edx
00000000 00000005 756e6547 6c65746e 49656e69
00000001 00000f4a 01020800 0000649d bfebfbff
00000002 605b5001 00000000 00000000 007d7040
00000003 00000000 00000000 00000000 00000000
00000004 00004121 01c0003f 0000001f 00000000
00000005 00000040 00000040 00000000 00000000
80000000 80000008 00000000 00000000 00000000
80000001 00000000 00000000 00000001 20100800
80000002 20202020 20202020 20202020 6e492020
80000003 286c6574 50202952 69746e65 52286d75
80000004 20342029 20555043 30342e33 007a4847
80000005 00000000 00000000 00000000 00000000
80000006 00000000 00000000 08006040 00000000
80000007 00000000 00000000 00000000 00000000
80000008 00003024 00000000 00000000 00000000

Vendor ID: "GenuineIntel"; CPUID level 5

Intel-specific functions:
Version 00000f4a:
Type 0 - Original OEM
Family 15 - Pentium 4
Model 4 - Intel Pentium 4 processor (generic) or newer
Stepping 10
Reserved 0

Extended brand string: "			  Intel(R) Pentium(R) 4 CPU 3.40GHz"
CLFLUSH instruction cache line size: 8
Initial APIC ID: 1
Hyper threading siblings: 2

Feature flags set 1 (CPUID.01H:EDX): bfebfbff:
FPU	Floating Point Unit
VME	Virtual 8086 Mode Enhancements
DE	 Debugging Extensions
PSE	Page Size Extensions
TSC	Time Stamp Counter
MSR	Model Specific Registers
PAE	Physical Address Extension
MCE	Machine Check Exception
CX8	COMPXCHG8B Instruction
APIC   On-chip Advanced Programmable Interrupt Controller present and enabled
SEP	Fast System Call
MTRR   Memory Type Range Registers
PGE	PTE Global Flag
MCA	Machine Check Architecture
CMOV   Conditional Move and Compare Instructions
FGPAT  Page Attribute Table
PSE-36 36-bit Page Size Extension
CLFSH  CFLUSH instruction
DS	 Debug store
ACPI   Thermal Monitor and Clock Ctrl
MMX	MMX instruction set
FXSR   Fast FP/MMX Streaming SIMD Extensions save/restore
SSE	Streaming SIMD Extensions instruction set
SSE2   SSE2 extensions
SS	 Self Snoop
HT	 Hyper Threading
TM	 Thermal monitor
31	 Pending Break Enable

Feature flags set 2 (CPUID.01H:ECX): 0000649d:
SSE3	 SSE3 extensions
DTES64   64-bit debug store
MONITOR  MONITOR/MWAIT instructions
DS-CPL   CPL Qualified Debug Store
EST	  Enhanced Intel SpeedStep Technology
CNXT-ID  L1 Context ID
CX16	 CMPXCHG16B
xTPR	 Send Task Priority messages

Extended feature flags set 1 (CPUID.80000001H:EDX): 20100800
SYSCALL   SYSCALL/SYSRET instructions
XD-bit	Execution Disable bit
EM64T	 Intel Extended Memory 64 Technology

Extended feature flags set 2 (CPUID.80000001H:ECX): 00000001
LAHF	  LAHF/SAHF available in IA-32e mode

Old-styled TLB and cache info:
50: Instruction TLB: 4KB, 2MB or 4MB pages, fully assoc., 64 entries
5b: Data TLB: 4KB or 4MB pages, fully assoc., 64 entries
60: 1st-level data cache: 16-KB, 8-way set associative, sectored cache, 64-byte line size
40: No 2nd-level cache, or if 2nd-level cache exists, no 3rd-level cache
70: Trace cache: 12K-micro-op, 8-way set assoc
7d: 2nd-level cache: 2-MB, 8-way set associative, 64-byte line size

Processor serial: 0000-0F4A-0000-0000-0000-0000

Deterministic Cache Parameters:
index=0: eax=00004121 ebx=01c0003f ecx=0000001f edx=00000000
> Data cache, level 1, self initializing
> 32 sets, 8 ways, 1 partitions, line size 64
> full size 16384 bytes
> shared between up to 2 threads
index=1: eax=00004143 ebx=01c0103f ecx=000007ff edx=00000000
> Unified cache, level 2, self initializing
> 2048 sets, 8 ways, 2 partitions, line size 64
> full size 2097152 bytes
> shared between up to 2 threads

- RAM: 2x512MB + 2x2048MB (Both DDR2 533MHz, Dual-channeled)

Code:

~# dmidecode --type memory
# dmidecode 3.0
Scanning /dev/mem for entry point.
SMBIOS 2.3 present.

Handle 0x1000, DMI type 16, 15 bytes
Physical Memory Array
	Location: System Board Or Motherboard
	Use: System Memory
	Error Correction Type: Single-bit ECC
	Maximum Capacity: 4 GB
	Error Information Handle: No Error
	Number Of Devices: 4

Handle 0x1100, DMI type 17, 27 bytes
Memory Device
	Array Handle: 0x1000
	Error Information Handle: No Error
	Total Width: 64 bits
	Data Width: 64 bits
	Size: 512 MB
	Form Factor: DIMM
	Set: None
	Locator: DIMM_1
	Bank Locator: Not Specified
	Type: DDR
	Type Detail: Synchronous
	Speed: 533 MHz
	Manufacturer: CE00000000000000
	Serial Number: F8165BCF
	Asset Tag: Not Specified
	Part Number: M3 78T6553CZ3-CD5

Handle 0x1101, DMI type 17, 27 bytes
Memory Device
	Array Handle: 0x1000
	Error Information Handle: No Error
	Total Width: 64 bits
	Data Width: 64 bits
	Size: 2048 MB
	Form Factor: DIMM
	Set: None
	Locator: DIMM_3
	Bank Locator: Not Specified
	Type: DDR
	Type Detail: Synchronous
	Speed: 533 MHz
	Manufacturer: 0000000000000000
	Serial Number: 00000006
	Asset Tag: Not Specified
	Part Number: V01D2LF2GB18818867

Handle 0x1102, DMI type 17, 27 bytes
Memory Device
	Array Handle: 0x1000
	Error Information Handle: No Error
	Total Width: 64 bits
	Data Width: 64 bits
	Size: 512 MB
	Form Factor: DIMM
	Set: None
	Locator: DIMM_2
	Bank Locator: Not Specified
	Type: DDR
	Type Detail: Synchronous
	Speed: 533 MHz
	Manufacturer: CE00000000000000
	Serial Number: F8165BD6
	Asset Tag: Not Specified
	Part Number: M3 78T6553CZ3-CD5

Handle 0x1103, DMI type 17, 27 bytes
Memory Device
	Array Handle: 0x1000
	Error Information Handle: No Error
	Total Width: 64 bits
	Data Width: 64 bits
	Size: 2048 MB
	Form Factor: DIMM
	Set: None
	Locator: DIMM_4
	Bank Locator: Not Specified
	Type: DDR
	Type Detail: Synchronous
	Speed: 533 MHz
	Manufacturer: 0000000000000000
	Serial Number: 00000006
	Asset Tag: Not Specified
	Part Number: V01D2LF2GB18818867

- Ethernet controller: Qualcomm Atheros Killer E220x Gigabit Ethernet Controller (rev 13)
- SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 05)
- Old disks got removed and I put these instead:
- mainVolume (RAIDz1): 3x3TB WesterDigital WDC WD30EZRX-00D8PB0 (ATA/ATAPI-9 SATA 3.x)
- downloadVolume (Temporary volume I don't care losing): 1x1TB Seagate ST31000340AS (ATA/ATAPI-8 SATA 2.x)
- UPS: Eaton Protection Station 500 (2.2V 250W/500VA) No USB feature - I'll change it soon enough since it does sometimes not even hold during simple power failures and it has no USB feature...

danb35 · Jun 9, 2017

Vrakfall said:
Don't believe S.M.A.R.T. when it says it's on FreeNAS 10.3.

It doesn't say that; it says it's on FreeBSD 10.3, which it is--that's the underlying operating system for FreeNAS 9.10.

Given that you have an encrypted pool consisting of 3 TB disks in RAIDZ1, you should be concerned, as there are a lot of ways for things to go badly for you. Hopefully you have a good backup of your encryption key and your recovery key (these aren't saved in the config file).

You're running long SMART tests way too often; looks like you're running them every day or even more often. Once a week is plenty, and there are some who would suggest that even that often is too much (though that's what I use).

But with all of that, I'm not seeing anything that sounds like it should have kicked your disk offline. There are a few pending sectors, which isn't a good thing, but it also isn't a "your disk has one foot in the grave" bad thing.

Honestly, at this point, I'd try to just online the disk again. Try these:

zpool online mainVolume 4384224381238940287
If that doesn't work, zpool online mainVolume gptid/####.eli (where the number is the disk that's offline--and no, there's no reason to mask those)
If neither of those work, try rebooting the server and see if FreeNAS will pick up the disk at that time.

Vrakfall · Jun 9, 2017

danb35 said:
It doesn't say that; it says it's on FreeBSD 10.3, which it is--that's the underlying operating system for FreeNAS 9.10.

Well, you caught me reading too quickly over here. ^^

danb35 said:
You're running long SMART tests way too often

Well, I think I'm only using default FreeNAS values and didn't change a thing about that. I'll look into it, thank you. ;)

Yes, I do have both my keys and recovery keys stored in multiple "other" locations.

Both `zpool online` commands output this error:

Code:

~# zpool online mainVolume 4384224381238940287
warning: device '4384224381238940287' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present

I forgot to mention I already tried that but I did it again to make sure.

I also already restarted a few days ago and I went back to the same state. I won't do it right now as I'm running out of time but I can do it again later (though I don't think it'll help).

danb35 said:
and no, there's no reason to mask those

Thank you for the info. :)

danb35 · Jun 9, 2017

After doing the online, does the output of zpool status change?

Vrakfall · Jun 9, 2017

Not even a zilch

Code:

~# zpool online mainVolume 4384224381238940287
warning: device '4384224381238940287' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present
~# zpool status
  pool: downloadVolume
 state: ONLINE
  scan: scrub repaired 0 in 8h52m with 0 errors on Sun Apr 30 08:52:55 2017
config:

	NAME											  STATE	 READ WRITE CKSUM
	downloadVolume									ONLINE	   0	 0	 0
	  gptid/3981d312-a3be-11e3-a9bc-001372239d6b.eli  ONLINE	   0	 0	 0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0h45m with 0 errors on Sat May 20 04:30:12 2017
config:

	NAME										  STATE	 READ WRITE CKSUM
	freenas-boot								  ONLINE	   0	 0	 0
	  gptid/27fd386c-2db5-11e6-aec8-001372239d6b  ONLINE	   0	 0	 0

errors: No known data errors

  pool: mainVolume
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
	the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 23h32m with 0 errors on Fri Jun  9 14:05:02 2017
config:

	NAME												STATE	 READ WRITE CKSUM
	mainVolume										  DEGRADED	 0	 0	 0
	  raidz1-0										  DEGRADED	 0	 0	 0
		4384224381238940287							 UNAVAIL	  0	 0	 0  was /dev/gptid/e2f20c74-4f13-11e3-be76-001372239d6b.eli
		gptid/e3e03a3f-4f13-11e3-be76-001372239d6b.eli  ONLINE	   0	 0	 0
		gptid/4314884f-baae-11e3-b761-001372239d6b.eli  ONLINE	   0	 0	 0

errors: No known data errors

danb35 · Jun 9, 2017

A reboot shouldn't harm anything (other than being inconvenient), but I wouldn't be too hopeful. It's possible that something with the encryption code has caused it to drop offline, rather than the disk itself, but I've not worked with the encryption system before.

Vrakfall · Jun 9, 2017

Ok thank you, I'll try it on sunday, I need to go now and my plex is currently used.

If the reboot doesn't help (and I bet it won't), would it be time for a full replace by the same disk (and generate a a full re-rewrite on it)?

danb35 · Jun 9, 2017

A full replace should do the trick. But hopefully before then, someone will chime in with some better troubleshooting for the encryption piece.

scrappy · Jun 9, 2017

Somebody correct me if I'm wrong here, but wouldn't you have to issue zpool clear <pool name> before ZFS will attempt to online a disk it previously deemed faulted?

SweetAndLow · Jun 9, 2017

Just replace the disk and be done with it. It's had some read errors and probably not stable.

You also need to include your hardware specs in your post, as it's required for this forum.

Smart tests are not set by default with FreeNAS you had to configure them.

Sent from my Nexus 5X using Tapatalk

danb35 · Jun 9, 2017

SweetAndLow said:
default

I don't know what (if anything) is the default for 9.10, but 11-RC3 does have a default SMART test schedule. It's a silly one--short tests every hour, all day on Sunday, but nothing else.

Vrakfall · Jun 13, 2017

SweetAndLow said:
Just replace the disk and be done with it. It's had some read errors and probably not stable.

I was about to go for something like that, or just buy another 3TB disk and go from RAIDz1 to RAIDz2. I don't think it'd hurt even if one of them is about to miserably fail.
Also, can the read errors have been generated by a (long) power failure, as my UPS doesn't have any USB feature and I'm not often close to the server when it happens? Then I guess it can be not related to the drive?

SweetAndLow said:
You also need to include your hardware specs in your post, as it's required for this forum.

I changed my first post accordingly but I found no rule/exemple anywhere so I did as I thought it was good.

SweetAndLow said:
Smart tests are not set by default with freenas you had to configure them.

Well it's not "activated" by default but after activating the service, I don't think I changed anything as my past-and-less-experienced me thought the default settings were ok (and weren't generating hours long tests every time). Look at this nasty 30 minutes one in the tooltip:

S4BwShPWFX3EnRvt9zUbvy51T4dsGoDWG33LCVLiNaM

So now, I restarted as I made the new update and the drive didn't come back to normal, as expected.

scrappy said:
Somebody correct me if I'm wrong here, but wouldn't you have to issue zpool clear <pool name> before ZFS will attempt to online a disk it previously deemed faulted?

Can someone confirm this? Is this the normal way in managing my disk, just remove the errors and pretend they didn't happen? I guess I can also pay more attention in the future in other signs of failure from it.
Still nothing has to be done with the encryption of the disk?

SweetAndLow · Jun 13, 2017

Vrakfall said:
I was about to go for something like that, or just buy another 3TB disk and go from RAIDz1 to RAIDz2. I don't think it'd hurt even if one of them is about to miserably fail.
Also, can the read errors have been generated by a (long) power failure, as my UPS doesn't have any USB feature and I'm not often close to the server when it happens? Then I guess it can be not related to the drive?

I changed my first post accordingly but I found no rule/exemple anywhere so I did as I thought it was good.

Well it's not "activated" by default but after activating the service, I don't think I changed anything as my past-and-less-experienced me thought the default settings were ok (and weren't generating hours long tests every time). Look at this nasty 30 minutes one in the tooltip:

So now, I restarted as I made the new update and the drive didn't come back to normal, as expected.

Can someone confirm this? Is this the normal way in managing my disk, just remove the errors and pretend they didn't happen? I guess I can also pay more attention in the future in other signs of failure from it.
Still nothing has to be done with the encryption of the disk?

You do not just clear your errors and move on but I think that is what you want us to tell you what to do. So you can do it but the errors will just come back and you will ask again and get told to replace it.

You have got to be joking, your system only has 1GB of memory? Dude the min requirements is 8GB. Your system doesn't even work no matter how much you tell me it does. The memory thing probably isn't related to your disks failing bit it will cause you to lose all your data one day.

It's also funny how people who avoid the rule of posting hardware specs are always the ones with unusable hardware.

Sent from my Nexus 5X using Tapatalk

danb35 · Jun 13, 2017

SweetAndLow said:
You have got to be joking, your system only has 1GB of memory?

No, 5 GB. 2 x 512MB + 2 x 2048MB. Still not close to the minimum requirements, but not as bad as you thought.

Vrakfall · Jun 13, 2017

SweetAndLow said:
You do not just clear your errors and move on but I think that is what you want us to tell you what to do. So you can do it but the errors will just come back and you will ask again and get told to replace it.

I don't want to be told anything in particular. I'm here to ask advices and act accordingly, especially since I said I consider buying a new drive! I just consider making it usable again in a RAIDz2 instead of the actual RAIDz1.
I'm here to ask experts' advice, isn't that already admitting I don't know?

SweetAndLow said:
You have got to be joking, your system only has 1GB of memory? Dude the min requirements is 8GB. Your system doesn't even work no matter how much you tell me it does. The memory thing probably isn't related to your disks failing bit it will cause you to lose all your data one day.

5GB! I added 2x2048MB to the original 2x512MB.
The following lines are under spoiler as it's not about solving my issue but more about trying to explain why is my system so "bad".

Though I could afford another set of 2x2048MB nowadays, I wasn't able to afford more at the time I built this system. Also, I built it at the time of FreeNAS 8 which required 6GB (if my memory is right). It was close enough and I thought I'd expand it later which I'm considering right now. This Nas was out of my priorities for a few years as I had bigger issues to deal with.
I think I'll check for some more RAM soon (even ECC one) but the drive comes first.
Also, this system works. I'm able to reach 27~30Mo/s read OR write (non-simultaneously) with Gb/s LAN while using Raid+Encryption+Compression. There's a plex server working on it and works on my TV and other devices as long as I don't use transcoding and don't copy at full speed at the same time. I've also got a git server running, a mail one, transmission and other trivial things. Until now, I didn't really feel I "needed" more.
I built this system as I thought having all my data stored on the hard drive of my desktop pc without RAID wasn't a good idea and to have a better backup solution. I thought this was better than nothing.
It was also quite a challenge for me to be able to build something "this good" with little.

I don't think I should/need to justify that much but the following made me do it:

SweetAndLow said:
It's also funny how people who avoid the rule of posting hardware specs are always the ones with unusable hardware.

I sense a taste of judgement here. Sorry, I'm not a rich guy.
I'm sorry about the rules, I did not "try to avoid them". I might have read them years ago (I'm not even sure) and then forgot a bit about it and didn't think about reading it again before, looks my common sense wasn't enough this time. I then corrected my mistake and even then I didn't find the rules. It's only today I found it, right under my eyes since the beginning, silly me.
Even though, now I read the rules again, I feel you're at the limit of what's called there "derogatory comments". I feel like this participate in the fact of people being afraid to ask for help. I will move on and do as if I didn't read that.

Also, I updated the main post and added my network + Sata controller as asked in the rules.

scrappy · Jun 13, 2017

"scrappy said:
Somebody correct me if I'm wrong here, but wouldn't you have to issue zpool clear <pool name> before ZFS will attempt to online a disk it previously deemed faulted?"

Can someone confirm this? Is this the normal way in managing my disk, just remove the errors and pretend they didn't happen? I guess I can also pay more attention in the future in other signs of failure from it.
Still nothing has to be done with the encryption of the disk?

Bringing back a faulted disk back into your pool is definitely NOT the normal way. Using the zpool clear command is only going to reset the disk error history on ZFS, but if your faulted disk is throwing out errors left and right, ZFS will soon offline your drive again.

danb35 · Jun 13, 2017

Very old, very un-suitable hardware, and an encrypted pool are a very dangerous combination.

Vrakfall · Jun 13, 2017

scrappy said:
Bringing back a faulted disk back into your pool is definitely NOT the normal way. Using the zpool clear command is only going to reset the disk error history on ZFS, but if your faulted disk is throwing out errors left and right, ZFS will soon offline your drive again.

Thank you for the enlightenment! Now, I guess clearing it AND adding a disk to the pool to make it a RAIDz2 is possible, couldn't harm and is ok "SPOF-wise"?
Am I mistaking, RAIDz2 is 2 drives + 2 parity drives, right?

danb35 said:
Very old, very un-suitable hardware, and an encrypted pool are a very dangerous combination.

I understand that. What do you suggest, unencrypt everything? There are things I don't like to leave unencrypted, if you know what I mean. I guess I could use the 2 types of datasets.

danb35 · Jun 13, 2017

Vrakfall said:
I guess clearing it AND adding a disk to the pool to make it a RAIDz2 is possible, couldn't harm and is ok "SPOF-wise"?

Not really. You can't turn a RAIDZ1 vdev into a RAIDZ2 vdev; you'd have to back up your data, destroy the pool, re-create it as RAIDZ2, then restore your data.

Vrakfall said:
I guess I could use the 2 types of datasets.

FreeNAS doesn't currently support encryption per-dataset (though I recall seeing that this is on the roadmap for OpenZFS, and if it's implemented there, it should find its way into FreeNAS). A better (i.e., safer) option would probably be some form of client-side encryption software like TrueCrypt/VeraCrypt.

Honestly, even if your hardware were up to scratch, I'd probably recommend the same thing. FreeNAS's encryption system seems to be pretty fragile (though I base that only on what I've seen here, not in my personal experience), such that it's really only recommended if you have a legal or regulatory requirement to have full-disk encryption.

As to that hardware... In the US, you can get a Proliant ML10 for $200. Add at least 4 GB of RAM (to the existing 4 GB, to make 8 GB, though I'd recommend adding at least 8 GB) and your drives, and you have suitable, modern, server-grade hardware for a very low price. I don't think that has anything to do with what's going on here, but something to consider.

SweetAndLow · Jun 13, 2017

Vrakfall said:
Thank you for the enlightenment! Now, I guess clearing it AND adding a disk to the pool to make it a RAIDz2 is possible, couldn't harm and is ok "SPOF-wise"?
Am I mistaking, RAIDz2 is 2 drives + 2 parity drives, right?

I understand that. What do you suggest, unencrypt everything? There are things I don't like to leave unencrypted, if you know what I mean. I guess I could use the 2 types of datasets.

You can't change to a raid z2 pool without destroying everything and rebuilding. The same thing goes for encryption. Encryption is fine just most people lose the key and can't unlock their data. So don't lose the key!

Sent from my Nexus 5X using Tapatalk

Important Announcement for the TrueNAS Community.

Errors on sectors of a drive, now unreadable and S.M.A.R.T. seems to say the drive is OK-ish.

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Patron

Sweet'NASty

Hall of Famer

Dabbler

Sweet'NASty

Hall of Famer

Dabbler

Patron

Hall of Famer

Dabbler

Hall of Famer

Sweet'NASty

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Errors on sectors of a drive, now unreadable and S.M.A.R.T. seems to say the drive is OK-ish."

Similar threads