One or more devices has experienced an unrecoverable error-Vault

TiklMyN1ps · Aug 2, 2017

Hello,

I am new to freenas and have done some homework as far as hardware, volume management etc. However I do not know the specific errors to freenas nor trouble shooting.

I APOLOGIES IF THIS IS REPEAT TOPIC BUT I WAS JUST HOPING SOMEONE CAN ASSIST ME, I AM NOT IGNORANT, JUST INEXPERIENCED.

Ok, so a little bit of back story.

I currently run 2 servers.

1. Atlas-main server running windows server 2012, dual x5650 Xeon processors, 48gb DDR3 ECC ram, 5X4TB HDD's in raid 6 using a LSI 9211-8i, Dual 750W PSU being supported by a ACP UPS 1000VA.
2. Vault- Backup server running freenas 11, installed on a Sandisk 8gb flash drive on a Lenovo ThinkServer TS430. Assembled 7/29/2017
Specs:
Intel Xeon E3-1220 v2 4-Core 3.1GHz Processor
8GB 2Rx8 PC3-10600E ECC Memory (8GB additional ordered and delivered 8/3, planning on install)
2 x Lenovo 450W Power Supplies (hot swappable)
1 x SAS 9240-8i RAID Card flashed IT
4X3TB WL 7200 rpm SATA HDD's.

Goal of "Vault"

The goal of this server is to backup all my data off my main server "Atlas". All the data (on both) contains media (replaceable) and some VM's (semi crucial) and other non critical data. Currently Atlas is running a nightly "sync" of all data that I've mapped to Vault. It has ran 3 sync's, with 7/31 being the only event where it could not preform the task (assuming it is to the same error that freenas is reporting). Other than that, 10GB of files have been written to Vault since the error, all of which were tested and seems to be functioning normally.

I bought the TS430 as used, however is still under a 90 warranty per the seller.

After I set up freenas, created a zfs volume, and connected Atlas to its shared volume, I was off to the races.

7/29-I've transferred 5tb of data on to the Vault, with zero errors.
7/31- I received the following error:

Now I've googled this and from initial investigation I have its seems that when others receive this error it is typically hardware failure.

At this moment the server is still on with the error being present. I have not reset the server yet, and wanted to keep the error up for troubleshooting purposes.

I've seen other threads where people are warning that all the data is corrupted, and that the system will have to be backed up and wiped with a possible memtest. To me this wont be a big issue (more time consuming to be honest) but this is only a backup of my main server.
Now I say this lightly, however with all major servers, a corrupted/down backup server, is essentially useless.

I am very new to freenas, and I am sorry if I have missed some fundamental information, however I am not really looking for condescending comments such as "learn the OS before posting issues", just someone who would be helpful enough to provide some troubleshooting for a first time newbie :)

Thanks in advance!

gpsguy · Aug 2, 2017

Please run a zpool status from the command line and post the results in code tags.

Did you burn in your disks first? Have you run SMART tests on the drives? I'd be a little concerned about them being "white label" drives. We've had some users get burned by using them. I can't remember the details right now.

BTW, You shouldn't receive any replies like this. We sincerely want to help you.

TiklMyN1ps said:
I am not really looking for condescending comments such as "learn the OS before posting issues"

gpsguy · Aug 2, 2017

@danb35 is the user I was thinking about, though there have been others. Perhaps he'll chime in on the thread.

gpsguy said:
I can't remember the details right now.

danb35 · Aug 2, 2017

Take a look here: https://forums.freenas.org/index.php?threads/smart-results-are-these-drives-bad.43457/#post-286353

Though other users (@Sakuru, I think) have had better luck.

danb35 · Aug 2, 2017

TiklMyN1ps said:
I've seen other threads where people are warning that all the data is corrupted,

This should not be the case; the error message itself says that applications are unaffected (i.e., your data's fine). But the output of zpool status would help address further.

TiklMyN1ps · Aug 3, 2017

Thanks for the help initial steps guys! I really appreciate the feedback so far!

I will work on running zpool status now. And I will reply with the status shortly.

@gpsguy no I did not burn in my disks, I am not really sure of that process and I have not setup SMART yet either.

If these are necessary steps I will be more than willing to wipe the server and start fresh :)

more to come.

danb35 · Aug 3, 2017

TiklMyN1ps said:
If these are necessary steps I will be more than willing to wipe the server and start fresh

I don't know that they'd warrant wiping the server, but the burn-in testing is intended to catch "infant mortality" before you've committed data to the system. SMART tests are part of routine system maintenance and should be run on a regular schedule (I like short tests daily and long tests weekly, but opinions vary on that subject), but can be set up at any time.

TiklMyN1ps · Aug 3, 2017

Hi all,
Here is what I got when running zpool status:

pool: Vault
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: none requested
config:

NAME STATE READ WRITE CKSUM
Vault ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/54a3541e-74b7-11e7-a6af-94de808e85cb ONLINE 1 0 0
gptid/55459f80-74b7-11e7-a6af-94de808e85cb ONLINE 0 0 0
gptid/55e6540b-74b7-11e7-a6af-94de808e85cb ONLINE 0 0 0
gptid/5679eb01-74b7-11e7-a6af-94de808e85cb ONLINE 0 0 0

errors: No known data errors

pool: freenas-boot
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
da4p2 ONLINE 0 0 0

errors: No known data errors

danb35 · Aug 3, 2017

TiklMyN1ps said:
Here is what I got when running zpool status:

A single read error is suspicious, but not critical--and as the message says, your data is unharmed. Try running a scrub on your pool-- zpool scrub Vault. It will take some time to complete. Once it's done, try zpool status again.

Can you also post the output of camcontrol devlist?

TiklMyN1ps · Aug 3, 2017

Hi @danb35
Here is what I got from the camcontrol devlist command:

<ATA MD3000GBDS 1KA2> at scbus0 target 0 lun 0 (pass0,da0)
<ATA MD3000GBDS 1KA2> at scbus0 target 1 lun 0 (pass1,da1)
<ATA MD3000GBDS 1KA2> at scbus0 target 2 lun 0 (pass2,da2)
<ATA MD3000GBDS 1KA2> at scbus0 target 3 lun 0 (pass3,da3)
<PLDS DVDROM DH16D7SH WL3A> at scbus4 target 0 lun 0 (pass4,cd0)
<SanDisk Cruzer Blade 1.00> at scbus6 target 0 lun 0 (pass5,da4)

I will be running the scrub here now soon.

danb35 · Aug 3, 2017

OK, I'm going to guess that da0 is the first disk in your pool. What's the output (in code tags, please) of smartctl -x /dev/da0?

TiklMyN1ps · Aug 3, 2017

@danb35

Is this correct?

Code:

SCT Version (vendor specific):	   258 (0x0102)																				   
SCT Support Level:				   1																							 
Device State:						Active (0)																					 
Current Temperature:					35 Celsius																				 
Power Cycle Min/Max Temperature:	 32/38 Celsius																				 
Lifetime	Min/Max Temperature:	 23/39 Celsius																				 
Under/Over Temperature Limit Count:   0/0																						   
Vendor specific:																													
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00																					 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00																					 
																																	
SCT Temperature History Version:	 2																							 
Temperature Sampling Period:		 1 minute																					   
Temperature Logging Interval:		1 minute																					   
Min/Max recommended Temperature:	  0/60 Celsius																				 
Min/Max Temperature Limit:		   -41/85 Celsius																				 
Temperature History Size (Index):	478 (152)																					 
																																	
Index	Estimated Time   Temperature Celsius																					   
153	2017-08-02 22:55	34  ***************																					 
...	..(308 skipped).	..  ***************																					 
462	2017-08-03 04:04	34  ***************																					 
463	2017-08-03 04:05	35  ****************																					
...	..(166 skipped).	..  ****************																					
152	2017-08-03 06:52	35  ****************																					
																																	
SCT Error Recovery Control:																										 
		   Read:	 70 (7.0 seconds)																							   
		  Write:	 70 (7.0 seconds)																							   
																																	
Device Statistics (GP/SMART Log 0x04) not supported																				 
																																	
SATA Phy Event Counters (GP Log 0x11)																							   
ID	  Size	 Value  Description																								 
0x0001  2		   38  Command failed due to ICRC error																			
0x0002  2		   38  R_ERR response for data FIS																				 
0x0003  2		   38  R_ERR response for device-to-host data FIS																 
0x0004  2			0  R_ERR response for host-to-device data FIS																 
0x0005  2			0  R_ERR response for non-data FIS																			 
0x0006  2			0  R_ERR response for device-to-host non-data FIS															 
0x0007  2			0  R_ERR response for host-to-device non-data FIS															 
0x0008  2			0  Device-to-host non-data FIS retries																		 
0x0009  2			3  Transition from drive PhyRdy to drive PhyNRdy															   
0x000a  2			4  Device-to-host register FISes sent due to a COMRESET														
0x000b  2			0  CRC errors within host-to-device FIS																		
0x000f  2			0  R_ERR response for host-to-device data FIS, CRC															 
0x0012  2			0  R_ERR response for host-to-device non-data FIS, CRC														 
0x8000  4	   303528  Vendor specific

danb35 · Aug 3, 2017

TiklMyN1ps said:
Is this correct?

It's correct as far as it goes, but incomplete. The full output would have scrolled off your screen, but if you're using an SSH client you can scroll back and get it. If you're using the shell through the web GUI, don't--turn on SSH in the server config and use an SSH client.

TiklMyN1ps · Aug 3, 2017

I was just looking in to SSH and setting it up now, the shell isn't user that user friendly. Let me work on getting SSH setup, and I'll run the command and send you an update.

danb35 · Aug 3, 2017

TiklMyN1ps said:
the shell isn't user that user friendly

Agreed. It's OK for running a command or two, and it's nice that it's just a click away in the web GUI, but its handling of control characters is iffy at best, and the lack of a scrollback buffer is a big problem.

TiklMyN1ps · Aug 3, 2017

Ok now I think I got it, hows this?

Code:

=== START OF INFORMATION SECTION ===
Device Model:	 MD3000GBDS
Serial Number:	MD302316623
LU WWN Device Id: 5 0014ee 0598469bc
Firmware Version: 01.01KA2
User Capacity:	3,000,592,982,016 bytes [3.00 TB]
Sector Size:	  512 bytes logical/physical
Rotation Rate:	7200 rpm
Device is:		Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Thu Aug  3 07:26:32 2017 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(37620) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 408) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x70bd) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAGS	VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate	 POSR-K   200   200   051	-	0
  3 Spin_Up_Time			POS--K   155   154   021	-	11233
  4 Start_Stop_Count		-O--CK   100   100   000	-	27
  5 Reallocated_Sector_Ct   PO--CK   200   200   140	-	0
  7 Seek_Error_Rate		 -OSR-K   200   200   000	-	0
  9 Power_On_Hours		  -O--CK   100   100   000	-	139
 10 Spin_Retry_Count		-O--CK   100   253   000	-	0
 11 Calibration_Retry_Count -O--CK   100   253   000	-	0
 12 Power_Cycle_Count	   -O--CK   100   100   000	-	27
183 Runtime_Bad_Block	   -O--CK   100   100   000	-	0
192 Power-Off_Retract_Count -O--CK   200   200   000	-	26
193 Load_Cycle_Count		-O--CK   200   200   000	-	0
194 Temperature_Celsius	 -O---K   117   113   000	-	35
196 Reallocated_Event_Count -O--CK   200   200   000	-	0
197 Current_Pending_Sector  -O--CK   200   200   000	-	0
198 Offline_Uncorrectable   ----CK   100   253   000	-	0
199 UDMA_CRC_Error_Count	-O--CK   200   192   000	-	15615
200 Multi_Zone_Error_Rate   ---R--   100   253   000	-	0
							||||||_ K auto-keep
							|||||__ C event count
							||||___ R error rate
							|||____ S speed/performance
							||_____ O updated online
							|______ P prefailure warning

General Purpose Log Directory Version 1
SMART		   Log Directory Version 1 [multi-sector log support]
Address	Access  R/W   Size  Description
0x00	   GPL,SL  R/O	  1  Log Directory
0x01		   SL  R/O	  1  Summary SMART error log
0x02		   SL  R/O	  5  Comprehensive SMART error log
0x03	   GPL	 R/O	  6  Ext. Comprehensive SMART error log
0x06		   SL  R/O	  1  SMART self-test log
0x07	   GPL	 R/O	  1  Extended self-test log
0x08	   GPL	 R/O	  2  Power Conditions log
0x09		   SL  R/W	  1  Selective self-test log
0x10	   GPL	 R/O	  1  SATA NCQ Queued Error log
0x11	   GPL	 R/O	  1  SATA Phy Event Counters log
0x24	   GPL	 R/O	  1  Current Device Internal Status Data log
0x80-0x9f  GPL,SL  R/W	 16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS	  16  Device vendor specific log
0xa8-0xb7  GPL,SL  VS	   1  Device vendor specific log
0xbd	   GPL,SL  VS	   1  Device vendor specific log
0xc0	   GPL,SL  VS	   1  Device vendor specific log
0xc1	   GPL	 VS	  24  Device vendor specific log
0xe0	   GPL,SL  R/W	  1  SCT Command/Status
0xe1	   GPL,SL  R/W	  1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:				  3
SCT Version (vendor specific):	   258 (0x0102)
SCT Support Level:				   1
Device State:						Active (0)
Current Temperature:					35 Celsius
Power Cycle Min/Max Temperature:	 32/38 Celsius
Lifetime	Min/Max Temperature:	 23/39 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:	 2
Temperature Sampling Period:		 1 minute
Temperature Logging Interval:		1 minute
Min/Max recommended Temperature:	  0/60 Celsius
Min/Max Temperature Limit:		   -41/85 Celsius
Temperature History Size (Index):	478 (186)

Index	Estimated Time   Temperature Celsius
 187	2017-08-02 23:29	34  ***************
 ...	..(274 skipped).	..  ***************
 462	2017-08-03 04:04	34  ***************
 463	2017-08-03 04:05	35  ****************
 ...	..(200 skipped).	..  ****************
 186	2017-08-03 07:26	35  ****************

SCT Error Recovery Control:
		   Read:	 70 (7.0 seconds)
		  Write:	 70 (7.0 seconds)

Device Statistics (GP/SMART Log 0x04) not supported

SATA Phy Event Counters (GP Log 0x11)
ID	  Size	 Value  Description
0x0001  2		   38  Command failed due to ICRC error
0x0002  2		   38  R_ERR response for data FIS
0x0003  2		   38  R_ERR response for device-to-host data FIS
0x0004  2			0  R_ERR response for host-to-device data FIS
0x0005  2			0  R_ERR response for non-data FIS
0x0006  2			0  R_ERR response for device-to-host non-data FIS
0x0007  2			0  R_ERR response for host-to-device non-data FIS
0x0008  2			0  Device-to-host non-data FIS retries
0x0009  2			3  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2			4  Device-to-host register FISes sent due to a COMRESET
0x000b  2			0  CRC errors within host-to-device FIS
0x000f  2			0  R_ERR response for host-to-device data FIS, CRC
0x0012  2			0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4	   305574  Vendor specific

Z300M · Aug 3, 2017

TiklMyN1ps said:
Hi @danb35
Here is what I got from the camcontrol devlist command:

<ATA MD3000GBDS 1KA2> at scbus0 target 0 lun 0 (pass0,da0)
<ATA MD3000GBDS 1KA2> at scbus0 target 1 lun 0 (pass1,da1)
<ATA MD3000GBDS 1KA2> at scbus0 target 2 lun 0 (pass2,da2)
<ATA MD3000GBDS 1KA2> at scbus0 target 3 lun 0 (pass3,da3)
<PLDS DVDROM DH16D7SH WL3A> at scbus4 target 0 lun 0 (pass4,cd0)
<SanDisk Cruzer Blade 1.00> at scbus6 target 0 lun 0 (pass5,da4)

I will be running the scrub here now soon.

Do those model numbers tell us what make these drives are really? Or was it possible for the reseller of the "white label" drives to replace the original manufacturer's model numbers by their own? Anyone?

That they are apparently 3TB drives makes me wonder whether they are really some of the notoriously unreliable 3TB Seagates --except that now I see drives with the "MD3000" model number (but different letters following) and with the name "MaxDigital" both on eBay and on NewEgg and claiming to be Enterprise-grade drives.

TiklMyN1ps · Aug 3, 2017

Ok so after the scrub, this is what I am seeing when I do a zpool status

Code:


root@freenas:~ # zpool status
  pool: Vault
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
		continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Aug  3 09:32:16 2017
		183G scanned out of 3.84T at 125M/s, 8h31m to go
		17.1M resilvered, 4.67% done
config:

		NAME											STATE	 READ WRITE CKS														  UM
		Vault										   DEGRADED   176	14															   2
		  raidz1-0									  DEGRADED   547	 4															   5
			gptid/54a3541e-74b7-11e7-a6af-94de808e85cb  ONLINE	  26	 0															   1  (resilvering)
			gptid/55459f80-74b7-11e7-a6af-94de808e85cb  ONLINE	  39   165															   1  (resilvering)
			gptid/55e6540b-74b7-11e7-a6af-94de808e85cb  FAULTED	 45   103															   0  too many errors  (resilvering)
			gptid/5679eb01-74b7-11e7-a6af-94de808e85cb  ONLINE	   9	 0															  15  (resilvering)

errors: 135 data errors, use '-v' for a list

  pool: freenas-boot
 state: ONLINE
  scan: none requested
config:

		NAME		STATE	 READ WRITE CKSUM
		freenas-boot  ONLINE	   0	 0	 0
		  da4p2	 ONLINE	   0	 0	 0

errors: No known data errors

danb35 · Aug 3, 2017

TiklMyN1ps said:
Ok so after the scrub, this is what I am seeing when I do a zpool status

That is very bad. All your drives are showing read errors, half are showing write errors as well, and there is some data loss already noted. The SMART data you posted for da0 appears to indicate some errors as well.

Once the scrub completes, run long SMART tests on each drive. smartctl -t long /dev/da0 (and da1-3 as well).

TiklMyN1ps · Aug 3, 2017

I will attempt to run SMART test when I get home from work, but unfortunately my server just went offline, however still shows active on my network. I RMD another client on my network and attempted to access it via webUI and no luck, just time out requests. I also just tried to SSH in to it and it is just locking up my Putty.

I think the server just crashed and is in a locked state. I will attempt to send a shutoff/ power on via SSH if I am able to log in.

Important Announcement for the TrueNAS Community.

One or more devices has experienced an unrecoverable error-Vault

Dabbler

Active Member

Active Member

Hall of Famer

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Guru

Dabbler

Hall of Famer

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "One or more devices has experienced an unrecoverable error-Vault"

Similar threads