One or more devices has experienced an unrecoverable error-Vault

TiklMyN1ps · Aug 3, 2017

So I managed to get the server back, and it is still continuing to scrub.

Code:


root@freenas:~ # zpool status
  pool: Vault
 state: DEGRADED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-JQ
  scan: scrub in progress since Thu Aug  3 12:34:17 2017
		61.7G scanned out of 3.84T at 68.1M/s, 16h8m to go
		14.6M repaired, 1.57% done
config:

		NAME											STATE	 READ WRITE CKSUM
		Vault										   DEGRADED	 7	12	 3
		  raidz1-0									  DEGRADED	23	 1	11
			gptid/54a3541e-74b7-11e7-a6af-94de808e85cb  ONLINE	  27	16	21  (repairing)
			gptid/55459f80-74b7-11e7-a6af-94de808e85cb  ONLINE	   7	 0	34  (repairing)
			gptid/55e6540b-74b7-11e7-a6af-94de808e85cb  DEGRADED	40	21   754  too many errors  (repairing)
			gptid/5679eb01-74b7-11e7-a6af-94de808e85cb  ONLINE	  16   113	25  (repairing)

errors: 7 data errors, use '-v' for a list

  pool: freenas-boot
 state: ONLINE
  scan: none requested
config:

		NAME		STATE	 READ WRITE CKSUM
		freenas-boot  ONLINE	   0	 0	 0
		  da4p2	 ONLINE	   0	 0	 0

errors: No known data errors

So at this point I am making the assumption that I have massive system loss, however it is repairing it.

It says it will need an additional 16h to repair, which means I'm just letting this go for now.

However this is not solving what the issue is. Any suggestions at this point?

danb35 · Aug 3, 2017

TiklMyN1ps said:
Any suggestions at this point?

Broadly speaking, I'd say the possibilities are the drives themselves, the data channel to the drives, or power to the drives. The long SMART tests should pretty well address the first possibility. As to the second, check the cables, make sure they're securely plugged in, etc. If you have spares, try replacing them. Power is probably the hardest thing to test in this context.

Chris Moore · Aug 4, 2017

TiklMyN1ps said:
So I managed to get the server back, and it is still continuing to scrub.

So at this point I am making the assumption that I have massive system loss, however it is repairing it.

It says it will need an additional 16h to repair, which means I'm just letting this go for now.

However this is not solving what the issue is. Any suggestions at this point?

I don't know that I would take the time to try and repair this. From the SMART data you posted for da0, the drive was getting a high number of communication errors which usually indicate a cabling problem but in this case (because they are white label drives) it could be defective drives. I hate the position this puts you in.
You have to try to test the connectivity between the drive and the system board and the sooner you start, the sooner it will be done. If I were in your spot, I would scrap the existing pool because it is a backup anyhow. If you have some known good drives, pull the ones that are giving errors, put in different drives and do some testing to see if the server can properly access those drives. If it works with different drives, that is one nail in the coffin for the drives.
Alternatively, or in addition to the above, you can put these drives in another system connected to a plain SATA controller and boot from a CD with something like DBAN boot and nuke. Make sure you only select these drives and do a wipe with verify on every pass. That will rigorously test the drive by writing data to the drive and then checking to see if the data written is the same data read back. DBAN will tell you if a drive fails and this has the advantage of doing a full surface test of the disk AND doing it using different hardware to ensure it is not some problem with the cables or controller card in the server.
When you have a hardware fault (which this appears to be) you have to do some testing to validate where the fault is so the bad component can be replaced.
My money (if I were to bet) would be on the drives being defective. But, that is just because of the 'brand' of drive. I try to stick with name brand drives. It is generally better (still not best) to buy a used name brand drive than to buy a 'new' white label drive. I have no evidence but I think that some of the white label drives are actually used drives where they flash a new firmware on the drive that makes it appear new when it is really already old, failed or in otherwise poor health.

Sakuru · Aug 7, 2017

danb35 said:
Take a look here: https://forums.freenas.org/index.php?threads/smart-results-are-these-drives-bad.43457/#post-286353

Though other users (@Sakuru, I think) have had better luck.

I would call my luck "meh". 7 out of 8 of the drives are fine, but I had to RMA one, and then that replacement died. I gave up and bought a WD Red instead. If any of the other 7 die I will replace them with Reds too.

JackShine · Aug 9, 2017

Ive never seen such a dire zpool status.

Remove all HDDs and check them out on another PC.

Also Id rename the pool “Titanic”

Glad its just a Backup.

Chris Moore · Aug 9, 2017

TiklMyN1ps said:
Hello,
I am very new to freenas, and I am sorry if I have missed some fundamental information, however I am not really looking for condescending comments such as "learn the OS before posting issues", just someone who would be helpful enough to provide some troubleshooting for a first time newbie :)

Thanks in advance!

Did you get any results?

Important Announcement for the TrueNAS Community.

One or more devices has experienced an unrecoverable error-Vault

TiklMyN1ps

Dabbler

danb35

Hall of Famer

Chris Moore

Hall of Famer

Sakuru

Guru

JackShine

Dabbler

Chris Moore

Hall of Famer

Similar threads

Important Announcement for the TrueNAS Community.

One or more devices has experienced an unrecoverable error-Vault

TiklMyN1ps

Dabbler

danb35

Hall of Famer

Chris Moore

Hall of Famer

Sakuru

Guru

JackShine

Dabbler

Chris Moore

Hall of Famer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "One or more devices has experienced an unrecoverable error-Vault"

Similar threads