Two drives spontaneously dropped from pool?

Phobos · May 30, 2017

Hi all,

tl;dr: Two disks spontaneously disappeared from my encrypted pool. I backed up the data, detached the pool, and re-imported it, and the two disks reappeared with CKSUM errors. What just happened, and why?

System: 6x4 TB WD Red encrypted pool running RAIDZ2; X10SLH-F; Xeon E3-1231v3; 32 GB Samsung ECC RAM; Seasonic PSU; FreeNAS-9.10.2-U4 (27ae72978). Boot pool is mirrored, I have a UPS, and I regularly run scrubs and SMART tests.

--------------------------------------------------------

Longer version:

Sunday afternoon, two disks spontaneously dropped out of my pool:

The volume tank (ZFS) state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.

Running `zpool status` showed the drives as UNAVAIL. Rebooting and attempting to unlock the pool resulted in:

The volume tank (ZFS) state is DEGRADED: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state.

The drives still were listed as UNAVAIL. However, in both cases the affected drives were recognized by SMART and showed no errors. (Curiously, I noticed that the drives had become unselected in the dialog box that is used to schedule SMART tests.)

In a panic, I spun up a backup server running 3x6TB WD Reds in a RAIDZ1 as a guest under VMWare Fusion on an old Mac Pro. (I know, very much not ideal. These three drives were supposed to be for a backup pool attached directly to my server, but I haven’t yet gotten an HBA.) Data replicated over in about 2 days without too much fuss.

Today, I detached the pool completely, imported it again, provided the GELI key and passphrase, and found that all six drives imported, but two had CKSUM errors:

Code:

  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 373M in 0h0m with 0 errors on Tue May 30 16:30:16 2017
config:

	NAME												STATE	 READ WRITE CKSUM
	tank												ONLINE	   0	 0	 0
	  raidz2-0										  ONLINE	   0	 0	 0
		gptid/aaaaaaaa-aaaa-11e4-aaaa-xxxxxxxxxxxx.eli  ONLINE	   0	 0	 0
		gptid/bbbbbbbb-bbbb-11e4-bbbb-xxxxxxxxxxxx.eli  ONLINE	   0	 0	 4
		gptid/cccccccc-cccc-11e4-cccc-xxxxxxxxxxxx.eli  ONLINE	   0	 0	 4
		gptid/dddddddd-dddd-11e4-dddd-xxxxxxxxxxxx.eli  ONLINE	   0	 0	 0
		gptid/eeeeeeee-eeee-11e5-eeee-xxxxxxxxxxxx.eli  ONLINE	   0	 0	 0
		gptid/ffffffff-ffff-11e7—ffff-xxxxxxxxxxxx.eli  ONLINE	   0	 0	 0

errors: No known data errors

Right now I am scrubbing the pool, and will follow with SMART long tests to all drives. There are no errors listed under any of the drives when running smartctl. The drives are running below 35°C (although it has been getting warmer the past few weeks).

Did two drives really spontaneously fail? Is the issue perhaps the cables, SATA ports, or SATA controller? Should I `zpool clear` the errors and wait and see what happens, or do I have to replace the drives, or even the motherboard? As always, thank you for any insight.

SweetAndLow · May 30, 2017

How are your drives attached? What is the smart data and have any tests failed?

Sent from my Nexus 5X using Tapatalk

Phobos · May 30, 2017

SweetAndLow said:
How are your drives attached? What is the smart data and have any tests failed?

Sent from my Nexus 5X using Tapatalk

Disks are attached via short SATA cables directly to the motherboard; all ports on the motherboard use the onboard Intel C226 controller.

No tests failed prior to the disks dropping out of the pool; I’m waiting for my scrub to finish before I kick off a new set of long tests.

Here’s SMART output for the disks in question from this morning:

Code:

# smartctl -i -H -A -n standby -l error -l selftest /dev/ada1
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD40EFRX-68WT0N0
Serial Number:	WD-WCC4XXXXXXXX
LU WWN Device Id: 5 0014ee xxxxxxxxx
Firmware Version: 80.00A80
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Tue May 30 01:30:00 2017 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Power mode is:	ACTIVE or IDLE

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
 3 Spin_Up_Time			0x0027   253   175   021	Pre-fail  Always	   -	   3791
 4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   76
 5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
 7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
 9 Power_On_Hours		  0x0032   069   069   000	Old_age   Always	   -	   23274
10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   76
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   19
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   261
194 Temperature_Celsius	 0x0022   120   115   000	Old_age   Always	   -	   32
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	 23226		 -
# 2  Short offline	   Completed without error	   00%	 23202		 -
# 3  Short offline	   Completed without error	   00%	 23178		 -
# 4  Short offline	   Completed without error	   00%	 23154		 -
# 5  Short offline	   Completed without error	   00%	 23130		 -
# 6  Short offline	   Completed without error	   00%	 23106		 -
# 7  Extended offline	Completed without error	   00%	 23095		 -
# 8  Short offline	   Completed without error	   00%	 23082		 -
# 9  Short offline	   Completed without error	   00%	 23058		 -
#10  Short offline	   Completed without error	   00%	 23034		 -
#11  Short offline	   Completed without error	   00%	 23010		 -
#12  Short offline	   Completed without error	   00%	 22986		 -
#13  Short offline	   Completed without error	   00%	 22962		 -
#14  Short offline	   Completed without error	   00%	 22938		 -
#15  Short offline	   Completed without error	   00%	 22914		 -
#16  Short offline	   Completed without error	   00%	 22890		 -
#17  Short offline	   Completed without error	   00%	 22866		 -
#18  Short offline	   Completed without error	   00%	 22842		 -
#19  Short offline	   Completed without error	   00%	 22818		 -
#20  Short offline	   Completed without error	   00%	 22794		 -
#21  Short offline	   Completed without error	   00%	 22651		 -

Code:

# smartctl -i -H -A -n standby -l error -l selftest /dev/ada2
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD40EFRX-68WT0N0
Serial Number:	WD-WCC4XXXXXXXX
LU WWN Device Id: 5 0014ee xxxxxxxxx
Firmware Version: 80.00A80
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Tue May 30 01:30:00 2017 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Power mode is:	ACTIVE or IDLE

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
 3 Spin_Up_Time			0x0027   246   175   021	Pre-fail  Always	   -	   4658
 4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   75
 5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
 7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
 9 Power_On_Hours		  0x0032   069   069   000	Old_age   Always	   -	   23274
10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   75
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   18
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   263
194 Temperature_Celsius	 0x0022   120   115   000	Old_age   Always	   -	   32
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	 23226		 -
# 2  Short offline	   Completed without error	   00%	 23202		 -
# 3  Short offline	   Completed without error	   00%	 23178		 -
# 4  Short offline	   Completed without error	   00%	 23154		 -
# 5  Short offline	   Completed without error	   00%	 23130		 -
# 6  Short offline	   Completed without error	   00%	 23106		 -
# 7  Extended offline	Completed without error	   00%	 23094		 -
# 8  Short offline	   Completed without error	   00%	 23082		 -
# 9  Short offline	   Completed without error	   00%	 23058		 -
#10  Short offline	   Completed without error	   00%	 23034		 -
#11  Short offline	   Completed without error	   00%	 23010		 -
#12  Short offline	   Completed without error	   00%	 22986		 -
#13  Short offline	   Completed without error	   00%	 22962		 -
#14  Short offline	   Completed without error	   00%	 22938		 -
#15  Short offline	   Completed without error	   00%	 22914		 -
#16  Short offline	   Completed without error	   00%	 22890		 -
#17  Short offline	   Completed without error	   00%	 22866		 -
#18  Short offline	   Completed without error	   00%	 22842		 -
#19  Short offline	   Completed without error	   00%	 22818		 -
#20  Short offline	   Completed without error	   00%	 22794		 -
#21  Short offline	   Completed without error	   00%	 22651		 -

SweetAndLow · May 30, 2017

look at smart data from all disks, after a reboot the names could have changed. Unless you matched them up with the gptid's.

Phobos · May 31, 2017

SweetAndLow said:
look at smart data from all disks, after a reboot the names could have changed. Unless you matched them up with the gptid's.

I matched them up with the gptid using `glabel status`. Long tests passed for all drives this morning, and all of the drives are error free. The scrub I was running also found no errors (“scrub repaired 0 in 7h29m with 0 errors on Wed May 31 00:02:58 2017”).

cyberjock · May 31, 2017

I would simply monitor. More than likely there was some fluke and that made the SATA controller angry and you lost 2 ports. The CHKSUM errors would be something I'd expect if you just dropped disks randomly while in use.

If it happens again, I'd definitely look at replacing cables and such and see if you can precisely pinpoint the exact problem.

Phobos · May 31, 2017

Thanks cyberjock and SweetAndLow, I'll `zpool clear` the errors and keep an eye on it.

Stux · May 31, 2017

Power glitch could do it too. Have you got a UPS?

Phobos · May 31, 2017

Stux said:
Power glitch could do it too. Have you got a UPS?

Yup, CyberPower CP1350PFCLCD.

Phobos · Jun 5, 2017

Hmm, one of the same drives dropped again today. Reattached the pool and it had no CKSUM error this time. Going to swap the ports the cables are plugged in to, and continue to keep an eye on the drives. Once again, no SMART errors.

Phobos · Jun 5, 2017

Here’s what showed up in dmesg:

Code:

ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <WDC WD40EFRX-68WT0N0 80.00A80> s/n WD-WCC4XXXXXXXX detached
GEOM_ELI: Device ada1p1.eli destroyed.
GEOM_ELI: Detached ada1p1.eli on last close.
GEOM_ELI: Device gptid/bbbbbbbb-bbbb-11e4-bbbb-xxxxxxxxxxxx.eli destroyed.
GEOM_ELI: Detached gptid/bbbbbbbb-bbbb-11e4-bbbb-xxxxxxxxxxxx.eli on last close.
(ada1:ahcich1:0:0:0): Periph destroyed
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <WDC WD40EFRX-68WT0N0 80.00A80> ACS-2 ATA SATA 3.x device
ada1: Serial Number WD-WCC4XXXXXXXX
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 3815447MB (7814037168 512 byte sectors)
ada1: quirks=0x1<4K>
ada1: Previously was known as ad6

Robert Trevellyan · Jun 5, 2017

Phobos said:
ada1: <WDC WD40EFRX-68WT0N0 80.00A80> s/n WD-WCC4XXXXXXXX detached

Classic symptom of a power glitch, but certainly far from conclusive evidence.

Phobos · Jun 7, 2017

Seems to be one of my power cables. Each power cable is connected to two drives. I swapped the cable powering the suspect pair of drives with a functioning pair of drives and upon rebooting, one of the previously functioning drives almost immediately dropped out and reconnected.

Stux · Jun 7, 2017

With Sata power splitters?

Phobos · Jun 7, 2017

Stux said:
With Sata power splitters?

They were cables with two SATA power connectors and a single Molex power connector. The Molex connector was plugged into another cable that had several Molex connectors and the PSU connector on the end (it’s a modular Seasonic PSU).

I pulled out all the Molex<->SATA cables and am now just using the SATA cables that came with the PSU (which have 2-4 SATA connectors and a single modular connector).

SweetAndLow · Jun 7, 2017

Molex power is sketchy when it comes to powering drives.

Sent from my Nexus 5X using Tapatalk

Stux · Jun 7, 2017

Well, a single molex connector can power 4 drives, whereas a SATA power connector is good for more than 1, but less than 2.

The problem with large molex peripheral connectors is the pins can be sketchy. So you should verify that they look good and are corrctly seated etc, but when it comes to powering large amounts of drives, its really the only way to go.

Phobos · Jun 7, 2017

Stux said:
Well, a single molex connector can power 4 drives, whereas a SATA power connector is good for more than 1, but less than 2.

The problem with large molex peripheral connectors is the pins can be sketchy. So you should verify that they look good and are corrctly seated etc, but when it comes to powering large amounts of drives, its really the only way to go.

Yeah, just to be clear, the cables I’m using now are 2 to 4 SATA power connectors connected to one modular connector for the PSU, not multiple drives hanging off a single SATA power connectors.

On a side note, would it actually be safe to use all 4 SATA power connectors on one of these cables?

rs225 · Jun 8, 2017

Some of those adapters are so poorly made that they short out internally and melt the drives connectors, sometimes with smoke. So the quality of those parts should be checked carefully. If they came with a server, they are probably good. If you bought them single and weren't looking for the most expensive ...

It could also be you are on the edge of overloading either on the PSU as a whole, or on certain of the rails.

garym · Jun 14, 2017

rs225 said:
Some of those adapters are so poorly made that they short out internally and melt the drives connectors, sometimes with smoke. So the quality of those parts should be checked carefully. If they came with a server, they are probably good. If you bought them single and weren't looking for the most expensive ...

It could also be you are on the edge of overloading either on the PSU as a whole, or on certain of the rails.

Please explain to me how a molex can short out? You do know what a short is, right?
I have seen poorly crimped connectors, not properly engaged connectors or pin/pins. Poorly sizes pins and or sockets. All of these can lead to arcing or high resistance connections, all leading to a hot joint. When you make a connection, check that the pins in a molex indeed are all the way in, easy to see.
Please stop spreading FUD.

Important Announcement for the TrueNAS Community.

Two drives spontaneously dropped from pool?

Dabbler

Sweet'NASty

Dabbler

Sweet'NASty

Dabbler

Inactive Account

Dabbler

MVP

Dabbler

Dabbler

Dabbler

Pony Wrangler

Dabbler

MVP

Dabbler

Sweet'NASty

MVP

Dabbler

Guru

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Two drives spontaneously dropped from pool?"

Similar threads