bad drives, all 12 of them?

Status
Not open for further replies.

orddie

Contributor
Joined
Jun 11, 2016
Messages
104
Hi all!

I have purchased 12 of the following drives in the link below and put them in a ZFS. IOPS is what i was after and the SSD's gave me what i needed. Within 5 days of the system being up and data finally moved over to it, i got alerts that drives started to fail.

https://www.amazon.com/gp/product/B0784SLQM6/ref=oh_aui_detailpage_o00_s00?ie=UTF8&psc=1

Code:
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   100   100   000	Pre-fail  Always	   -	   0
  5 Reallocated_Sector_Ct   0x0032   100   100   010	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   100   100   000	Old_age   Always	   -	   93
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   6
171 Unknown_Attribute	   0x0032   100   100   000	Old_age   Always	   -	   0
172 Unknown_Attribute	   0x0032   100   100   000	Old_age   Always	   -	   0
173 Unknown_Attribute	   0x0032   100   100   000	Old_age   Always	   -	   3
174 Unknown_Attribute	   0x0032   100   100   000	Old_age   Always	   -	   0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   000   000   000	Pre-fail  Always	   -	   44
183 Runtime_Bad_Block	   0x0032   100   100   000	Old_age   Always	   -	   0
184 End-to-End_Error		0x0032   100   100   000	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0
194 Temperature_Celsius	 0x0022   069   059   000	Old_age   Always	   -	   31 (Min/Max 0/41)
196 Reallocated_Event_Count 0x0032   100   100   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   100   100   000	Old_age   Always	   -	   0
202 Unknown_SSD_Attribute   0x0030   100   100   001	Old_age   Offline	  -	   0
206 Unknown_SSD_Attribute   0x000e   100   100   000	Old_age   Always	   -	   0
210 Unknown_Attribute	   0x0032   100   100   000	Old_age   Always	   -	   0
246 Unknown_Attribute	   0x0032   100   100   000	Old_age   Always	   -	   755882900
247 Unknown_Attribute	   0x0032   100   100   000	Old_age   Always	   -	   12402796
248 Unknown_Attribute	   0x0032   100   100   000	Old_age   Always	   -	   26866849



what started off with 6 drives has turned into 9. I can only think its soon going to be all 12!

I questing how valid the reporting is when it comes to these drives.

i have 18 10k SAS drives in the same host. All running as expected with no errors. Its still possible i have a back plain issue but the slots which now have SSD's in them are populated with 10k SAS's drives, but smaller size and never reported an issue.
 
Joined
Dec 29, 2014
Messages
1,135
If the issue is across all the drives, I would try replacing the cables. It could also be the disk controller. I might suspect power if you hadn't said there are 18 other drives in the system. Think what is shared by the problem components when you have a mass error like that.
 

CraigD

Patron
Joined
Mar 8, 2016
Messages
343
@orddie

I have the same drives, and the same errors....

My guess is the Unknown Attributes are not errors, I have stopped SMART testing on both drives

My Crucial SSDs only hold Jails that with a little effort can be recreated, because of this I'm chancing it

If a drive dies prematurely I'll let you know

Have Fun
 

orddie

Contributor
Joined
Jun 11, 2016
Messages
104
this is very odd... When you read the below what your thoughts?




Code:
root@freenas:~ # smartctl -a /dev/da18
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:	 CT500MX500SSD1
Serial Number:	1822E13FC735
LU WWN Device Id: 5 00a075 1e13fc735
Firmware Version: M3CR020
User Capacity:	500,107,862,016 bytes [500 GB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	Solid State Device
Form Factor:	  2.5 inches
Device is:		Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Sat Jul  7 11:47:07 2018 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
										was completed without error.
										Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(	0) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		(  30) minutes.
Conveyance self-test routine
recommended polling time:		(   2) minutes.
SCT capabilities:			  (0x0031) SCT Status supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   100   100   000	Pre-fail  Always	   -	   0
  5 Reallocated_Sector_Ct   0x0032   100   100   010	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   100   100   000	Old_age   Always	   -	   98
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   6
171 Unknown_Attribute	   0x0032   100   100   000	Old_age   Always	   -	   0
172 Unknown_Attribute	   0x0032   100   100   000	Old_age   Always	   -	   0
173 Unknown_Attribute	   0x0032   100   100   000	Old_age   Always	   -	   3
174 Unknown_Attribute	   0x0032   100   100   000	Old_age   Always	   -	   0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   000   000   000	Pre-fail  Always	   -	   43
183 Runtime_Bad_Block	   0x0032   100   100   000	Old_age   Always	   -	   0
184 End-to-End_Error		0x0032   100   100   000	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0
194 Temperature_Celsius	 0x0022   067   057   000	Old_age   Always	   -	   33 (Min/Max 0/43)
196 Reallocated_Event_Count 0x0032   100   100   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   100   100   000	Old_age   Always	   -	   0
202 Unknown_SSD_Attribute   0x0030   100   100   001	Old_age   Offline	  -	   0
206 Unknown_SSD_Attribute   0x000e   100   100   000	Old_age   Always	   -	   0
210 Unknown_Attribute	   0x0032   100   100   000	Old_age   Always	   -	   0
246 Unknown_Attribute	   0x0032   100   100   000	Old_age   Always	   -	   760500684
247 Unknown_Attribute	   0x0032   100   100   000	Old_age   Always	   -	   12479385
248 Unknown_Attribute	   0x0032   100   100   000	Old_age   Always	   -	   25506835

SMART Error Log Version: 1
Warning: ATA error count 0 inconsistent with error log pointer 1

ATA Error Count: 0
		CR = Command Register [HEX]
		FR = Features Register [HEX]
		SC = Sector Count Register [HEX]
		SN = Sector Number Register [HEX]
		CL = Cylinder Low Register [HEX]
		CH = Cylinder High Register [HEX]
		DH = Device/Head Register [HEX]
		DC = Device Command Register [HEX]
		ER = Error register [HEX]
		ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 0 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was in an unknown state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 ec 00 00 00 00 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 00 00	  00:00:00.000  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00	  00:00:00.000  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00	  00:00:00.000  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00	  00:00:00.000  IDENTIFY DEVICE
  c8 00 00 00 00 00 00 00	  00:00:00.000  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%		98		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
You should check for firmware updates on the drives and the controller.
 

orddie

Contributor
Joined
Jun 11, 2016
Messages
104
You should check for firmware updates on the drives and the controller.
Yes I have a firmware update for the SSD's. Whats the best way here since we can not update firmware on an OS other than Windows.
 
Joined
Dec 29, 2014
Messages
1,135
Yes I have a firmware update for the SSD's. Whats the best way here since we can not update firmware on an OS other than Windows.

I put FreeDOS on a USB stick and boot to that. I have a few systems that are cranky about that. For those, I had to use the EFI shell in the BIOS to do the update.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
What’s your PSU?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I see that those are 3D NAND drives... there is a known issue with those. https://redmine.ixsystems.com/issues/35065

Seems it's related to NCQ and TRIM and the workaround may be to disable TRIM... read the bug report... work still ongoing on it.
 
Joined
May 10, 2017
Messages
838
If you're getting a warning that the drives have a pending sector (that then disappears) it's a firmware bug on all MX500 SSDs, ignore until there's a firmware update.
 
Status
Not open for further replies.
Top