Pool degraded - failed to read SMART Attribute Data

Touche · Mar 28, 2018

smartctl -x /dev/da1

Code:

smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)															 
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org														 
																																   
=== START OF INFORMATION SECTION ===																							   
Model Family:	 Toshiba 3.5" DT01ACA... Desktop HDD																			   
Device Model:	 TOSHIBA DT01ACA300																							   
Serial Number:	66UAJN1AS																										 
LU WWN Device Id: 5 000039 fe3d2e1b4																							   
Firmware Version: MX6OABB0																										 
User Capacity:	3,000,592,982,016 bytes [3.00 TB]																				 
Sector Sizes:	 512 bytes logical, 4096 bytes physical																		   
Rotation Rate:	7200 rpm																										 
Form Factor:	  3.5 inches																									   
Device is:		In smartctl database [for details use: -P show]																   
ATA Version is:   ATA8-ACS T13/1699-D revision 4																				   
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)																		   
Local Time is:	Wed Mar 28 16:19:28 2018 CEST																					 
SMART support is: Available - device has SMART capability.																		 
SMART support is: Enabled																										   
AAM feature is:   Unavailable																									   
APM feature is:   Disabled																										 
Rd look-ahead is: Enabled																										   
Write cache is:   Enabled																										   
DSN feature is:   Unavailable  
ATA Security is:  Disabled, NOT FROZEN [SEC1]																					   
Wt Cache Reorder: Enabled																										   
																																   
=== START OF READ SMART DATA SECTION ===																						   
SMART overall-health self-assessment test result: PASSED																		   
																																   
General SMART Values:																											   
Offline data collection status:  (0x84) Offline data collection activity														   
										was suspended by an interrupting command from host.										 
										Auto Offline Data Collection: Enabled.													 
Self-test execution status:	  (   0) The previous self-test routine completed												   
										without error or no self-test has ever													 
										been run.																				   
Total time to complete Offline																									 
data collection:				(22508) seconds.																				   
Offline data collection																											 
capabilities:					(0x5b) SMART execute Offline immediate.														   
										Auto Offline data collection on/off support.											   
										Suspend Offline collection upon new														 
										command.																				   
										Offline surface scan supported.															 
										Self-test supported.																	   
										No Conveyance Self-test supported.														 
										Selective Self-test supported.  
SMART capabilities:			(0x0003) Saves SMART data before entering														   
										power-saving mode.																		 
										Supports SMART auto save timer.															 
Error logging capability:		(0x01) Error logging supported.																   
										General Purpose Logging supported.														 
Short self-test routine																											 
recommended polling time:		(   1) minutes.																				   
Extended self-test routine																										 
recommended polling time:		( 376) minutes.																				   
SCT capabilities:			  (0x003d) SCT Status supported.																	   
										SCT Error Recovery Control supported.													   
										SCT Feature Control supported.															 
										SCT Data Table supported.																   
																																   
SMART Attributes Data Structure revision number: 16																				 
Vendor Specific SMART Attributes with Thresholds:																				   
ID# ATTRIBUTE_NAME		  FLAGS	VALUE WORST THRESH FAIL RAW_VALUE															 
  1 Raw_Read_Error_Rate	 PO-R--   100   100   016	-	0																	 
  2 Throughput_Performance  P-S---   140   140   054	-	68																	 
  3 Spin_Up_Time			POS---   140   140   024	-	391 (Average 425)													 
  4 Start_Stop_Count		-O--C-   100   100   000	-	88																	 
  5 Reallocated_Sector_Ct   PO--CK   100   100   005	-	0																	 
  7 Seek_Error_Rate		 PO-R--   100   100   067	-	0																	 
  8 Seek_Time_Performance   P-S---   124   124   020	-	33   
  9 Power_On_Hours		  -O--C-   100   100   000	-	587																   
 10 Spin_Retry_Count		PO--C-   100   100   060	-	0																	 
 12 Power_Cycle_Count	   -O--CK   100   100   000	-	88																	 
192 Power-Off_Retract_Count -O--CK   100   100   000	-	138																   
193 Load_Cycle_Count		-O--C-   100   100   000	-	138																   
194 Temperature_Celsius	 -O----   176   176   000	-	34 (Min/Max 17/47)													 
196 Reallocated_Event_Count -O--CK   100   100   000	-	0																	 
197 Current_Pending_Sector  -O---K   100   100   000	-	0																	 
198 Offline_Uncorrectable   ---R--   100   100   000	-	0																	 
199 UDMA_CRC_Error_Count	-O-R--   200   200   000	-	0																	 
							||||||_ K auto-keep																					 
							|||||__ C event count																				   
							||||___ R error rate																				   
							|||____ S speed/performance																			 
							||_____ O updated online																			   
							|______ P prefailure warning																		   
																																   
General Purpose Log Directory Version 1																							 
SMART		   Log Directory Version 1 [multi-sector log support]																 
Address	Access  R/W   Size  Description																						 
0x00	   GPL,SL  R/O	  1  Log Directory																					   
0x01		   SL  R/O	  1  Summary SMART error log																			 
0x03	   GPL	 R/O	  1  Ext. Comprehensive SMART error log																   
0x04	   GPL	 R/O	  7  Device Statistics log		 
0x06		   SL  R/O	  1  SMART self-test log																				 
0x07	   GPL	 R/O	  1  Extended self-test log																			   
0x08	   GPL	 R/O	  2  Power Conditions log																				 
0x09		   SL  R/W	  1  Selective self-test log																			 
0x10	   GPL	 R/O	  1  NCQ Command Error log																			   
0x11	   GPL	 R/O	  1  SATA Phy Event Counters log																		 
0x20	   GPL	 R/O	  1  Streaming performance log [OBS-8]																   
0x21	   GPL	 R/O	  1  Write stream error log																			   
0x22	   GPL	 R/O	  1  Read stream error log																			   
0x80-0x9f  GPL,SL  R/W	 16  Host vendor specific log																			 
0xe0	   GPL,SL  R/W	  1  SCT Command/Status																				   
0xe1	   GPL,SL  R/W	  1  SCT Data Transfer																				   
																																   
SMART Extended Comprehensive Error Log Version: 1 (1 sectors)																	   
No Errors Logged																												   
																																   
SMART Extended Self-test Log Version: 1 (1 sectors)																				 
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error									 
# 1  Extended offline	Interrupted (host reset)	  10%	   560		 -													 
# 2  Short offline	   Completed without error	   00%	   554		 -													 
# 3  Extended offline	Completed without error	   00%	   538		 -													 
# 4  Short offline	   Completed without error	   00%	   530		 -													 
# 5  Extended offline	Completed without error	   00%	   512		 -													 
# 6  Short offline	   Completed without error	   00%	   506		 -	 
# 7  Extended offline	Completed without error	   00%	   239		 -													 
# 8  Short offline	   Completed without error	   00%	   233		 -													 
# 9  Extended offline	Completed without error	   00%		92		 -													 
#10  Short offline	   Completed without error	   00%		86		 -													 
#11  Short offline	   Completed without error	   00%		39		 -													 
																																   
SMART Selective self-test log data structure revision number 1																	 
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS																					   
	1		0		0  Not_testing																							   
	2		0		0  Not_testing																							   
	3		0		0  Not_testing																							   
	4		0		0  Not_testing																							   
	5		0		0  Not_testing																							   
Selective self-test flags (0x0):																								   
  After scanning selected spans, do NOT read-scan remainder of disk.															   
If Selective self-test is pending on power-up, resume after 0 minute delay.														 
																																   
SCT Status Version:				  3																							 
SCT Version (vendor specific):	   256 (0x0100)																				   
SCT Support Level:				   1																							 
Device State:						SMART Off-line Data Collection executing in background (4)									 
Current Temperature:					34 Celsius																				 
Power Cycle Min/Max Temperature:	 32/36 Celsius																				 
Lifetime	Min/Max Temperature:	 17/47 Celsius		 
Under/Over Temperature Limit Count:   0/0																						   
																																   
SCT Temperature History Version:	 2																							 
Temperature Sampling Period:		 1 minute																					   
Temperature Logging Interval:		1 minute																					   
Min/Max recommended Temperature:	  0/60 Celsius																				 
Min/Max Temperature Limit:		   -40/70 Celsius																				 
Temperature History Size (Index):	128 (55)																					   
																																   
Index	Estimated Time   Temperature Celsius																					   
  56	2018-03-28 14:14	34  ***************																					 
 ...	..( 67 skipped).	..  ***************																					 
 124	2018-03-28 15:22	34  ***************																					 
 125	2018-03-28 15:23	35  ****************																				   
 ...	..( 29 skipped).	..  ****************																				   
  27	2018-03-28 15:53	35  ****************																				   
  28	2018-03-28 15:54	34  ***************																					 
 ...	..(  4 skipped).	..  ***************																					 
  33	2018-03-28 15:59	34  ***************																					 
  34	2018-03-28 16:00	35  ****************																				   
 ...	..( 15 skipped).	..  ****************																				   
  50	2018-03-28 16:16	35  ****************																				   
  51	2018-03-28 16:17	34  ***************																					 
  52	2018-03-28 16:18	34  ***************	 
  53	2018-03-28 16:19	34  ***************																					 
  54	2018-03-28 16:20	35  ****************																				   
  55	2018-03-28 16:21	34  ***************																					 
																																   
SCT Error Recovery Control:																										 
		   Read: Disabled																										   
		  Write: Disabled																										   
																																   
Device Statistics (GP Log 0x04)																									 
Page  Offset Size		Value Flags Description																				   
0x01  =====  =			   =  ===  == General Statistics (rev 1) ==															   
0x01  0x008  4			  88  ---  Lifetime Power-On Resets																	   
0x01  0x010  4			 587  ---  Power-on Hours																				 
0x01  0x018  6	 77164668449  ---  Logical Sectors Written																	   
0x01  0x020  6	   255843169  ---  Number of Write Commands																	   
0x01  0x028  6	 66867981131  ---  Logical Sectors Read																		   
0x01  0x030  6	   200791094  ---  Number of Read Commands																	   
0x03  =====  =			   =  ===  == Rotating Media Statistics (rev 1) ==													   
0x03  0x008  4			 587  ---  Spindle Motor Power-on Hours																   
0x03  0x010  4			 586  ---  Head Flying Hours																			 
0x03  0x018  4			 138  ---  Head Load Events																			   
0x03  0x020  4			   0  ---  Number of Reallocated Logical Sectors														 
0x03  0x028  4			   9  ---  Read Recovery Attempts																		 
0x03  0x030  4			   6  ---  Number of Mechanical Start Failures														   
0x04  =====  =			   =  ===  == General Errors Statistics (rev 1) ==													   
0x04  0x008  4			   0  ---  Number of Reported Uncorrectable Errors													   
0x04  0x010  4			   0  ---  Resets Between Cmd Acceptance and Completion												   
0x05  =====  =			   =  ===  == Temperature Statistics (rev 1) ==														   
0x05  0x008  1			  34  ---  Current Temperature																		   
0x05  0x010  1			  33  N--  Average Short Term Temperature																 
0x05  0x018  1			  37  N--  Average Long Term Temperature																 
0x05  0x020  1			  47  ---  Highest Temperature																		   
0x05  0x028  1			  17  ---  Lowest Temperature																			 
0x05  0x030  1			  43  N--  Highest Average Short Term Temperature														 
0x05  0x038  1			  25  N--  Lowest Average Short Term Temperature														 
0x05  0x040  1			  42  N--  Highest Average Long Term Temperature														 
0x05  0x048  1			  25  N--  Lowest Average Long Term Temperature														   
0x05  0x050  4			   0  ---  Time in Over-Temperature																	   
0x05  0x058  1			  60  ---  Specified Maximum Operating Temperature													   
0x05  0x060  4			   0  ---  Time in Under-Temperature																	 
0x05  0x068  1			   0  ---  Specified Minimum Operating Temperature													   
0x06  =====  =			   =  ===  == Transport Statistics (rev 1) ==															 
0x06  0x008  4			 334  ---  Number of Hardware Resets																	 
0x06  0x010  4			 169  ---  Number of ASR Events																		   
0x06  0x018  4			   0  ---  Number of Interface CRC Errors																 
								|||_ C monitored condition met																	 
								||__ D supports DSN																				 
								|___ N normalized value																			 
																																   
Pending Defects log (GP Log 0x0c) not supported																					 
																																   
SATA Phy Event Counters (GP Log 0x11)																							   
ID	  Size	 Value  Description																								 
0x0001  2			0  Command failed due to ICRC error																		   
0x0002  2			0  R_ERR response for data FIS																				 
0x0003  2			0  R_ERR response for device-to-host data FIS																 
0x0004  2			0  R_ERR response for host-to-device data FIS																 
0x0005  2			0  R_ERR response for non-data FIS																			 
0x0006  2			0  R_ERR response for device-to-host non-data FIS															 
0x0007  2			0  R_ERR response for host-to-device non-data FIS															 
0x0009  2			8  Transition from drive PhyRdy to drive PhyNRdy															   
0x000a  2			9  Device-to-host register FISes sent due to a COMRESET													   
0x000b  2			0  CRC errors within host-to-device FIS																	   
0x000d  2			0  Non-CRC errors within host-to-device FIS

Touche · Apr 3, 2018

After several days of everything running fine, the issue returned and now it's /da2 so I think we can rule out the drives themselves as the issue.

Code:

Apr  3 09:11:53 FreeNAS smartd[3016]: Device: /dev/da2 [SAT], failed to read SMART Attribute Data
Apr  3 09:11:53 FreeNAS	(da2:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 1131 Aborting command 0xfffffe0000f289d0
Apr  3 09:11:53 FreeNAS mpr0: Sending reset from mprsas_send_abort for target ID 4
Apr  3 09:11:53 FreeNAS	(da2:mpr0:0:4:0): WRITE(10). CDB: 2a 00 12 41 11 a8 00 00 08 00 length 4096 SMID 1077 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Apr  3 09:11:53 FreeNAS	(pass2:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 894 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Apr  3 09:11:53 FreeNAS mpr0: Unfreezing devq for target ID 4
Apr  3 09:11:53 FreeNAS (da2:mpr0:0:4:0): WRITE(10). CDB: 2a 00 12 41 11 a8 00 00 08 00
Apr  3 09:11:53 FreeNAS (da2:mpr0:0:4:0): CAM status: CCB request completed with an error
Apr  3 09:11:53 FreeNAS (da2:mpr0:0:4:0): Retrying command
Apr  3 09:11:53 FreeNAS (da2:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Apr  3 09:11:53 FreeNAS (da2:mpr0:0:4:0): CAM status: Command timeout
Apr  3 09:11:53 FreeNAS (da2:mpr0:0:4:0): Retrying command
Apr  3 09:11:54 FreeNAS (da2:mpr0:0:4:0): WRITE(10). CDB: 2a 00 12 41 11 a8 00 00 08 00
Apr  3 09:11:54 FreeNAS (da2:mpr0:0:4:0): CAM status: SCSI Status Error
Apr  3 09:11:54 FreeNAS (da2:mpr0:0:4:0): SCSI status: Check Condition
Apr  3 09:11:54 FreeNAS (da2:mpr0:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Apr  3 09:11:54 FreeNAS (da2:mpr0:0:4:0): Retrying command (per sense data)

At least this time the pool didn't degrade.

Any ideas how to proceed?

pro lamer · Apr 4, 2018

Ericloewe said:
If you're comfortable with the CLI, you can just offline the partition (da3p2 IIRC) and then use it as the replacement. This avoids having to generate new keys for all the disks.

Excuse the off-topic: Should any newbie like me read the cited text: skipping generating new keys is an exceptional case. This is an example of such exception. More details about the general rule can be found in this thread I guess.

Ericloewe · Apr 4, 2018

It's a very particular case, since the underlying disk is presumed to be good, so it doesn't need to be replaced and thus the replacement doesn't need to be encrypted, so the whole pool doesn't need to be re-keyed. It's unlikely to be the case (maybe if a cable crapped out...) and it requires a lot of care.

It's not something I recommend, unless it's as a learning experience with data that is not important.

Touche · Apr 5, 2018

Identical configuration except with WD Reds 4TB shows no sign of the error. It's not being actively used, just idling. Not sure if that makes a difference.

DGenerateKane · Apr 7, 2018

I've been dealing with this very issue on my Primary machine for months now, with no end in sight. I had 2 or 3 drives show either read, write, and/or checksum errors at roughly the same time. After weeks of them reappearing even after multiple short and long tests with no errors, I finally decided to replace the HBA. My pool stayed online less than a week before I started having errors again, with different drives. At this point, I'd say almost every single drive has shown error if not faulted completely. So last weekend I finally decided to RMA the one drive that kept faulting. Two other drives started showing errors after I replaced the drive. Now, one drive has faulted with 110 write errors. I don't know what to do. I'm already worried Seagate is going to charge me $450 for this re-certified drive when I send in the drive it replaced and their tests show it's fine. Which they will since they only do SMART tests apparently.

Sometimes I could go weeks without my pool degrading, other times mere hours. I'm at a complete loss. I have no idea what I need to replace at this point. I'm running a long test on the drive now but I guarantee it will show no errors when it finishes.

Edit: That didn't last long. The re-certified drive just faulted, and another drive is showing a lot of errors. So I have two faulted drives, which makes my server useless right now.

Bidule0hm · Apr 8, 2018

What is your hardware? (especially MB, CPU, RAM, PSU and HBA)

joeschmuck · Apr 8, 2018

Why are you using desktop hard drives in a server? Based on the way I read your previous postings you are placing the server in a true production environment. A desktop drive may not be able to cut the mustard as they say.

DGenerateKane · Apr 8, 2018

I have two machines per my sig. My previous postings were about my old machine, which the desktop drives worked just fine once I took care of the firmware bug they had. My current machine has only NAS drives. Frankly the Desktop drives were more reliable.

FreeNAS-11.1-RELEASE
1x SATADom Flash 64GB (Boot)
SupermicroX9DRH-iF
2x Intel Xeon E5-2660 V1 Octo (8) Core 2.2GHz
6 x 32GB - DDR3 - REG (196GB total)
2x 1280Watt Power Supply PWS-1K28P-SQ
8x Seagate IronWolf 10TB NAS Hard Drive ST10000VN0004 (RAIDZ2)
Supermicro SuperChassis 847E16-R1K28LPB

Oh, since I had to force a reboot on my machine, just one drive shows 1 checksum error. That won't stay that way forever though.

joeschmuck · Apr 8, 2018

So the machine in question is not a company machine but rather your personal machine. Sorry, I took a leap to a company machine when you said you put it into production.

DGenerateKane · Apr 8, 2018

Correct. As of a half hour ago, two drives that showed zero errors earlier today now each have six read errors, and hundreds of write errors, and are both listed as faulted.

DGenerateKane · Apr 13, 2018

Anybody have any ideas? I've got one disk faulted right now along with another showing errors. SMART tests I've checked always show as passed without error.

Breit · Apr 14, 2018

I'm experiencing exactly the same here. Good to know I'm not alone on this.

System is based on a Supermicro X10SDV-2C-7TP4F with an onboard LSI 2116 16-port SAS2 controller and FreeNAS 11.1 U4.
I recently added a LSI SAS3008 to the system to get more ports and that's were things get ugly.
I keep getting disk dropouts on random disks connected to the SAS3 controller due to failed SMART data reads like the OP has. The log shows something like this:

Code:

		(da12:mpr0:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 798 Aborting command 0xfffffe00010dab20
mpr0: Sending reset from mprsas_send_abort for target ID 0
		(pass12:mpr0:0:0:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 570 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
mpr0: Unfreezing devq for target ID 0
(da12:mpr0:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da12:mpr0:0:0:0): CAM status: Command timeout
(da12:mpr0:0:0:0): Retrying command
(da12:mpr0:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da12:mpr0:0:0:0): CAM status: SCSI Status Error
(da12:mpr0:0:0:0): SCSI status: Check Condition
(da12:mpr0:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da12:mpr0:0:0:0): Error 6, Retries exhausted
(da12:mpr0:0:0:0): Invalidating pack

I tested the dropped disks, they came out good, no errors whatsoever. Short and long SMART came out good. Resilvering the pool with the exact same disk also goes without errors. A few days later another disk on a different SAS3 port gets dropped. I'm suspecting the HBA for now. Can this be a firmware/driver thing?

Code:

mpr0: <Avago Technologies (LSI) SAS3008> port 0xd000-0xd0ff mem 0xfba40000-0xfba4ffff,0xfba00000-0xfba3ffff irq 40 at device 0.0 on pci7
mpr0: Firmware: 15.00.02.00, Driver: 18.03.00.00-fbsd
mpr0: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray>

Breit · Apr 14, 2018

I investigated a bit further and noticed that while I was probing a disk on the SAS3 controller with smartctl -a, I got a reading with 65535 Raw_Read_Error_Rate. After a shock-second, I quickly typed that same command again and got the usual 0 on Raw_Read_Error_Rate for that exact same disk. So again a possible false reading.
The disks in question are old HGST 5K3000 3TB disks, but as said their SMART readings show no sign of degradation.

Additionally while probing all my disks with smartctl -a, I noticed that the HGST 5K3000 (no matter to what controller they are connected) are a lot slower to report their output of smartctl, as my other drives are. For instance my WD RED 8TB drives are producing the smartctl report rather instantly, while the HGST drives may take a few seconds. Maybe the LSI SAS3008 controller is just a bit impatient when expecting the result from a SMART check? I don't know, just a guess...
I'll redo cabling and hook all HGST drives to the LSI SAS2116 and move the WD drives to the SAS3008 and see what happens.

Touche · Apr 20, 2018

DGenerateKane said:
I've been dealing with this very issue on my Primary machine for months now, with no end in sight. I had 2 or 3 drives show either read, write, and/or checksum errors at roughly the same time. After weeks of them reappearing even after multiple short and long tests with no errors, I finally decided to replace the HBA. My pool stayed online less than a week before I started having errors again, with different drives.

What HBAs were they?

Breit said:
I investigated a bit further and noticed that while I was probing a disk on the SAS3 controller with smartctl -a, I got a reading with 65535 Raw_Read_Error_Rate. After a shock-second, I quickly typed that same command again and got the usual 0 on Raw_Read_Error_Rate for that exact same disk. So again a possible false reading.
The disks in question are old HGST 5K3000 3TB disks, but as said their SMART readings show no sign of degradation.

Additionally while probing all my disks with smartctl -a, I noticed that the HGST 5K3000 (no matter to what controller they are connected) are a lot slower to report their output of smartctl, as my other drives are. For instance my WD RED 8TB drives are producing the smartctl report rather instantly, while the HGST drives may take a few seconds. Maybe the LSI SAS3008 controller is just a bit impatient when expecting the result from a SMART check? I don't know, just a guess...
I'll redo cabling and hook all HGST drives to the LSI SAS2116 and move the WD drives to the SAS3008 and see what happens.

Hmm, the Toshibas that are giving me problems are rebranded Hitachi drives, while WD Reds in an identical build are working without problems.

I've reported the issue https://redmine.ixsystems.com/issues/31398 but, unfortunately, it doesn't seem like we'll get a fix from the FreeNAS side. This seems to be a long running issue which makes me sceptical it will be resolved any time soon, if ever. From my searching, it seems to be affecting multiple drives and LSI controllers. I was thinking of downgrading my 3008 FW to 14.x or even older as a desperate attempt of finding a solution. For now I've moved 5 of my 6 drives to my remaining onboard Intel SATA. This way I hope to have only the one drive left on the LSI fail. If it turns out that everything works on Intel controller, I'll try downgrading the LSI FW and going on from there, but I'm afraid this is something that LSI and FreeBSD need to resolve with either FW or driver updates.

Breit · Apr 23, 2018

I was hammering the controller with data transfers on different kinds of disks for the last few days and it seems the LSI SAS3008 just don't like the HGST 5K3000. All other drives were fine and gave no errors whatsoever. I guess the controller is just a bit picky about the drives connected to it. It's probably the same for the Toshiba's mentioned above.

DGenerateKane · Apr 26, 2018

Touche said:
What HBAs were they?

Hmm, the Toshibas that are giving me problems are rebranded Hitachi drives, while WD Reds in an identical build are working without problems.

I've reported the issue https://redmine.ixsystems.com/issues/31398 but, unfortunately, it doesn't seem like we'll get a fix from the FreeNAS side. This seems to be a long running issue which makes me sceptical it will be resolved any time soon, if ever. From my searching, it seems to be affecting multiple drives and LSI controllers. I was thinking of downgrading my 3008 FW to 14.x or even older as a desperate attempt of finding a solution. For now I've moved 5 of my 6 drives to my remaining onboard Intel SATA. This way I hope to have only the one drive left on the LSI fail. If it turns out that everything works on Intel controller, I'll try downgrading the LSI FW and going on from there, but I'm afraid this is something that LSI and FreeBSD need to resolve with either FW or driver updates.

Sorry, I don't know why I don't have the HBA listed with the rest of my hardware in my sig. It came with an LSI 9211-8i, I replaced it with an LSI 9240-8i, which was a bitch and a half to remove the IBM firmware it came with so I could flash the correct one. So do you think my issue is the firmware for the HBA? I flashed the latest, P20. I wish I had the option to move them to onboard ports, but that isn't possible in this chassis, since the drives are all connected to a backplane. At this point, I guess I should cancel the RMA on the drive. I don't know what to do to fix this problem though. I just keep rebooting my server every few days when my server isn't responding properly. It's actually causing more problems. Today I started getting spammed every 5 minutes about my UPS not having a connection. While trying to diagnose that problem I saw two drives had faulted. After reboot, my server now has a connection to the UPS again. It's all very frustrating.

Chris Moore · Apr 26, 2018

Touche said:
Toshiba DT01ACA3 drives

Here is the thing. I used 6 of the very similar Toshiba DT01ACA200 drives to replace the drives in one of the vdevs in one of my storage pools and after less than 6 months 3 of them had failed. These drives are desktop computer drives and they are NOT suitable to use in a server.
They are not rated for the service and the drives that are showing as failed are really failed. Buy actual NAS drives.

Chris Moore · Apr 26, 2018

DGenerateKane said:
I have two machines per my sig. My previous postings were about my old machine, which the desktop drives worked just fine once I took care of the firmware bug they had. My current machine has only NAS drives. Frankly the Desktop drives were more reliable.

FreeNAS-11.1-RELEASE
1x SATADom Flash 64GB (Boot)
SupermicroX9DRH-iF
2x Intel Xeon E5-2660 V1 Octo (8) Core 2.2GHz
6 x 32GB - DDR3 - REG (196GB total)
2x 1280Watt Power Supply PWS-1K28P-SQ
8x Seagate IronWolf 10TB NAS Hard Drive ST10000VN0004 (RAIDZ2)
Supermicro SuperChassis 847E16-R1K28LPB

Oh, since I had to force a reboot on my machine, just one drive shows 1 checksum error. That won't stay that way forever though.

You are hijacking the thread. Why don't you post your own?

Chris Moore · Apr 26, 2018

Touche said:
Hmm, the Toshibas that are giving me problems are rebranded Hitachi drives

No. Totally different company and even when it was the same company it was a different design at a different plant.

Important Announcement for the TrueNAS Community.

Pool degraded - failed to read SMART Attribute Data

Explorer

Explorer

Guru

Server Wrangler

Explorer

Explorer

Server Electronics Sorcerer

Old Man

Explorer

Old Man

Explorer

Explorer

Dabbler

Dabbler

Explorer

Dabbler

Explorer

Hall of Famer

Hall of Famer

Hall of Famer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Pool degraded - failed to read SMART Attribute Data"

Similar threads