Hard Drive Troubleshooting - Massive Failures - Need Help Isolating the Problem(s)

rs225 · Oct 11, 2017

OK, let the resilver of
gptid/8275e396-a83c-11e7-9cee-002590f5b804 and da0p2 finish. Don't worry about doing any SMART tests right now. You need to identify exactly which drive is da2. da2 needs to be replaced next. You can use a 4TB or larger drive to replace da2, which is 8TB. When you are ready, shut down the system, remove da2, replace with a newer disk, and start up. Use the GUI to command the replace and then wait until the resilver is complete.

Are all the primary pool drives removed?

Inxsible · Oct 11, 2017

Stux said:
Generally agree, except I think you're running smart/longs too often :)

Agreed, but pretty useful -- at least for this thread.

I might increase the interval to 5 or 7 days for longs then.

arameen · Oct 11, 2017

Ok, not sure where and what to look for.
I remember i read here on forums about WD green having some kind of issue not making them suitable as NAS drives, but that could be updated and I did that on one drive I had in the pool before and it worked. that was before.
All of my drives now are NAS drives, some are older NAS and some are newer generation called IRONWOLF. its hard to think that they are going into sleep mode. If some drives would, shouldn't that be a malfunction? and shouldn't that show in a more dramatic way than interrupted long tests?!
FreeNAS wouldn't put drives to sleep, would it? I don't have own scripts or advanced commands, using the GUI mostly and using FreeNAS as storage container.

Question is, what else could interrupt a long test on a specific drive?

Stux · Oct 11, 2017

So, HBA FW version seems fine.

Onlineing an offline drive seems to have helped. If the unavail drives still exist, reinstalling them could help too.

You seem to have hw issues.

If you shut down (resolvers resume) and disconnect your OTHER pool (power and Sata), then connect this pool to your motherboard ports you might find everything starts to work.

arameen · Oct 11, 2017

rs225 said:
OK, let the resilver of
gptid/8275e396-a83c-11e7-9cee-002590f5b804 and da0p2 finish. Don't worry about doing any SMART tests right now. You need to identify exactly which drive is da2. da2 needs to be replaced next. You can use a 4TB or larger drive to replace da2, which is 8TB. When you are ready, shut down the system, remove da2, replace with a newer disk, and start up. Use the GUI to command the replace and then wait until the resilver is complete.

Are all the primary pool drives removed?

Actually it has been resilvering several times since last week, the data is already backed up.
So for now I dont care about the pool itself so much, if it gets healty again then great, if not then ok. But what is important is to figure out what is wrong. So I can replaced the faulty device once I start with a new healthy pool. I dont want to have these problems again and need to figure out what is causing all this :)

rs225 · Oct 11, 2017

Controller resets can interrupt long tests.

The resilvers should finish and not repeat this time. This time, you are not getting checksum errors everywhere.

Stux · Oct 11, 2017

Basically, when an HD 'fails' you need to work out if the hd failed, or something else.

And a good part of that is to burn in replacements.

arameen · Oct 11, 2017

Stux said:
So, HBA FW version seems fine.

Onlineing an offline drive seems to have helped. If the unavail drives still exist, reinstalling them could help too.

You seem to have hw issues.

If you shut down (resolvers resume) and disconnect your OTHER pool (power and Sata), then connect this pool to your motherboard ports you might find everything starts to work.

That is what I did since last week. I disconnected the other pool, to free some ports for this troubled pool, and the pool came online. Even though in some sh-ty status but it came online.
I dont know if we can conclude its the HBA. It could be that the dying drive came alife for a few days more, you know some drives come back a live sometimes for short time before dying definitly. Or it was the HBA.
I don't know if one test would be to create a new pool connected to the motherboard, fill it with some test data, the move it and connect it to the HBA again and then :
do long tests on the drives?
or test something else to see if a newly created pool gets issues once it goes through the HBA

Stux · Oct 11, 2017

It could be the HBA overheating. Some point fans at the heat sinks to combat this

arameen · Oct 11, 2017

Stux said:
Basically, when an HD 'fails' you need to work out if the hd failed, or something else.

And a good part of that is to burn in replacements.

Yes I will do that too, but i think its better to first figur out what is wrong. no need to burn test drives now ?

rs225 · Oct 11, 2017

Stux said:
It could be the HBA overheating. Some point fans at the heat sinks to combat this

It could, but I see interrupted long tests on both ada and da drives. Back to power supply, or is it something like system sleep?

edit: I guess that isn't valid anymore, since drives have migrated ports recently.

arameen · Oct 11, 2017

rs225 said:
Controller resets can interrupt long tests.

The resilvers should finish and not repeat this time. This time, you are not getting checksum errors everywhere.

Well the difference this time is that the pool is connected directly to the motherboard, not through the HBA at all.

Why does a controller resets? the connect now going to the motherboard should work

arameen · Oct 11, 2017

Stux said:
It could be the HBA overheating. Some point fans at the heat sinks to combat this

I don't think that one is likely in this case, the ventilation is very good inside, the case big and I have many noctua fans installed :)
I recently added 2 more sidefans, and they happen to blow directly on the HBA and motherboard

arameen · Oct 11, 2017

rs225 said:
It could, but I see interrupted long tests on both ada and da drives. Back to power supply, or is it something like system sleep?

edit: I guess that isn't valid anymore, since drives have migrated ports recently.

no those last SMART results I posted, with some of them been interrupted, are on the new SATA ports on the motherboard. so they got interrupted despite the HBA not been used.

rs225 · Oct 11, 2017

My suspicion has been the power supply or an HBA issue. What is the power supply? Are the drives all daisy-chained off a small number of rails into the power supply, or are there plenty of rails coming out of the power supply?

rs225 · Oct 11, 2017

See if you can run a short SMART test on da2 (after the resilvers finish). I am interested that it claims it is FAILING_NOW due to spin retry count. Power under voltage at spinup?

arameen · Oct 11, 2017

rs225 said:
My suspicion has been the power supply or an HBA issue. What is the power supply? Are the drives all daisy-chained off a small number of rails into the power supply, or are there plenty of rails coming out of the power supply?

BE Quiet! Dark Power PRO BQT P10-750W
So its not a cheap PSU and the power shoud be more than enough for all drives even during spinup.
Sure the PSU maybe starts to malfunction, and to be sure about that some electrical measurements would be need. That is some skill and stuff I dont even have amoung all people i know.
I connected some power extenders to the PSU power cable also, its needed when there is 15 drives that need power connected :)
https://www.google.se/search?q=sata...enWAhUCKcAKHeYuBNUQ_AUICigB&biw=2144&bih=1023

I can come back tomorrow with a detailed map of the powerconnections to the drives if the helps ?
SMART shot test on da2

Code:

=== START OF INFORMATION SECTION ===																								
Device Model:	 ST8000VN0022-2EL112																							  
Serial Number:	ZA161YG1																										
LU WWN Device Id: 5 000c50 0a1ca94f6																								
Firmware Version: SC61																											
User Capacity:	8,001,563,222,016 bytes [8.00 TB]																				
Sector Sizes:	 512 bytes logical, 4096 bytes physical																			
Rotation Rate:	7200 rpm																										
Form Factor:	  3.5 inches																										
Device is:		Not in smartctl database [for details use: -P showall]															
ATA Version is:   ACS-3 T13/2161-D revision 5																					  
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)																			
Local Time is:	Thu Oct 12 00:27:51 2017 CEST																					
SMART support is: Available - device has SMART capability.																		
SMART support is: Enabled																										  
																																	
=== START OF READ SMART DATA SECTION ===																							
SMART overall-health self-assessment test result: FAILED!																		  
Drive failure expected in less than 24 hours. SAVE ALL DATA.																		
See vendor-specific Attribute list for failed Attributes.																		  
																																	
General SMART Values:																											  
Offline data collection status:  (0x82) Offline data collection activity															
										was completed without error.																
										Auto Offline Data Collection: Enabled.													
Self-test execution status:	  (  73) The previous self-test completed having													
										a test element that failed and the test													
										element that failed is not known.														  
Total time to complete Offline																									
data collection:				(  567) seconds.																					
Offline data collection																											
capabilities:					(0x7b) SMART execute Offline immediate.															
										Auto Offline data collection on/off support.												
										Suspend Offline collection upon new														
										command.																					
										Offline surface scan supported.															
										Self-test supported.																		
										Conveyance Self-test supported.															
										Selective Self-test supported.															
SMART capabilities:			(0x0003) Saves SMART data before entering															
										power-saving mode.																		
										Supports SMART auto save timer.															
Error logging capability:		(0x01) Error logging supported.																	
										General Purpose Logging supported.														
Short self-test routine																											
recommended polling time:		(   1) minutes.									
Extended self-test routine																										
recommended polling time:		( 718) minutes.																					
Conveyance self-test routine																										
recommended polling time:		(   2) minutes.																					
SCT capabilities:			  (0x50bd) SCT Status supported.																	  
										SCT Error Recovery Control supported.													  
										SCT Feature Control supported.															
										SCT Data Table supported.																  
																																	
SMART Attributes Data Structure revision number: 10																				
Vendor Specific SMART Attributes with Thresholds:																				  
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE									
  1 Raw_Read_Error_Rate	 0x000f   083   064   044	Pre-fail  Always	   -	   199744256									
  3 Spin_Up_Time			0x0003   085   085   000	Pre-fail  Always	   -	   0											
  4 Start_Stop_Count		0x0032   100   100   020	Old_age   Always	   -	   30										  
  5 Reallocated_Sector_Ct   0x0033   100   100   010	Pre-fail  Always	   -	   0											
  7 Seek_Error_Rate		 0x000f   082   060   045	Pre-fail  Always	   -	   159200373									
  9 Power_On_Hours		  0x0032   100   100   000	Old_age   Always	   -	   823 (193 111 0)							
10 Spin_Retry_Count		0x0013   090   090   097	Pre-fail  Always   FAILING_NOW 0											
12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   31										  
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0											
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0											
188 Command_Timeout		 0x0032   100   100   000	Old_age   Always	   -	   0											
189 High_Fly_Writes		 0x003a   100   100   000	Old_age   Always	   -	   0											
190 Airflow_Temperature_Cel 0x0022   068   057   040	Old_age   Always	   -	   32 (Min/Max 26/38)						  
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   212										
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   23										  
193 Load_Cycle_Count		0x0032   100   100   000	Old_age   Always	   -	   1187										
194 Temperature_Celsius	 0x0022   032   043   000	Old_age   Always	   -	   32 (0 26 0 0 0)							
195 Hardware_ECC_Recovered  0x001a   005   001   000	Old_age   Always	   -	   199744256									
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0											
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0											
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   1											
240 Head_Flying_Hours	   0x0000   100   253   000	Old_age   Offline	  -	   806 (0 2 0)								
241 Total_LBAs_Written	  0x0000   100   253   000	Old_age   Offline	  -	   7294625740								  
242 Total_LBAs_Read		 0x0000   100   253   000	Old_age   Offline	  -	   110345043888								
																																	
SMART Error Log Version: 1																										
No Errors Logged																													
																																	
SMART Self-test log structure revision number 1																					
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error									
# 1  Short offline	   Completed: unknown failure	90%	   823		 0													
# 2  Extended offline	Completed: unknown failure	90%	   799		 0													
# 3  Extended offline	Completed without error	   00%	   385		 -													
# 4  Short offline	   Completed without error	   00%	   372		 -													
																																	
SMART Selective self-test log data structure revision number 1																	
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS																						
	1		0		0  Not_testing																								
	2		0		0  Not_testing																								
	3		0		0  Not_testing																								
	4		0		0  Not_testing																								
	5		0		0  Not_testing																								
Selective self-test flags (0x0):																									
  After scanning selected spans, do NOT read-scan remainder of disk.																
If Selective self-test is pending on power-up, resume after 0 minute delay.

I will even execute long tests again on those drives that did have their tests interrupted and post them later again with results or report in case they were interrupted again.

arameen · Oct 11, 2017

I just started long tests again on those drives that got their tests interrupted, interrupted somehow:
ada0
ada1
ada5
da1
Will post results tomorrow :)

rs225 · Oct 11, 2017

Since da2 can't complete a SMART test, definitely plan to replace it next, as I outlined in #61.

Check your power lines, and if you can balance the load any better, do so. I do think the power supply is undersized for the system plus two pools. Although the wattage may seem fine mathematically, each individual rail coming out of the PSU can only provide a certain portion of that power, not the total. So there could be overload on the rails. The 8TB which complains about spin-up could be an indicator, as the larger drives usually have high spin-up power requirements. But if the drive were to start working in a different power situation, it might actually be okay.

Long tests will probably not complete tomorrow, you've got resilvers going that take two days. :)

joeschmuck · Oct 11, 2017

Whoa, this is one long thread for a very basic problem. I'm glad to see a lot of people offering assistance.

@arameen Please take the advice of the group who have posted that you should be doing some more reading up on how FreeNAS works and how to maintain it. Unfortunately there is not a nice single guide on each and every step you must do to configure your FreeNAS system properly to make it a virtually hands-free system. FreeNAS was never originally intended to be used buy the typical novice and required the user to do a lot of work investigating how to make the system work for you. Thankfully our forum members are here to help us all out when we are in need. Also I know that some of the postings here read as if the poster is geting a little frustrated, both them and you, I can relate to be honest, I think we have all been on both sides of a problem. I am glad to see it looks like you are getting your problematic pool recovered.

If you do not know how to setup routine SMART tests then read the User Guide, it has a section on it. Once you have your pool issues solved then I'd establish those settings as follows: Smart Short Test = Once a Day on Sunday through Friday, Smart Long Test = Once on Saturday. Do not run a long test on Sunday because the automatic scrub runs on a sunday and then the hard drive is fighting for time between running both tests. And as previously stated, ensure your email notification is setup properly becasue you will get an email when a drive experiences a problem that you should take care of.

Other things to look at while you are troubleshooting, ensure all the fans in your computer are spinning, including your power supply.

You have several drives which have UDMA errors (ID 199). These are caused by communications errors, typically a bad cable or controller. Unfortunately these values never reset back to zero so my advice is to just print out the SMART data and keep it on hand. If you notice the UDMA CRC errors increase then you have a problem with a cable or controller. While the problem "could" be in the hard drive itself, it's very unlikely.

So lets see if I understand your present hardware configuration...
1) The "Main" pool has been disconnected from the motherboard SATA connectors.
Q1) Are the drives still plugged in to power or are they disconnected?
2) The pool having issues is not disconnected from the HBA card and plugged directly into the motherboard SATA connectors.
Q2) Is the HBA still plugged into the motherboard?

If you are not using the HBA and your pool is fine once you have completed all the testing, then you can validate if the HBA is causing it by powering down your system and using the HBA SATA ports and powering back up, test out your system, run a scrub, after that check the SMART data for any changes. Obvioulsy you can't use just a single data point, you need to give the system some time to fail or pass but this is a good quick check. If you have new failures then it's highly suspect that the HBA is your problem. You can rule out the power supply too by connecting the other main pool hard drives up to the power, assuming they were disconnected. If they have been connected this entire time then I doubt it's the power supply at fault.

My last piece of advice is to take your steps slow, do not rush into anything. While you can check SMART data at anytime, you shouldn't be resilvering one drive while running a smart long test on another drive at the same time, it will slow down the resilvering and smart test and the drive under the smart test will hate you. I'm not telling you to stop any test if you have it running, but it will take a lot longer during a resilver or scrub operation.

Stick with it and best of luck to you,
-Joe

Important Announcement for the TrueNAS Community.

Hard Drive Troubleshooting - Massive Failures - Need Help Isolating the Problem(s)

Guru

Guru

Contributor

MVP

Contributor

Guru

MVP

Contributor

MVP

Contributor

Guru

Contributor

Contributor

Contributor

Guru

Guru

Contributor

Contributor

Guru

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Hard Drive Troubleshooting - Massive Failures - Need Help Isolating the Problem(s)"

Similar threads