SMART test keeps going

Status
Not open for further replies.

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
One of my drives has been doing an extended test for at least 30 hours or so. It seems to be stuck at 10%. No other errors on an other drives. Pool activity is low overall, especially over night. Should I be worried? Let it go? Do something about it?

Code:
root@nas:~ # smartctl -a /dev/da4
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:	 ST8000NM0055-1RM112
Serial Number:	ZA18QEYT
LU WWN Device Id: 5 000c50 0a43feff0
Firmware Version: SN02
User Capacity:	8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	7200 rpm
Form Factor:	  3.5 inches
Device is:		Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Fri May 18 00:07:29 2018 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
										was completed without error.
										Auto Offline Data Collection: Enabled.
Self-test execution status:	  ( 249) Self-test routine in progress...
										90% of test remaining.
Total time to complete Offline
data collection:				(  575) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   1) minutes.
Extended self-test routine
recommended polling time:		( 753) minutes.
Conveyance self-test routine
recommended polling time:		(   2) minutes.
SCT capabilities:			  (0x70bd) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   077   064   044	Pre-fail  Always	   -	   53788952
  3 Spin_Up_Time			0x0003   089   087   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   020	Old_age   Always	   -	   56
  5 Reallocated_Sector_Ct   0x0033   100   100   010	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000f   076   060   045	Pre-fail  Always	   -	   38154757
  9 Power_On_Hours		  0x0032   100   100   000	Old_age   Always	   -	   341 (24 199 0)
 10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   56
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0
188 Command_Timeout		 0x0032   100   099   000	Old_age   Always	   -	   4295098370
189 High_Fly_Writes		 0x003a   100   100   000	Old_age   Always	   -	   0
190 Airflow_Temperature_Cel 0x0022   050   043   040	Old_age   Always	   -	   50 (Min/Max 47/57)
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   621
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   43
193 Load_Cycle_Count		0x0032   100   100   000	Old_age   Always	   -	   117
194 Temperature_Celsius	 0x0022   050   057   000	Old_age   Always	   -	   50 (0 14 0 0 0)
195 Hardware_ECC_Recovered  0x001a   077   064   000	Old_age   Always	   -	   53788952
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0
240 Head_Flying_Hours	   0x0000   100   253   000	Old_age   Offline	  -	   333 (97 137 0)
241 Total_LBAs_Written	  0x0000   100   253   000	Old_age   Offline	  -	   14308148320
242 Total_LBAs_Read		 0x0000   100   253   000	Old_age   Offline	  -	   35428419184

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Self-test routine in progress 90%	   341		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

CraigD

Patron
Joined
Mar 8, 2016
Messages
343
You put a drive into service without doing a Hard Drive Burn-in? That is asking for trouble

If possible take the drive out of service and burn it in to find out if the drive is good (this may take a long time)

190 Airflow_Temperature_Cel 0x0022 050 043 040 Old_age Always - 50 (Min/Max 47/57)

Your drive needs better cooling...

Without knowing you system specs, I can not help you further

Have Fun
 

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
Hey Craig, thanks for pointing out the burn in guide. I definitely made some mistakes putting this together - one being that I didn't test the drives before putting them into use. They were all used and ironically all but this one do have extended SMART test history on them that was recently done, so I made the assumption they were ok. The test still hadn't completed this morning after nearly two days, so I took that one offline and am running through the SMART tests and badblocks on it now.

As for the cooling - the Seagate drives I got seem to run fairly hot. The max temps were only reached during a scrub, otherwise they do stay around 50 which I know is quite warm. Trying to see what I can do about it at this point. Perhaps I should have gone with some less demanding drives. I guess sometimes this is how we learn.

My other specs are:
Supermicro X11SSH-CTF
i3-7320 CPU
2X Crucial CR16G4WFD824 16GB ECC
Seasonic Focus 650W 80 Plus Gold
8X ST8000NM0055
iStarUSA D406-DEBK Black 4U Rackmount Case
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
Just to follow up, the SMART test did eventually complete after 30 hours with no errors. I ran badblocks which seemed fine with writes, but the read cycles were really messed up. They would start fast (200MBps) then go in waves of 100MBps down to 4MBps or so for a few seconds. It would take a few days to complete a read cycle. After 100 hours of this, but no errors, it still wasn't done.

I also noticed that on speed tests using dd when reading, some of my tests were hanging on the performance of that one drive. It would be near 100% utilization while the rest would be at like 10% or lower. This is a RAIDZ2 setup, so all drives should be close to equal in utilization when reading back a 100GB file.

I eventually decided to get the drive replaced by Seagate, and all is well. SMART completed in about 18 hours, badblocks looks good, and performance in the pool is as expected.

Working pool bandwidth test:
Code:
root@nas:/mnt/MainPool/test # dd if=/dev/zero of=test.dat bs=2048k count=50000
50000+0 records in
50000+0 records out
104857600000 bytes transferred in 96.318273 secs (1088657397 bytes/sec)

root@nas:/mnt/MainPool/test # dd of=/dev/null if=test.dat bs=2048k count=50000
50000+0 records in
50000+0 records out
104857600000 bytes transferred in 97.303216 secs (1077637556 bytes/sec)


Thanks for your help!
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
One of my drives has been doing an extended test for at least 30 hours or so.
so I made the assumption they were ok. The test still hadn't completed this morning after nearly two days, so I took that one offline and am running through the SMART tests and badblocks on it now.
If it won't finish a SMART test, why would it finish the burn-in process which includes running the SMART long test twice?
If it is in warranty, the best thing is to get it replaced. If it isn't in warranty, you better plan on buying a drive. While you are at it, you should buy a spare. Take the time to test them when you get them and then wrap the spare up for when you need it. I keep two spares of each size drive that I use, already tested, ready to go because it is not a question of if a drive will fail, just when.
As for the cooling - the Seagate drives I got seem to run fairly hot. The max temps were only reached during a scrub, otherwise they do stay around 50 which I know is quite warm. Trying to see what I can do about it at this point. Perhaps I should have gone with some less demanding drives. I guess sometimes this is how we learn.
The Seagate drives are not demanding. I run all Seagate drives in my server because the Western Digital and HGST drives I had before ran 5°C to 10°C hotter. The problem you have is undoubtedly the airflow over the drives. What kind of case are you using? How many fans do you have. Drives generate heat. That heat must be moved away from the drives or the drives will cook themselves. It is the biggest cause of early drive failure.
Your drives should be kept in the 30s, and even under heavy load, they should never get above 50°C. That is certainlynot where they should be all the time.
I see now:
iStarUSA D406-DEBK Black 4U Rackmount Case
The fans on those drive cages are totally inadequate. You need a lot more airflow.
I also noticed that on speed tests using dd when reading, some of my tests were hanging on the performance of that one drive.
That is a defective drive. It is slow because it is retrying repeatedly to read a location. It shouldn't need to do that.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
PS. You need to get the temp under control or you will have more drive failures.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
So the drive you are testing should take about 12.5 hours to run an Extended test, assuming that you are not actively using the drive (reading/writing). If the drive is active then it will take longer. 30 hours is a bit excessive and to only get 10% of the work done, thus I suspect you have a lot of activity occuring, or a real faulty drive.

First of all, I'd start with a SMART Short test, check out the basics.
Second of all, I'd ensure all services are stopped and then kick off an Extended test. This is a read only test.
These drive tests are internal to the drive once you have started them, but they will slow down if you are using the drives becasue this will interupt the testing in progress in order to fulfill the data request, then it will recommence the testing.

Badblocks is the absolute test that must pass, if it passes then your drive is good. If any of the tests above fail then RMA the drive.

And just to join in with the others, your drive temps are a bit high. If you plan to have long lasting drives then you need to work on controlling the temperature better. That may mean putting a fan on it to force more air into the case, case modifications o increase air flow or to direct airflow better, or just a new high flow case.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Just to follow up, the SMART test did eventually complete after 30 hours
~ SNIP ~
I eventually decided to get the drive replaced by Seagate, and all is well. SMART completed in about 18 hours, badblocks looks good, and performance in the pool is as expected.
So, the drive has already been replaced. Good.
 

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
Chris, thank you for your advice and input. Knowing your knowledge around all this is likely vastly superior to mine I’m hesitant to argue, but feel I must say a couple of things in my defense here.

First, the answer to why I would think a burn in process would finish when it won’t even complete a SMART test is “I really don’t know.” The SMART test did eventually complete without errors, and bad blocks wasn’t showing any errors – it was just going very slow. I have little to no experience running these tests so I’m not sure how exactly they should behave. My big mistake was in not testing all the drives first. I did eventually come to the conclusion that even though none of the tests were returning errors for that drive, its behavior was wrong. I replaced it which as I mentioned seems to have fixed the problem. In the future I know to test all drives first before putting them into a pool. That drive was used, but new to my possession, so it doesn’t seem likely that heat in my setup damaged the drive since I noticed the poor behavior almost as soon as I put it into service.

My assertion that the Seagate drives run hot is specifically for THESE drives, the ST8000NM0055. I’m not trying to make sweeping generalizations about Seagate drives in general. The reason I’m saying this is that I’ve got 6 4TB Seagate drives running in a Node 304 case (which seems to have terrible airflow with no fans near the drives) that sits a shelf higher than the iStarUSA case (where the ambient temps are a bit higher). Those are 5900rpm drives and their temperatures range between about 40-44C which is a solid 10c below the ST8000NM0055s. I previously had those exact drives (the 5900rpm ones) in a desktop Synology model in the same spot and their temps were about 38C.

The iStarUSA case I’m using has two 4 bay hot swap cages, each with an 80mm fan that sucks air in over the drives. I don’t know the specs of the fans, but the airflow coming out of them does feel substantial. There is also 2 more 80mm fans out the back, and the top of the case has holes in it so hot air can rise out of the top. I really think if I put my 4TB drives in the iStarUSA case, they would easily be running a few degrees cooler, which does put them in the upper 30s. The airflow should certainly be better than the Node 304 and I would think the compact little Synology I used to have.

This is all to flush out my point of comparison, which is to say that I have other Seagate drives in the same space, that run far cooler, with poorer cooling conditions.

At one point I had one of the ST8000NM0055 drives in a usb drive toaster on my desk, just idling, and it quickly hit 50C when the ambient temperature was about 22C. It felt painfully hot to the touch which I’ve never experience before with a drive running out in the open. That drive is now one of the drives in the iStarUSA case, and its temperatures are in line with the others, so that experience of it hitting 50C on my desk doesn’t seem like an outlier.

This review does mention the heat of these as well: https://www.bjorn3d.com/2015/12/seagate/ “The operating temps were a bit of a shock and we ran some lengthy tests (5 hours and longer) and the drive will get a bit warm but it operated well within the given specifications but we would recommend a chassis with a fan in front of the drive.”

As an experiment, I took a pretty powerful desk fan I have (enough that it will blow my hair back) and put it about 2 feet in front of the iStarUSA case. It was blowing air that was about 24C right at the case. The drives were idle. I managed to get their temps down to about 44C under those conditions, but I really can’t leave that fan sitting there like that all the time.

I don’t know what else I can realistically do here as I’m constrained by multiple factors. I don’t have the space to run my servers out in the open, so it’s in a place on a shelf where the airflow is limited. I spent a few months trying to find the right case, and settled on the iStarUSA as something that met my demands for physical size, drive bay numbers, and financial limits. I made the assumption that the hot swap drive bays with built in fans were sufficient to cool hard drives, since that’s exactly what they are designed for. I might have been incorrect, but it doesn’t seem like an unreasonable assumption. For reference, those bays are model BPN-DE340SS. If I happen to have sufficiently proven my point that THESE drives run very hot, you could tell me that I should use different or less demanding drives as I’m unable to provide them with a robust enough cooling solution, but I’ve got these already, and the time and expense to replace them doesn’t feel worth it to me.

I’ve been unhappy with the performance of 5900rpm drives in the past (my performance goal for this pool was 1GBs of bandwidth, which I’ve just barely achieved), so didn’t want to make that mistake again, and the ST8000NM0055 drives seemed to be available for pretty good prices, had a couple of stellar reviews out there for performance, have a good warranty, and are “enterprise.”

I am running a Netgear 10gb switch near my servers which throws out a ton of heat, so I’m considering trading that in for something that runs much cooler. The ambient temps where the server sits is probably about 27C. I know it’s not ideal.

At this point, while I’m open to suggestions, even if there was a better case option that presented itself, I’m not convinced the money and time to transfer everything is worth it. It may be heresy to say, but unless the drives die very frequently, like multiple times a year, I’m content with reducing their lifespan. I am running RAIDZ2, and all the data is backed up to another freenas box also running RAIDZ2, and also backed up to a cloud storage account for a worst case scenario. In my opinion, I’ve done a pretty decent job with the rest of the hardware – I’ve got a supermicro motherboard, a good power supply, ECC memory, a sata dom for boot, etc. I’m a home user, so this stuff isn’t business critical. I’m also fairly new at this, and I’m sure I’ve made some mistakes here, but I’ve spent a decent amount of time researching, and have some constraints that I just can’t really get around. If this was for a business, aside from not even trying to build it on my own and purchasing a Truenas box instead, it would be in a rack with AC where I wouldn’t have to worry about ambient temperatures, size, noise factors, and money spent. Additionally, as I type this, the temps for the drives are all 47 and 48C, so they are definitely not running up to 57C all the time like those SMART results I shared in my original post.

All that being said, I do greatly appreciate the expertise you are offering, and if there are some concrete suggestions, instead of simply “you need better cooling” I would love to explore that.

What I’m seeing for possible options for this set up at this point is:

1. Look into higher performing fans for the drive bays.
2. Look into another case that fulfills the same constraints as this one (size, price, drive bay #s)
3. Get rid of the Netgear switch to reduce ambient heat.
4. Be willing to run the server someplace really inconvenient like at my desk or the dining room table.
5. Let the whole thing run for a while and see if the drives really do die. While I can replace them under warranty, I’m not willing to live with multiple drive failures per year. If however the failure rate is a drive death every year or two, and I can likely attribute this to poor cooling, it may be an acceptable risk for me given what I’ve laid out above and what I feel are adequate backups.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I did eventually come to the conclusion that even though none of the tests were returning errors for that drive, its behavior was wrong.
Good. I am not trying to bash you.
That drive was used, but new to my possession, so it doesn’t seem likely that heat in my setup damaged the drive since I noticed the poor behavior almost as soon as I put it into service.
Sometimes people sell drives knowing that they are lemons. Sorry. Were you able to get it replace on warranty?
My assertion that the Seagate drives run hot is specifically for THESE drives, the ST8000NM0055.
Sounds reasonable.
BPN-DE340SS
bpn-de340ss_03.jpg

I had a similar drive enclosure (different brand) but it had the 80mm fan like this. Even with the low RPM drives like the 4TB units you have, it would get too hot. The 80mm fan just couldn't move enough air. The thing I did was to build a duct from cardboard and tape and add a 120mm fan on the back of the unit that I plugged direct to the 12 volt from the power supply. It looked ugly, really ugly, but it worked to keep the drives cool. If you are handy / crafty, you might be able to make something like that and have it look decent. I rationalized my ugly solution by saying that I was the only one that had to see it.
If I happen to have sufficiently proven my point that THESE drives run very hot
I trust you. That is a different model than I was thinking. You have the 7200 RPM heating element in there.
enterprise
That means hot... ;)
I am running a Netgear 10gb switch near my servers which throws out a ton of heat, so I’m considering trading that in for something that runs much cooler. The ambient temps where the server sits is probably about 27C. I know it’s not ideal.
No, but we do what we must. At work I have the computer room at 18°C but at home, I have less money to work with...
It may be heresy to say, but unless the drives die very frequently, like multiple times a year, I’m content with reducing their lifespan.
I hate to have them die prematurely, maybe you can try the fan idea?
Additionally, as I type this, the temps for the drives are all 47 and 48C, so they are definitely not running up to 57C all the time like those SMART results I shared in my original post.
I know it isn't ideal, but those temperatures will not kill the drives instantly. I have a big storage server at work where the hottest drive sits at 37°C, but it is in that 18°C room. My server at home is in my somewhat warmer house and the hottest drive right now is at 46°C but that is because it is running the burn-in test. Out of the drives being used for storage, the hottest is at 39°C, but that is on the back row of drives in a chassis like this: http://www.chenbro.com/en-global/products/RackmountChassis/4U_Chassis/NR40700
Drives at the front run cooler.
4. Be willing to run the server someplace really inconvenient like at my desk or the dining room table.
Dining room table... Absolutely... LOL. No. What about some fans as a first step?
If however the failure rate is a drive death every year or two
The problem is, and again, I am not trying to bash you, it could be multiple drives failing at once. Just today there was another thread on here where someone had hot drives and ended up with three failures before they could deal with the first replacement. The drives were just throwing bad sectors and ATA errors, but it is probably better to take a proactive approach.
it may be an acceptable risk for me given what I’ve laid out above and what I feel are adequate backups.
I know how much fun it is to restore from a backup...
 

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
Chris, thank you for being understanding that I’m trying. ;)

Yes, the drive was replaced under warranty, so no big deal.

That is a good point about having the multiple drives die at once. Obviously RAIDZ2 only reduces the chances of losing the pool in the event of drive failure, but it sounds like multiple concurrent drive problems could be more of a possibility than I realized, especially if exacerbated by poor cooling.

The suggestion of getting a bit creative with the drive enclosure seems good. I don’t know that I can physically get a 120mm in there due to the size of the case, but maybe there are some more powerful 80mm or slightly larger fans I can try out, and then do what I need to do to get them to fit.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
The suggestion of getting a bit creative with the drive enclosure seems good. I don’t know that I can physically get a 120mm in there due to the size of the case, but maybe there are some more powerful 80mm or slightly larger fans I can try out, and then do what I need to do to get them to fit.
They are not very popular, but 92mm fans are a thing. For some reason that I never understood, the sizes go from 80 to 92 to 120 to 140... So, 92 has always been kind of odd, but you can get them:
https://www.newegg.com/Product/Product.aspx?Item=9SIAADY4B02739
As for fitting them, you might be able to remove the stock fan and put the larger fan on with an adapter of some sort.
67212-0200-024-60-SG.jpg
 
Joined
Jan 18, 2017
Messages
525
My assertion that the Seagate drives run hot is specifically for THESE drives, the ST8000NM0055.

Interesting I have six of these drives in the back of my Supermicro SC847 case and I have not seen them exceed 43 degree C , the ambient in the room at the time was 31 (which is why I had to go check on them) which really really didn't help.... the fan wall in the SC847 is fairly powerful and not something you want to try and sleep in the same room with lol
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Interesting I have six of these drives in the back of my Supermicro SC847 case and I have not seen them exceed 43 degree C , the ambient in the room at the time was 31 (which is why I had to go check on them) which really really didn't help.... the fan wall in the SC847 is fairly powerful and not something you want to try and sleep in the same room with lol
Yes, if you have the standard fans in that chassis, it will rock you to sleep.
 

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
I like your idea of an "adapter" Chris. :)

I'm not too surprised to hear that with the "proper" cooling, the drives temps are way better. I'm most definitely running into the limits of a small case, small space, and needing tolerable noise levels in my tiny little apartment.
 

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
As a final followup, I got some 80mm to 92mm adapters, and Noctua fans. Along with some electrical tape, and tweaking of the fans with the IPMI tool, the drives are now sitting at 42C while idle. The maximum temps now seem to stop at about 49C under load. It's still not ideal, but compared to idling at 48C and maxing out around 58C I'll take it. Hopefully the lifetime of the drives will be reasonably preserved now. Short of some pretty drastic changes and sacrifices to how I have things set up, it's the best it's going to get.

Thank you again for your help, and also pushing me to stretch the cooling performance a bit more than I thought was possible.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Thank you again for your help, and also pushing me to stretch the cooling performance a bit more than I thought was possible.
Thanks for the update. I hope you enjoy the system for many years to come. Keep an eye on the tape from time to time. The adhesive tends to come loose on some brands after a while.
 
Status
Not open for further replies.
Top