Critical SMART alerts on both USB boot drives after upgrade to 11.1

Status
Not open for further replies.

NetSoerfer

Explorer
Joined
May 8, 2016
Messages
57
I upgraded to 11.1 today, and immediately after the upgrade I got critical alerts shown for both SanDisk Extreme 32GB USB Flash Drives I use as boot devices:
Code:
CRITICAL: Dec. 27, 2017, 12:09 p.m. - Device: /dev/da1 [SAT], FAILED SMART self-check. BACK UP DATA NOW!
CRITICAL: Dec. 27, 2017, 12:09 p.m. - Device: /dev/da1 [SAT], Failed SMART usage Attribute: 232 Perc_Avail_Resrvd_Space.
CRITICAL: Dec. 27, 2017, 12:09 p.m. - Device: /dev/da0 [SAT], FAILED SMART self-check. BACK UP DATA NOW!
CRITICAL: Dec. 27, 2017, 12:09 p.m. - Device: /dev/da0 [SAT], Failed SMART usage Attribute: 232 Perc_Avail_Resrvd_Space.

Running smartctl -a /dev/da0 outputs the following:
Code:
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 SanDisk based SSDs
Device Model:	 SanDisk pSSD
Serial Number:	0074aab31
LU WWN Device Id: 5 001b44 04a0731ab
Firmware Version: 6EB 1030
User Capacity:	31,406,948,352 bytes [31.4 GB]
Sector Size:	  512 bytes logical/physical
Rotation Rate:	Solid State Device
Form Factor:	  1.8 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:	Wed Dec 27 16:49:22 2017 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(  120) seconds.
Offline data collection
capabilities:					(0x51) SMART execute Offline immediate.
										No Auto Offline data collection support.
										Suspend Offline collection upon new
										command.
										No Offline surface scan supported.
										Self-test supported.
										No Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		(   7) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0002   100   100   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0002   100   100   000	Old_age   Always	   -	   4
 12 Power_Cycle_Count	   0x0002   100   100   000	Old_age   Always	   -	   0
165 Total_Write/Erase_Count 0x0002   100   100   000	Old_age   Always	   -	   2054
171 Program_Fail_Count	  0x0002   100   100   000	Old_age   Always	   -	   0
172 Erase_Fail_Count		0x0002   100   100   000	Old_age   Always	   -	   0
173 Avg_Write/Erase_Count   0x0002   100   100   000	Old_age   Always	   -	   0
174 Unexpect_Power_Loss_Ct  0x0002   100   100   000	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0002   100   100   000	Old_age   Always	   -	   0
194 Temperature_Celsius	 0x0022   100   000   000	Old_age   Always	   -	   0 (0 0 0 192 0)
230 Perc_Write/Erase_Count  0x0002   100   100   000	Old_age   Always	   -	   0
232 Perc_Avail_Resrvd_Space 0x0003   000   100   005	Pre-fail  Always   FAILING_NOW 0
234 Perc_Write/Erase_Ct_BC  0x0002   100   100   000	Old_age   Always	   -	   10000
241 Total_LBAs_Written	  0x0002   100   100   000	Old_age   Always	   -	   0
242 Total_LBAs_Read		 0x0002   100   100   000	Old_age   Always	   -	   0

SMART Error Log not supported

SMART Self-test Log not supported

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Google says that 232 Perc_Avail_Resrvd_Space means that the drive's projected maximum number of erase cycles has been exceeded, but it seems odd:
VALUE = 0, WORST = 100, THRESH = 5 seems like there was an error but there isn't anymore, but the number of erase cycles can't very well have decreased from 100 back to 0, can it?

What do I do with this?
 
Last edited:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I'm a bit perplexed in what you are saying...

I upgraded to 11.1 today, and immediately after the upgrade I got critical alerts shown for both Sandisk USB Keys I use as boot devices
So you say it's a USB device here and then...

odel Family: SanDisk based SSDs Device Model: SanDisk pSSD
Which is a SSD, not USB Flash drive. So which is it? Maybe if you could provide a specific model?

165 Total_Write/Erase_Count 0x0002 100 100 000 Old_age Always - 2054
This is a peice of key information, most SSDs can only endure 2000 erase cycles and you have 2054 cycles.

My advice is to roll back to the previous boot environment and cross your fingers it works without any error messages. Then check your smart data again.

If after you roll back and all works again, and you still want to upgrade, ensure you have a backup of your config file and then I'd perform a clean install of 11.1 and then restore your config file. But my advice is to not upgrade to 11.1 if rolling it back fixes your issue and then wait for 11.2 if you really want to upgrade.
 

NetSoerfer

Explorer
Joined
May 8, 2016
Messages
57
I'm a bit perplexed in what you are saying...
[...Sandisk USB Keys...]
So you say it's a USB device here and then...
[...SanDisk based SSDs Device Model: SanDisk pSSD...]
Which is a SSD, not USB Flash drive. So which is it? Maybe if you could provide a specific model?
Well yeah, I could've added that bit of info. They're both identical SanDisk Extreme 32GB USB Flash Drives, I have no idea why they come up as SSDs. I've also added that info to the original post.

[165 Total_Write/Erase_Count 0x0002 100 100 000 Old_age Always - 2054]
This is a peice of key information, most SSDs can only endure 2000 erase cycles and you have 2054 cycles.
Wow, so after roughly a year of boot device duty these drives might actually be worn down already? That's surprisingly fast (not that it's much of a problem since they're not that expensive), is there anything I can do to prevent excessive writes?

My advice is to roll back to the previous boot environment and cross your fingers it works without any error messages. Then check your smart data again.
I did a rollback as per your suggestion. I'm not getting any critical alerts anymore, but smartctl -a /dev/da0 still gives me the same output as before:
Code:
...
165 Total_Write/Erase_Count 0x0002   100   100   000	Old_age   Always	   -	   2054
...
232 Perc_Avail_Resrvd_Space 0x0003   000   100   005	Pre-fail  Always   FAILING_NOW 0
...

If after you roll back and all works again, and you still want to upgrade, ensure you have a backup of your config file and then I'd perform a clean install of 11.1 and then restore your config file. But my advice is to not upgrade to 11.1 if rolling it back fixes your issue and then wait for 11.2 if you really want to upgrade.
If I do an install from scratch and import my settings, will everything just work again, like jails, VMs, etc?
 
Last edited:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
If I do an install from scratch and import my settings, will everything just work again, like jails, VMs, etc?
They should. I would recommend that you do the clean install on a different USB Flash Drive that way if something goes wrong, you can recover easily by just installing your original USB Flash Drives.

Also, you do not need a pair of boot devices, just maintain a copy of your configuration file so you can restore it if needed, it's a much better way to go.

If you do not need to upgrade to 11.1 then I'd wait it out for 11.2. Also make a backup of your configuration file and run your boot devices until they die, or you could replace them if you like.

Wow, so after roughly a year of boot device duty these drives might actually be worn down already?
Some people don't even make it this long on a USB Flash drive. What I find odd here is I've never seen a SMART report for a USB Flash Drive before that was like yours. Thanks for updating the original post with the model of the USB drive. When I look up the model number provided by smartctl it shows up as a small SSD card but apparently it was installed into a USB device.

My advice is if you can replace your boot device with a traditional SSD, just a single SSD, I'd do it. It doesn't need to be anything fancy, just what is on sale, a 120GB or less is a good place to start. They last longer and updates are faster and are more reliable.
 

NetSoerfer

Explorer
Joined
May 8, 2016
Messages
57
I just noticed that in FN11.0, the boot USB flash drives don't show up in View Devices, and I think they did in FN11.1.

Is it possible that they were excluded from SMART monitoring in 11.0, and have been added to monitoring in 11.1? With identical smartctl output between 11.0 and 11.1, it seems like the only noticeable difference is the error showing up in the alert system...
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Prior to 11.1 the boot device typically didn't show up in View Devices. Maybe it is this difference I don't know but it is new to me is about all I can say.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Yes, @joeschmuck, now they do, and I got a temperature warning on my SATA DOM - see here
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Clearly those USB drives are SATA SSDs in disguise, given their SMART output.

Now, some of the fields may not be interpreted correctly by smartmontools, since they vary a lot on SSDs. Check the manufacturer's utility, too.
 

cswboys

Cadet
Joined
Jan 17, 2015
Messages
2
Hey all...

I wanted to add that I have the exact same smart error after updating to 11.1. I also have the exact same SanDisk Extreme USB drive at 32 gigs. Weird that it’s the same usb thumb drive and same error (232). My usb drive however, is only a month or two old and had a clean version of FN 11 installed on it. Let me know if I can share some more info to help get to the bottom of this!
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
@cswboys , the answer's in the post immediately before yours. Smartmontools is likely not interpreting the the SSD information fields correctly. 11.1 reporting SMART results from boot drives is a change from previous versions. @Ericloewe suggests you try whatever Sandisk offers as a utility for these drives and see what it tells you if you have a concern.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Hey all...

I wanted to add that I have the exact same smart error after updating to 11.1. I also have the exact same SanDisk Extreme USB drive at 32 gigs. Weird that it’s the same usb thumb drive and same error (232). My usb drive however, is only a month or two old and had a clean version of FN 11 installed on it. Let me know if I can share some more info to help get to the bottom of this!
Someone beat me to the punch my mere seconds.
 

cswboys

Cadet
Joined
Jan 17, 2015
Messages
2
@Redcoat Thank you for the reply! I wanted to comment originally because I thought it was interesting that it was with the exact same drive and capacity. I think i'll be swapping out that Sandisk drive for a redundant setup with two samsung units to see how that works. @joeschmuck @Ericloewe Thanks again for the great info and knowledge share! @NetSoerfer Great post with the details of the problem! it saved me having to make it myself!
 

MiG

Dabbler
Joined
Jan 6, 2017
Messages
21
Hey all...

I wanted to add that I have the exact same smart error after updating to 11.1. I also have the exact same SanDisk Extreme USB drive at 32 gigs. Weird that it’s the same usb thumb drive and same error (232). My usb drive however, is only a month or two old and had a clean version of FN 11 installed on it. Let me know if I can share some more info to help get to the bottom of this!
Same problem, nearly identical situation.

Cloned my Sandisk Extreme USB 3.0 64GB drive containing 9.10.2-U6 to an identical, barely used second one (which I then plugged into the server), changed trains on that one and upgraded to 11.1-U5. After logging back in I found a glowing red critical warning "Device: /dev/da0 [SAT], FAILED SMART self-check. BACK UP DATA NOW!".

Smartctl output:
Code:
										Self-test supported.																		
										No Conveyance Self-test supported.														
										Selective Self-test supported.															
SMART capabilities:			(0x0003) Saves SMART data before entering															
										power-saving mode.																		
										Supports SMART auto save timer.															
Error logging capability:		(0x01) Error logging supported.																	
										General Purpose Logging supported.														
Short self-test routine																											
recommended polling time:		(   2) minutes.																					
Extended self-test routine																										
recommended polling time:		(  11) minutes.																					
																																	
SMART Attributes Data Structure revision number: 1																				
Vendor Specific SMART Attributes with Thresholds:																				
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE									
  5 Reallocated_Sector_Ct   0x0002   100   100   000	Old_age   Always	   -	   2											
  9 Power_On_Hours		  0x0002   100   100   000	Old_age   Always	   -	   0											
 12 Power_Cycle_Count	   0x0002   100   100   000	Old_age   Always	   -	   0											
165 Total_Write/Erase_Count 0x0002   100   100   000	Old_age   Always	   -	   2054										
171 Program_Fail_Count	  0x0002   100   100   000	Old_age   Always	   -	   0											
172 Erase_Fail_Count		0x0002   100   100   000	Old_age   Always	   -	   0											
173 Avg_Write/Erase_Count   0x0002   100   100   000	Old_age   Always	   -	   1											
174 Unexpect_Power_Loss_Ct  0x0002   100   100   000	Old_age   Always	   -	   0											
187 Reported_Uncorrect	  0x0002   100   100   000	Old_age   Always	   -	   0											
194 Temperature_Celsius	 0x0022   120   236   000	Old_age   Always	   -	   236 (0 209 0 220 0)						
230 Perc_Write/Erase_Count  0x0002   100   100   000	Old_age   Always	   -	   3											
232 Perc_Avail_Resrvd_Space 0x0003   000   100   005	Pre-fail  Always   FAILING_NOW 0											
234 Perc_Write/Erase_Ct_BC  0x0002   100   100   000	Old_age   Always	   -	   10000										
241 Total_LBAs_Written	  0x0002   100   100   000	Old_age   Always	   -	   0											
242 Total_LBAs_Read		 0x0002   100   100   000	Old_age   Always	   -	   0											
																																	
SMART Error Log not supported																									
																																	
SMART Self-test Log not supported																								
																																	
SMART Selective self-test log data structure revision number 0																	
Note: revision number not 1 implies that no selective self-test has ever been run												
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS																						
	1		0		0  Not_testing																								
	2		0		0  Not_testing																								
	3		0		0  Not_testing																								
	4		0		0  Not_testing																								
	5		0		0  Not_testing																								
Selective self-test flags (0x0):																									
  After scanning selected spans, do NOT read-scan remainder of disk.																
If Selective self-test is pending on power-up, resume after 0 minute delay.														

In my case I'm not worried as I still have the old 9.10.2 USB drive (no smart warnings from Freenas 9 for that one, at least until just now, and I didn't upgrade the ZFS pool), as well as multiple cfg backups both in 9 and 11.

However, is this a genuinely critical warning (and should I therefore get another USB drive?) or is the SMART data of this series of drives perhaps interpreted differently/wrongly in 11.1?

@cswboys , the answer's in the post immediately before yours. Smartmontools is likely not interpreting the the SSD information fields correctly. 11.1 reporting SMART results from boot drives is a change from previous versions. @Ericloewe suggests you try whatever Sandisk offers as a utility for these drives and see what it tells you if you have a concern.
EDIT: missed this one, 'critical' warnings get the adrenaline flowing even when not warranted. The SanDisk tools page (https://www.sandisk.com/tools) however is pretty much blank at the time of posting.
 
Last edited:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
You can plug that USB drive into a Windoze machine and check the smart data, maybe it is failing. What catches my eye is ID165 Total Write/Erase count is over 2000 and typically the normal write/erase count for a SSD cell is 2000 cycles. Could it be interpreted wrong, of course it can be. So you have two options, replace the USB flash drive or just keep on rolling with it. Just make sure you maintain a good copy of your configuration file just in-case.
 
  • Like
Reactions: MiG

MiG

Dabbler
Joined
Jan 6, 2017
Messages
21
My advice is if you can replace your boot device with a traditional SSD, just a single SSD, I'd do it. It doesn't need to be anything fancy, just what is on sale, a 120GB or less is a good place to start. They last longer and updates are faster and are more reliable.

Before FreeNAS I've never used a USB as the boot device on a permanent basis, if these things break that fast I wouldn't mind going back to a full hard drive setup.

I have a few old SSDs lying around, do I clone the partitions on the USB drive to one of these or is there a more elegant method?
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
USB 3.0 devices typically run hot and fail prematurely, I don't think this is your case. I've never seen a USB drive fail do to actual normal use but FreeNAS 9.x had an issue where it was writting to the boot device a lot thus wearing them out fast. FreeNAS 10.x I think fixed that. I don't recall exactly which subversion.

I prefer a SSD of course but if you are fine with a USB device they you can continue to use it. You might try to update to FreeNAS 10 and use a new flash drive. The call is up to you.
 
  • Like
Reactions: MiG

MiG

Dabbler
Joined
Jan 6, 2017
Messages
21
I'll give it some consideration, I guess I have some breathing space. I also realised I currently don't have any free SATA ports available, so this wouldn't happen until I've installed a new controller (in the works, increased chipset support is the reason I upgraded to 11).

As far as cfg backup frequency is concerned, I assume that without me making any changes to settings or pool(s), roughly once per FreeNAS update is sufficient?
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
As far as cfg backup frequency is concerned, I assume that without me making any changes to settings or pool(s), roughly once per FreeNAS update is sufficient?
That is all I do as well, one backup for when I make any changes to my system which isn't often at all. My last change was probably over 4 months ago, plus since my system runs on ESXi, I have an automated weekly backup of my FreeNAS VM just for super easy restoration since it's a VM in my case.
 
  • Like
Reactions: MiG

MiG

Dabbler
Joined
Jan 6, 2017
Messages
21
Slight update: I suspect it was a faulty USB drive issue, at least in my case.
I just let it run for a while, and last night it started producing partial garbage output in the web interface. Pool's fine fortunately. Took a third, also barely used identical (Sandisk Extreme USB 3.0 64GB) USB drive, restored my last config and 11.1-U6 now reports zero smart warnings (whereas the previous drive immediately raised literal red flags).

I'm still a bit iffy on the concept of permanently using USB storage as a system drive though, so I'll have to free up an SSD.
 
Last edited:
Status
Not open for further replies.
Top