FN11 reports unrecoverable error & Currently unreadable (pending) sectors, smartctl shows nothing...

HeloJunkie · Aug 23, 2017

SYSTEM:

Supermicro Superserver 5028R-E1CR12L
Supermicro X10SRH-CLN4F Motherboard
1 x Intel Xeon E5-2640 V3 8 Core 2.66GHz
4 x 16GB PC4-17000 DDR4 2133Mhz Registered ECC
12 x 4TB HGST HDN724040AL 7200RPM NAS SATA Hard Drives
2 x 6 Drive RAIDZ2 VDEVs
LSI3008 SAS Controller - Flashed to IT Mode (Firmware Version 12.00.02.00)
LSI SAS3x28 SAS Expander
LSI9211-8i SAS Controller - Flashed to IT Mode (Firmware Version 20.00.02.00)
(connects to external JBOD enclosure)
Dual 920 Watt Platinum Power Supplies
16GB USB Thumb Drive for booting
Chelsio T580-SO-CR Dual 40Gbe NIC (Replication Connection to backup FreeNAS server)
Chelsio T520-SO-CR Dual 10Gbe NIC (Data connection to Plex server & media management server)
FreeNAS-11.0-U2 (e417d8aa5)

So this morning at 0120 hours, I received an email:

Code:

The volume vol1 state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

followed by another email:

Code:

Device: /dev/da14 [SAT], 7 Currently unreadable (pending) sectors

Since I am running RAIDZ2, I made a note of it and went back to sleep. This morning when I came into work, I took a look at the pool:

Code:

root@plexnas:~ # zpool status vol1
  pool: vol1
 state: ONLINE
  scan: scrub repaired 0 in 7h40m with 0 errors on Tue Aug 15 09:40:27 2017
config:

		NAME											STATE	 READ WRITE CKSUM
		vol1											ONLINE	   0	 0	 0
		  raidz2-0									  ONLINE	   0	 0	 0
			gptid/f46fb4ec-ed62-11e4-a956-0cc47a31abcc  ONLINE	   0	 0	 0
			gptid/f69f4e21-ed62-11e4-a956-0cc47a31abcc  ONLINE	   0	 0	 0
			gptid/f8cde372-ed62-11e4-a956-0cc47a31abcc  ONLINE	   0	 0	 0
			gptid/faeb3d6d-ed62-11e4-a956-0cc47a31abcc  ONLINE	   0	 0	 0
			gptid/fd087ff0-ed62-11e4-a956-0cc47a31abcc  ONLINE	   0	 0	 0
			gptid/ff28300a-ed62-11e4-a956-0cc47a31abcc  ONLINE	   0	 0	 0
		  raidz2-1									  ONLINE	   0	 0	 0
			gptid/013d5491-ed63-11e4-a956-0cc47a31abcc  ONLINE	   0	 0	 0
			gptid/0357b342-ed63-11e4-a956-0cc47a31abcc  ONLINE	   0	 0	 0
			gptid/05811f51-ed63-11e4-a956-0cc47a31abcc  ONLINE	   0	 0	 0
			gptid/079f5f22-ed63-11e4-a956-0cc47a31abcc  ONLINE	   0	 0	 0
			gptid/09b81318-ed63-11e4-a956-0cc47a31abcc  ONLINE	   0	 0	 0
			gptid/a82dda8c-ef5f-11e4-bb0a-0cc47a31abcc  ONLINE	   0	 0	 0
		  raidz2-2									  ONLINE	   0	 0	 0
			gptid/120bb7bc-89a2-11e6-9e64-0cc47a31abcc  ONLINE	   0	 0	 0
			gptid/12ce80a0-89a2-11e6-9e64-0cc47a31abcc  ONLINE	   0	 0	 0
			gptid/13867ead-89a2-11e6-9e64-0cc47a31abcc  ONLINE	   0	 0	 0
			gptid/14413602-89a2-11e6-9e64-0cc47a31abcc  ONLINE	   0	 0	 0
			gptid/14f95eb4-89a2-11e6-9e64-0cc47a31abcc  ONLINE	   0	 0	 0
			gptid/15af6956-89a2-11e6-9e64-0cc47a31abcc  ONLINE	   0	 0	 0
		  raidz2-3									  ONLINE	   0	 0	 0
			gptid/d69a5dad-0ab4-11e7-9f3c-0cc47a31abcc  ONLINE	   0	 0	 0
			gptid/d7b9b84e-0ab4-11e7-9f3c-0cc47a31abcc  ONLINE	   0	 0	 0
			gptid/d8d769fb-0ab4-11e7-9f3c-0cc47a31abcc  ONLINE	   0	 0	 0
			gptid/d9f58a88-0ab4-11e7-9f3c-0cc47a31abcc  ONLINE	   0	 0	 0
			gptid/db11810d-0ab4-11e7-9f3c-0cc47a31abcc  ONLINE	   0	 0	 0
			gptid/dc3427cd-0ab4-11e7-9f3c-0cc47a31abcc  ONLINE	   0	 0	 0

errors: No known data errors

And I took a look at /dev/da14:

Code:

root@plexnas:~ # smartctl -A /dev/da14
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   613
  3 Spin_Up_Time			0x0027   183   181   021	Pre-fail  Always	   -	   7808
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   18
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   5
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   37
  9 Power_On_Hours		  0x0032   090   090   000	Old_age   Always	   -	   7762
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   18
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   15
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   98
194 Temperature_Celsius	 0x0022   129   119   000	Old_age   Always	   -	   23
196 Reallocated_Event_Count 0x0032   196   196   000	Old_age   Always	   -	   4
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   31

So I am kind of at a loss as to where to look next for the unrecoverable error and unreadable pending sectors. Everything seems to be running well, no alerts on the system itself.

One weird thing was that I had no webui when I came in this morning. The system was working jsut fine (just nfs mounts), all the mounts were working, but I could not access the
web interface itself. I am using the old interface. It looked like django was the cause of the problem as nothing I could do would get it to stop and restart and eventually I had to
roll to my backup NAS and reboot the primary. As soon as I did, I got the web interface back again.

I am not sure if it is related, but thought I would throw it in as well, just in case.

Mlovelace · Aug 23, 2017

It looks like da14 had some reallocations. I'd run a long test on it to see if there are any other bad sectors. Should be fine just keep an eye on it.

HeloJunkie · Aug 23, 2017

Thank you @Mlovelace long test in operation now, I'll see what it says!!

HeloJunkie · Aug 23, 2017

OK, so I run a long SMART 2 x per month. Here is the output from Aug 10th:

Code:

########## SMART status report summary for all drives on server PLEXNAS ##########

+------+------------------+----+-----+-----+-----+-------+-------+--------+------+----------+------+-------+----+
|Device|Serial			|Temp|Power|Start|Spin |ReAlloc|Current|Offline |Seek  |Total	 |High  |Command|Last|
|	  |Number			|	|On   |Stop |Retry|Sectors|Pending|Uncorrec|Errors|Seeks	 |Fly   |Timeout|Test|
|	  |				  |	|Hours|Count|Count|	   |Sectors|Sectors |	  |		  |Writes|Count  |Age |
+------+------------------+----+-----+-----+-----+-------+-------+--------+------+----------+------+-------+----+
|da0   |xxxxxxxxxxxxxx   | 26 |20355|   65|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da1   |xxxxxxxxxxxxxx   | 26 |20407|   80|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da2   |xxxxxxxxxxxxxx   | 25 |20406|   79|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da3   |xxxxxxxxxxxxxx   | 26 |20405|   78|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da4   |xxxxxxxxxxxxxx   | 27 |20406|   78|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da5   |xxxxxxxxxxxxxx   | 26 |20406|   78|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da6   |xxxxxxxxxxxxxx   | 27 |20406|   78|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da7   |xxxxxxxxxxxxxx   | 27 |20405|   78|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da8   |xxxxxxxxxxxxxx   | 26 |20406|   78|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da9   |xxxxxxxxxxxxxx   | 27 |20406|   79|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da10  |xxxxxxxxxxxxxx   | 27 |20406|   78|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da11  |xxxxxxxxxxxxxx   | 26 |20406|   78|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da12  |xxxxxxxxxxxxxx   | 22 | 7443|   17|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da13  |xxxxxxxxxxxxxx   | 22 | 7442|   18|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da14  |xxxxxxxxxxxxxx   | 22 | 7442|   18|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da15  |xxxxxxxxxxxxxx   | 23 | 7442|   17|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da16  |xxxxxxxxxxxxxx   | 22 | 7442|   17|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da17  |xxxxxxxxxxxxxx   | 21 | 7442|   17|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da18  |xxxxxxxxxxxxxx   | 23 | 3563|   12|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da19  |xxxxxxxxxxxxxx   | 23 | 3563|   13|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da20  |xxxxxxxxxxxxxx   | 25 | 3563|   13|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da21  |xxxxxxxxxxxxxx   | 25 | 3563|   13|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da22  |xxxxxxxxxxxxxx   | 25 | 3563|   13|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|da23  |xxxxxxxxxxxxxx   | 25 | 3563|   13|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
+------+------------------+----+-----+-----+-----+-------+-------+--------+------+----------+------+-------+----+

Here is the one that was just run on the 22nd (which is why I assumed I received the email alerts. This looks like TWO drives are having issues:

Code:

########## SMART status report summary for all drives on server PLEXNAS ##########

+------+------------------+----+-----+-----+-----+-------+-------+--------+------+----------+------+-------+----+
|Device|Serial			|Temp|Power|Start|Spin |ReAlloc|Current|Offline |Seek  |Total	 |High  |Command|Last|
|	  |Number			|	|On   |Stop |Retry|Sectors|Pending|Uncorrec|Errors|Seeks	 |Fly   |Timeout|Test|
|	  |				  |	|Hours|Count|Count|	   |Sectors|Sectors |	  |		  |Writes|Count  |Age |
+------+------------------+----+-----+-----+-----+-------+-------+--------+------+----------+------+-------+----+
|da0   |xxxxxxxxxxxxxx	| 27 |20676|   66|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
|da1   |xxxxxxxxxxxxxx	| 28 |20728|   81|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
|da2   |xxxxxxxxxxxxxx	| 26 |20728|   80|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
|da3   |xxxxxxxxxxxxxx	| 27 |20727|   79|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
|da4   |xxxxxxxxxxxxxx	| 28 |20727|   79|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
|da5   |xxxxxxxxxxxxxx	| 27 |20727|   79|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
|da6   |xxxxxxxxxxxxxx	| 28 |20727|   79|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
|da7   |xxxxxxxxxxxxxx	| 27 |20727|   79|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
|da8   |xxxxxxxxxxxxxx	| 27 |20727|   79|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
|da9   |xxxxxxxxxxxxxx	| 28 |20727|   80|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
|da10  |xxxxxxxxxxxxxx	| 28 |20728|   79|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
|da11  |xxxxxxxxxxxxxx	| 27 |20728|   79|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
|da12  |xxxxxxxxxxxxxx	| 24 | 7764|   17|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
|da13  |xxxxxxxxxxxxxx	| 23 | 7763|   18|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
|da14 ?|xxxxxxxxxxxxxx	| 23 | 7763|   18|	0|	  5|	  0|	   0|	 0|		37|   N/A|	N/A|   0|
|da15 ?|xxxxxxxxxxxxxx	| 24 | 7763|   17|	0|	  7|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
|da16  |xxxxxxxxxxxxxx	| 23 | 7763|   17|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
|da17  |xxxxxxxxxxxxxx	| 23 | 7763|   17|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
|da18  |xxxxxxxxxxxxxx	| 25 | 3885|   12|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
|da19  |xxxxxxxxxxxxxx	| 25 | 3885|   13|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
|da20  |xxxxxxxxxxxxxx	| 26 | 3885|   13|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
|da21  |xxxxxxxxxxxxxx	| 26 | 3884|   13|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
|da22  |xxxxxxxxxxxxxx	| 26 | 3884|   13|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
|da23  |xxxxxxxxxxxxxx	| 26 | 3884|   13|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   0|
+------+------------------+----+-----+-----+-----+-------+-------+--------+------+----------+------+-------+----+

It seems weird that two drives right next to each other start showing errors at the exact same moment. I guess I will kick off another long test on da15 as well.

Now as I understand it, these are errors INTERNAL to the drives correct? These errors would not be caused by a faulty backplane, SAS card, cable, etc...? Is that correct?

Thanks!

wblock · Aug 23, 2017

HeloJunkie said:
Now as I understand it, these are errors INTERNAL to the drives correct? These errors would not be caused by a faulty backplane, SAS card, cable, etc...? Is that correct?

Yes. Sectors are reallocated when the drive writes a block, then tries to read it and gets a bad value. So it uses one of the nearby spare blocks instead.

As to why these two drives are both getting reallocated blocks now, could be due to age or shared environment. Google had a report years back that said exposure to vibration was a significant problem, so maybe they both are near a vibrating fan or something. How old are the drives? Might this just be early failures on the near side of the bathtub curve? Are the drives still in warranty?

HeloJunkie · Aug 23, 2017

well....something does not look good. I am running a long test on da14 and just received this email:

Code:

Device: /dev/da14 [SAT], Self-Test Log error count increased from 0 to 1

HeloJunkie · Aug 23, 2017

wblock said:
Yes. Sectors are reallocated when the drive writes a block, then tries to read it and gets a bad value. So it uses one of the nearby spare blocks instead.

As to why these two drives are both getting reallocated blocks now, could be due to age or shared environment. Google had a report years back that said exposure to vibration was a significant problem, so maybe they both are near a vibrating fan or something. How old are the drives? Might this just be early failures on the near side of the bathtub curve? Are the drives still in warranty?

These drives are less than a year old and are (along with all the rest of the drives) installed in a Supermicro 5028R server chassis mounted in a computer rack in our computer room.

Mlovelace · Aug 23, 2017

HeloJunkie said:
These drives are less than a year old

Well, on the bright side they are still under warranty. :D

You can dd a write to the bad sector and force the drive to reallocate, then run another long test to make sure there aren't more.

Code:

dd if=/dev/zero of=/dev/ada2 bs=512 count=1 seek=(LBA_of_first_error from smart results) conv=noerror,sync

HeloJunkie · Aug 23, 2017

hahah..that is true.....guess it is best to just replace them or is it normal to have a few errors like this...?

Mlovelace · Aug 23, 2017

HeloJunkie said:
hahah..that is true.....guess it is best to just replace them or is it normal to have a few errors like this...?

I have had a couple drives that had pending bad sector errors. After dd'ing them and running long tests to make sure there weren't additional bad sectors they were fine. The last one was about a year ago and the drive is still in service.

HeloJunkie · Aug 23, 2017

Thanks, I will give that a try. Do I need to offline them or anything before doing the dd?

Mlovelace · Aug 23, 2017

HeloJunkie said:
Thanks, I will give that a try. Do I need to offline them or anything before doing the dd?

Nope, no need. Once the long test is done, following the dd operation, kick off a scrub to make sure the sector didn't have data on it... If it did, it should self healing from parity.

HeloJunkie · Aug 23, 2017

Perfect, did the dd, running the long tests now again....

HeloJunkie · Aug 23, 2017

OK, after the dd and the long smart, I see this:

Code:

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed: read failure	   90%	  7770		 100549961
# 2  Extended offline	Completed: read failure	   90%	  7763		 100549960

Do I just keep doing the dd until for each sector?

Thanks

Mlovelace · Aug 23, 2017

HeloJunkie said:
OK, after the dd and the long smart, I see this:

Code:
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 7770 100549961 # 2 Extended offline Completed: read failure 90% 7763 100549960

Do I just keep doing the dd until for each sector?

Thanks

Yep, the other option being to just rma the drive.

HeloJunkie · Aug 23, 2017

Can i assume as a result of this:

Code:

5 Reallocated_Sector_Ct 0x0033 200 200 140	Pre-fail Always	 -	 5

That I could do a DD and hit them all at once? I guess that would assume that those sectors were all right next to each other...

Mlovelace · Aug 23, 2017

HeloJunkie said:
Can i assume as a result of this:

Code:
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 5

That I could do a DD and hit them all at once? I guess that would assume that those sectors were all right next to each other...

Unfortunately you can't assume the sectors are contiguous, so it can be a time consuming process on large drives.

HeloJunkie · Aug 23, 2017

Understood, time to RMA the drives...

Thank for all your help.

Stux · Aug 23, 2017

You should RMA these drives if they are in warranty. If not, then you would perhaps put up with the faulty sectors and errors....

HeloJunkie · Aug 24, 2017

Thanks Guys - RMA it is....

Important Announcement for the TrueNAS Community.

FN11 reports unrecoverable error & Currently unreadable (pending) sectors, smartctl shows nothing...

Patron

Guru

Patron

Patron

Documentation Engineer

Patron

Patron

Guru

Patron

Guru

Patron

Guru

Patron

Patron

Guru

Patron

Guru

Patron

MVP

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "FN11 reports unrecoverable error & Currently unreadable (pending) sectors, smartctl shows nothing..."

Similar threads