SMART error (OfflineUncorrectableSector) Offline uncorrectable sectors

TXAG26 · Jul 11, 2014

I am receiving the following e-mail every time my FreeNAS 9.2.1.5 box is restarted and error messages that automatically scroll on the boot screen keep showing this error and indicating that da2 is "offline".

However, when I open a shell and run "zpool status", it shows everything is "online" and no errors. The same "online" and no errors status is showing in the GUI in Storage>Volumes>View Disks.

I have a cold spare (brand new unopened Seagate ST4000VN000), but am hesitant to swap it out since there's indications the entire array is online and OK, plus it is a 6-disk ZFS Raidz2.

Any ideas on what the true issue is? Anything to worry about? The single drive throwing these errors has been in service for approximately 9 months.

FreeNAS Box & Array Details: 6-Drive (6x 4TB Seagate NAS drives), ZFS Raidz2,
FreeNAS 9.2.1.5, LSI 2308 HBA mode, 16GB, Xeon E3-1240.

E-mail Error:

Code:

The following warning/error was logged by the smartd daemon:

Device: /dev/da2 [SAT], 16 Offline uncorrectable sectors

Device info:
ST4000VN000-xxxxxx, S/N:xxxxxxx, WWN:x-xxxxxx-xxxxxxxx, FW:SC43, 4.00 TB

Error on log-in screen when machine is rebooted:

Code:

20:13:29 FreeNAS smartd[3037]: Device: /dev/da2 [SAT], 16 Currently unreadable (pending) sectors
20:13:29 FreeNAS smartd[3037]: Device: /dev/da2 [SAT], 16 Offline uncorrectable sectors

Shell command "zpool status":

Code:

[root@FreeNAS ~]# zpool status                                                                                               
  pool: zpool1                                                                                                                 
state: ONLINE                                                                                                                    
status: Some supported features are not enabled on the pool. The pool can                                                         
        still be used, but some features are unavailable.                                                                         
action: Enable all features using 'zpool upgrade'. Once this is done,                                                             
        the pool may no longer be accessible by software that does not support                                                    
        the features. See zpool-features(7) for details.                                                                          
                                                                                                     
config:                                                                                                                           
                                                                                                                                  
        NAME                      STATE     READ WRITE CKSUM                                                
        zpool1                       ONLINE       0     0     0                                                
          raidz2                     ONLINE       0     0     0                                                
            gptid/xxxxxxxxxxxxxxxxxx  ONLINE       0     0     0                                                
            gptid/xxxxxxxxxxxxxxxxxx  ONLINE       0     0     0                                                
            gptid/xxxxxxxxxxxxxxxxxx  ONLINE       0     0     0                                                
            gptid/xxxxxxxxxxxxxxxxxx  ONLINE       0     0     0                                                
            gptid/xxxxxxxxxxxxxxxxxx  ONLINE       0     0     0                                                
            gptid/xxxxxxxxxxxxxxxxxx  ONLINE       0     0     0                                                
                                                                                                                                  
errors: No known data errors

DrKK · Jul 11, 2014

Sir,

I think you might be confusing things?

You have "offline uncorrectable sectors". It's not "offline: uncorrectable sectors". I don't see any indication in what you have included above that the drive is "offline".

Let me tell you what's going on. Your S.M.A.R.T. daemon has detected that you have some bad sectors on the disk that cannot be used, and those sectors, and the data on them, are in the queue to be transferred over to healthy sectors. Normally, on a healthy drive, you would *NOT* see this, so this is bad.

The disk itself, however, is coming up fine, overall, in the pool.

You should keep a very, very close eye on this. 90% of the time, if the number of uncorrectable sectors is anything but zero, your disk *is* going to start dying soon. If you see *ANY* increase in pending sectors, or other indications of failure, replace that drive. If this were me, I'd be replacing the drive RIGHT NOW.

Now, you do have a 6-disk RAID-Z2 vdev, and that's pretty safe. You can try to ride it out, if you want. But if you do have the cold spare, I'd replace it, now.

Let me ask you this. Your zpool status does not show any scrub results. Normally, it will show the result of the last scrub. Are you not doing ZFS scrubs?!!??!?!?!!?!??!?!!? You should do one *IMMEDIATELY* and see how much (if any) damage you have, and how correctable it is. You should be automatically doing 2 ZFS scrubs per month on this pool, and probably should be scheduling a monthly "long" SMART test on top of it. I bet you're not doing any of that.

danb35 · Jul 12, 2014

As DrKK said, you should be running scrubs regularly, though I don't expect one will show data corruption. You should also be running SMART self-tests on your drives--are you?

What is the output of smartctl -a on the affected drive?

joeschmuck · Jul 12, 2014

In short, replace the drive using the serial number for da2, it is becoming defective. It's a good thing you have both a RAIDZ2 and a spare drive available. Use the User Manual to replace it and do not hot swap the drive, it's bound to cause more problems than you could shake a stick at. If your drive is still under warranty then that is a good thing, get it replaced and you will have another emergency spare at hand.

The long answer, when you do a

Code:

smartctl -a /dev/da2

look for IDs 5, 197 and 198, they should all be zero values. I'm certain you will find values 197 and 198 will be 16, not good.

The pool showing as Online means the drive is working as it should, trying to map out failing sectors but typically once you start seeing a few failures, a lot start happening and within a few hours you can't access your hard drive.

If you are not doing this now, I recommend you do so... Daily short SMART tests on all drives and weekly Long test on all drives. You should know what your other drives are doing with respect to failures.

TXAG26 · Jul 12, 2014

Thank you for the replies. The zpool was in the middle of a scrub last night when I originally posted. The zpool has been conducting scrubs twice a month and short smart tests were being done once a week I think. I've changed the short smart tests to "daily" and have added a "weekly" long smart test.

Scrub Results from this morning:
scan: scrub repaired 0 in 8h26m with 0 errors on Sat Jul 12 04:25:03 2014

Results from command: "smartctl -a /dev/da2"

Code:

=== START OF INFORMATION SECTION ===
Device Model:     ST4000VN000-xxxxx
Serial Number:    Zxxxxxx
LU WWN Device Id: xxxxxxxxxx
Firmware Version: SC43
User Capacity:    xxxxxxxxxxx bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Jul 12 13:18:37 2014 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
       
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 248) Self-test routine in progress...
                                        80% of test remaining.
Total time to complete Offline
data collection:                (  107) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 535) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x10bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   114   099   006    Pre-fail  Always       -       71509192
  3 Spin_Up_Time            0x0003   091   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       11
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       264
  7 Seek_Error_Rate         0x000f   072   060   030    Pre-fail  Always       -       16869763
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       5417
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       11
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   081   081   000    Old_age   Always       -       19
190 Airflow_Temperature_Cel 0x0022   066   051   045    Old_age   Always       -       34 (Min/Max 28/35)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       11
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       11
194 Temperature_Celsius     0x0022   034   049   000    Old_age   Always       -       34 (0 22 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       16
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       16
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       70%      5417         3180340968
# 2  Short offline       Completed without error       00%      5367         -
# 3  Short offline       Completed without error       00%      5355         -
# 4  Short offline       Completed without error       00%      5248         -
# 5  Short offline       Completed without error       00%      5236         -
# 6  Short offline       Completed without error       00%      5128         -
# 7  Short offline       Completed without error       00%      5116         -
# 8  Short offline       Completed without error       00%      5008         -
# 9  Short offline       Completed without error       00%      4996         -
#10  Short offline       Completed without error       00%      4888         -
#11  Short offline       Completed without error       00%      4876         -
#12  Short offline       Completed without error       00%      4768         -
#13  Short offline       Completed without error       00%      4756         -
#14  Short offline       Completed without error       00%      4648         -
#15  Short offline       Completed without error       00%      4636         -
#16  Short offline       Completed without error       00%      4528         -
#17  Short offline       Completed without error       00%      4516         -
#18  Short offline       Completed without error       00%      4385         -
#19  Short offline       Completed without error       00%      4373         -
#20  Short offline       Completed without error       00%      4265         -
#21  Short offline       Completed without error       00%      4253         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

diedrichg · Jul 12, 2014

I'm sorry but I disagree with how the GUI notifies us that things are awry. This is not the first time I have heard about a disk dying but the volume status, disk status and the big green button say that everything is a-okay. NO! Everything is not alright! There is a disk throwing errors! All of those reporting modules should be showing errors. I consider this a bug! I strongly urge anybody with this issue to report it as a bug. We need the devs to fix this.

joeschmuck · Jul 12, 2014

You can disregard ID's 1 and 7, on many drives these can be inflated and not mean that you have a failed drive, however the ID 5 says you have a lot of sector issues and as I said before, ID's 197 and 198 are at a 16 count.

My advice, replace your drive and RMA it since you have very few hours on it.

TXAG26 · Jul 12, 2014

ID 5 Reallocated_Sector_Ct = 264 looks troubling. My other 5 disks have zero (0) for this value.

This drive just failed a long smart test only 30% into the test. Yet the FreeNAS GUI is still showing a GREEN LIGHT for this pool. WTH!

Code:

Self-test execution status:      ( 119) The previous self-test completed having
                                        the read element of the test failed.

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       70%      5417         3180340968
# 2  Short offline       Completed without error       00%      5367         -

joeschmuck · Jul 12, 2014

You need to separate the pool from the hard drive. I too think that the Alert light should be flashing yellow if a SMART anything failed however it doesn't. As long as the pool is healthy then it will remain green. So once your hard drive completely fails to work, then you will get a Yellow alert. It's a good thing for the emails. And you should submit a feature request or bug into the tracker if you want the alert to report on a SMART failure. I think you will find out that the problem is, with so many hard drives out there and the results not always being standardized by the drive manufacturers that it would be difficult to implement what we would all like. It could be done but I think it would be a lot of work.

cyberjock · Jul 12, 2014

diedrichg said:
I'm sorry but I disagree with how the GUI notifies us that things are awry. This is not the first time I have heard about a disk dying but the volume status, disk status and the big green button say that everything is a-okay. NO! Everything is not alright! There is a disk throwing errors! All of those reporting modules should be showing errors. I consider this a bug! I strongly urge anybody with this issue to report it as a bug. We need the devs to fix this.

ZFS won't kick a disk from a pool for some bad sectors or some errors. If you get a boatload of bad sectors and a boatload of errors it WILL eventually kick it. But the threshold is pretty high.

SMART on the other hand reports to you if any error occurs.... any.

So the thresholds are different and SHOULD be different. You don't want ZFS kicking a drive out of a pool because of a single error, but you *want* SMART telling you if there is one error as that may be an indicator of a possible pre-fail condition (which is one of the main reasons that SMART was invented).

Use the right tool for the job and know what your tools are for and what they do and you'll be happy. Yes, the developers are going to be doing some changes to SMART and ZFS reporting. I don't know what their plan is, but nobody has discussed cross-pollinating SMART reporting with ZFS. You *don't* want these to be crossed and you *don't* want to see disks kicked from pools until they are failed very hard.

Skilty · Jul 17, 2014

Just want to mention that cyberjock wrote an excellent post on scheduling scrubs and short/long SMART tests. I shamelessly used it and have so far replaced two failed disks (older 3TB Seagate Barracuda) with WD Red 3TB NAS drives as I got early warnings they were failing.

TXAG26 · Jul 19, 2014

Skilty said:
Just want to mention that cyberjock wrote an excellent post on scheduling scrubs and short/long SMART tests. I shamelessly used it and have so far replaced two failed disks (older 3TB Seagate Barracuda) with WD Red 3TB NAS drives as I got early warnings they were failing.

@Skilty, I noticed that we have nearly identical systems. What kind of resilver speeds did you see when you replaced those two failed disks?

I'm in the middle of switching out the failed disk I posted about last week and am over halfway through the resilver process. It has been going around ~305M/s and should finish up in a total of about 7-8hrs for 8TB of actual data.

ZFS RaidZ2 (6x Seagate 4TB NAS, ST4000VN000)
Supermicro X10SL7 w/LSI 2308 (IT/HBA mode) 6Gb/s
Xeon E3-1240v3 w/16GB DDR3 ECC
FreeNAS 9.2.1.6

Skilty · Jul 19, 2014

Just doing a switch of disks at the moment and I am getting:

status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sat Jul 19 15:00:12 2014
7.85T scanned out of 8.16T at 334M/s, 0h15m to go
1.31T resilvered, 96.27% done

So 7-8 hours sounds about right, considering you are using 4TB and they spin at 5900rpm whereas mine are 5400rpm.

I am also using the Intel in addition to the LSI ports, they are only SATA2 but the drives do get anywhere near SATA3 speeds anyway.

TXAG26 · Jul 19, 2014

Thanks for the reply and the info about your switch.

FWIW, I'm running FreeNAS on ESXi 5.5 with the LSI2308 passed through (DirectPath I/O) probably accounts for the slightly lower performance.

These Supermicro X10SL7 boards are outstanding. All I've added to them are Intel i350 dual port nics as mine serves multiple roles. I do wish this board wasn't maxed at 32GB ram, but that's an Intel gripe!

TXAG26 · Jul 26, 2014

Wow, talk about a rough week for hardware failures! After swapping out the HDD that was throwing all the errors and resilvering the RaidZ2 pool, I shut the SM X10SL7-F down and did a cold reboot. Wouldn't you know, the darn thing wouldn't come back up! The screen would stay blank and a steady 4 bios beeps would sound. One each second, for four seconds. Found out this was a fatal memory error as not even the bios screen would come up.

Luckily, I had a spare 16GB kit that I was about to install in my Supermicro X10SAE workstation. Got it installed and the server booted right back up! I tried the defective ram on a different X10 board and had the same 4 beeps, so I'm pretty sure the ram has failed. I've never had ram fail before, and it was just dumb luck on the timing to have a spare set handy!

Once the server was back online, I fired up IPMI View and noticed some Correctable ECC events in the IPMI System Event Log! Scarily, these dates/times correlate exactly to when the monthly zpool scrub fires off! Yikes! I have since completed a new scrub, so would it be safe to say I likely dodged any possible data corruption? Thank goodness for ECC ram. Anyone who runs ZFS without it is asking for serious trouble!

Code:

202,System Event,06/24/2014 06:45:22 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
203,System Event,06/24/2014 06:48:19 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
204,System Event,06/24/2014 07:16:10 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
205,System Event,06/24/2014 07:16:45 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
206,System Event,06/24/2014 07:24:50 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
207,System Event,06/24/2014 07:25:39 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
208,System Event,06/24/2014 07:31:36 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
209,System Event,06/24/2014 07:33:58 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
210,System Event,06/24/2014 07:33:58 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
211,System Event,06/24/2014 07:34:43 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
212,System Event,06/24/2014 07:34:58 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
213,System Event,06/24/2014 07:49:10 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
214,System Event,06/24/2014 08:19:54 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
215,System Event,06/24/2014 08:22:32 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
216,System Event,06/24/2014 09:32:00 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
217,System Event,06/24/2014 09:50:07 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
218,System Event,06/24/2014 10:09:11 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
219,System Event,06/24/2014 18:19:01 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
220,System Event,06/24/2014 18:19:02 Tue,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
221,System Event,06/25/2014 00:25:37 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
222,System Event,06/25/2014 00:25:37 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
223,System Event,06/25/2014 01:14:29 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
224,System Event,06/25/2014 05:15:03 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
225,System Event,06/25/2014 05:15:04 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
226,System Event,06/25/2014 07:01:47 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
227,System Event,06/25/2014 08:06:15 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
228,System Event,06/25/2014 18:46:34 Wed,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
229,System Event,06/26/2014 14:42:42 Thu,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)
230,System Event,06/26/2014 14:42:43 Thu,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB2(CPU1)

cyberjock · Jul 26, 2014

Yeah, correctable means it fixed it. Uncorrectable would have been "bad thing".

You basically got your butt saved by ECC. ;)

TXAG26 · Jul 26, 2014

cyberjock said:
Yeah, correctable means it fixed it. Uncorrectable would have been "bad thing".

You basically got your butt saved by ECC. ;)

Indeed I did! And BTW, thank you for writing "The Guide"! I went through it when I was first making the move to FreeNAS and the recommendations, including ECC ram definitely got the build headed down the right path!

cyberjock · Jul 26, 2014

You're welcome! Glad you took the info to heart. Looks like it saved you. :D

joeschmuck · Jul 27, 2014

This is one of those postings folks should reference as it has an actual ECC error message. I've always heard of them but never seen one.

TXAG26 · Jul 27, 2014

I too have never seen a correctable ECC error in the logs, much less have a stick of ram fail in this manner. The way ECC worked in my instance is similar to SMART Errors on a HDD - both give you a little heads up that something is in the process of failing (if you're paying attention)!

I will definitely be checking the main system event logs through IPMI much more frequently!

Important Announcement for the TrueNAS Community.

SMART error (OfflineUncorrectableSector) Offline uncorrectable sectors

Patron

FreeNAS Generalissimo

Hall of Famer

Old Man

Patron

Wizard

Old Man

Patron

Old Man

Inactive Account

Dabbler

Patron

Dabbler

Patron

Patron

Inactive Account

Patron

Inactive Account

Old Man

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "SMART error (OfflineUncorrectableSector) Offline uncorrectable sectors"

Similar threads