refreshing the value of LBA_of_first_error

neveragain · Oct 20, 2013

Hi,

Yesterday my FreeNAS server started reporting the following:

SMART error (CurrentPendingSector) detected
Device: /dev/ada0, 8 Currently unreadable (pending) sectors
SMART error (OfflineUncorrectableSector) detected
Device: /dev/ada0, 8 Offline uncorrectable sectors

I've researched the error and found the following guide referenced by one of the forum threads here: http://daemon-notes.com/articles/system/smartmontools/current-pending

I've followed the procedure and was able to refresh 1 sector by using the following:

Code:

dd if=/dev/ada0 of=/dev/ada0 bs=512 count=1 iseek=2643817536 oseek=2643817536 conv=noerror,sync

After that, I have re-run a long test on the disk to try and get the next sector ID, however, after the completion of the test, I get the following:

Code:

smartctl -l selftest /dev/ada0
 
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error      00%      8550        -
# 2  Extended offline    Completed: read failure      70%      8541        2643817536
# 3  Short offline      Completed without error      00%      8539        -
# 4  Extended offline    Completed without error      00%      8434        -
# 5  Short offline      Completed without error      00%      8261        -
# 6  Short offline      Completed without error      00%      7997        -
# 7  Short offline      Completed without error      00%      7877        -
# 8  Extended offline    Completed without error      00%      7714        -
# 9  Short offline      Completed without error      00%      7541        -
#10  Short offline      Completed without error      00%      7277        -
#11  Short offline      Completed without error      00%      7157        -
#12  Short offline      Completed without error      00%      6882        -
#13  Short offline      Completed without error      00%      6762        -
#14  Short offline      Completed without error      00%      6724        -
1 of 1 failed self-tests are outdated by newer successful extended offline self-test # 1

The LBA_of_first_error is still showing the original sector ID. However, /var/log/messages is now reporting the following, which tells me that the sector refresh was successful:

Code:

Oct 20 12:09:59 nas01 smartd[2485]: Device: /dev/ada0, 7 Currently unreadable (pending) sectors
Oct 20 12:09:59 nas01 smartd[2485]: Device: /dev/ada0, 7 Offline uncorrectable sectors
Oct 20 12:09:59 nas01 smartd[2485]: Device: /dev/ada0, 7 Currently unreadable (pending) sectors
Oct 20 12:09:59 nas01 smartd[2485]: Device: /dev/ada0, 7 Offline uncorrectable sectors

I would like to get the remaining sector IDs.. What am I missing here?

Thanks in advance.

survive · Oct 20, 2013

Hi neveragain,

I think you need to scrub the volume, then run another long SMART test.

-Will

cyberjock · Oct 20, 2013

Long tests don't always compare the sector data with its ECC. So when the sector's data and its ECC don't match(or the firmware determines that an error occurred on read) then you get the Current Pending Sector Count. But if your manufacturer doesn't do the comparison, it might pass the long test. Unfortunately, the manufacturers that do and don't do the comparison is a secret. :(

Do a scrub. If that doesn't fix it, then the only other option you really have is to do something like a bad blocks on the disk. That'll write to the whole drive and you should then see zero again.

neveragain · Oct 21, 2013

Thank you very much for your responses. I've done as suggested: manually initiated a scrub on the pool and then on completion ran a long SMART test on /dev/ada0. Unfortunately no change in the selftest output. Unreadable/uncorrectable sector messages persist in /var/log/messages also.

Code:

smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p5 amd64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
 
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error      00%      8569        -
# 2  Extended offline    Completed without error      00%      8550        -
# 3  Extended offline    Completed: read failure      70%      8541        2643817536
# 4  Short offline      Completed without error      00%      8539        -
# 5  Extended offline    Completed without error      00%      8434        -
# 6  Short offline      Completed without error      00%      8261        -
# 7  Short offline      Completed without error      00%      7997        -
# 8  Short offline      Completed without error      00%      7877        -
# 9  Extended offline    Completed without error      00%      7714        -
#10  Short offline      Completed without error      00%      7541        -
#11  Short offline      Completed without error      00%      7277        -
#12  Short offline      Completed without error      00%      7157        -
#13  Short offline      Completed without error      00%      6882        -
#14  Short offline      Completed without error      00%      6762        -
#15  Short offline      Completed without error      00%      6724        -
1 of 1 failed self-tests are outdated by newer successful extended offline self-test # 1

A new negative development are checksum errors on another device. I had scrubs scheduled to run every 35 days, and not sure when it ran last.

Code:

pool: nas01_pool
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://www.sun.com/msg/ZFS-8000-9P
  scan: scrub repaired 132G in 3h43m with 0 errors on Mon Oct 21 12:19:23 2013
config:
 
        NAME                                            STATE    READ WRITE CKSUM
        nas01_pool                                      ONLINE      0    0    0
          raidz2-0                                      ONLINE      0    0    0
            gptid/927df9ce-0b5f-11e2-80e1-902b3498f036  ONLINE      0    0    0
            gptid/932211a8-0b5f-11e2-80e1-902b3498f036  ONLINE      0    0    0
            gptid/93bf9dc6-0b5f-11e2-80e1-902b3498f036  ONLINE      0    0    0
            gptid/9464dae0-0b5f-11e2-80e1-902b3498f036  ONLINE      0    0 4.19M
            gptid/9506d3ce-0b5f-11e2-80e1-902b3498f036  ONLINE      0    0    0
            gptid/95aa6903-0b5f-11e2-80e1-902b3498f036  ONLINE      0    0    0
 
errors: No known data errors

I've bought a new set of drives and going to set up a new file server and migrate the data. Following that, I think, I'll try to run more tests on the disks of the original pool.

Important Announcement for the TrueNAS Community.

refreshing the value of LBA_of_first_error

neveragain

Cadet

survive

Behold the Wumpus

cyberjock

Inactive Account

neveragain

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

refreshing the value of LBA_of_first_error

neveragain

Cadet

survive

Behold the Wumpus

cyberjock

Inactive Account

neveragain

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "refreshing the value of LBA_of_first_error"

Similar threads