Is there any way of dealing with pending sectors?

NASbox

Guru
Joined
May 8, 2012
Messages
650
I have a drive that has developed a few "pending sectors" is there a way to force ZFS to text/remap these sectors if they have hard errors. Since ZFS identified the errors and resilvered the drive in question, I don't think a scrub will even access the sectors in question.
 

NASbox

Guru
Joined
May 8, 2012
Messages
650
Thanks for passing this along.... I have done a bit of reading since I originally posted and I'm not sure if I should attempt to fix it. I originally had 60 "Pending Sectors" and I did a scrub, and after the scrub I was down to 49 and NO Reallocation. Does that mean 11 were corrected (and the other 49 weren't touched)?
 

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177
Does that mean 11 were corrected (and the other 49 weren't touched)?
A Scrub does not usually resolve Pending Sectors as there is no writing going on.
I would say that you have a large number there and I would think about changing the drive.
You could resilver the drive and see what happens.
 

NASbox

Guru
Joined
May 8, 2012
Messages
650
A Scrub does not usually resolve Pending Sectors as there is no writing going on.
I would say that you have a large number there and I would think about changing the drive.
You could resilver the drive and see what happens.
When I started a scrub I had 60 pending sectors, as part of the scrub it did a partial resilver and the number dropped to 49.
How do you resilver a disk on demand"
 

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
as part of the scrub it did a partial resilver
This is not normal. You need to check out the smart data from the drive in question to verify if it is on it's way out.
 

NASbox

Guru
Joined
May 8, 2012
Messages
650
I've got a replacement drive doing it's burn in with BadBlocks as I write this, but I find this behaviour interesting. Why does the number of pending sectors drop without a reallocation event/sectors? That would seem to imply that bad sectors are correcting themselves automatically.

Once I get the drive safely out of the pool, I think I'll give it a good go over with badblocks to see if it falls appart or fixes itself completely or somewhere in between.

Does anytone have any comments about this weird behaviour?

Code:
20200714_050000~da2:ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
20200714_050000~da2:  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
20200714_050000~da2:196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
20200714_050000~da2:197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
20200714_050000~da2:198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
20200715_050000~da2:ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
20200715_050000~da2:  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
20200715_050000~da2:196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
20200715_050000~da2:197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       60
20200715_050000~da2:198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
20200716_050000~da2:ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
20200716_050000~da2:  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
20200716_050000~da2:196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
20200716_050000~da2:197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       60
20200716_050000~da2:198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
20200717_050000~da2:ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
20200717_050000~da2:  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
20200717_050000~da2:196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
20200717_050000~da2:197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       49
20200717_050000~da2:198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
20200718_050000~da2:ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
20200718_050000~da2:  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
20200718_050000~da2:196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
20200718_050000~da2:197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       49
20200718_050000~da2:198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
20200719_050000~da2:ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
20200719_050000~da2:  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
20200719_050000~da2:196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
20200719_050000~da2:197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       130
20200719_050000~da2:198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
20200720_050000~da2:ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
20200720_050000~da2:  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
20200720_050000~da2:196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
20200720_050000~da2:197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       81
20200720_050000~da2:198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
20200721_050000~da2:ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
20200721_050000~da2:  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
20200721_050000~da2:196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
20200721_050000~da2:197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       81
20200721_050000~da2:198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
20200722_050000~da2:ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
20200722_050000~da2:  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
20200722_050000~da2:196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
20200722_050000~da2:197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       81
20200722_050000~da2:198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
 

Fredda

Guru
Joined
Jul 9, 2019
Messages
608
I've got a replacement drive doing it's burn in with BadBlocks as I write this, but I find this behaviour interesting. Why does the number of pending sectors drop without a reallocation event/sectors? That would seem to imply that bad sectors are correcting themselves automatically.
Pending sectors can happen when the power is turned of during a write. In that case the block is not necessarily bad.
badblocks will to write to the sector, if it works, the sector becomes good, in case it does not work the sector gets reallocated.
In either case the pending sector count is decreased but the reallocated sector count is not necessarily increased.
 

NASbox

Guru
Joined
May 8, 2012
Messages
650
Pending sectors can happen when the power is turned of during a write. In that case the block is not necessarily bad.
badblocks will to write to the sector, if it works, the sector becomes good, in case it does not work the sector gets reallocated.
In either case the pending sector count is decreased but the reallocated sector count is not necessarily increased.
It is possible that it could be power damage as there was a point when I had a bad UPS battery.

My current plan is to replace the drive with a new one after I finish buring it in.

Once I don't need the drive anymore I am going to run the burn in on the drive. If the damage is temporary then it should clear things up, and if the drive is flakey it will make it worse. Either way I should know what to do.
 

NASbox

Guru
Joined
May 8, 2012
Messages
650
I replaced the disk, and after removing it from the pool I did a full zero wipe from the Storage / Disks FreeNAS menu.

After the wipe this is the output from smartctl appears to have cleared any errors:

Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   167   051    Pre-fail  Always       -       38
  3 Spin_Up_Time            0x0027   211   182   021    Pre-fail  Always       -       8425
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       313
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   046   046   000    Old_age   Always       -       39912
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       96
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       66
193 Load_Cycle_Count        0x0032   188   188   000    Old_age   Always       -       38807
194 Temperature_Celsius     0x0022   112   096   000    Old_age   Always       -       40
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       2



Given the nature of the data on the drive, it is quite likely that the surface has not experienced a write since the pool was created about 5 years ago since the content of the drive is largely library material that doesn't get updated frequently if at all. Does recorded info fade with time?

Am I correct in assuming that scrub is read only unless an error is found? No errors = no writes during scrub.
If an error is found, then the data is reconstructed and written to another part of the disk?

Any thoughts about this behaviour?

Now that the drive is out of the pool, I am going to run one pass of badblocks followed by a selftest to see what happens.

If the drive is clean after that would it be safe to assume that I can be as confident of the drive as I can be of any drive of the same type with a similar number of hours?
 

NASbox

Guru
Joined
May 8, 2012
Messages
650
I decided to retire the disk when I found an are in the upper TB that started throwing a ton of errors and started to put huge numbers on the 'Raw_Read_Error_Rate'.

As a matter of intellectual curiosity (since the drive is being scrapped), I am confused by the behaviour of SMART and was hoping someone might be able to explain the behaviour I have been observing..

I was able to use dd to write a full drive without errors. Writing cleared 'Current_Pending_Sector' to zero. When I read the bad area back with dd, I got an error, and issuing multiple reads on the bad spot didn't cause the sector to be reallocated. I could successfully write the sector that caused the read error repeately without any errors and reading always caused an error. There didn't seem to be any way to force the bad sector to be replaced, and the drive electronics didn't seem to catch the write error. I thought modern drives attempted to read back after writing to make sure the write was completed accurately.

Can someone explain this weird behaviour and what might force reallocation?

Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   001   001   051    Pre-fail  Always   FAILING_NOW 14644
  3 Spin_Up_Time            0x0027   211   182   021    Pre-fail  Always       -       8450
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       314
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   045   045   000    Old_age   Always       -       40259
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       97
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       67
193 Load_Cycle_Count        0x0032   188   188   000    Old_age   Always       -       38811
194 Temperature_Celsius     0x0022   115   096   000    Old_age   Always       -       37
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   198   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       2

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: unknown failure    90%     40220         -
# 2  Extended offline    Completed: unknown failure    90%     40220         -
# 3  Extended offline    Completed: read failure       10%     39625         3072421288
# 4  Extended offline    Completed without error       00%     17380         -
# 5  Extended offline    Completed without error       00%        92         -
# 6  Extended offline    Completed without error       00%        57         -
# 7  Short offline       Completed without error       00%        13         -
# 8  Extended offline    Completed without error       00%        12         -
# 9  Short offline       Completed without error       00%         0         -
#10  Conveyance offline  Completed without error       00%         0         -
 
Top