Fix bad blocks

budmannxx · Nov 3, 2011

I have some bad blocks on 2 of 6 drives in my ZFS raidz2. Is there a way to fix them without formatting the drives? If not, will formatting fix them? Some details:

I'm running FreeNAS-8.0.2-RELEASE-amd64 (8288)

The drives are Samsung HD204UI (the drives were manufactured after the firmware issue that is well documented on the forum, and I applied the firmware patch anyway, just to be sure)

Here is the smartctl output for 1 of the drives (output for the other drive is pretty much the same, just different LBA_of_first_error):

Code:

smartctl -l selftest /dev/ada0
smartctl 5.41 2011-06-09 r3365 [FreeBSD 8.2-RELEASE-p3 amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%      3741         1192616706
# 2  Short offline       Completed: read failure       90%      3732         1192616706
# 3  Short offline       Completed: read failure       90%      3407         1192616706
# 4  Extended offline    Completed: read failure       70%       299         1192616706
# 5  Short offline       Completed: read failure       90%       297         1192616706
# 6  Short offline       Completed without error       00%        33         -

I've read through this tutorial but it's geared towards Linux, and the sg3_utils mentioned are not available in FreeNAS. Any help here would be greatly appreciated.

Dmitry Nosachev · Nov 4, 2011

Please show the output from smartctl -A /dev/ada0

budmannxx · Nov 4, 2011

Here you go:

Code:

smartctl -A /dev/ada0
smartctl 5.41 2011-06-09 r3365 [FreeBSD 8.2-RELEASE-p3 amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   099   051    Pre-fail  Always       -       936
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always       -       0
  3 Spin_Up_Time            0x0023   067   067   025    Pre-fail  Always       -       10185
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       77
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       4078
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       1
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       46
181 Program_Fail_Cnt_Total  0x0022   100   100   000    Old_age   Always       -       9672078
191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always       -       69
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   064   063   000    Old_age   Always       -       33 (Min/Max 22/37)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       73
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       1
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       3486

budmannxx · Nov 4, 2011

I guess the only concern there is the 2 in Current_Pending_Sector--is this something I can fix myself or do I have to RMA the drive?

Dmitry Nosachev · Nov 4, 2011

You need to make the HDD to remap these sectors.
"sg3_utils" package can run on FreeBSD and contains such useful tools as sg_format, sg_verify and sg_reassing, but they are designed for use with SCSI/SAS disks. Some SATA discs may work with a limited number of SCSI commands (e.g. sg_verify works on Hitachi A7K2000), but in your case you need to fill HDD with zeroes: dd if=/dev/zero of=/dev/ada0 bs=1M (remove HDD from zpool first).
Then look at the attributes "Reallocated_Sector_Ct" (it should increase) and "Current_Pending_Sector" (it should be 0) again. Finally, run the SMART selftest: smartctl --test=long /dev/ada0

budmannxx · Nov 5, 2011

I tried this, but couldn't get the dd command to work (logged in as root).

Tried to remove the drive from the pool (can't, I think because it's a raidz2, not a mirror):

Code:

/mnt/# zpool remove freenas gpt/ada0
cannot remove gpt/ada0: only inactive hot spares or cache devices can be removed

Offlined the drive (success) and checked pool status (degraded, as expected):

Code:

/mnt/# zpool offline freenas gpt/ada0
/mnt/# zpool status
  pool: freenas
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
 scrub: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        freenas       DEGRADED     0     0     0
          raidz2      DEGRADED     0     0     0
            gpt/ada0  OFFLINE      0     0     0
            gpt/ada1  ONLINE       0     0     0
            gpt/ada2  ONLINE       0     0     0
            gpt/ada3  ONLINE       0     0     0
            gpt/ada4  ONLINE       0     0     0
            gpt/ada5  ONLINE       0     0     0

errors: No known data errors

Attempt the dd command:

Code:

/mnt/# dd if=/dev/zero of=/dev/ada0 bs=1M
dd: /dev/ada0: Operation not permitted

I think the dd didn't work because the drive is only OFFLINE, and somehow still "locked" by the pool. Should I be removing the drive in a different way? I'd prefer not to have to physically pull the drive.

Dmitry Nosachev · Nov 9, 2011

Try to zero problem block with hdparm: hdparm --write-sector 1192616706 /dev/ada0
Then check the "Current_Pending_Sector" value, and start long selftest. Finally you will need to scrub your zpool.

budmannxx · Nov 9, 2011

I must be missing something obvious, but I'm getting a "command not found" when trying to run hdparm:

Code:

/mnt/# hdparm --write-sector 1192616706 /dev/ada0
hdparm: Command not found.
/mnt/# hdparm
hdparm: Command not found.

bsalinux · Dec 17, 2011

It would be easier if you have a spare drive. Replace the bad drive with another drive, move the defected drive to another system and run seatools / other manufacturer low level tools to format the drive. If you RMA the drive, they will send you a re-certified drive. Smart won't trip until you have Reallocated_Sector_Ct <= 36.

sunflashx · Jan 6, 2012

For the record, I think it's asinine you can't fix something stupid like this easily on a dedicated storage appliance.

I gave up and RMA my drive that had bad blocks. Samsung's advance RMA system is great. They'll ship you a new drive, you can yank one drive and immediately start your rebuild. Cost me $12 or so to ship the old drive back.

budmannxx · Jan 6, 2012

sunflashx said:
Samsung's advance RMA system is great. They'll ship you a new drive, you can yank one drive and immediately start your rebuild. Cost me $12 or so to ship the old drive back.

And now that they're Seagate, it appears that the advanced RMA option isn't available, at least for the Samsung HD204UI. Anyone know of a way to do this post merger?

deajan · Sep 10, 2012

Hello,

Sorry if i burry out this topic, but here's what i tried to manage bad blocks:

Package sg3_utils ins't included in freenas by default (indeed very strange as it's kinda useful for a storage server) so i manually installed it:

Code:

# mount -uw /
# mkdir /root/sg3_utils
# cd /root/sg3_utils
# wget http://ftp2.freebsd.org/pub/FreeBSD/ports/amd64/packages-8.2-release/sysutils/sg3_utils-1.28.tbz
# pkg_add sg3_utils-1.28.tbz
# mount -ur /

Then i tried to check my bad blocks listed

Code:

# smartctl -l selftest /dev/ada1

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       30%     10685         2903215976

# /usr/local/bin/sg_verify --lba=2903215976 /dev/ada1

verify (10): transport: (pass2:ahcich2:0:0:0): VERIFY(10). CDB: 2f 0 ad b 8f 68 0 0 1 0
(pass2:ahcich2:0:0:0): CAM status: CCB request was invalid

Verify(10) failed near lba=2903215976 [0xad0b8f68]

Now i'm stuck here... It seems that my RAID controller (IBM ServeRAID C100) can't speak with CAM framework... Or my drive (WD2003FYYS-02W0B0) can't speak SCSI.
Does someone have any clues ?

Thanks.

deajan · Sep 20, 2012

Finally, Western Digital RE3 /RE4 drives do not simply speak SCSI.

I've tried on my home NAS having a RE3 drive, and i got the same result:

Code:

[root@freenas] ~# sg_verify --lba=10340032 /dev/ada0
verify (10): transport: (pass0:ahcich0:0:0:0): VERIFY(10). CDB: 2f 0 0 9d c6 c0
(pass0:ahcich0:0:0:0): CAM status: CCB request was invalid

Verify(10) failed near lba=10340032 [0x9dc6c0]

Only solution is what ? Removing the disk from zpool, fill it with zeros until HDD firmware finds out that the sector is not writable, remaps it, and then attach the disk to the zpool again and resilver ?

deajan · Sep 20, 2012

Okay... new round.

I've played around with dd and think got success:

These are some lines of my smartctl -a /dev/ada1 output before:

Code:

...
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
...
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       1

...
# 2  Extended offline    Completed: read failure       30%     10899         2903215976
...

I actually disabled disk geometry protection and then zerofilled the sector with dd:

Code:

# sysctl kern.geom.debugflags=0x10
# dd bs=512 seek=2903215976 if=/dev/zero of=/dev/ada1 count=1
# sysctl kern.geom.debugflags=0x0

Now my smartctl output says

Code:

...
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
...
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
...

I'm making a long selftest right now to be sure.
Hopefully this helps someone.

If you have SATA drives, use the dd technique.
If you have SAS drives, install the sg3_utils package i stated in this topic and follow this guide http://smartmontools.sourceforge.net/badblockhowto.html#bb

Cheers.

paleoN · Sep 20, 2012

deajan said:
I actually disabled disk geometry protection and then zerofilled the sector with dd:

Correct me if I'm wrong, but I don't believe it's necessary to disable geom unless the bad block was in one of the GPT labels. It's not clear what your steps were, but you would want to offline the disk before dd'ing it and online it afterwards.

deajan · Sep 20, 2012

I've actually tried putting offline the disk, even exporting the zpool didn't the trick.
As long as i did not change the sysctl parameter i suggested, everytime i tried dd i ended with:

Code:

dd: /dev/ada1: Operation not permitted

I might be wrong too (i'm not a BSD expert at all), but i think kern.geom.debugflags provides protection against "raw" writing to disk with tools like fdisk / gdisk or in my case dd.

paleoN · Sep 20, 2012

deajan said:
I've actually tried putting offline the disk, even exporting the zpool didn't the trick.

You would offline the disk when you were working on it. Then when you online the disk it would resilver if needed. Unless you destroyed the partitions, geom would still protect the disk.

deajan said:
I might be wrong too (i'm not a BSD expert at all), but i think kern.geom.debugflags provides protection against "raw" writing to disk with tools like fdisk / gdisk or in my case dd.

Ah, I see now. Thanks.

Visseroth · Oct 19, 2012

This was a good post and very helpful. I have a total of 12 of these drives, one which suddenly got the click of death one day and 11 more that are showing pending sector reallocation counts. I doubt they are all bad so I'm working on trying to repair the drives one at a time by taking them offline and am currently in the process of running the long test while they are still in the server and since the server is always online this means I don't have to keep another machine turned on to repair them.
So far I'm running the "smartctl --test=long /dev/ada0" and it reported "Please wait 347 minutes for test to complete"
Again thank you for this post and in information contained within. Newbs like me appreciate it.

Important Announcement for the TrueNAS Community.

Fix bad blocks

budmannxx

Contributor

Dmitry Nosachev

Cadet

budmannxx

Contributor

budmannxx

Contributor

Dmitry Nosachev

Cadet

budmannxx

Contributor

Dmitry Nosachev

Cadet

budmannxx

Contributor

bsalinux

Dabbler

sunflashx

Dabbler

budmannxx

Contributor

deajan

Dabbler

deajan

Dabbler

deajan

Dabbler

paleoN

Wizard

deajan

Dabbler

paleoN

Wizard

Visseroth

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Fix bad blocks

Contributor

Cadet

Contributor

Contributor

Cadet

Contributor

Cadet

Contributor

Dabbler

Dabbler

Contributor

Dabbler

Dabbler

Dabbler

Wizard

Dabbler

Wizard

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Fix bad blocks"

Similar threads