TrueNAS Core - Replacing a Drive With Bad Sectors

Impact · Jul 12, 2023

Hi,

I'm after advice on the correct procedure to replace a disk that has started showing bad sectors as per the below message.

Device: /dev/da8 [SAT], 40 Offline uncorrectable sectors.

The smartctrl -a dev/da8 output is at the end of the post.

Notes

The Seagate IronWolf 4TB Disk, will be replaced with an identical one in the Dell R510 12-bay server.
The server is running TrueNAS Core 13.0-U5.1.
The Pool contains ten disks, with two disk redundancy and one samba share.
The Pool is reporting the Status is 'Online' and Disks w/Errors is '0'.
I'm copying around 20TB of critical data onto the server and into the Pool where this Disk is so I'm keen to replace the drive before it fails.

Question

Should I be following the process in the documentation found here, under the section Replacing a Disk?

This would see me;

offline the disk until it's reported as Offline;
shutdown the server;
physically remove the disk from the server and replace it with the new drive;
reboot the server and login to the GUI and under Pool Status, open the options for the offline disk and click Replace; and
select the new disk which will automatically begin the resilver process.

The answer to the above may just be 'Yes', I'm asking as the last time I run through this process I ended up with TrueNAS thinking there were 11 drives in the Pool and the offline drive was still present alongside the new one.

smartctrl Output

Code:

smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST4000VN008-2DR166
Serial Number:    ZDH1LEDH
LU WWN Device Id: 5 000c50 0a2f0540c
Firmware Version: SC60
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5980 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jul 12 15:23:33 2023 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  591) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 657) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   084   064   044    Pre-fail  Always       -       230196528
  3 Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   097   097   020    Old_age   Always       -       3509
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       232
  7 Seek_Error_Rate         0x000f   094   060   045    Pre-fail  Always       -       2749659395
  9 Power_On_Hours          0x0032   046   046   000    Old_age   Always       -       48069 (207 194 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       84
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   044   044   000    Old_age   Always       -       56
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       8590065666
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   070   055   040    Old_age   Always       -       30 (Min/Max 24/33)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       27
193 Load_Cycle_Count        0x0032   051   051   000    Old_age   Always       -       98712
194 Temperature_Celsius     0x0022   030   045   000    Old_age   Always       -       30 (0 8 0 0 0)
197 Current_Pending_Sector  0x0012   100   099   000    Old_age   Always       -       40
198 Offline_Uncorrectable   0x0010   100   099   000    Old_age   Offline      -       40
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       44580h+58m+04.653s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       49074968699
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1922414607351

SMART Error Log Version: 1
ATA Error Count: 96 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 96 occurred at disk power-on lifetime: 36927 hours (1538 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  42 00 00 ff ff ff 4f 00  46d+05:07:00.600  READ VERIFY SECTOR(S) EXT
  42 00 01 ff ff ff 4f 00  46d+05:07:00.369  READ VERIFY SECTOR(S) EXT
  61 00 01 ff ff ff 4f 00  46d+05:07:00.368  WRITE FPDMA QUEUED
  42 00 00 ff ff ff 4f 00  46d+05:07:00.248  READ VERIFY SECTOR(S) EXT
  42 00 01 ff ff ff 4f 00  46d+05:07:00.021  READ VERIFY SECTOR(S) EXT

Error 95 occurred at disk power-on lifetime: 36927 hours (1538 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  42 00 00 ff ff ff 4f 00  46d+05:07:00.600  READ VERIFY SECTOR(S) EXT
  42 00 01 ff ff ff 4f 00  46d+05:07:00.369  READ VERIFY SECTOR(S) EXT
  61 00 01 ff ff ff 4f 00  46d+05:07:00.368  WRITE FPDMA QUEUED
  42 00 00 ff ff ff 4f 00  46d+05:07:00.248  READ VERIFY SECTOR(S) EXT
  42 00 01 ff ff ff 4f 00  46d+05:07:00.021  READ VERIFY SECTOR(S) EXT

Error 94 occurred at disk power-on lifetime: 36927 hours (1538 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  42 00 00 ff ff ff 4f 00  46d+05:07:00.248  READ VERIFY SECTOR(S) EXT
  42 00 01 ff ff ff 4f 00  46d+05:07:00.021  READ VERIFY SECTOR(S) EXT
  61 00 01 ff ff ff 4f 00  46d+05:07:00.020  WRITE FPDMA QUEUED
  42 00 00 ff ff ff 4f 00  46d+05:06:59.897  READ VERIFY SECTOR(S) EXT
  42 00 01 ff ff ff 4f 00  46d+05:06:59.661  READ VERIFY SECTOR(S) EXT

Error 93 occurred at disk power-on lifetime: 36927 hours (1538 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  42 00 00 ff ff ff 4f 00  46d+05:07:00.248  READ VERIFY SECTOR(S) EXT
  42 00 01 ff ff ff 4f 00  46d+05:07:00.021  READ VERIFY SECTOR(S) EXT
  61 00 01 ff ff ff 4f 00  46d+05:07:00.020  WRITE FPDMA QUEUED
  42 00 00 ff ff ff 4f 00  46d+05:06:59.897  READ VERIFY SECTOR(S) EXT
  42 00 01 ff ff ff 4f 00  46d+05:06:59.661  READ VERIFY SECTOR(S) EXT

Error 92 occurred at disk power-on lifetime: 36927 hours (1538 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  42 00 00 ff ff ff 4f 00  46d+05:06:59.897  READ VERIFY SECTOR(S) EXT
  42 00 01 ff ff ff 4f 00  46d+05:06:59.661  READ VERIFY SECTOR(S) EXT
  61 00 01 ff ff ff 4f 00  46d+05:06:59.660  WRITE FPDMA QUEUED
  42 00 00 ff ff ff 4f 00  46d+05:06:59.536  READ VERIFY SECTOR(S) EXT
  42 00 01 ff ff ff 4f 00  46d+05:06:59.300  READ VERIFY SECTOR(S) EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     48053         -
# 2  Short offline       Completed without error       00%     48029         -
# 3  Short offline       Completed without error       00%     48005         -
# 4  Short offline       Completed without error       00%     47981         -
# 5  Short offline       Completed without error       00%     47957         -
# 6  Short offline       Completed without error       00%     47933         -
# 7  Short offline       Completed without error       00%     47909         -
# 8  Short offline       Completed without error       00%     47885         -
# 9  Short offline       Completed without error       00%     47861         -
#10  Short offline       Completed without error       00%     47837         -
#11  Short offline       Completed without error       00%     47813         -
#12  Short offline       Completed without error       00%     47789         -
#13  Short offline       Completed without error       00%     47765         -
#14  Short offline       Completed without error       00%     47741         -
#15  Short offline       Completed without error       00%     47717         -
#16  Short offline       Completed without error       00%     47693         -
#17  Short offline       Completed without error       00%     47675         -
#18  Short offline       Completed without error       00%     47651         -
#19  Short offline       Completed without error       00%     47627         -
#20  Short offline       Completed without error       00%     47603         -
#21  Short offline       Completed without error       00%     47579         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing

Any advice would be greatly appreciated, even if that's just a 'Yes'.

Thank you

samarium · Jul 13, 2023

Personally I would try and overwrite the sectors and see if that makes attr 197 decrease, but I've done that a few times, and my pool disks are 73000 hrs, but they don't get hammered.

I had a recent SMART error like that logged, but it was a warning, and suggested that maybe it would go away but itself, and it went away, maybe it got overwritten by ZFS, or maybe it got overwritten in a SMART test, I don't exactly know what happens in a SMART tests.

None of the errors shown in the logs happened recently at all? 39000 hrs vs disk power on hrs 48000 hrs? Are you sure the 40x 197 CPS count is new? It matches the 198 OE errors. Have you checked you system kernel logs, are there any events logged that might match the notification? Of is it so long ago that there aren't any indicators in the system logs? You might be focusing on something that has been around for a while.

You have attr 5, 197 and 198 as interesting to me, and attr 4 and 193 seem high for a server. Disk sectors can go bad and reallocation is the designed mechanism to handle them, and there is a pool of sectors in the disk to be used for this. Lots of reallocations is a warning, but that is the rate of change, like all the attributes, stable isn't usually a problem, it is when they share indicating issues, and start accelerating that I would worry.

As for replacing a disk, I prefer the leave the existing disk in place, if you have the capacity, especially as you have no errors showing. Add the replacement disk on a spare slot, or a spare SATA connector. As soon as you remove a disk you have less redundancy, and are more exposed, but if you leave the disk in place you maintain pool redundancy levels.

When was the last time you did a scrub? No errors doesn't mean much if you don't have a recent scrub, but if you are in the middle of a big transfer I wouldn't want to do a scrub, and replacing a disk is effectively the same thing, only more so, and it will cause the entire pool to thash around, and the disks to seek back and forth, causing contention with your data transfer.

I would keep an eye on the SMART stats, the zpool status, and the system kernel logs, at least while you are going a transfer. Then evaluate it again.

Impact · Jul 13, 2023

Hi samarium,

Thank you for responding. I've broken down your reply to make it easier to respond to.

samarium said:
None of the errors shown in the logs happened recently at all? 39000 hrs vs disk power on hrs 48000 hrs? Are you sure the 40x 197 CPS count is new? It matches the 198 OE errors. Have you checked you system kernel logs, are there any events logged that might match the notification? Of is it so long ago that there aren't any indicators in the system logs? You might be focusing on something that has been around for a while.

The following error has been logged since my initial post:

TrueNAS Core Output said:
Device: /dev/da8 [SAT], Self-Test Log error count increased from 0 to 1.

I'll have to look through the kernel logs to see if I can find anything relating to when the issues you're referring to ocurred. After transferring the current batch of data, I'll run an extended SMART test followed by a scrub of the Pool and see if it persists.

samarium said:
As for replacing a disk, I prefer the leave the existing disk in place, if you have the capacity, especially as you have no errors showing. Add the replacement disk on a spare slot, or a spare SATA connector. As soon as you remove a disk you have less redundancy, and are more exposed, but if you leave the disk in place you maintain pool redundancy levels.

The server is a 12-bay Dell R510 with 12 x 3.5" hard drives in the front bays (0-11), two 2.5" SSDs in the internal bays (12 & 13) and an M.2 SATA SSD plugged in via USB so I'd need to replace the drive rather than putting in another alongside it.

If we got to the point where the disk needed replacing, can you confirm this is the correct procedure to follow?

Impact said:
Should I be following the process in the documentation found here, under the section Replacing a Disk?

This would see me;

1. offline the disk until it's reported as Offline;
2. shutdown the server;
3. physically remove the disk from the server and replace it with the new drive;
4. reboot the server and login to the GUI and under Pool Status, open the options for the offline disk and click Replace; and
5. select the new disk which will automatically begin the resilver process.

If the above process is correct, at least I know the correct way to replace the drive should the need arise.

samarium said:
When was the last time you did a scrub? No errors doesn't mean much if you don't have a recent scrub, but if you are in the middle of a big transfer I wouldn't want to do a scrub, and replacing a disk is effectively the same thing, only more so, and it will cause the entire pool to thash around, and the disks to seek back and forth, causing contention with your data transfer.

The last scrub was completed on Sunday, 9 July with the '40 Offline uncorrectable sectors' notes appearing Wednesday, 12 July. At the point of the scrub, the share had no files.

Any further advice would be greatly appreciated.

Thank you

Johnny Fartpants · Jul 13, 2023

Impact said:
This would see me;

offline the disk until it's reported as Offline;

shutdown the server;

physically remove the disk from the server and replace it with the new drive;

reboot the server and login to the GUI and under Pool Status, open the options for the offline disk and click Replace; and

select the new disk which will automatically begin the resilver process.

Yes

Impact said:
shutdown the server;

Assuming you don't have hot-swap capabilities

samarium · Jul 13, 2023

Where the self test log error count increased, I would be interested to see what is reflected in the smartctl output, as it is it there isn't much data in the message. What attributes were changed, if there is a new error log entry, what does it say? Had the health status or anything else in the smartctl output changed. This is where saving the smartctl output on the machine, and using the diff command makes things easier.

If you don't have a spare slot, then you have no choice but to remove a drive for replacement.

Replacement procedure looks fine, although if you have hot swap bays you can just hot swap the drives, verify status and proceed, rather than go thru a reboot cycle. Certainly that is what I've been doing for years. Hopefully you have identified the serial numbers and corresponding slots and have an external reference and will be checking those before, during and after the swap operation. The last thing you want to do is take one disk offline, then pull another, incorrect, disk, although at least you can plug them back in, bring then back online and wait for resilver to complete before trying again. Shutdown I supposed allows you to cross check the serial numbers if you don't already have a good reference to serial number and slot, always a good idea, printed and stored accessible near the server is good, although updating printed doco is always a pain.

If there was no data in the pool, then there was nothing to scrub, so it doesn't say anything about the disk errors if they were in a position on the disk that wasn't part of the pool, they could have already been there for 9000 hours? If you haven't got errors logged on the pool, then that usually means any disk errors were not realated to pool IO, so maybe SMART, or identifed by the disk itself? Have you cross referenced the smartctl output time, the power on hrs and the error log event power on hrs to try and track down when the errors happened and then what might have been happening on the pool?

Impact · Jul 19, 2023

Hi Johnny Fartpants & samarium,

Thank you for replying with further support and apologies for the delay in responding, I've been focussing on making a backup of the TrueNAS server we've just started using (see the reason for this under Question Two at the end of this reply).

Johnny Fartpants, thank you for confirming.

Johnny Fartpants said:
Impact said:

shutdown the server;

Click to expand...

Assuming you don't have hot-swap capabilities

I wasn't even thinking about the hot-swappable bays in the Dell R510, we don't have OMSA so I've never tried hot-swapping them. Do you know if they will just straight out of the box with TrueNAS?

samarium,

samarium said:
Where the self test log error count increased, I would be interested to see what is reflected in the smartctl output, as it is it there isn't much data in the message. What attributes were changed, if there is a new error log entry, what does it say? Had the health status or anything else in the smartctl output changed. This is where saving the smartctl output on the machine, and using the diff command makes things easier.

The error count increase was reflected in the smartctl output and the offline uncorrectable sectors now sit at 704 and the self-test error count 114. The Pool is now showing as degraded and the disk is faulted.

samarium said:
Replacement procedure looks fine, although if you have hot swap bays you can just hot swap the drives, verify status and proceed, rather than go thru a reboot cycle. Certainly that is what I've been doing for years.

Please see the query at the top of this reply relating to hot-swappable bays in the Dell R510.

samarium said:
Hopefully you have identified the serial numbers and corresponding slots and have an external reference and will be checking those before, during and after the swap operation. The last thing you want to do is take one disk offline, then pull another, incorrect, disk, although at least you can plug them back in, bring then back online and wait for resilver to complete before trying again. Shutdown I supposed allows you to cross check the serial numbers if you don't already have a good reference to serial number and slot, always a good idea, printed and stored accessible near the server is good, although updating printed doco is always a pain.

I use the Disk description in TrueNAS to list the drive type and drive bay for this exact reason, thank you for the suggestion though!

samarium said:
If there was no data in the pool, then there was nothing to scrub, so it doesn't say anything about the disk errors if they were in a position on the disk that wasn't part of the pool, they could have already been there for 9000 hours? If you haven't got errors logged on the pool, then that usually means any disk errors were not realated to pool IO, so maybe SMART, or identifed by the disk itself? Have you cross referenced the smartctl output time, the power on hrs and the error log event power on hrs to try and track down when the errors happened and then what might have been happening on the pool?

As there has been an increase in disk errors since the original smartctl I think it's fair to say this disk has failed. With that said, I now have a better understanding of what to test in future to see whether it is a true failure or not - thank you for the advice here.

I'm due to replace the drive tomorrow, however, an ugly situation has reared its head.

Question Two

The Seagate IronWolf 4TB drive noted above will be replaced, however, a second drive has now started showing the same errors as what has been discussed here - this drive is in the same Pool as the first and if it fails would see all redundancy lost in the system (running RaidZ2). The second drive is a pretty old one that I'd earmarked to replace regardless of its failure.

Would it be best to replace the first drive and perform the resilver process, followed by the second drive once complete or is it safe to replace two drives and complete the resilver process for both?

Any advice would be greatly appreciated.

Thank you

Davvo · Jul 19, 2023

Impact said:
Would it be best to replace the first drive and perform the resilver process, followed by the second drive once complete or is it safe to replace two drives and complete the resilver process for both?

If the old drive still live, it's best to complete the resilver for the faulty one first.

Impact · Jul 19, 2023

Davvo said:
Impact said:

Would it be best to replace the first drive and perform the resilver process, followed by the second drive once complete or is it safe to replace two drives and complete the resilver process for both?

Click to expand...

If the old drive still live, it's best to complete the resilver for the faulty one first.

Thank you for the quick reply Davvo, I assumed that would be the case as it's how we would dow it with our Ubuntu servers but just thought I'd check for clarity.

joeschmuck · Jul 19, 2023

I prefer to power off my server to replace a hard drive, or any component. It's safer. Think about this, even high quality drive bays can be inserted slightly crooked and bend pins, short out power, and yes I've seen it happen on a Supermicro system. Scared the crap out of me.

Replace the drive ASAP! You had 232 reallocated sectors, this is a bad sign that the platers are developing bad spots on them. You have 40 Pending Reallocation Sectors, and as you noted, 40 Uncorrectable Sector Errors. This is very bad for a hard drive. Your system is telling you that you have a serious problem, thankfully probably no data loss.

As @Davvo said, replace it now before you have a second drive start throwing errors. And if you have drives with that many hours on them, consider replacing (I'm just saying consider, not saying you must) them as well when you can afford it and remember, you replace a drive, it resilvers, once that is complete you can replace another drive.

Good luck.

Impact · Jul 19, 2023

Hi joeschmuck,

Thank you for the feedback.

joeschmuck said:
I prefer to power off my server to replace a hard drive, or any component. It's safer. Think about this, even high quality drive bays can be inserted slightly crooked and bend pins, short out power, and yes I've seen it happen on a Supermicro system. Scared the crap out of me.

Historically we've always powered down our servers to replace drives and whilst it's good to remember they offer the ability to hot swap drives, speed is not the highest priority in this situation.

joeschmuck said:
Replace the drive ASAP! You had 232 reallocated sectors, this is a bad sign that the platers are developing bad spots on them. You have 40 Pending Reallocation Sectors, and as you noted, 40 Uncorrectable Sector Errors. This is very bad for a hard drive. Your system is telling you that you have a serious problem, thankfully probably no data loss.

We're in the middle of an unlucky situation which has seen us speed up the transition to TrueNAS from an Ubuntu Server, I'm in the process of syncing our TrueNAS data back to the Ubuntu Server before replacing the drive as a precaution which should be complete in the next few hours.

joeschmuck said:
As @Davvo said, replace it now before you have a second drive start throwing errors. And if you have drives with that many hours on them, consider replacing (I'm just saying consider, not saying you must) them as well when you can afford it and remember, you replace a drive, it resilvers, once that is complete you can replace another drive.

Good luck.

We'll be replacing a number of the drives in the TrueNAS over the coming weeks and months as this was previously our second fileserver and we know some of the drives have a high number of hours on them.

Thank you for the advice and well wishes.

Impact · Jul 31, 2023

Hi samarium, Johnny Fartpants, Davvo & joeschmuck,

Thank you for the support and guidance in this thread. I replaced the drive and resilvered the 10x4TB Pool in circa 13 hours.

The one issue I ran into was taking the failed drive Offline; no error message was displayed, and the status never changed from Faulted to Offline. Given the Faulted drive status and the likelihood that a scrub of the Pool would've taken hours or days, I opted to shut down the server and replace the drive - this worked fine.

Question Three

My reason for noting this is I'm keen to replace two drives in the Pool (to replace with improved drives for consistency), neither of which are showing as Faulted; is it likely the failed drive wouldn't go Offline due to its Faulted status and the two Online drives I'd like to replace will Offline without issue?

Any advice would be greatly appreciated.

Johnny Fartpants · Jul 31, 2023

Faulted and Offline are kind of the same thing practically. Faulted means the system has removed the drivel from the pool and Offline normally refers to the sys admin removing the drive from the pool.

I look at disk replacements in two ways 1. pre-emptive and 2. reactive

Some people like to monitor their drive heath via SMART stats and if they see something they don't like they carryout a pre-emptive replacement (I do this). Others wait till TrueNAS throws the drive out of the system (FAULTED) and then replace.

Drives don't always fail in a gentlemanly fashion and blow their brains out. Sometimes they can scream a lot before death and cause system-wide issues hence why some prefer to use the pre-emptive strategy.

Important Announcement for the TrueNAS Community.

TrueNAS Core - Replacing a Drive With Bad Sectors

Impact

Dabbler

samarium

Contributor

Impact

Dabbler

Johnny Fartpants

Guru

samarium

Contributor

Impact

Dabbler

Davvo

MVP

Impact

Dabbler

joeschmuck

Old Man

Impact

Dabbler

Impact

Dabbler

Johnny Fartpants

Guru

Similar threads

Important Announcement for the TrueNAS Community.

TrueNAS Core - Replacing a Drive With Bad Sectors

Dabbler

Contributor

Dabbler

Guru

Contributor

Dabbler

MVP

Dabbler

Old Man

Dabbler

Dabbler

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "TrueNAS Core - Replacing a Drive With Bad Sectors"

Similar threads