Disk replacement procedure for RaidZ1?

Sokonomi · Jul 10, 2022

Im sure this has been asked a dozen times before, but unfortunately all the info I can find either pertains to older builds of TrueNAS/FreeNAS, or involves some kind of issue that makes their situation more complicated than mine. So id like to run this checklist by some seasoned users before I start pulling things in my NAS.

So first some context;
The version i'm running is TrueNAS-12.0-U5 (Want to update, but im told its best to do after resilvering)
My current pool consists of 6 x WD Red 3Tb drives, one of which has been flaking out.
My current pool status seems to read as 'unhealthy', with 0 disks w/errors.
All disks are online and the NAS seems to function as usual, still.
ada3p2 has a 'checksum 1' which I assume is where the unhealthy status is coming from.
I have copied said drives serialnumber to make the physical disk easier to identify.
My system has a spare SATA port that I could use; apparently this helps.

So where do I go from here?
I have found this tutorial though it leaves some questions unanswered; Do I need to do any prepwork before doing this? I have jails running on this pool, should I offline those first? Any settings I need to back up/note down beforehand? The manual states a failing disk can be left online when replacing, though only when you know exactly what the failing disk condition is. How do I know which route is best to take? The pool isn't degraded, just unhealthy and seemingly still functioning. I could just plug the replacement disk into a spare SATA port and go at it that way, or I can offline and yank the broken disk out and plop the replacement in its spot.

The data on my pool isn't super curial, though recovering it would save me a few hours of time.

Any tips on how to I should proceed?

danb35 · Jul 10, 2022

Sokonomi said:
Do I need to do any prepwork before doing this?

Any disk you're using in your system should be burned-in and tested first. There are a number of guides on doing this; Uncle Fester's (link in my sig) is one.

Sokonomi said:
I have jails running on this pool, should I offline those first?

No.

Sokonomi said:
Any settings I need to back up/note down beforehand?

No.

Sokonomi said:
The manual states a failing disk can be left online when replacing, though only when you know exactly what the failing disk condition is.

If the failing disk is still generally OK (and with only one checksum error, I'd expect that to be the case), I prefer to leave it online during the replacement process. Taking it offline degrades your redundancy, and with RAIDZ1 means you have no redundancy at all. Leaving it online means that the resilvering will likely take longer than it would if the disk were offline, but I'd consider the preservation of redundancy worth it.

Etorix · Jul 10, 2022

"Checksum" could be a cable rather than the drive.

Do you have a backup?
What does SMART report? (smartctl -a /dev/ada3 in a SSH shell—avoid the GUI shell if possible)
You're welcome to post smartctl output, for all drives, as well as zpool status -v. Within CODE tags for readability, please.

If it's "just" a single checksum error, a scrub may be enough.
If you proceed to replace the drive, plug the new drive (preferably after running some burn-in tests on it) to the free port, and initiate replacement from the GUI (Storage > Pool > (gear) > Status > (old drive) > (…) > Replace). There's no need to offline or remove the old drive or stop the jails—though performance may suffer while resilver is in progress.

Sokonomi · Jul 10, 2022

danb35 said:
Any disk you're using in your system should be burned-in and tested first. There are a number of guides on doing this; Uncle Fester's (link in my sig) is one.

Ive taken a look at the uncle fester manual, but I cant seem to find any mentions of disk burn-in procedures? Could you point me to where this is explained?

danb35 said:
If the failing disk is still generally OK (and with only one checksum error, I'd expect that to be the case), I prefer to leave it online during the replacement process. Taking it offline degrades your redundancy, and with RAIDZ1 means you have no redundancy at all. Leaving it online means that the resilvering will likely take longer than it would if the disk were offline, but I'd consider the preservation of redundancy worth it.

The NAS seems to be performing 'alright' still, I think it recovered and resilvered itself once, but the disk has had a sector reallocated or something. I think it should still have enough to give to aid its replacement, but for me that's a hard call to make since Ive got no experience resilvering arrays.

Etorix said:
"Checksum" could be a cable rather than the drive.

Do you have a backup?
What does SMART report? (smartctl -a /dev/ada3 in a SSH shell—avoid the GUI shell if possible)
You're welcome to post smartctl output, for all drives, as well as zpool status -v. Within CODE tags for readability, please.

If it's "just" a single checksum error, a scrub may be enough.
If you proceed to replace the drive, plug the new drive (preferably after running some burn-in tests on it) to the free port, and initiate replacement from the GUI (Storage > Pool > (gear) > Status > (old drive) > (…) > Replace). There's no need to offline or remove the old drive or stop the jails—though performance may suffer while resilver is in progress.

The cable being faulty would be a surprise, I've never had a cable give out after 4 years of unimpeded service before. But ill keep that in mind once I pull the allegedly faulty drive for testing.

I have fully backed up all important data, and most of the less important stuff as well, though if I could preserve the pool it would save me a bit of headache.

Heres what SMART spat out;

smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Green
Device Model:     WDC WD30EZRX-00MMMB0
Serial Number:    WD-WCAWZ0908447
LU WWN Device Id: 5 0014ee 25b5f72ee
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Jul 10 10:51:34 2022 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                (50160) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 482) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   167   144   021    Pre-fail  Always       -       8616
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1762
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   008   008   000    Old_age   Always       -       67738
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1306
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       177
193 Load_Cycle_Count        0x0032   145   145   000    Old_age   Always       -       167827
194 Temperature_Celsius     0x0022   117   106   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       2
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       2071
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       2

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%      2191         1044225103
# 2  Short offline       Completed without error       00%      2024         -
# 3  Extended offline    Completed without error       00%      1992         -
# 4  Short offline       Completed: read failure       10%      1856         979058200
# 5  Short offline       Completed: read failure       60%      1689         979058200
# 6  Short offline       Completed: read failure       10%      1528         978963121
# 7  Short offline       Completed: read failure       90%      1353         1001713848
# 8  Extended offline    Completed: read failure       90%      1257         1001713848
# 9  Short offline       Completed: read failure       90%      1185         1001713848
#10  Short offline       Completed: read failure       90%      1017         1001713848
#11  Short offline       Completed: read failure       80%       850         1001713848
#12  Short offline       Completed without error       00%       683         -
#13  Extended offline    Completed: read failure       80%       516         982200736
#14  Short offline       Completed without error       00%       347         -
#15  Short offline       Completed without error       00%       179         -
#16  Short offline       Completed without error       00%        11         -
#17  Short offline       Completed without error       00%     65380         -
#18  Extended offline    Completed without error       00%     65342         -
#19  Short offline       Completed without error       00%     65213         -
#20  Short offline       Completed without error       00%     65045         -
#21  Short offline       Completed without error       00%     64878         -
9 of 10 failed self-tests are outdated by newer successful extended offline self-test # 3

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0) :
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

To my surprise, this disk seems to identify itself as a WD GREEN (??) while I am 100% certain the physical disk label states otherwise. I even have pictures of the drives from when I was constructing the NAS, and it definitely appears to be all reds. I've SMART checked the others and they all identify normally. I bought them all new and sealed from a reputable vendor. I am as baffled at this as you might be. I guess this is why the drive is faulting while the rest of them are fine? Could some weird fault cause a drive to misidentify somehow? All the more reason for me to remove this bizarre drive I guess..

For what its worth, the failures seems to be all read failures, which might be typical for a slow to wake WD Green?

So all I have to really do here is run a burn test on the new drive, plop it into the spare SATA port and do as you instructed to replace? I presume doing that will prompt me for what disk to use as replacement? And once its done resilvering, I can just power down, yank the red/green bastard, and plug the new disk in its rackmount/port? The new disk would have to sit unmounted during this procedure as ive only got a SATA port to spare, not a disk mount (its a 6 disk factal design Node case).

danb35 · Jul 10, 2022

Sokonomi said:
Could you point me to where this is explained?

fester112:hvalid_hdd [danb35's Wiki]

www.familybrown.org

In the SMART attributes of your disk, nothing's too concerning other than the load cycle count, which is consistent with it being a Green rather than a Red (and just the age of the disk). But it's pretty consistently failing SMART self-tests; that's definitely a reason to replace it. You should have been getting email alerts about this--make sure you've entered your email address in the SMART service configuration.

Etorix · Jul 10, 2022

This is not too bad (yet… these counters should go up in the not-too-distant future):

Code:

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

but I'd say that this is worrying:

Code:

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       2
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       2071
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       2

as well as the SMART errors on various sectors. Replace the drive, by a CMR drive of the same or higher capacity.
These appear to be physical errors, not just "slow to wake up".
As for the Red label on a Green, only WD could tell what might have happened.

Sokonomi said:
So all I have to really do here is run a burn test on the new drive, plop it into the spare SATA port and do as you instructed to replace? I presume doing that will prompt me for what disk to use as replacement? And once its done resilvering, I can just power down, yank the red/green bastard, and plug the new disk in its rackmount/port? The new disk would have to sit unmounted during this procedure as ive only got a SATA port to spare, not a disk mount (its a 6 disk factal design Node case).

Yes. You may also hang the new drive in the bay already and let the old drive unmounted—but plugged to a SATA port. ZFS tracks drives by gptid, so it doesn't matter if drives/ports are shuffled across reboots.
But always carefully check by serial # which drive is to be replaced! It may or may not still be 'ada3' after a reboot.

Sokonomi · Jul 10, 2022

danb35 said:
fester112:hvalid_hdd [danb35's Wiki]

www.familybrown.org

Oh, so burn in and validation mean the same thing? Sorry, I didn't know. Is there an easy way to perform these tests using a USB external and windows? I don't really have any lab computers on hand to do it with at the moment. I've looked around for a windows/dos SMART test tool, but most results felt kinda sketchy, so I was hoping someone here knows of a tried and true bit of software that I could use.

I have two new WD Red Plus 4Tb drives on hand that I had intended to silver in one by one (and then 2 more next week and 2 more the week after, to spread cost). So if I could get through the preliminary validation stuff using my desktop machine that would make things a little easier.

danb35 said:
In the SMART attributes of your disk, nothing's too concerning other than the load cycle count, which is consistent with it being a Green rather than a Red (and just the age of the disk). But it's pretty consistently failing SMART self-tests; that's definitely a reason to replace it. You should have been getting email alerts about this--make sure you've entered your email address in the SMART service configuration.

I did enter my email but for some reason I did not receive notice until I logged in and saw a red bell icon.. By some strange coincidence I logged on to check up on a jail, only hours after the NAS resilvered itself. So fortunately it hasn't been sitting on it for that long yet.

Etorix said:
This is not too bad (yet… these counters should go up in the not-too-distant future):

Code:
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0

but I'd say that this is worrying:

Code:
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 2 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 2 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 2071 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 2

as well as the SMART errors on various sectors. Replace the drive, by a CMR drive of the same or higher capacity.
These appear to be physical errors, not just "slow to wake up".
As for the Red label on a Green, only WD could tell what might have happened.

Yes. You may also hang the new drive in the bay already and let the old drive unmounted—but plugged to a SATA port. ZFS tracks drives by gptid, so it doesn't matter if drives/ports are shuffled across reboots.
But always carefully check by serial # which drive is to be replaced! It may or may not still be 'ada3' after a reboot.

Some sectors are eating dirt, that's definitely a death rattle sign. What bothers me the most though, is the label says Red, but the SMART comes up with Green, no idea why I hadn't noticed that before, but it really bothers me. I vaguely remember flashing some firmware changes on two of my old 1Tb WD greens (terrible purchase, never touch green!) in another computer just for funzies, but they have never been in the same system together with my reds, so I don't know. All I can think of is either I or my vendor got scammed. 4 years too late to do anything about that now though.. I'm glad I didn't buy my new drives from the same place now.

Another thing im pondering now is if the firmware given serial# will even match the darn physical sticker at all. But I guess I can find the bugger by process of elimination; Any sticker # that doesn't match what TrueNAS is giving me is the culprit.

ZFS is some wonderfully resilient system it seems. I remember screwing up a classic raid0 by accidentally mixing SATA plugs up. Not possible with ZFS apparently.

So here is where im at;
- First I need to validate my new disk(s) one way or another.
- Then I need power down my TrueNAS and introduce said disk to it.
- Power back on, then go to 'Storage > Pool > (gear) > Status > (old drive) > (…) > Replace'
- It will probably prompt me what to use as the replacement here.
- Let it carry out a resilver.
- Power down and pull the broken disk out.
- Move new disk in its place.
- Power back on, and run a SMART test to see if all went well.
- Fester's your uncle.

Did I miss anything?

I remember reading somewhere I should turn on automatic expansion so the pool will silver into a bigger size once all drives are replaced, but im not sure if and where I should do that.

Etorix · Jul 10, 2022

Sokonomi said:
I remember reading somewhere I should turn on automatic expansion so the pool will silver into a bigger size once all drives are replaced, but im not sure if and where I should do that.

The autoexpand flag should have been set by default. You can check with zpool get autoexpand.
Your replace list is complete.

The burn-in can be done in any system which can run the commands you're using (there's no definitive standard on that). But it's best to plug the drive(s) through SATA rather than by USB, as the procedure may run for a long time (badblocks takes days to complete its passes) and USB connections are not that reliable.

danb35 · Jul 10, 2022

Sokonomi said:
Is there an easy way to perform these tests using a USB external and windows?

I'm sure it could be done, but I have no idea what tools you'd use. You said you had a spare SATA port in the server, right? Why not just connect the disk there and run the tests on your server?

Sokonomi said:
Did I miss anything?

Doesn't look like it.

Sokonomi · Jul 11, 2022

danb35 said:
Why not just connect the disk there and run the tests on your server?

Its a bit of a mess when it comes to IT in my building. I was halfway through running all the cables for a nice serverrack cabin with all the space id want, but the contractors wife got ill so now I'm still stuck keeping my NAS in a cramped little fractal design Node box up on a high shelf in an awkward hard to access corner of my workshop. And of course one of the disks decided to die now, a month or two before I have the chance to move it all over to a nice roomy accessible 14 bay silverstone rack. Talk about bad timing.

So I have a bit of an issue with where im going to keep the server while all this validation and resilvering is happening. It cant sit on my workbench with its disk hanging out for longer than a 3 day weekend, so I was kind of looking to minimise that by doing the preliminaries from my desktop or something. But that machine is a heavy watercooled rig with all the cables managed a little too well, so its not easy to just hang a disk on it internally. Hence the question about USB external.

But there is a plan C.. maybe..
I have a little Dell computer I was preparing to be my new router (pfSense is neat), but I can hold off on deploying that since the rack isn't ready yet, to use it to run the tests instead. I believe uncle festers manual states just installing TrueNAS on some temporary box solely to run the apparently built in disk checks as one of the options. But ill have to see if that tiny SFF computer even has a second SATA to begin with. :')

Tech is fun, when its working!

Sokonomi · Jul 19, 2022

Welp, Ive done all that I summarized, but after pulling the broken disk out, the NAS complains:

CRITICAL
Pool Tank state is DEGRADED: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:

Disk 9911428738427028150 is UNAVAIL

2022-07-19 09:23:32 (Europe/Amsterdam)

So apparently I did miss something..
Anyone know what I should be doing?

Patrick M. Hausen · Jul 19, 2022

zpool status is the first thing we need to start ...

Sokonomi · Jul 19, 2022

Patrick M. Hausen said:
zpool status is the first thing we need to start ...

Ofcourse, here you go:

Code:

  pool: Tank
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-2Q
  scan: resilvered 25.8M in 00:00:06 with 0 errors on Tue Jul 19 10:09:50 2022
config:

        NAME                                            STATE     READ WRITE CKSUM
        Tank                                            DEGRADED     0     0     0
          raidz1-0                                      DEGRADED     0     0     0
            gptid/73ea972a-d762-11eb-97f5-d05099c19171  ONLINE       0     0     0
            gptid/74b038f8-d762-11eb-97f5-d05099c19171  ONLINE       0     0     0
            gptid/74baea02-d762-11eb-97f5-d05099c19171  ONLINE       0     0     0
            9911428738427028150                         UNAVAIL      0     0     0  was /dev/gptid/74a6b398-d762-11eb-97f5-d05099c19171
            gptid/74cf2566-d762-11eb-97f5-d05099c19171  ONLINE       0     0     0
            gptid/93915049-05ad-11ed-9080-d05099c19171  ONLINE       0     0     0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:10 with 0 errors on Sun Jul 17 03:45:10 2022
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          ada0p2    ONLINE       0     0     0

errors: No known data errors

Patrick M. Hausen · Jul 19, 2022

So there's one disk missing, which must be the one you pulled out. Did you replace that with the new disk before you pulled it? If not you need to do that now. Can be done in the UI ...

Your new disk is connected? camcontrol devlist shows all the disks the system know about. Also the UI shows the pool status in a way where we can see which GPTID is which disk ...

joeschmuck · Jul 19, 2022

Yea, I see five drives of the six you originally had. I don't think you have actually added the new drive so you missed some instructions. I'd probably power down and reconnect the drive you pulled out, power back on and scrub the pool again, get it back to normal, then you can start all over. My assessment of your drives is you have one that might have a SATA data cable issue (UDMA_CRC_Errors), but the fact that you cannot always pass a selftest (which is a completely internal drive test) is evidence the drive is premature of a total failure.

Be very descriptive in what you did to get to this point, do not assume we know what you have done, we are not there and you could be making a simple mistake. Treat us as if we are clueless on what you are doing, because we are. This is not to say that we think less of you, it's the fact that a lot is lost by making assumptions. We all want to help you fix your problem with the least amount of trouble and no data loss.

Jailer · Jul 19, 2022

Are you sure you pulled the correct disk?

joeschmuck · Jul 19, 2022

Jailer said:
Are you sure you pulled the correct disk?

Very true, you must use the serial number every time.

Sokonomi · Jul 19, 2022

Patrick M. Hausen said:
So there's one disk missing, which must be the one you pulled out. Did you replace that with the new disk before you pulled it? If not you need to do that now. Can be done in the UI ...

Your new disk is connected? camcontrol devlist shows all the disks the system know about. Also the UI shows the pool status in a way where we can see which GPTID is which disk ...

So heres a rundown of what I did sofar;
01. Ran the disk through its paces in another computer, according to uncle festers guide, it came up clean and proper.
02. Power down and plug the new disk into the last available SATA port, booted back up and it recognised the disk correctly.
03. I clicked 'Storage > Pools > [cog] > Status' to determine which ada# had the checksum 1.
04. I clicked '⋮ > Replace' on said disk and selected the new disk as the member disk.
05. Waited for it to prattle through the long resilvering process.
06. At this point the pool said it was healthy again.
07. I checked which ada# the broken disk had once more, then clicked 'Storage > Disks' to find the corresponding S/N.
08. Powered down, pulled the disk that matched it, moved the new disk from the spare SATA to the old disks SATA.
09. Powered up, only to find that the pool is degraded once again.
10. Added the old disk back in using the spare SATA, this restored the pool back to healthy again.

So long story short; Can't remove broken disk despite having replaced it with new one.

EDIT :
This is what camcontrol devlist gives me with all disks plugged and running;

Code:

<TEAM T253LE120G SBFM11.1>         at scbus0 target 0 lun 0 (pass0,ada0)
<WDC WD30EFRX-68EUZN0 82.00A82>    at scbus1 target 0 lun 0 (pass1,ada1)
<WDC WD30EFRX-68EUZN0 82.00A82>    at scbus2 target 0 lun 0 (pass2,ada2)
<WDC WD30EZRX-00MMMB0 80.00A80>    at scbus3 target 0 lun 0 (pass3,ada3)
<WDC WD30EZRX-00AZ6B0 80.00A80>    at scbus4 target 0 lun 0 (pass4,ada4)
<WDC WD30EFRX-68N32N0 82.00A82>    at scbus5 target 0 lun 0 (pass5,ada5)
<WDC WD30EFRX-68EUZN0 82.00A82>    at scbus6 target 0 lun 0 (pass6,ada6)
<WDC WD40EFZX-68AWUN0 81.00B81>    at scbus7 target 0 lun 0 (pass7,ada7)
<AHCI SGPIO Enclosure 2.00 0001>   at scbus8 target 0 lun 0 (pass8,ses0)

The pool status also shows all but ada3 present, which is to be expected as ada3 is the 'bad' one that got replaced with the new disk currently residing on ada7.

To verify further, heres the list of disks in the UI;

The highlighted one, on ada3, is the broken drive, its pool is marked as N/A as well, which I assume means its no longer part of the pool and can be safely removed from the system. But when I do, it complains, for some bizarre reason..

joeschmuck · Jul 19, 2022

Go to the drive listed as ada3 and 'offline' the drive. Once done the pool should still be healthy. Then power down and disconnect the drive, lastly power on again. That "should" work.

Sokonomi · Jul 19, 2022

joeschmuck said:
Go to the drive listed as ada3 and 'offline' the drive. Once done the pool should still be healthy. Then power down and disconnect the drive, lastly power on again. That "should" work.

Strangely enough.. it did! Thank you!

Important Announcement for the TrueNAS Community.

Disk replacement procedure for RaidZ1?

Contributor

Hall of Famer

Wizard

Contributor

Hall of Famer

Wizard

Contributor

Wizard

Hall of Famer

Contributor

Contributor

CRITICAL​

Pool Tank state is DEGRADED: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. The following devices are not healthy: Disk 9911428738427028150 is UNAVAIL

Hall of Famer

Contributor

Hall of Famer

Old Man

Not strong, but bad

Old Man

Contributor

Old Man

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Disk replacement procedure for RaidZ1?"

Similar threads

CRITICAL

Pool Tank state is DEGRADED: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:

Disk 9911428738427028150 is UNAVAIL