Does order of drive replacement matter?

David Simpson · Mar 10, 2023

David Simpson said:
Looks like it will be a long time, yikes

Looks like it ended up finishing some time in the night and I survived replacing the first drive. I'm going to try the second drive today.

It reported two files that were damaged, so I just deleted them

Code:

RESULT IN SYSTEM FAILURE.

root@freenas:~ # zpool status -v
  pool: Drive
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 1.92T in 07:48:42 with 30 errors on Thu Mar  9 20:51:16 2023
config:

        NAME                                            STATE     READ WRITE CKSUM
        Drive                                           DEGRADED     0     0 0
          raidz1-0                                      DEGRADED     0     0 0
            gptid/abb78c4a-88b8-11e5-beea-d05099c043da  ONLINE       0     0   115
            gptid/ac7c0f67-88b8-11e5-beea-d05099c043da  DEGRADED    72     045  too many errors
            gptid/ad42f84f-88b8-11e5-beea-d05099c043da  ONLINE       0     0   115
            gptid/91776ae5-bea4-11ed-a7fe-0007431447b0  ONLINE       0     0   115

David Simpson · Mar 11, 2023

Survived new drive #2. Things appear healthy, but I'm probably going to rotate the other two before I end up in a degraded state again. Any recommendations on a test I should run at this point?

Code:

root@freenas:~ # zpool status -v
  pool: Drive
 state: ONLINE
  scan: resilvered 1.91T in 07:49:58 with 0 errors on Fri Mar 10 17:17:09 2023
config:

        NAME                                            STATE     READ WRITE CKSUM
        Drive                                           ONLINE       0     0 0
          raidz1-0                                      ONLINE       0     0 0
            gptid/abb78c4a-88b8-11e5-beea-d05099c043da  ONLINE       0     0 0
            gptid/a4bc2458-bf4f-11ed-9967-0007431447b0  ONLINE       0     0 0
            gptid/ad42f84f-88b8-11e5-beea-d05099c043da  ONLINE       0     0 0
            gptid/91776ae5-bea4-11ed-a7fe-0007431447b0  ONLINE       0     0 0

errors: No known data errors

WI_Hedgehog · Mar 13, 2023

David Simpson said:
Survived new drive #2. Things appear healthy, but I'm probably going to rotate the other two before I end up in a degraded state again. Any recommendations on a test I should run at this point?

awasb said:
First and foremost: Don't hurry. And as already mentioned ...

[1] Make some backup(s).
[3] Get _good_ replacement drives, i.e. CMR (maybe even larger ones than before). Get four of them ... maybe five. (If one of Your drives shows symptoms of defects of old age, it's likely that others will follow.)
[5] Do some stress tests / disk burn in.

Given you asked a question @awasb answered:

On a RAID Server of your size I would buy a spare drive (S1) for swapping in immediately should one drive start to look itchy.
I'd test (S1) drive in another system running smartctl -xall then badblocks -w then smartctl -xall. Mind you @jgreco would say to burn it in with more runs (like badblocks -w -p 10) and wrote his own burn-in test suite, and for a setup like yours I would consider that to be great advice.
I'd replace one of the older drives (O3) with the new drive (S1) from above, then run the above tests on the drive that was just pulled (O3) to make sure it's still good.
Assuming (O3) is good I'd replace the other old drive (O4) with (O3), then run the above tests on (O4) to make sure it's still good.
Assuming (O4) is good I'd replace one of the new drives (N1) with the other old drive (O4), then run the above tests on (N1) to make sure it's actually good.
Assuming (N1) is good I'd replace the other new drive (N2) with (N1), run the tests on (N2) to make sure it's actually good.

This leaves the system with 2 old drives mirrored by two new drives, and one backup drive. We know all spinners die eventually, some in short order, so it's wise to have at least one tested backup on hand.

7.) I'd then try to find a second spare drive (S2) on sale and burn it in. That's just me, but I don't really trust backups because often they're untested until you need them, and at that point it is discovered they should have been tested prior to needing them because part of the backup is corrupt--if your backup solution is as old as your NAS that just makes sense.

---
As an aside, joining the TrueNAS community is one of the best things I've done. I had no idea how many errors happen inside a computer every day. Previously I'd thought that's what all those error-handling sub-systems are for, to catch the rare exception and correct it, and that's quite incorrect. Systems designers have pushed speeds so high errors are frequent, and the designers are relying on the error-handling sub-systems to correct them. That's plain wrong. The ECC sub-system is being used as a crutch, pushing it up from a sub-system to part of the actual system. That's poor system implementation. It's like saying, 1+1=47, and if not my friend will correct me.

When I joined I thought the TrueNAS Community members were a bit (or more than a bit) over-zealous on protecting "their" data. Now I understand they're simply educated and making sensible decisions.

jgreco · Mar 13, 2023

WI_Hedgehog said:
Now I understand they're simply educated and making sensible decisions.

Or paranoid and acting the part. From my perspective, I want to tell people things that are as right as I possibly can, because if I tell someone something like "it's all right to use SATA port multipliers", the truth is that they may actually be able to get away with it, but it will wreak havoc with scrubs/resilvers in many cases, and if you happen to be unfortunate and find a SATA PM that works fine on FreeBSD but then switch from CORE to SCALE, I'll feel like a real heel. The reason Synology and QNAP send you a prebuilt NAS unit with a consistent manifest of parts is because that's a good way to get a consistent result. TrueNAS gets a good result when you build systems that at least somewhat resemble the hardware iXsystems sells.

Important Announcement for the TrueNAS Community.

Does order of drive replacement matter?

David Simpson

Dabbler

David Simpson

Dabbler

WI_Hedgehog

Guru

jgreco

Resident Grinch

Similar threads

Important Announcement for the TrueNAS Community.

Does order of drive replacement matter?

David Simpson

Dabbler

David Simpson

Dabbler

WI_Hedgehog

Guru

jgreco

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Does order of drive replacement matter?"

Similar threads