3M+ errors during replace

Purraxxus · Oct 19, 2023

One of my hard drives started displaying a DEGRADED status a while ago, so I decided to replace the disk on monday. The replace procudure has been runing for a long while now and all seemed relatively OK. This night however, I got this status email while I was sleeping:
Device /dev/gptid/ba62d176-a1b2-11ea-b1d4-4ccc6af67a89 is causing slow I/O on pool Pool.

About 2 hours later:
Disk 2355410887658246026 is FAULTED

And again an hour later:
The following alert has been cleared:
* Device /dev/gptid/ba62d176-a1b2-11ea-b1d4-4ccc6af67a89 is causing slow I/O on pool Pool

When I now look at zpool status, i get this:

pool: Pool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sat Oct 14 09:47:31 2023
24.2T scanned at 59.1M/s, 24.2T issued at 59.1M/s, 26.2T total
3.26T resilvered, 92.55% done, 09:36:52 to go
config:

NAME STATE READ WRITE CKSUM
Pool DEGRADED 0 0 0
gptid/b01261f9-42c8-11ea-8678-4ccc6af67a89 ONLINE 0 0 0
gptid/29b37c8b-4a88-11ea-ab15-4ccc6af67a89 ONLINE 0 0 0
replacing-2 DEGRADED 1.58M 0 0
gptid/ba62d176-a1b2-11ea-b1d4-4ccc6af67a89 FAULTED 220 0 221 too many errors (resilvering)
gptid/ed5f51b1-6a65-11ee-af43-4ccc6af67a89 ONLINE 0 0 3.16M (resilvering)
gptid/bbce31f3-25c1-11eb-87f5-4ccc6af67a89 ONLINE 0 0 0
gptid/ec31cfe5-7b65-11eb-b1ef-4ccc6af67a89 ONLINE 0 0 0
gptid/b4aabe3f-7c34-11eb-b1ef-4ccc6af67a89 ONLINE 0 0 0 (resilvering)
gptid/d5a63b64-7c34-11eb-b1ef-4ccc6af67a89 ONLINE 0 0 0 (resilvering)
gptid/ea3bda10-7c34-11eb-b1ef-4ccc6af67a89 ONLINE 0 0 0

errors: 109205 data errors, use '-v' for a list

It seems something has gone massively wrong and I'm not sure how to fix it.

Arwen · Oct 19, 2023

From what I can see, you have a STRIPED pool, WITHOUT redundancy. So yes, it would be expected to have data loss. Having more than 1 disk in a striped pool increases the risk of data loss. This is because most data will be striped across multiple disks. Then statistically, when you loose a disk, (or series of blocks), it will likely affect more than one file.

After the replacement is done, run the requested command and restore the affected files;
zpool status -v Pool

That said, my own media pool is striped, without redundancy. However, I have multiple cold backups and knew that on block loss, I would need to restore file(s). And on complete loss of 1 disk, it would be full restore time.

Now if your pool is not striped, the output you list above is missing what type of redundancy, RAID-Zx or Mirroring.

Purraxxus · Oct 19, 2023

Full restore time it is then. It is in fact just a striped pool as the data is not super important.

Arwen · Oct 19, 2023

Glad you knew it was stripped and that you have a backup.

Some very new users to ZFS & TrueNAS accidentally make a stripped pool. Or, have a redundant vDev, but later add a single disk vDev thinking it would "expand" the RAID-Zx.

winnielinnie · Oct 19, 2023

Some ~~very new users to ZFS & TrueNAS~~ users from Unraid accidentally make a stripped pool. Or, have a redundant vDev, but later add a single disk vDev thinking it would "expand" the RAID-Zx.

Fine, fine, it's not true for everyone, but Unraid has conditioned its users to "keep adding drives to expand your storage" and "use a couple drives to hold parity data of your files".

Davvo · Oct 19, 2023

Purraxxus said:
Device X is causing slow I/O on pool Y.

Is usually related to the use of SMR drives.

Important Announcement for the TrueNAS Community.

3M+ errors during replace

Purraxxus

Cadet

Arwen

MVP

Purraxxus

Cadet

Arwen

MVP

winnielinnie

MVP

Davvo

MVP

Similar threads

Important Announcement for the TrueNAS Community.

3M+ errors during replace

Purraxxus

Cadet

Arwen

MVP

Purraxxus

Cadet

Arwen

MVP

winnielinnie

MVP

Davvo

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "3M+ errors during replace"

Similar threads