I'm running a burnin period on a new set of disks. Shape is three RAIDZ1's of four 4TB disks, I was messing around with throughput for 10gigE use - this runs about 900meg/second r/w. I have a Spare drive also allocated to the pool.
Yesterday one of the disks popped, recovered, resilvered to the Spare, that completed, then popped again and died, and the Spare stepped in again. The pool was resilvering when I came to look this morning. But some worrying additional r/w errors are seen on other hdds in the same RAIDZ1:
I didn't want to swap out the spare with those other errors also in play, so I'm leaving that to go to completion. I also didn't want to pull a second disk in a RAIDZ1, of course that would be suicidal.
Instead I added another spare to take on that FAULTED drive:
Now it's "No known data errors", magically? And the new spare-0 has allocated itself to a different drive? Well, ok - it did say "too many errors" on that one too. Well, let's bang in another spare and try again:
Can someone explain to me how this pool is working at all? Three disks in a RAIDZ1 that are in a bad enough way that the system has all by itself subbed in three spares, and yet "No known data errors" (and there were originally!) and the pool is still up and running?
Spares are clearly doing some sort of deep magic here. Much more than was revealed in my recent thread "What are Spare drives in pools useful for?"
Also I probably ought to add another spare for that last member of the RAIDZ1 since that's having read/write errors too! But I've run out of slots unless I plumb in the other disk shelf. And I don't understand how this is still working! With zero data errors!
Help?
Yesterday one of the disks popped, recovered, resilvered to the Spare, that completed, then popped again and died, and the Spare stepped in again. The pool was resilvering when I came to look this morning. But some worrying additional r/w errors are seen on other hdds in the same RAIDZ1:
Code:
root@Sisyphus:~ # zpool status
pool: DataPool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sat Mar 28 10:37:08 2020
25.0T scanned at 3.82G/s, 19.2T issued at 2.93G/s, 25.0T total
2.62G resilvered, 76.73% done, 0 days 00:33:48 to go
config:
NAME STATE READ WRITE CKSUM
DataPool DEGRADED 2 0 5.48M
raidz1-0 ONLINE 0 0 0
gptid/7abf8085-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0
gptid/8d7ccf17-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0
gptid/94977190-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0
gptid/9348eedb-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0
raidz1-1 DEGRADED 2 0 11.0M
gptid/82f438bf-6a02-11ea-a922-a0369f4e18bc DEGRADED 0 0 0 too many errors
spare-1 DEGRADED 0 0 4.03K
12077469904772203790 REMOVED 0 0 0 was /dev/gptid/8d6b27be-6a02-11ea-a922-a0369f4e18bc
gptid/014fbddf-6bcc-11ea-9cdb-a0369f4e18bc ONLINE 0 0 0
gptid/8f66452c-6a02-11ea-a922-a0369f4e18bc FAULTED 114 693 0 too many errors
gptid/8f771d80-6a02-11ea-a922-a0369f4e18bc ONLINE 104 112 0
raidz1-2 ONLINE 0 0 0
gptid/8ad40550-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0
gptid/8f548563-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0
gptid/44ad6394-6aec-11ea-a922-a0369f4e18bc ONLINE 0 0 0
gptid/8ac32412-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0
spares
4973883019692991099 INUSE was /dev/gptid/014fbddf-6bcc-11ea-9cdb-a0369f4e18bc
errors: 5744771 data errors, use '-v' for a list
I didn't want to swap out the spare with those other errors also in play, so I'm leaving that to go to completion. I also didn't want to pull a second disk in a RAIDZ1, of course that would be suicidal.
Instead I added another spare to take on that FAULTED drive:
Code:
root@Sisyphus:~ # zpool status
pool: DataPool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sat Mar 28 12:30:51 2020
14.6T scanned at 471M/s, 9.82T issued at 1.20G/s, 25.0T total
0 resilvered, 39.34% done, 0 days 03:36:12 to go
config:
NAME STATE READ WRITE CKSUM
DataPool DEGRADED 2 0 5.48M
raidz1-0 ONLINE 0 0 0
gptid/7abf8085-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0
gptid/8d7ccf17-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0
gptid/94977190-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0
gptid/9348eedb-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0
raidz1-1 DEGRADED 2 0 11.0M
spare-0 DEGRADED 0 0 0
gptid/82f438bf-6a02-11ea-a922-a0369f4e18bc DEGRADED 0 0 0 too many errors
gptid/cfe11c7a-70ef-11ea-895f-a0369f4e18bc ONLINE 0 0 0
spare-1 DEGRADED 0 0 4.03K
12077469904772203790 REMOVED 0 0 0 was /dev/gptid/8d6b27be-6a02-11ea-a922-a0369f4e18bc
gptid/014fbddf-6bcc-11ea-9cdb-a0369f4e18bc ONLINE 0 0 0
gptid/8f66452c-6a02-11ea-a922-a0369f4e18bc FAULTED 114 693 0 too many errors
gptid/8f771d80-6a02-11ea-a922-a0369f4e18bc ONLINE 104 112 0
raidz1-2 ONLINE 0 0 0
gptid/8ad40550-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0
gptid/8f548563-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0
gptid/44ad6394-6aec-11ea-a922-a0369f4e18bc ONLINE 0 0 0
gptid/8ac32412-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0
spares
4973883019692991099 INUSE was /dev/gptid/014fbddf-6bcc-11ea-9cdb-a0369f4e18bc
8076838506038568445 INUSE was /dev/gptid/cfe11c7a-70ef-11ea-895f-a0369f4e18bc
errors: No known data errors
Now it's "No known data errors", magically? And the new spare-0 has allocated itself to a different drive? Well, ok - it did say "too many errors" on that one too. Well, let's bang in another spare and try again:
Code:
root@Sisyphus:~ # zpool status
pool: DataPool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sat Mar 28 12:35:20 2020
24.0T scanned at 1.43G/s, 16.1T issued at 228M/s, 25.0T total
1.21G resilvered, 64.57% done, 0 days 11:17:39 to go
config:
NAME STATE READ WRITE CKSUM
DataPool DEGRADED 2 0 5.48M
raidz1-0 ONLINE 0 0 0
gptid/7abf8085-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0
gptid/8d7ccf17-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0
gptid/94977190-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0
gptid/9348eedb-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0
raidz1-1 DEGRADED 2 0 11.0M
spare-0 DEGRADED 0 0 0
gptid/82f438bf-6a02-11ea-a922-a0369f4e18bc DEGRADED 0 0 0 too many errors
gptid/cfe11c7a-70ef-11ea-895f-a0369f4e18bc ONLINE 0 0 0
spare-1 DEGRADED 0 0 4.03K
12077469904772203790 REMOVED 0 0 0 was /dev/gptid/8d6b27be-6a02-11ea-a922-a0369f4e18bc
gptid/014fbddf-6bcc-11ea-9cdb-a0369f4e18bc ONLINE 0 0 0
spare-2 DEGRADED 0 0 14.9K
gptid/8f66452c-6a02-11ea-a922-a0369f4e18bc FAULTED 114 693 0 too many errors
gptid/4b125e82-70f0-11ea-895f-a0369f4e18bc ONLINE 0 0 0
gptid/8f771d80-6a02-11ea-a922-a0369f4e18bc ONLINE 104 112 0
raidz1-2 ONLINE 0 0 0
gptid/8ad40550-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0
gptid/8f548563-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0
gptid/44ad6394-6aec-11ea-a922-a0369f4e18bc ONLINE 0 0 0
gptid/8ac32412-6a02-11ea-a922-a0369f4e18bc ONLINE 0 0 0
spares
4973883019692991099 INUSE was /dev/gptid/014fbddf-6bcc-11ea-9cdb-a0369f4e18bc
8076838506038568445 INUSE was /dev/gptid/cfe11c7a-70ef-11ea-895f-a0369f4e18bc
14815739748345029129 INUSE was /dev/gptid/4b125e82-70f0-11ea-895f-a0369f4e18bc
errors: No known data errors
Can someone explain to me how this pool is working at all? Three disks in a RAIDZ1 that are in a bad enough way that the system has all by itself subbed in three spares, and yet "No known data errors" (and there were originally!) and the pool is still up and running?
Spares are clearly doing some sort of deep magic here. Much more than was revealed in my recent thread "What are Spare drives in pools useful for?"
Also I probably ought to add another spare for that last member of the RAIDZ1 since that's having read/write errors too! But I've run out of slots unless I plumb in the other disk shelf. And I don't understand how this is still working! With zero data errors!
Help?
Last edited: