Swapped disk without offlining old one... what now?

Joined
Jan 20, 2022
Messages
6
Hi all,

I had a misbehaving disk in my array (3-disk RAIDz1) so I set about replacing it. However, I forgot to offline the old disk. So I just shut down the server, pulled it, put in the new disk, then used the GUI replace.

Now I'm getting hundreds of thousands of checksum errors, but the data seems to be okay? At least the files I tested.

What could be the solution?

Thanks!
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
Checksum errors usually indicate cabling issues.

Could you have bumped or somehow unseated a cable on one or more of your disks in replacing the disk?
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
Did you finish resilvering? Please post the output of zpool status within code brackets.
A cabling issue can also be indicated by a drive accumulating UDMA_CRC_Errors in the SMART data.
 
Joined
Jan 20, 2022
Messages
6
Checksum errors usually indicate cabling issues.

Could you have bumped or somehow unseated a cable on one or more of your disks in replacing the disk?
The way my disks are installed I had to unplug all of them when I replaced the one disk. Plugged them back in. I guess it's possible all the cables died? I'll replace them regardless since I've already had a SATA cable die in that machine.
 
Joined
Jan 20, 2022
Messages
6
Did you finish resilvering? Please post the output of zpool status within code brackets.
A cabling issue can also be indicated by a drive accumulating UDMA_CRC_Errors in the SMART data.
I finished a resilver and subsequently a scrub. I've rebooted the machine so the errors don't show anymore (used to have about 104k) but here is the output of zpool status
Code:
  pool: delorenzstore
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 352K in 13:15:27 with 36652 errors on Tue Jul 25 22:28:12 2023
config:

        NAME                                      STATE     READ WRITE CKSUM
        delorenzstore                             ONLINE       0     0     0
          raidz1-0                                ONLINE       0     0     0
            73b60159-9659-4d3d-8c06-fe77e2c22246  ONLINE       0     0     0
            54eef677-8607-4e69-a74e-e8dd71e391c8  ONLINE       0     0     0
            1db1b278-dd43-4dd3-b692-4d73dfd498c6  ONLINE       0     0     0

errors: 36652 data errors, use '-v' for a list


I'll run smart tests right now
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
Joined
Jan 20, 2022
Messages
6
You still have a big problem there...

run zpool status -v to find out which files are corrupt
A bunch of the files are just listed as hex codes... the only ones with a recognizable data path are system files I don't care about. I don't have a backup (i know that's bad) but I do have snapshots turned on and running frequently. Could I restore from a snapshot?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
A bunch of the files are just listed as hex codes
Those are metadata and will represent a whole bunch of your files in the pool.

I don't have a backup (i know that's bad) but I do have snapshots turned on and running frequently. Could I restore from a snapshot?
Snapshots aren't a backup... a pool checkpoint might be something between those things, but isn't a backup either.

It's unlikely your snapshots will be accessible (certainly not in full) as they live in your pool also, which is damaged.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
My advice is to make a backup of your files before doing anything more if you haven't already done so. With some luck, the corrupt files are not important (cross your fingers).
 
Joined
Jan 20, 2022
Messages
6
My advice is to make a backup of your files before doing anything more if you haven't already done so. With some luck, the corrupt files are not important (cross your fingers).
Crossing them big time... I've got my hands on a 14TB external drive and hope to start offloading everything tonight.

I'd still like to understand what caused all these issues? Was it the simple fact that I didn't offline the drive before pulling it? Seems crazy when the docs just say "if you don't offline resilvering will take longer". Maybe it was running a scrub with faulty (?) cables?
 
Joined
Oct 22, 2019
Messages
3,641
Those errors could be from a previous scrub / resilver. (Of which cabling, ports, and/or HBA could have triggered a massive number of "checksum" errors.)

Usually the procedure is to "zpool clear" followed by a full scrub. Then you can safely rule it out.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
Maybe it was running a scrub with faulty (?) cables?
Generally a faulty SATA cable will generate UDMA_CRC_Errors on your hard drives, I would have to say that the CHKSUM Errors are due to your drive(s) failing. Some people will try to stretch the life of a hard drive by ignoring early warnings, I see that too often. They tell themself that I have a RAIDZ1/Z2 and I'm safe, but in reality if there is another drive having issues, data loss is a real possibility. If your drives are old, consider replacing them before they completely fail. Also examine the SMART data on each drive for failure indications. If you would like us to analyze the data, post it here and well will let you know if all is good or if you have a problem.
 
Top