Swapped disk without offlining old one... what now?

rudydelorenzo · Jul 26, 2023

Hi all,

I had a misbehaving disk in my array (3-disk RAIDz1) so I set about replacing it. However, I forgot to offline the old disk. So I just shut down the server, pulled it, put in the new disk, then used the GUI replace.

Now I'm getting hundreds of thousands of checksum errors, but the data seems to be okay? At least the files I tested.

What could be the solution?

Thanks!

sretalla · Jul 27, 2023

Checksum errors usually indicate cabling issues.

Could you have bumped or somehow unseated a cable on one or more of your disks in replacing the disk?

joeschmuck · Jul 27, 2023

Did you finish resilvering? Please post the output of zpool status within code brackets.
A cabling issue can also be indicated by a drive accumulating UDMA_CRC_Errors in the SMART data.

rudydelorenzo · Jul 27, 2023

sretalla said:
Checksum errors usually indicate cabling issues.

Could you have bumped or somehow unseated a cable on one or more of your disks in replacing the disk?

The way my disks are installed I had to unplug all of them when I replaced the one disk. Plugged them back in. I guess it's possible all the cables died? I'll replace them regardless since I've already had a SATA cable die in that machine.

rudydelorenzo · Jul 27, 2023

joeschmuck said:
Did you finish resilvering? Please post the output of zpool status within code brackets.
A cabling issue can also be indicated by a drive accumulating UDMA_CRC_Errors in the SMART data.

I finished a resilver and subsequently a scrub. I've rebooted the machine so the errors don't show anymore (used to have about 104k) but here is the output of zpool status

Code:

  pool: delorenzstore
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 352K in 13:15:27 with 36652 errors on Tue Jul 25 22:28:12 2023
config:

        NAME                                      STATE     READ WRITE CKSUM
        delorenzstore                             ONLINE       0     0     0
          raidz1-0                                ONLINE       0     0     0
            73b60159-9659-4d3d-8c06-fe77e2c22246  ONLINE       0     0     0
            54eef677-8607-4e69-a74e-e8dd71e391c8  ONLINE       0     0     0
            1db1b278-dd43-4dd3-b692-4d73dfd498c6  ONLINE       0     0     0

errors: 36652 data errors, use '-v' for a list

I'll run smart tests right now

sretalla · Jul 27, 2023

rudydelorenzo said:
errors: 36652 data errors, use '-v' for a list

You still have a big problem there...

run zpool status -v to find out which files are corrupt

rudydelorenzo · Jul 27, 2023

sretalla said:
You still have a big problem there...

run zpool status -v to find out which files are corrupt

A bunch of the files are just listed as hex codes... the only ones with a recognizable data path are system files I don't care about. I don't have a backup (i know that's bad) but I do have snapshots turned on and running frequently. Could I restore from a snapshot?

sretalla · Jul 27, 2023

rudydelorenzo said:
A bunch of the files are just listed as hex codes

Those are metadata and will represent a whole bunch of your files in the pool.

rudydelorenzo said:
I don't have a backup (i know that's bad) but I do have snapshots turned on and running frequently. Could I restore from a snapshot?

Snapshots aren't a backup... a pool checkpoint might be something between those things, but isn't a backup either.

It's unlikely your snapshots will be accessible (certainly not in full) as they live in your pool also, which is damaged.

joeschmuck · Jul 27, 2023

My advice is to make a backup of your files before doing anything more if you haven't already done so. With some luck, the corrupt files are not important (cross your fingers).

rudydelorenzo · Jul 28, 2023

joeschmuck said:
My advice is to make a backup of your files before doing anything more if you haven't already done so. With some luck, the corrupt files are not important (cross your fingers).

Crossing them big time... I've got my hands on a 14TB external drive and hope to start offloading everything tonight.

I'd still like to understand what caused all these issues? Was it the simple fact that I didn't offline the drive before pulling it? Seems crazy when the docs just say "if you don't offline resilvering will take longer". Maybe it was running a scrub with faulty (?) cables?

winnielinnie · Jul 28, 2023

Those errors could be from a previous scrub / resilver. (Of which cabling, ports, and/or HBA could have triggered a massive number of "checksum" errors.)

Usually the procedure is to "zpool clear" followed by a full scrub. Then you can safely rule it out.

joeschmuck · Jul 28, 2023

rudydelorenzo said:
Maybe it was running a scrub with faulty (?) cables?

Generally a faulty SATA cable will generate UDMA_CRC_Errors on your hard drives, I would have to say that the CHKSUM Errors are due to your drive(s) failing. Some people will try to stretch the life of a hard drive by ignoring early warnings, I see that too often. They tell themself that I have a RAIDZ1/Z2 and I'm safe, but in reality if there is another drive having issues, data loss is a real possibility. If your drives are old, consider replacing them before they completely fail. Also examine the SMART data on each drive for failure indications. If you would like us to analyze the data, post it here and well will let you know if all is good or if you have a problem.

Important Announcement for the TrueNAS Community.

Swapped disk without offlining old one... what now?

rudydelorenzo

Cadet

sretalla

Powered by Neutrality

joeschmuck

Old Man

rudydelorenzo

Cadet

rudydelorenzo

Cadet

sretalla

Powered by Neutrality

rudydelorenzo

Cadet

sretalla

Powered by Neutrality

joeschmuck

Old Man

rudydelorenzo

Cadet

winnielinnie

MVP

joeschmuck

Old Man

Similar threads

Important Announcement for the TrueNAS Community.

Swapped disk without offlining old one... what now?

Cadet

Powered by Neutrality

Old Man

Cadet

Cadet

Powered by Neutrality

Cadet

Powered by Neutrality

Old Man

Cadet

MVP

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Swapped disk without offlining old one... what now?"

Similar threads