Do I need to rebuild my pool?

Itay1778

Patron
Joined
Jan 29, 2018
Messages
269
Hi, I need the help/advice of this great community again.
After the crashes, I had with TrueNAS - https://www.truenas.com/community/threads/sos-kernel-panic-help.104524
I decided (And they recommended me) to upgrade the hardware and reinstall TrueNAS on the new hardware and everything works great, very stable!
Except for one problem that I already knew about...
So among all the crashes, I had with my previous TrueNAS I also started getting files corruption mainly when I did a scrub
It's scary to see this, but I was sure at the time it was because of all the crashes so luckily I had a backup of them I just restored the same files from the backup and that's it (mostly it was one file or the ZFS would just manage to solve it by itself without files corruption I would just get a CKSUM error and Applications are unaffected.)

Now that I rebuilt my TrueNAS and reinstalled it - I just took the same drives and the previous pool with me.

Every time I do Scrub I get these errors - sometimes "Applications are unaffected" and sometimes there really is file corruption and then I have to restore these files from a backup.

Now, what should I do? From my previous post - on the crashes. I was told that I would have to rebuild the pool. jgreco reply

I'm asking to make sure this is really what I need to do to solve it and there is no other way. Before I buy a drive to have somewhere to move the data while I rebuild the pool.

And if this is really what I need to do (rebuild the pool) what is the best way to move everything to another pool in TrueNAS? Simply make a snapshot for the entire existing pool and then use the replication task to perform the transfer. Or is there a better and safer way?

I hope I was clear and didn't confuse you with what is going on here.

See replies 5 and 6

Thanks in advance, waiting for your reply!
 
Last edited:

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
I can't advise on rebuilding your pool, or not.


However, ZFS was specifically designed to have zero data loss on unexpected power offs, (aka crashes). The only data you can loose, is data in flight, (just like any other file system). When a crash occurs during writes, either the full set of data was written and available afterwards. Or none in flight data is available.

Their are exceptions to this, bad hardware. If;
  • A storage device lies about flushing it's write cache
  • Drive re-ordering writes
  • Using write cache based hardware RAID controller
  • Or potentially non-ECC RAM with errors
When Sun Microsystems designed and tested ZFS, they did not anticipate the massive numbers of users on home & consumer hardware. Thus, those exceptions generally don't apply to actual server grade hardware designed for NAS uses.
 

Itay1778

Patron
Joined
Jan 29, 2018
Messages
269
Does anyone have any advice for me on what to do? Yes, No?
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
We don't actually have enough information to say.

You shouldn't really be seeing file corruption with ZFS. It could be detecting periodic errors with your disks, but if you have sufficient redundancy that shouldn't result in corruption.

So the question is, what is your pool layout? And how do you know you have file corruption?

Ie, post a copy of your "zpool status <pool name>"
 

Itay1778

Patron
Joined
Jan 29, 2018
Messages
269
what is your pool layout?
RaidZ1 3X 3TB WD Red (CMR)
how do you know you have file corruption?
When I have a file corruption that ZFS was unable to fix by itself, the first thing I get is a notification: ״Pool NAS state is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected ״ And when ZFS detects an error or after Scrub there is an error that was fixed automatically I get the same notification but it ends with "Applications are unaffected" then I know if I have data corruption or not.

Then I log in through SSH and see through zpool status -v and in the error section, I see the name of the file and its path. (that only if the notification ends with Applications may be affected)



post a copy of your "zpool status <pool name>"
Code:
 pool: NAS
 state: ONLINE
  scan: scrub repaired 24K in 07:54:48 with 0 errors on Sun Nov 27 10:56:27 2022
config:

    NAME                                            STATE     READ WRITE CKSUM
    NAS                                             ONLINE       0     0     0
      raidz1-0                                      ONLINE       0     0     0
        gptid/a79fe0a8-3c43-11e9-b8f1-e069952bbf5c  ONLINE       0     0     0
        gptid/a87765fc-3c43-11e9-b8f1-e069952bbf5c  ONLINE       0     0     0
        gptid/a9463a8e-3c43-11e9-b8f1-e069952bbf5c  ONLINE       0     0     0

errors: No known data errors


As you can see at the moment everything is fine but I'm almost sure that if I do a Scrub now it will find something, the question is whether it will be able to fix itself or if I will have to restore the files from a backup...


And it's not a disk(s) problem either. All their SMART data is completely OK.
 

Itay1778

Patron
Joined
Jan 29, 2018
Messages
269
Update: I did a manual scrub today to check what's going on. As you can see in the attached picture there is a file with errors.
File Corruption.png

And as I describe in the post here, it wasn't the first and I'm pretty sure it won't be the last.
What to do???? How do I solve this??
 

Itay1778

Patron
Joined
Jan 29, 2018
Messages
269
Another scrub and a new file with errors. The file itself has not been on the pool for a long time and is relatively new.
Asking again, what to do? How to solve these problems???
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
You shouldn't really be seeing file corruption with ZFS. It could be detecting periodic errors with your disks, but if you have sufficient redundancy that shouldn't result in corruption.
This still holds true and it is kinda baffling. There is something that keeps corrupting your pool, and usually it's hardware.

You could destroy the pool and recreate it.
You should check the power and data cablesoof your devices.
Are you using any tunable or deduplication?
 

Itay1778

Patron
Joined
Jan 29, 2018
Messages
269
This still holds true and it is kinda baffling. There is something that keeps corrupting your pool, and usually it's hardware.

You could destroy the pool and recreate it.
You should check the power and data cablesoof your devices.
Are you using any tunable or deduplication?
The chance that its hardware is very low, because of all these problems and others I upgraded my TrueNAS
I upgraded the following:
Motherboard
Processor
RAM (with ECC)
And I bought a new power supply
I also checked and changed to new SATA cables
And in this upgrade I reinstalled TrueNAS
It is not a problem with the drives because their SMART Test is excellent and does not show any problem and all their tests also pass.
Dedup is off, it's the default and I didn't turn it on.
And yes I have tunables but they should not cause this, see the picture.
Screenshot 2022-12-26 at 23-18-03 TrueNAS - 192.168.2.1.png

That's why I'm asking if recreating the pool is the only way to fix it or if there is another way to solve these problems.
 

Itay1778

Patron
Joined
Jan 29, 2018
Messages
269
Which kind of files get corrupted?
There is nothing specific, it could be anything. I had different types of files already corrupted.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
There is nothing specific, it could be anything. I had different types of files already corrupted.
Can't get anything in mind more than nuking the pool and rebuilding, sorry.

Edit: setting copies to more than 1 would be treating the symptoms and not the root, and even that wouldn't be a sure fix.
 
Last edited:
Top