Unhealthy pool, no errors at all

Migsi

Dabbler
Joined
Mar 3, 2021
Messages
40
Hello Community,

today I reach out to you to ask for help regarding some strange behavior I recently found on one of my TrueNAS systems. It consists of consumer grade HW, more concrete an Athlon 3000G, 16GB RAM (non ECC), the mobo is an Asrock A320M-HDV R4.0 and its single pool consists of 2 500GB NVMe SSDs (one using the native slot on the board, the other one plugged into a generic pcie -> M.2 adapter card on the x16 (x4 with this CPU) slot) + 3 2TB Seagate IronWolf Red HDDs, the OS (TrueNAS-12.0-U4) is running on its own single 128GB SATA M.2 drive (more details can be found in the following screenshots) and finally a HP NC523SFP card on the x1 slot via a riser cable. The drives were bought new this february. All in all its nothing to special and its single purpose is to serve as a home NAS at my parents place and "backup" most of their data (which stays in sync with their PCs).

Long story short, two weeks ago I recieved a critical warning about data corruption in the pool. I checked the system immediately, looked at the pool status, checksums, smart data. I checked literally everything which could help me figuring out what was going on, but nothing. Besides that alert, the system ran fine (and is still running fine) except for the additional warning that popped up yesterday about a core file that was dumped. I still have no clue what happened and I'm unsure about what to do. The only possibility I see is that a recent unscheduled reboot broke something, but that happened already a week before, so I doubt that was the case. Should I wait for the upcoming scrub of the pool tomorrow and, if it goes fine, clear the error state of the pool, going ahead as if nothing happened or are there any other checks I could/should run?

1626358535413.png

1626358816388.png

1626358843054.png


If you need any more infomration to help my out, I'll kindly provide it asap.

Best regards

EDIT: I just ran a zpool status -v <pool> for the first time and that cleared things up a little. Appearently some files got corrupted, which might indeed be caused by unscheduled reboots. I'd appreciate any confirmation though.
 

Attachments

  • smartctl.txt
    31.2 KB · Views: 219
  • zpool_status.txt
    1.4 KB · Views: 160
Last edited:

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
ZFS was specifically design from the very beginning to survive a sudden power loss, (or reboot), WITHOUT DATA LOSS.

That said, any data in flight could be lost, but it simply would not show up. And not be listed as an error. Further, while ZFS itself can survive, if the hardware lies to ZFS, (like a hardware RAID controller re-ordering writes and using it's write cache), then data loss can occur.

It would be helpful for you to, (per forum rules / suggestions);
- List the complete hardware, (for example, you have only one M.2 slot, so how is the other 2 wired up)
- Version of TrueNAS
 

QonoS

Explorer
Joined
Apr 1, 2021
Messages
87
Well a lot could have happend that caused that error. You should provide more data to narrow it down.

System logs around when the error happened would be valueable.
 
Last edited:

Migsi

Dabbler
Joined
Mar 3, 2021
Messages
40
ZFS was specifically design from the very beginning to survive a sudden power loss, (or reboot), WITHOUT DATA LOSS.

That said, any data in flight could be lost, but it simply would not show up. And not be listed as an error. Further, while ZFS itself can survive, if the hardware lies to ZFS, (like a hardware RAID controller re-ordering writes and using it's write cache), then data loss can occur.

It would be helpful for you to, (per forum rules / suggestions);
- List the complete hardware, (for example, you have only one M.2 slot, so how is the other 2 wired up)
- Version of TrueNAS
Good to get confirmed ZFS itself is designed to survive power losses, that makes it unlikely that an outage was causing the issue. I never touched the native RAID functions of this board, well knowing that would mess with ZFS capabilities. But I can't say for sure if the controller isn't doing any shady stuff regardlessly, though I highly hope thats not the case.

Pardon me for skipping out on those two important parts of information, I must have overlooked them missing before submitting the post. I've edited the post and added it.

Well a lot could have happend that caused that error. You should provide more data to narrow it down.

System logs around when the error happened would be valueable.
I'll add logs ASAP I have time to access the system again.
 
Top