Checksum errors, need help diagnosing.

dairou

Cadet
Joined
Aug 22, 2022
Messages
5
Hi guys, I ran out of ideas, maybe both my drives are bad (they are new)... But I'd like your opinion. I'm using this as a home NAS and media server, with no real critical data, hence the use of consumer hardware that I mostly got from upgrading my main PCs.

I'm getting a lot of checksum errors on my actual NAS drives, I first tried using ZFS on proxmox, and got lots of errors (sorry I didn't save logs of that) so then I tried on TrueNAS as a VM, where I'm still getting them.

I tried swapping the data and power cables between the pool with no issues and the one with issues, and the behavior persists.

Ran memtest for 11 hours with 0 errors.

memtest.jpg



What else can I try?
Specs:
  • Proxmox Virtual Environment 7.2-7 (bare-metal)
  • TrueNAS-13.0-U1.1 (VM under proxmox)
  • CPU: Ryzen 5 2600X
  • GPU: Gigabyte GeForce 210 1 GB (because there are no integrated graphics on the CPU)
  • Memory: Corsair Vengeance LPX 32GB DDR4 3600MHz C18 at 2933 MT/s(CMK32GX4M2D3600C18)
  • Motherboard: ASRock B450M Steel Legend BIOS version p3.60
  • Boot drive: Kingston A2000 250 GB M.2-2280 NVME

Disks:

ZFS on Proxmox (cheapo 2,5" drives):
Code:
NAME                                    STATE     READ WRITE CKSUM
local-madmen                            ONLINE       0     0     0
  mirror-0                              ONLINE       0     0     0
    ata-ST1000LM035-1RK172_(redacted)   ONLINE       0     0     0
    ata-TOSHIBA_MQ04ABF100_(redacted)   ONLINE       0     0     0


ZFS on TrueNAS using Seagate IronWolf 4TB, Proxmox passthrough of ST4000VN008-2DR166
Code:
NAME                                            STATE     READ WRITE CKSUM
sidekick                                        ONLINE       0     0 0
  mirror-0                                      ONLINE       0     0 0
    gptid/685769b9-1e62-11ed-b64f-9f55033a5f72  ONLINE       0     0 1.75K
    gptid/6865c552-1e62-11ed-b64f-9f55033a5f72  ONLINE       0     0 1.75K

errors: Permanent errors have been detected in the following files:

        /var/db/system/rrd-81963fc7279b4cf49c43e5a8cbe36cdb/localhost/df-mnt-sidekick-iocage/df_complex-reserved.rrd
        /var/db/system/rrd-81963fc7279b4cf49c43e5a8cbe36cdb/localhost/load/load.rrd
        /var/db/system/rrd-81963fc7279b4cf49c43e5a8cbe36cdb/localhost/memory/memory-active.rrd
        /var/db/system/rrd-81963fc7279b4cf49c43e5a8cbe36cdb/localhost/memory/memory-cache.rrd
        /var/db/system/rrd-81963fc7279b4cf49c43e5a8cbe36cdb/localhost/memory/memory-free.rrd
        /var/db/system/rrd-81963fc7279b4cf49c43e5a8cbe36cdb/localhost/memory/memory-inactive.rrd
        /var/db/system/rrd-81963fc7279b4cf49c43e5a8cbe36cdb/localhost/memory/memory-laundry.rrd
        /var/db/system/rrd-81963fc7279b4cf49c43e5a8cbe36cdb/localhost/memory/memory-wired.rrd
        /var/db/system/rrd-81963fc7279b4cf49c43e5a8cbe36cdb/localhost/df-mnt-sidekick-iocage/df_complex-used.rrd
        /var/db/system/rrd-81963fc7279b4cf49c43e5a8cbe36cdb/localhost/df-mnt-sidekick-iocage-images/df_complex-reserved.rrd
        /var/db/system/rrd-81963fc7279b4cf49c43e5a8cbe36cdb/localhost/df-mnt-sidekick-iocage-log/df_complex-free.rrd
        /var/db/system/rrd-81963fc7279b4cf49c43e5a8cbe36cdb/localhost/df-mnt-sidekick-vault/df_complex-used.rrd
        /var/db/system/rrd-81963fc7279b4cf49c43e5a8cbe36cdb/localhost/df-mnt-sidekick-vault/df_complex-reserved.rrd
        /var/db/system/rrd-81963fc7279b4cf49c43e5a8cbe36cdb/localhost/df-root/df_complex-free.rrd
        /var/db/system/rrd-81963fc7279b4cf49c43e5a8cbe36cdb/localhost/df-root/df_complex-reserved.rrd
        /var/db/system/rrd-81963fc7279b4cf49c43e5a8cbe36cdb/localhost/df-root/df_complex-used.rrd
        /var/db/system/rrd-81963fc7279b4cf49c43e5a8cbe36cdb/localhost/df-mnt-sidekick/df_complex-free.rrd
        /var/db/system/rrd-81963fc7279b4cf49c43e5a8cbe36cdb/localhost/df-mnt-sidekick/df_complex-reserved.rrd
        /var/db/system/rrd-81963fc7279b4cf49c43e5a8cbe36cdb/localhost/df-mnt-sidekick/df_complex-used.rrd 
        /var/db/system/rrd-81963fc7279b4cf49c43e5a8cbe36cdb/localhost/df-mnt-sidekick-iocage-log/df_complex-reserved.rrd
 

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
Are you passing through the controller, or the drives individually? If you're passing through the drives, then I'd wager that this is your issue. ZFS generally wants direct hardware control of drives, and not just logical control of the data.

For memory testing, I noticed that you're using Memtest86+. I would strongly recommend using Memtest86 (no +). The plus version was much better for a while, but the "original" has largely surpassed it over the last 4-5 years, and I've found that it's much more reliable at finding memory issues.

Assuming that you have good backups for your data, I would recommend destructive testing using badblocks. This will help you confirm what ZFS is seeing.
 

dairou

Cadet
Joined
Aug 22, 2022
Messages
5
Are you passing through the controller, or the drives individually? If you're passing through the drives, then I'd wager that this is your issue. ZFS generally wants direct hardware control of drives, and not just logical control of the data.

For memory testing, I noticed that you're using Memtest86+. I would strongly recommend using Memtest86 (no +). The plus version was much better for a while, but the "original" has largely surpassed it over the last 4-5 years, and I've found that it's much more reliable at finding memory issues.

Assuming that you have good backups for your data, I would recommend destructive testing using badblocks. This will help you confirm what ZFS is seeing.
Hi Nick2253, thanks for the insight.

Are you passing through the controller, or the drives individually?
I'm not sure, I followed this procedure: https://pve.proxmox.com/wiki/Passthrough_Physical_Disk_to_Virtual_Machine_(VM)
If this were the case, how would I go about doing it correctly? Can I still give proxmox 2 disks and TrueNAS VM 2 disks? Also... I was getting errors even when I created the ZFS pool directly on proxmox (not theTrueNAS VM)

For memory testing, I noticed that you're using Memtest86+
Huh... Thanks, I wasn't aware of the existence of different versions.

I would recommend destructive testing using badblocks
Will this work? https://calomel.org/badblocks_wipe.html
 

somethingweird

Contributor
Joined
Jan 27, 2022
Messages
183
Also check if those drive are CMR or SMR? SMR drives and ZFS don't like each other.
 

somethingweird

Contributor
Joined
Jan 27, 2022
Messages
183
Ah.. wasn't thinking IronWolf - thinking about the cheapo drives.. but if the offending drive is IronWolf then it's something else.
 

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
I'm not sure, I followed this procedure: https://pve.proxmox.com/wiki/Passthrough_Physical_Disk_to_Virtual_Machine_(VM)
If this were the case, how would I go about doing it correctly? Can I still give proxmox 2 disks and TrueNAS VM 2 disks? Also... I was getting errors even when I created the ZFS pool directly on proxmox (not theTrueNAS VM)
So, let me dive into some more detail so that you can understand what's happening here.

First off, ZFS is pretty high-demand when it comes to managing drives. Many configurations "work", but only a few configurations "work well", in the sense that they provide the full data security elements of ZFS.

A common misconfiguration when virtualizing is only giving ZFS the drives. This comes in one of two forms: just access to the logical data, or access to the disk as a device attached to a virtual controller. What you really want to have is the physical controller itself passed through to the ZFS-using VM. This has the downside of preventing access to other disks on the controller from the hypervisor perspective, but is usually a relatively low-cost problem considering the cost of controllers.

As such, I would not recommend the configuration you are using. That isn't to say it won't "work", just that it's not recommended.

However, as you point out, you were getting these same problems when using ZFS via ProxMox. This makes me suspicious that your HDD controller on the motherboard is actually doing something funky with the data, which is leading to the ZFS errors. Going through the badblocks test should help you confirm if the controller itself is the issue.

All-in-all, though, I think we're actually asking the wrong question. Instead, I'm curious: why are you using TrueNAS or ZFS in the first place? With the kind of hardware you're using, and the less-than-recommended disk configuration, it seems like data integrity is not a mission-critical problem for you.
 

dairou

Cadet
Joined
Aug 22, 2022
Messages
5
So, let me dive into some more detail so that you can understand what's happening here.

First off, ZFS is pretty high-demand when it comes to managing drives. Many configurations "work", but only a few configurations "work well", in the sense that they provide the full data security elements of ZFS.

A common misconfiguration when virtualizing is only giving ZFS the drives. This comes in one of two forms: just access to the logical data, or access to the disk as a device attached to a virtual controller. What you really want to have is the physical controller itself passed through to the ZFS-using VM. This has the downside of preventing access to other disks on the controller from the hypervisor perspective, but is usually a relatively low-cost problem considering the cost of controllers.

As such, I would not recommend the configuration you are using. That isn't to say it won't "work", just that it's not recommended.

However, as you point out, you were getting these same problems when using ZFS via ProxMox. This makes me suspicious that your HDD controller on the motherboard is actually doing something funky with the data, which is leading to the ZFS errors. Going through the badblocks test should help you confirm if the controller itself is the issue.

All-in-all, though, I think we're actually asking the wrong question. Instead, I'm curious: why are you using TrueNAS or ZFS in the first place? With the kind of hardware you're using, and the less-than-recommended disk configuration, it seems like data integrity is not a mission-critical problem for you.

Thank you so much for taking the time to write up the details. I will go ahead and test with badblocks.

All-in-all, though, I think we're actually asking the wrong question. Instead, I'm curious: why are you using TrueNAS or ZFS in the first place? With the kind of hardware you're using, and the less-than-recommended disk configuration, it seems like data integrity is not a mission-critical problem for you.

Well yeah, right now it's not critical. I'm basically just experimenting, trying to learn by doing. I want to know just how far I can get with this hardware, and have to buy the least stuff possible to get a "good enough" system, where I will eventually store relevant data. In any case, I don't intend to rely 100% on any one system.
 

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
Well yeah, right now it's not critical. I'm basically just experimenting, trying to learn by doing. I want to know just how far I can get with this hardware, and have to buy the least stuff possible to get a "good enough" system, where I will eventually store relevant data. In any case, I don't intend to rely 100% on any one system.
Gotcha. Definitely don't mind helping with "bad" hardware, I just want to make sure everyone understands the limitations of their setup :wink:.

At this point, I'd recommend:
  • Do the badblocks test. I'd expect the test to take a couple days.
  • If you get no errors, then that indicates a ZFS issue, not a controller/cable/disk issue.
  • From Proxmox, create the ZFS mirror.
  • Use a tool like dd to write a bunch of (preferably random) data (1TB at least) to the pool, and then read it back. If there are issues, you should see checksum errors.
All together, this should help us narrow in on where the problem is.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
I'm getting a lot of checksum errors on my actual NAS drives, I first tried using ZFS on proxmox, and got lots of errors (sorry I didn't save logs of that) so then I tried on TrueNAS as a VM, where I'm still getting them.
It appears that most of the discussion ignored this part from the original post.

Generally speaking, if ZFS on Proxmox (bare metal) does not work properly, there is no way it will work with TrueNAS running virtualized on Proxmox. Also, the additional layer will only make it more difficult to diagnose things, but never provide additional information. So in terms of finding the root cause, you will need to go back to ZFS with Proxmox.

To have a clearly defined starting point, the cleanest approach would be to erase the disks and start from scratch. That would eliminate the situation that corrupted data are reported, while the reason for that corruption has be gone; a faulty data cable would a typical case here.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Can you show us the output of smartctl -a /dev/ada0 for the two disks with checksum errors (swap 0 to the correct number, run the command for each drive).
 

dairou

Cadet
Joined
Aug 22, 2022
Messages
5
I'm going to update y'all.

I've been error free for a couple of weeks now :grin:. One of two things fixed my problem (I did both at the same time, so I'm not sure which one):
  • Updated my BIOS
  • Replaced my power supply (the original one was an off-brand that required sata adpters, that felt like my most egregious hardware part)
Other than that, all my disks passed the destructive badblocks tests, but some still produced new R/W or CKSUM errors, so I feel like this is not really a great test.

Thanks for all your input.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Generally improperly seated power connections are the root of those errors: check those.
 
Top