ZFS Checksum Errors on New Drive During Resilvering

SeaWolfX · Sep 23, 2022

I recently replaced a disk in my ZFS mirrored vdev as one of the drives was showing signs of being unhealthy (my first time attempting a resilvering).

The four drives in my ZFS vdev are connected via a Chieftec hotswap backplane to a X11SCH-F motherboard.

I removed the bad drive, replaced it with a new identical one (Seagate Exos 78E Enterprise drive), run S.M.A.R.T short-, conveyance-, and long tests on it, updated the alias to point to the new drive and started the resilvering using zpool replace.

During the resilvering some checksum errors showed up in zpool status and they were also present after the resilvering was done.

I checked the SMART metrics which showed no concerning values. I then decided to do a badblocks test (I know, I should have done that before resilvering), but it finished without any errors.

Now I am not sure how to proceed. Could the checksum error I saw just be false errors due to the resilvering? Is this something I need to be concerned about and investigate further (if so what should I look for) or should I just resilver the drive again and ignore any checksum errors?

(Edit: Please let me know if this is not the correct sub for this issue and I'll move it)

Before Resilvering

During Resilvering

After Resilvering

Comparing SMART Metrics of Both Drives in Mirror

Badblock Test

K41z6Ef5eHXPI0TEHzvCdGN6DBETB2ekKswFc3zoy0szjTOCSs6jZPQtvA56cS9DPXueHsWpQnZ8dFG-EnrL6oWNd7ufr3eQBh5_RZAkgEBpjLNTsfqJdT732IuNnLAHbDrc_TVtRA7sz87dZtHXZ8cAa5zdRgMzoGS9meykSzelOzFS1i0dr63EplhHt97_N-wNfrm2NkoELFRH7Jt4BR8SHPqkGaORyFhaAEdc9nBkxd-YVLHSgl5fD3GWnamuZKOY_3whqAk919tMNzQx2qoTjLog0-e-kabwzyLCoDO4WYaX-GUG9w5dSoKC_aGzve_3TEGnC9rLat0Y5PiUZXHKaje544g2IbvcZhmoBbJE_9Qvby290k8XAemkx-DesD4WjflGRjcbspLmWk6Zu_PxWxfraltIsy1td_Oz14gw0IP6xlgWKz9yF9whSXy4yBlgl94auxnCFSm87CQ5rZmzoyv-pz48uzBm6arcmEITiuWYeAuCRQFDLFl5ZzKXMTifie9Dsk45FfW2TPe676vvoiengFBARtHgw-6LtM3WepwwXdRa0kXoc6Qt78KmYz_i88xD0YNGZyPTUQWBqjKclL4dFTMBN0ncGTTAXMoe-QIfFJcDgbA81gQS-pme33mKIbO29CbY8pf0vcJudMzmAK6P72GXmT_zxH0Tw_iwawk521kRy8rRe2QNJXpWCF6k7FulyzbQNcIClkXQbV-qCvapUp17E3Pwwlo7lMQWoDVCo_1ev_gfzQ22Bbs=w1920-h612-no

FnKAHt8X0Nog_DHrk8SqjAwmR9b0wzmh0gSOJwSUruF7UpYBYrDvkrSbeP3Be6Hn_EyePMT-op7xqaz7cUtnNUq2nAmBR48QRQJ3KlcX_CfFZTr3ADNo2PRpajS6FX_n7InW0hPpDn3ejdlgVCv7_440SMplfGOyH2yfZYg1J_11DdNsO23uT2JkcP1f3VQ_1LXNVBUM-IyvecDNEeEvPSaiIGvEd5IwK071nfRtxDkTRPRgNfLgXGvvO3zNqTWHOoY8ZmbhOqho4Chgzu4q4QCUywt8MMK6wWw8sx39Uaq1gw-MP6D7aIGHm1qkYyS4_n9txBi4AfSM7o__a7SuWzEhJ0_b8U3hw7RqkYwzdrYyMT2Lfgmp19T7phezaIJavDwTfwBvGVuC9gZBBnifxrJmVosbTfRVwiKMj6qmwh9AUca-nnMkJlaqr8oGiPOsZYQYDKGNRf8A5pTKEDuoMk5pHeBpwSXXRfHY81AhFI_19aFShZy3w-OY_-BedUo55EUhmDWXd_kbBtLOcxbyPACdCklJzPGaRf1zs5aekOikC_7zEHlNkw9QFBF8XqYzYVjTQ1w17GCNwsZzFEQZE5zsoWYxi-umd7pkbKLmuJLacCIFb_FXwpOaZ0ESBA-vUqpXx86eYX3vX9GHWDCJKbyRgxmf3-NuaZxJkmb3e4frgziZa-H9VQ0Fvis3IChS3NKlpuK_vLgIIv-rCEnkAf7h02Ud85_E3UkUwdH7eG7a8XlHHDl6oDZqW_3FTFU=w529-h938-no

Arwen · Sep 25, 2022

The "Load_Cycle_Count" for both old and new drives seems high. It's possible that the drives have some of the power saving options enabled that NAS specific drives don't have. Perhaps the re-silver paused at times, which allowed the drive to park it's heads. Then when ZFS resumed, it had a time out failure because of the time it took to re-load the heads on to the platters.

For hard disk drives that don't use head parking, the "Power_Cycle_Count" would be roughly the same as the "Load_Cycle_Count".

Other than that, I don't have a useful suggestion.

SeaWolfX · Sep 26, 2022

Arwen said:
The "Load_Cycle_Count" for both old and new drives seems high. It's possible that the drives have some of the power saving options enabled that NAS specific drives don't have.

I am not sure to be honest. Would that be enabled by default and how can I check this?

elvisimprsntr · Sep 26, 2022

I noticed your mobo has a RAID controller.

Re: https://www.supermicro.com/en/products/motherboard/X11SCH-F

Intel® C246 controller for 8 SATA3 (6 Gbps) ports; RAID 0,1,5,10

My understanding is RAID controllers are to be avoided. Read the following links.

What's all the noise about HBA's, and why can't I use a RAID controller?

This is relevant to FreeNAS and TrueNAS CORE. Some parts of it might also be relevant to Scale, but I don't really know how reliable the Linux drivers are. 1) An HBA is a Host Bus Adapter. This is a controller that allows SAS and SATA devices to be attached to, and communicate directly with...

www.truenas.com

ZFS - Wikipedia

en.wikipedia.org

Some forum members have been successful in getting Supermicro to send them firmware that will re-flash the on-board controller to IT mode.
Your other option might be to install a separate non-RAID HBA.

SeaWolfX · Sep 26, 2022

elvisimprsntr said:
I noticed your mobo has a RAID controller.

Re: https://www.supermicro.com/en/products/motherboard/X11SCH-F

My understanding is RAID controllers are to be avoided. Read the following links.

What's all the noise about HBA's, and why can't I use a RAID controller?

This is relevant to FreeNAS and TrueNAS CORE. Some parts of it might also be relevant to Scale, but I don't really know how reliable the Linux drivers are. 1) An HBA is a Host Bus Adapter. This is a controller that allows SAS and SATA devices to be attached to, and communicate directly with...

www.truenas.com

ZFS - Wikipedia

en.wikipedia.org

Some forum members have been successful in getting Supermicro to send them firmware that will re-flash the on-board controller to IT mode.
Your other option might be to install a separate non-RAID HBA.

Maybe I have misunderstood this, but I am not actually using any HW RAID feature on the card. I did a lot of research on this forum before selecting a motherboard and this one was very often recommended. Why would that be if there might be an issue with onboard RAID controller on this MB when using ZFS.

Arwen · Sep 26, 2022

SeaWolfX said:
I am not sure to be honest. Would that be enabled by default and how can I check this?

You don't specify which TrueNAS release, Core, (based on FreeBSD) or SCALE, (based on Linux). But I am guessing SCALE based on your disk naming convention.

Linux has a tool called hdparm which can show the current settings;

Code:

# hdparm -B -M /dev/sda

/dev/sda:
 APM_level      = 254
 acoustic      = 254 (128=quiet ... 254=fast)

This shows my 2.5" laptop drive's settings, which used to have quite increasing "Load_Cycle_Count". Now I have disabled both "features" and no more increasing "Load_Cycle_Count".

These can be changed with;

Code:

# hdparm -B 254 -M 254 /dev/sda

I don't know the TrueNAS Core / FreeBSD tool name or how to use it.

SeaWolfX · Sep 26, 2022

Arwen said:
You don't specify which TrueNAS release, Core, (based on FreeBSD) or SCALE, (based on Linux). But I am guessing SCALE based on your disk naming convention.

Linux has a tool called hdparm which can show the current settings;

Code:
# hdparm -B -M /dev/sda /dev/sda: APM_level = 254 acoustic = 254 (128=quiet ... 254=fast)

This shows my 2.5" laptop drive's settings, which used to have quite increasing "Load_Cycle_Count". Now I have disabled both "features" and no more increasing "Load_Cycle_Count".

These can be changed with;

Code:
# hdparm -B 254 -M 254 /dev/sda

I don't know the TrueNAS Core / FreeBSD tool name or how to use it.

You are correct, I'm on Linux.

Seems like APM is not supported on my drives though :/

Doing a bit of searching it seems I might be able to do something similar using the SeaChest tool. Not sure how to install it on Linux or what parameters to change though.

In any case; is 45K LLC such a high number? The drives have a rating of 600K and they have been up and running for about 1.5 years. At this rate they should be good at least for 20 years before they reach their limit.

Arwen · Sep 26, 2022

The point is that you have ZFS checksum errors. It's a guess on my part that these so called features are interfering with ZFS' reads, causing the checksum errors. So I suggested eliminating something and seeing if that improved the situation.

In my case, I was getting 1,000s of LCC, in far excess of the expected life of the disk. It was parking, but ZFS would wake it up. However, for me, before the change, it did not create any ZFS errors, checksum or otherwise. Thus, as I said, just a guess.

Desc	Model	OS	Size/Speed	Boot/Pool	Other
NAS-1	QNAP TS-453A	TrueNAS CORE	2x8GB	16GB SLC eUSB DOM	APACHE, LACP, RSYNC, SMB, TFTP, TM
- Hourly	Seagate ST2000VN00[04]		4x2TB	RZ2	SATA
- Daily	Google Drive		15GB		Offsite
NAS-2	QNAP TS-253A	TrueNAS CORE	2x8GB	16GB SLC eUSB DOM	LACP, RSYNC, SMB, TM, VM
- Daily	Seagate ST4000VN008		2x4TB	RZ1	SATA
- Weekly	Crucial X8		500GB	RZ0	USB
NAS-3	QNAP TS-453A	TrueNAS SCALE	2x8GB	16GB SLC eUSB DOM	LACP, RSYNC, SMB, TM
- Weekly	Seagate ST2000[DM,VN]00[46]		4x2TB	RZ2	SATA
- Monthly	Crucial X8		500GB	RZ0	USB
NAS-4	QNAP TS-253A	TrueNAS SCALE	2x8GB	16GB SLC eUSB DOM	LACP, RSYNC, SMB, TM, VM
- Testing	WD40EFRX		2x4TB	RZ1	SATA
VM-1	QNAP TS-253A		2x8GB	16GB SLC eUSB DOM	KVM, LXC
- PVE	Segate ST4000VN000	Proxmox VE	2x4TB	RZ1	SATA
FW	Protectli FW4C	pfSense CE	8GB	256GB TLC mSATA	DHCP/DNS, IDS/IPS, GPS/PPS, NTP, VPN
- WAN-1	Arris NVG599		375Mbps		ATT Fiber
~~- WAN-2~~	~~Netgear LB1120~~		~~150Mbps~~		~~SpeedTalk LTE~~
- GPS	Garmin 18X LVC				RS232, PPS
- LCD	Crystalfontz XES635	LCDproc			USB, NTP, UPS
- UPS	APC BX1500M		1500VA		USB, Master
SW	Linksys LGS326		52Gbps		LACP, VLAN
- AP-[123]	EnGenius EWS377APv3		3.6Gbps		WPA2/3
- SK-1	EnGenius SkyKeyIv1.1		1GB	4GB MLC eMMC	400GB mSD
NVR
- DB-1	Lorex B862AJD		8MP	32GB mSDXC	RTSP
- DB-[23]	Lorex B451AJD		4MP	32GB mSDXC	RTSP
- CAM-[12]	Lorex W461ASC		4MP	32GB mSDXC	RTSP

Important Announcement for the TrueNAS Community.

ZFS Checksum Errors on New Drive During Resilvering

SeaWolfX

Explorer

Arwen

MVP

SeaWolfX

Explorer

elvisimprsntr

Guru

What's all the noise about HBA's, and why can't I use a RAID controller?

ZFS - Wikipedia

SeaWolfX

Explorer

What's all the noise about HBA's, and why can't I use a RAID controller?

ZFS - Wikipedia

Arwen

MVP

SeaWolfX

Explorer

Arwen

MVP

Similar threads

Important Announcement for the TrueNAS Community.

ZFS Checksum Errors on New Drive During Resilvering

Explorer

MVP

Explorer

Guru

Explorer

MVP

Explorer

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ZFS Checksum Errors on New Drive During Resilvering"

Similar threads