ZFS Checksum Errors on New Drive During Resilvering

SeaWolfX

Explorer
Joined
Mar 14, 2018
Messages
65
I recently replaced a disk in my ZFS mirrored vdev as one of the drives was showing signs of being unhealthy (my first time attempting a resilvering).

The four drives in my ZFS vdev are connected via a Chieftec hotswap backplane to a X11SCH-F motherboard.

I removed the bad drive, replaced it with a new identical one (Seagate Exos 78E Enterprise drive), run S.M.A.R.T short-, conveyance-, and long tests on it, updated the alias to point to the new drive and started the resilvering using zpool replace.

During the resilvering some checksum errors showed up in zpool status and they were also present after the resilvering was done.

I checked the SMART metrics which showed no concerning values. I then decided to do a badblocks test (I know, I should have done that before resilvering), but it finished without any errors.

Now I am not sure how to proceed. Could the checksum error I saw just be false errors due to the resilvering? Is this something I need to be concerned about and investigate further (if so what should I look for) or should I just resilver the drive again and ignore any checksum errors?

(Edit: Please let me know if this is not the correct sub for this issue and I'll move it)

Before Resilvering
addxlrez8sm91.png



During Resilvering
uwbolod09sm91.png



After Resilvering
9wn7zx2x8sm91.png



Comparing SMART Metrics of Both Drives in Mirror
bl3pnpt2atm91.png



Badblock Test
K41z6Ef5eHXPI0TEHzvCdGN6DBETB2ekKswFc3zoy0szjTOCSs6jZPQtvA56cS9DPXueHsWpQnZ8dFG-EnrL6oWNd7ufr3eQBh5_RZAkgEBpjLNTsfqJdT732IuNnLAHbDrc_TVtRA7sz87dZtHXZ8cAa5zdRgMzoGS9meykSzelOzFS1i0dr63EplhHt97_N-wNfrm2NkoELFRH7Jt4BR8SHPqkGaORyFhaAEdc9nBkxd-YVLHSgl5fD3GWnamuZKOY_3whqAk919tMNzQx2qoTjLog0-e-kabwzyLCoDO4WYaX-GUG9w5dSoKC_aGzve_3TEGnC9rLat0Y5PiUZXHKaje544g2IbvcZhmoBbJE_9Qvby290k8XAemkx-DesD4WjflGRjcbspLmWk6Zu_PxWxfraltIsy1td_Oz14gw0IP6xlgWKz9yF9whSXy4yBlgl94auxnCFSm87CQ5rZmzoyv-pz48uzBm6arcmEITiuWYeAuCRQFDLFl5ZzKXMTifie9Dsk45FfW2TPe676vvoiengFBARtHgw-6LtM3WepwwXdRa0kXoc6Qt78KmYz_i88xD0YNGZyPTUQWBqjKclL4dFTMBN0ncGTTAXMoe-QIfFJcDgbA81gQS-pme33mKIbO29CbY8pf0vcJudMzmAK6P72GXmT_zxH0Tw_iwawk521kRy8rRe2QNJXpWCF6k7FulyzbQNcIClkXQbV-qCvapUp17E3Pwwlo7lMQWoDVCo_1ev_gfzQ22Bbs=w1920-h612-no




FnKAHt8X0Nog_DHrk8SqjAwmR9b0wzmh0gSOJwSUruF7UpYBYrDvkrSbeP3Be6Hn_EyePMT-op7xqaz7cUtnNUq2nAmBR48QRQJ3KlcX_CfFZTr3ADNo2PRpajS6FX_n7InW0hPpDn3ejdlgVCv7_440SMplfGOyH2yfZYg1J_11DdNsO23uT2JkcP1f3VQ_1LXNVBUM-IyvecDNEeEvPSaiIGvEd5IwK071nfRtxDkTRPRgNfLgXGvvO3zNqTWHOoY8ZmbhOqho4Chgzu4q4QCUywt8MMK6wWw8sx39Uaq1gw-MP6D7aIGHm1qkYyS4_n9txBi4AfSM7o__a7SuWzEhJ0_b8U3hw7RqkYwzdrYyMT2Lfgmp19T7phezaIJavDwTfwBvGVuC9gZBBnifxrJmVosbTfRVwiKMj6qmwh9AUca-nnMkJlaqr8oGiPOsZYQYDKGNRf8A5pTKEDuoMk5pHeBpwSXXRfHY81AhFI_19aFShZy3w-OY_-BedUo55EUhmDWXd_kbBtLOcxbyPACdCklJzPGaRf1zs5aekOikC_7zEHlNkw9QFBF8XqYzYVjTQ1w17GCNwsZzFEQZE5zsoWYxi-umd7pkbKLmuJLacCIFb_FXwpOaZ0ESBA-vUqpXx86eYX3vX9GHWDCJKbyRgxmf3-NuaZxJkmb3e4frgziZa-H9VQ0Fvis3IChS3NKlpuK_vLgIIv-rCEnkAf7h02Ud85_E3UkUwdH7eG7a8XlHHDl6oDZqW_3FTFU=w529-h938-no
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
The "Load_Cycle_Count" for both old and new drives seems high. It's possible that the drives have some of the power saving options enabled that NAS specific drives don't have. Perhaps the re-silver paused at times, which allowed the drive to park it's heads. Then when ZFS resumed, it had a time out failure because of the time it took to re-load the heads on to the platters.

For hard disk drives that don't use head parking, the "Power_Cycle_Count" would be roughly the same as the "Load_Cycle_Count".

Other than that, I don't have a useful suggestion.
 

SeaWolfX

Explorer
Joined
Mar 14, 2018
Messages
65
The "Load_Cycle_Count" for both old and new drives seems high. It's possible that the drives have some of the power saving options enabled that NAS specific drives don't have.
I am not sure to be honest. Would that be enabled by default and how can I check this?
 
Joined
Jun 2, 2019
Messages
591
I noticed your mobo has a RAID controller.

Re: https://www.supermicro.com/en/products/motherboard/X11SCH-F

Intel® C246 controller for 8 SATA3 (6 Gbps) ports; RAID 0,1,5,10

My understanding is RAID controllers are to be avoided. Read the following links.


Some forum members have been successful in getting Supermicro to send them firmware that will re-flash the on-board controller to IT mode.
Your other option might be to install a separate non-RAID HBA.
 

SeaWolfX

Explorer
Joined
Mar 14, 2018
Messages
65
I noticed your mobo has a RAID controller.

Re: https://www.supermicro.com/en/products/motherboard/X11SCH-F



My understanding is RAID controllers are to be avoided. Read the following links.


Some forum members have been successful in getting Supermicro to send them firmware that will re-flash the on-board controller to IT mode.
Your other option might be to install a separate non-RAID HBA.

Maybe I have misunderstood this, but I am not actually using any HW RAID feature on the card. I did a lot of research on this forum before selecting a motherboard and this one was very often recommended. Why would that be if there might be an issue with onboard RAID controller on this MB when using ZFS.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
I am not sure to be honest. Would that be enabled by default and how can I check this?
You don't specify which TrueNAS release, Core, (based on FreeBSD) or SCALE, (based on Linux). But I am guessing SCALE based on your disk naming convention.

Linux has a tool called hdparm which can show the current settings;
Code:
# hdparm -B -M /dev/sda

/dev/sda:
 APM_level      = 254
 acoustic      = 254 (128=quiet ... 254=fast)

This shows my 2.5" laptop drive's settings, which used to have quite increasing "Load_Cycle_Count". Now I have disabled both "features" and no more increasing "Load_Cycle_Count".

These can be changed with;
Code:
# hdparm -B 254 -M 254 /dev/sda


I don't know the TrueNAS Core / FreeBSD tool name or how to use it.
 

SeaWolfX

Explorer
Joined
Mar 14, 2018
Messages
65
You don't specify which TrueNAS release, Core, (based on FreeBSD) or SCALE, (based on Linux). But I am guessing SCALE based on your disk naming convention.

Linux has a tool called hdparm which can show the current settings;
Code:
# hdparm -B -M /dev/sda

/dev/sda:
 APM_level      = 254
 acoustic      = 254 (128=quiet ... 254=fast)

This shows my 2.5" laptop drive's settings, which used to have quite increasing "Load_Cycle_Count". Now I have disabled both "features" and no more increasing "Load_Cycle_Count".

These can be changed with;
Code:
# hdparm -B 254 -M 254 /dev/sda


I don't know the TrueNAS Core / FreeBSD tool name or how to use it.
You are correct, I'm on Linux.

Seems like APM is not supported on my drives though :/

1664202805566.png


Doing a bit of searching it seems I might be able to do something similar using the SeaChest tool. Not sure how to install it on Linux or what parameters to change though.

In any case; is 45K LLC such a high number? The drives have a rating of 600K and they have been up and running for about 1.5 years. At this rate they should be good at least for 20 years before they reach their limit.
 
Last edited:

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
The point is that you have ZFS checksum errors. It's a guess on my part that these so called features are interfering with ZFS' reads, causing the checksum errors. So I suggested eliminating something and seeing if that improved the situation.

In my case, I was getting 1,000s of LCC, in far excess of the expected life of the disk. It was parking, but ZFS would wake it up. However, for me, before the change, it did not create any ZFS errors, checksum or otherwise. Thus, as I said, just a guess.
 
Top