Scrub Read Errors and increased UDMA_CRC_Error_Count (on Proxmox).

CompuGlobalHyperMegaNet · Nov 4, 2020

I was hoping someone could double check my thought process / conclussion in regards to an issue I'm having on my Proxmox server. As it's ZFS hardware related and not specific to Proxmox, I hope it's okay to post here (if not in the Storage forum).

I've got a disk that showed 35 Read Errors after a scrub. Seen as though SMART reported 3 UDMA_CRC_Error_Count errors (which I've read can indicate cable problems).

Here's the course of events

0. 35 Read errors reported on my pool (all on the same disk mounted at /dev/sda and marked as "FAULTED"). I'm not sure whether this was after a scrub I manually tirggered or not. Write and CHKSUM are at 0.
1. I check the SMART status and see the 3 UDMA_CRC_Error_Count errors. Queue research.
2. Having read that UDMA_CRC_Error_Count can indicate cable errors I "zpool clear tank" the pool and shut the server down (not noticing that a resilver was triggered by the clear command).
3. I disconnect and reconnect the SAS to SATA cables from the drive caddies, and remove and reinsert the disks within the caddies, making sure that the disks are properly seated.
4. I boot up the server, the resilver starts and completes sucessfully.
5. The next day, I'm getting more read errors and the UDMA_CRC_Error_Count has increased from 3 to 4 and once again, the disk is marked as "FAULTED". Still think that it's a cable issue, I "zfs clear" and shut the server down, and swap the cables around.
6. I reboot, the read errors are gone but now the same disk has 60 or so CHKSUM errors. And this point, I can't remember whether they disappeared by themselves after the resilver and/or scrub or whether I cleared them, but theyr're gone and they have yet to come back.
7. A few hours later, another disk has read error, is marked as "FAULTED", and its' UDMA_CRC_Error_Count has gone from 0 to 2. The original disk still only has 4 UDMA_CRC_Error_Count errors and is reporting 0 0 0 from "zpool status tank"

[NOTE] Please see my second post for more details in regards to the order of the cables before and after swapping them around.

As the problems move to a different disk after just swapping the cables and nothing else, am I right in concluding that the most likely culprit is the cable that came with my server pulled M1015?

Nick2253 · Nov 4, 2020

I'm surprised that you're getting a resilver. Maybe I'm misunderstanding, but a resilver should only be happening when you add a new drive to a vdev.

I'm assuming that the "another disk" in event 7 is in fact the same disk that received the cable that was originally attached to "original disk". If this is the case, then I'd say it's probably the cable. Swapping the cable is a relatively low-effort fix: at best it fixes the problem, and at worst it makes no difference.

CompuGlobalHyperMegaNet · Nov 4, 2020

Nick2253 said:
I'm surprised that you're getting a resilver. Maybe I'm misunderstanding, but a resilver should only be happening when you add a new drive to a vdev.

I'm assuming that the "another disk" in event 7 is in fact the same disk that received the cable that was originally attached to "original disk". If this is the case, then I'd say it's probably the cable. Swapping the cable is a relatively low-effort fix: at best it fixes the problem, and at worst it makes no difference.

In regards to the resilvering behaviour, the disk was marked as FAULTED if that makes a difference... I'll edit my original post to reflect that.

Just to clarify 7. I can't say with 100% certainty that the first disk to show errors and the second disk to show errors were connected with the same cable pre and post cable swap but having swapped the cables around, the issues have only effected the disk with the /dev/sda mount point. The original disk was at /dev/sda but having swapped cables, it is now at /dev/sdd.

The more I think about it, the more I think it's the cable. The disks themselves are in two 2x 5.25 to 3x 3.5" Icy Dock drive cages.

Each HDD cage has 3 SATA connectors.
There are two Mini SAS 8087 to SATA cables connected to my M1015. Each cable splits into 4x SATA connectors
One cable goes to each drive cage.
The SATA connectors are connected top to bottom on each cage, using SATA connectors labelled 1, 2, and 3. For the sake of clarity, lets label them connectors A1 to A3, and B1 to B3.

I swapped the cables between the drive cages so basically the cable order went from-

A1 - disk 1
A2 - disk 2 [note to self - A is the bottom cage]
A3 - disk 3
B1 - disk 4
B2 - disk 5 [note to self - B is the top cage]
B3 - disk 6

to

B1 - disk 1
B2 - disk 2
B3 - disk 3
A1 - disk 4
A2 - disk 5
A3 - disk 6

Or to put it another way, assuming the disks are assign /dev/ paths in the same physical order-

A1 - disk 1 = /dev/sda (first disk to report problems)
A2 - disk 2
A3 - disk 3
B1 - disk 4 = if disk 1 is sda, then this might/should be /dev/sdd (no problems reported originally)
B2 - disk 5
B3 - disk 6

to

B1 - disk 1 = newly mounted as /dev/sda (second disk to report problems. It was effectvly disk 4 in the previous cable layout)
B2 - disk 2
B3 - disk 3
A1 - disk 4 = orignally mounted at /dev/sda but now at /dev/sdd (no longer reporting ZFS errors and isn't showing any further increases to UDMA)
A2 - disk 5
A3 - disk 6

So it looks like the drive cages, that I've never had an issue with, aren't the problem, and that everything seams to point toward connector A1 or something directly related to it.

I had run the usual Burn-In process on the disks as can be found in my sig (SMART conveyance, short and then long tests, 4x badblocks passes then another long test) and never had an issue but perhaps I damaged the cable when I moved the disks / drive cages / HBA between my servers

CompuGlobalHyperMegaNet · Nov 5, 2020

Just a minor update in the hopes that other users might be helped by this thread.

Having postulated that the SATA connector labelled "P1" on my second Mini SAS 8087 to SATA was bad, I disconnected P1 entirely and used on of the spare connectors on the cable. I rebooted by server, the pool resilvered and I ran a scrub, and there are now errors or increases in the UDMA_CRC_Error_Count count after roughly the same period of time that it took the previous problems to rear their head.

Fingers crossed that my interpretation of early signs are correct but if it isn't, I will update this post.

In conclusion, it does indeed look like "UDMA_CRC_Error_Count" can indicate issues with your cabling, which can also result in ZFS errors. So if you're in my situation, you first port of call should be to check you cabl-.... NO! Your first port of call should always be to make sure your data is backed up!... then check your cables.

Nick2253 · Nov 5, 2020

CompuGlobalHyperMegaNet said:
So if you're in my situation, you first port of call should be to check you cabl-.... NO! Your first port of call should always be to make sure your data is backed up!... then check your cables.

This is excellent advise in all situations!

Important Announcement for the TrueNAS Community.

Scrub Read Errors and increased UDMA_CRC_Error_Count (on Proxmox).

CompuGlobalHyperMegaNet

Contributor

Nick2253

Wizard

CompuGlobalHyperMegaNet

Contributor

CompuGlobalHyperMegaNet

Contributor

Nick2253

Wizard

Similar threads

Important Announcement for the TrueNAS Community.

Scrub Read Errors and increased UDMA_CRC_Error_Count (on Proxmox).

CompuGlobalHyperMegaNet

Contributor

Nick2253

Wizard

CompuGlobalHyperMegaNet

Contributor

CompuGlobalHyperMegaNet

Contributor

Nick2253

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Scrub Read Errors and increased UDMA_CRC_Error_Count (on Proxmox)."

Similar threads