CompuGlobalHyperMegaNet
Contributor
- Joined
- Sep 13, 2014
- Messages
- 149
I was hoping someone could double check my thought process / conclussion in regards to an issue I'm having on my Proxmox server. As it's ZFS hardware related and not specific to Proxmox, I hope it's okay to post here (if not in the Storage forum).
I've got a disk that showed 35 Read Errors after a scrub. Seen as though SMART reported 3 UDMA_CRC_Error_Count errors (which I've read can indicate cable problems).
Here's the course of events
0. 35 Read errors reported on my pool (all on the same disk mounted at /dev/sda and marked as "FAULTED"). I'm not sure whether this was after a scrub I manually tirggered or not. Write and CHKSUM are at 0.
1. I check the SMART status and see the 3 UDMA_CRC_Error_Count errors. Queue research.
2. Having read that UDMA_CRC_Error_Count can indicate cable errors I "zpool clear tank" the pool and shut the server down (not noticing that a resilver was triggered by the clear command).
3. I disconnect and reconnect the SAS to SATA cables from the drive caddies, and remove and reinsert the disks within the caddies, making sure that the disks are properly seated.
4. I boot up the server, the resilver starts and completes sucessfully.
5. The next day, I'm getting more read errors and the UDMA_CRC_Error_Count has increased from 3 to 4 and once again, the disk is marked as "FAULTED". Still think that it's a cable issue, I "zfs clear" and shut the server down, and swap the cables around.
6. I reboot, the read errors are gone but now the same disk has 60 or so CHKSUM errors. And this point, I can't remember whether they disappeared by themselves after the resilver and/or scrub or whether I cleared them, but theyr're gone and they have yet to come back.
7. A few hours later, another disk has read error, is marked as "FAULTED", and its' UDMA_CRC_Error_Count has gone from 0 to 2. The original disk still only has 4 UDMA_CRC_Error_Count errors and is reporting 0 0 0 from "zpool status tank"
[NOTE] Please see my second post for more details in regards to the order of the cables before and after swapping them around.
As the problems move to a different disk after just swapping the cables and nothing else, am I right in concluding that the most likely culprit is the cable that came with my server pulled M1015?
I've got a disk that showed 35 Read Errors after a scrub. Seen as though SMART reported 3 UDMA_CRC_Error_Count errors (which I've read can indicate cable problems).
Here's the course of events
0. 35 Read errors reported on my pool (all on the same disk mounted at /dev/sda and marked as "FAULTED"). I'm not sure whether this was after a scrub I manually tirggered or not. Write and CHKSUM are at 0.
1. I check the SMART status and see the 3 UDMA_CRC_Error_Count errors. Queue research.
2. Having read that UDMA_CRC_Error_Count can indicate cable errors I "zpool clear tank" the pool and shut the server down (not noticing that a resilver was triggered by the clear command).
3. I disconnect and reconnect the SAS to SATA cables from the drive caddies, and remove and reinsert the disks within the caddies, making sure that the disks are properly seated.
4. I boot up the server, the resilver starts and completes sucessfully.
5. The next day, I'm getting more read errors and the UDMA_CRC_Error_Count has increased from 3 to 4 and once again, the disk is marked as "FAULTED". Still think that it's a cable issue, I "zfs clear" and shut the server down, and swap the cables around.
6. I reboot, the read errors are gone but now the same disk has 60 or so CHKSUM errors. And this point, I can't remember whether they disappeared by themselves after the resilver and/or scrub or whether I cleared them, but theyr're gone and they have yet to come back.
7. A few hours later, another disk has read error, is marked as "FAULTED", and its' UDMA_CRC_Error_Count has gone from 0 to 2. The original disk still only has 4 UDMA_CRC_Error_Count errors and is reporting 0 0 0 from "zpool status tank"
[NOTE] Please see my second post for more details in regards to the order of the cables before and after swapping them around.
As the problems move to a different disk after just swapping the cables and nothing else, am I right in concluding that the most likely culprit is the cable that came with my server pulled M1015?
Last edited: