Checksum errors for multiple vdev's

kcrawford

Cadet
Joined
Oct 20, 2020
Messages
2
Hello forum,

First, I want to start off by saying that I am running ZFS in Proxmox (Debian). I was previously using FreeNAS for storage and Proxmox for virtualization and combined the two servers into a new one in January-March of this year.

My problems began about a month ago, so about 5 months after the new build was completed. The server is as follows:

  • Supermicro MBD-X11SCL-F-O (New in March 2020)
  • Intel Xeon E2246G (New in March 2020)
  • 128 GB of DDR4-2933 ECC UDIMM memory (New in October 2020)
  • LSI 9211-8i flashed in IT Mode (Circa 2012 from my old FreeNAS build)
  • SAS to right angle SATA with metal retention tabs (New in October 2020)
  • 4x 8TB WD Red Drives WD80EFAX. Configured in RAIDZ2 (New in August 2020)
  • 4x 3TB WD Red Drives, configured in RAIDZ2 (Circa 2012 from my old FreeNAS build) Currently disconnected
  • 512GB Samsung 970 NVMe boot and primary storage drive (New in March 2020)
  • Corsair CX450 power supply (New in March 2020)
  • 4-in-1 SATA power adapter with Capacitors (https://www.amazon.com/gp/product/B00ENKYJB4/)
  • CyberPower 1500VA UPS

Each pool of 4 drives is connected up to the 4-in-1 adapter. One pool is connected to the 12 volt SATA rail and the other pool was connected to the 12 volt 4-pin molex rail with a 4 pin molex to SATA adapter.

I replaced the non-ECC RAM with ECC RAM and replaced the SAS breakout cables. I ran a scrub after replacing these two items this past week and am still seeing checksum errors, as shown below.

Code:
root@bigbear:/mnt/helium/staging# zpool status
  pool: helium
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 0 days 02:44:15 with 2 errors on Tue Oct 20 11:19:45 2020
config:

    NAME                                   STATE     READ WRITE CKSUM
    helium                                 ONLINE       0     0     0
      raidz2-0                             ONLINE       0     0     0
        ata-WDC_WD80EFAX-68KNBN0_VAHDPLEL  ONLINE       0     0     4
        ata-WDC_WD80EFAX-68KNBN0_VAKSH6BL  ONLINE       0     0     4
        ata-WDC_WD80EFAX-68KNBN0_VAG9DJXL  ONLINE       0     0     4
        ata-WDC_WD80EFAX-68KNBN0_VAH152JL  ONLINE       0     0     4

errors: 2 data errors, use '-v' for a list     


Here is the SMART data for the 4 drives that make up this pool.


/dev/sda
Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0004 129 129 054 Old_age Offline - 104
3 Spin_Up_Time 0x0007 195 195 024 Pre-fail Always - 476 (Average 357)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 20
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000a 100 100 067 Old_age Always - 0
8 Seek_Time_Performance 0x0004 128 128 020 Old_age Offline - 18
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 1783
10 Spin_Retry_Count 0x0012 100 100 060 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 20
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 94
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 94
194 Temperature_Celsius 0x0002 151 151 000 Old_age Always - 43 (Min/Max 20/51)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0


/dev/sdb
Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0004 128 128 054 Old_age Offline - 108
3 Spin_Up_Time 0x0007 193 193 024 Pre-fail Always - 481 (Average 361)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 20
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000a 100 100 067 Old_age Always - 0
8 Seek_Time_Performance 0x0004 128 128 020 Old_age Offline - 18
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 1783
10 Spin_Retry_Count 0x0012 100 100 060 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 20
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 94
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 94
194 Temperature_Celsius 0x0002 166 166 000 Old_age Always - 39 (Min/Max 20/43)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0


/dev/sdc
Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0004 128 128 054 Old_age Offline - 108
3 Spin_Up_Time 0x0007 190 190 024 Pre-fail Always - 491 (Average 364)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 20
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000a 100 100 067 Old_age Always - 0
8 Seek_Time_Performance 0x0004 128 128 020 Old_age Offline - 18
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 1783
10 Spin_Retry_Count 0x0012 100 100 060 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 20
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 94
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 94
194 Temperature_Celsius 0x0002 151 151 000 Old_age Always - 43 (Min/Max 20/51)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0


/dev/sdd
Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0004 128 128 054 Old_age Offline - 108
3 Spin_Up_Time 0x0007 192 192 024 Pre-fail Always - 484 (Average 361)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 20
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000a 100 100 067 Old_age Always - 0
8 Seek_Time_Performance 0x0004 128 128 020 Old_age Offline - 18
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 1783
10 Spin_Retry_Count 0x0012 100 100 060 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 20
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 94
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 94
194 Temperature_Celsius 0x0002 151 151 000 Old_age Always - 43 (Min/Max 20/52)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0


Possibly relevant backstory on the 8TB drives, they are refurbs from WD. When I found out that I purchased 6TB SMR drives, I complained to WD and they replaced them for me, no charge. I am running long tests on the drives, but this will take a few days to get the results. The first one came back with no errors.

I am also in contact with Corsair, they are sending out a replacement power supply as this was sometimes suggested as being the culprit from my research.

Where else should I look for the cause of the checksum and file corruption errors? Note that this was affecting both pools. I disconnected the 4x 3TB drive pool as I thought maybe I was over-utilizing the power supply. Bit skeptial on this as my UPS indicates I only 100-140 watt load is being placed on it from the server. I was planning on using the newer drives as my main storage pool and using the older drives as a backup to the main pool.
 

kcrawford

Cadet
Joined
Oct 20, 2020
Messages
2
Update:

The smartctl long tests have completed without error on all four of the 8TB drives.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
Checksum errors can be cabling... check the SATA cables or SAS breakout cable (since it seems broken the same for all 4, maybe it's a 4-way breakout cable, so check the card end).
 
Top