Hello forum,
First, I want to start off by saying that I am running ZFS in Proxmox (Debian). I was previously using FreeNAS for storage and Proxmox for virtualization and combined the two servers into a new one in January-March of this year.
My problems began about a month ago, so about 5 months after the new build was completed. The server is as follows:
Each pool of 4 drives is connected up to the 4-in-1 adapter. One pool is connected to the 12 volt SATA rail and the other pool was connected to the 12 volt 4-pin molex rail with a 4 pin molex to SATA adapter.
I replaced the non-ECC RAM with ECC RAM and replaced the SAS breakout cables. I ran a scrub after replacing these two items this past week and am still seeing checksum errors, as shown below.
Here is the SMART data for the 4 drives that make up this pool.
/dev/sda
/dev/sdb
/dev/sdc
/dev/sdd
Possibly relevant backstory on the 8TB drives, they are refurbs from WD. When I found out that I purchased 6TB SMR drives, I complained to WD and they replaced them for me, no charge. I am running long tests on the drives, but this will take a few days to get the results. The first one came back with no errors.
I am also in contact with Corsair, they are sending out a replacement power supply as this was sometimes suggested as being the culprit from my research.
Where else should I look for the cause of the checksum and file corruption errors? Note that this was affecting both pools. I disconnected the 4x 3TB drive pool as I thought maybe I was over-utilizing the power supply. Bit skeptial on this as my UPS indicates I only 100-140 watt load is being placed on it from the server. I was planning on using the newer drives as my main storage pool and using the older drives as a backup to the main pool.
First, I want to start off by saying that I am running ZFS in Proxmox (Debian). I was previously using FreeNAS for storage and Proxmox for virtualization and combined the two servers into a new one in January-March of this year.
My problems began about a month ago, so about 5 months after the new build was completed. The server is as follows:
- Supermicro MBD-X11SCL-F-O (New in March 2020)
- Intel Xeon E2246G (New in March 2020)
- 128 GB of DDR4-2933 ECC UDIMM memory (New in October 2020)
- LSI 9211-8i flashed in IT Mode (Circa 2012 from my old FreeNAS build)
- SAS to right angle SATA with metal retention tabs (New in October 2020)
- 4x 8TB WD Red Drives WD80EFAX. Configured in RAIDZ2 (New in August 2020)
4x 3TB WD Red Drives, configured in RAIDZ2 (Circa 2012 from my old FreeNAS build)Currently disconnected- 512GB Samsung 970 NVMe boot and primary storage drive (New in March 2020)
- Corsair CX450 power supply (New in March 2020)
- 4-in-1 SATA power adapter with Capacitors (https://www.amazon.com/gp/product/B00ENKYJB4/)
- CyberPower 1500VA UPS
Each pool of 4 drives is connected up to the 4-in-1 adapter. One pool is connected to the 12 volt SATA rail and the other pool was connected to the 12 volt 4-pin molex rail with a 4 pin molex to SATA adapter.
I replaced the non-ECC RAM with ECC RAM and replaced the SAS breakout cables. I ran a scrub after replacing these two items this past week and am still seeing checksum errors, as shown below.
Code:
root@bigbear:/mnt/helium/staging# zpool status pool: helium state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub repaired 0B in 0 days 02:44:15 with 2 errors on Tue Oct 20 11:19:45 2020 config: NAME STATE READ WRITE CKSUM helium ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 ata-WDC_WD80EFAX-68KNBN0_VAHDPLEL ONLINE 0 0 4 ata-WDC_WD80EFAX-68KNBN0_VAKSH6BL ONLINE 0 0 4 ata-WDC_WD80EFAX-68KNBN0_VAG9DJXL ONLINE 0 0 4 ata-WDC_WD80EFAX-68KNBN0_VAH152JL ONLINE 0 0 4 errors: 2 data errors, use '-v' for a list
Here is the SMART data for the 4 drives that make up this pool.
/dev/sda
Code:
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0004 129 129 054 Old_age Offline - 104 3 Spin_Up_Time 0x0007 195 195 024 Pre-fail Always - 476 (Average 357) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 20 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000a 100 100 067 Old_age Always - 0 8 Seek_Time_Performance 0x0004 128 128 020 Old_age Offline - 18 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 1783 10 Spin_Retry_Count 0x0012 100 100 060 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 20 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 94 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 94 194 Temperature_Celsius 0x0002 151 151 000 Old_age Always - 43 (Min/Max 20/51) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
/dev/sdb
Code:
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0004 128 128 054 Old_age Offline - 108 3 Spin_Up_Time 0x0007 193 193 024 Pre-fail Always - 481 (Average 361) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 20 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000a 100 100 067 Old_age Always - 0 8 Seek_Time_Performance 0x0004 128 128 020 Old_age Offline - 18 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 1783 10 Spin_Retry_Count 0x0012 100 100 060 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 20 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 94 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 94 194 Temperature_Celsius 0x0002 166 166 000 Old_age Always - 39 (Min/Max 20/43) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
/dev/sdc
Code:
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0004 128 128 054 Old_age Offline - 108 3 Spin_Up_Time 0x0007 190 190 024 Pre-fail Always - 491 (Average 364) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 20 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000a 100 100 067 Old_age Always - 0 8 Seek_Time_Performance 0x0004 128 128 020 Old_age Offline - 18 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 1783 10 Spin_Retry_Count 0x0012 100 100 060 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 20 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 94 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 94 194 Temperature_Celsius 0x0002 151 151 000 Old_age Always - 43 (Min/Max 20/51) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
/dev/sdd
Code:
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0004 128 128 054 Old_age Offline - 108 3 Spin_Up_Time 0x0007 192 192 024 Pre-fail Always - 484 (Average 361) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 20 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000a 100 100 067 Old_age Always - 0 8 Seek_Time_Performance 0x0004 128 128 020 Old_age Offline - 18 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 1783 10 Spin_Retry_Count 0x0012 100 100 060 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 20 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 94 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 94 194 Temperature_Celsius 0x0002 151 151 000 Old_age Always - 43 (Min/Max 20/52) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
Possibly relevant backstory on the 8TB drives, they are refurbs from WD. When I found out that I purchased 6TB SMR drives, I complained to WD and they replaced them for me, no charge. I am running long tests on the drives, but this will take a few days to get the results. The first one came back with no errors.
I am also in contact with Corsair, they are sending out a replacement power supply as this was sometimes suggested as being the culprit from my research.
Where else should I look for the cause of the checksum and file corruption errors? Note that this was affecting both pools. I disconnected the 4x 3TB drive pool as I thought maybe I was over-utilizing the power supply. Bit skeptial on this as my UPS indicates I only 100-140 watt load is being placed on it from the server. I was planning on using the newer drives as my main storage pool and using the older drives as a backup to the main pool.