Virtual Machines permanent errors detected in files

csjjpm

Contributor
Joined
Feb 16, 2015
Messages
126
Hi,
my POOL says it is fine apart from two errors in the virtual disks for two virtual machines. I've checked the SMART and all four of my RAID drives PASS.
Any ideas how I can check and fix the two virtual machine virtual disks?
The virtual machines both boot and appear to be working fine.
thanks
Paul
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
Any ideas how I can check and fix the two virtual machine virtual disks?
The virtual machines both boot and appear to be working fine.
Option 1:
Perform a backup of the VM(s) from their respective operating systems.

Re-create the VM(s) with new ZVOLs and restore the backup in the VM's OS.

Option 2:
Check for the existence of a snapshot of the ZVOL(s) which doesn't have corruption and roll back to that snapshot.


In either case... consider your pool/hardware design in the context of how such a corruption could exist and re-design to eliminate the possibility.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
@csjjpm , please check the forum rules on what information about hardware and system configuration are needed by others to provide the best possible support. In addition, your wording may or may not indicate that 1) a hypervisor is in play, and 2) a RAID controller is being used. Both fact would have a substantial influence an what should be done.
 

csjjpm

Contributor
Joined
Feb 16, 2015
Messages
126
Option 1:
Perform a backup of the VM(s) from their respective operating systems.

Re-create the VM(s) with new ZVOLs and restore the backup in the VM's OS.

Option 2:
Check for the existence of a snapshot of the ZVOL(s) which doesn't have corruption and roll back to that snapshot.


In either case... consider your pool/hardware design in the context of how such a corruption could exist and re-design to eliminate the possibility.
Hi,
thank you I will attempt this with one or both. One is Windows 11 and the other Ubuntu, I can probably do the first, not sure about the second.
thanks again
Paul
 

csjjpm

Contributor
Joined
Feb 16, 2015
Messages
126
@csjjpm , please check the forum rules on what information about hardware and system configuration are needed by others to provide the best possible support. In addition, your wording may or may not indicate that 1) a hypervisor is in play, and 2) a RAID controller is being used. Both fact would have a substantial influence an what should be done.
Apologies it was a rushed post as I'd otherwise forget to do it so I'll try and find more details and send on.

1st: I'm using the 'Virtual Machines' option in Truenas, is that still bhyve?

2nd: I think my storage adapter is a <Marvell 88SE9215 AHCI SATA controller> however I've done software RAID in Truenas. It is 4 disks, a mirror of a striping I can't remember what RAID level that is (10?). Something like this

Code:
WD20EARS                                        ONLINE       0     0 0
          mirror-0                                      ONLINE       0     0 0
            gptid/d0606b2d-447a-11e8-9209-00224d7cab86  ONLINE       0     0 0
            gptid/3b89997c-bd4b-11eb-82a1-00224d7cab86  ONLINE       0     0 0
          mirror-1                                      ONLINE       0     0 0
            gptid/080525f0-457c-11e8-afa5-00224d7cab86  ONLINE       0     0 0
            gptid/bed65e14-458c-11e8-afa5-00224d7cab86  ONLINE       0     0 0


Thanks Chris let me know if there is anything else I can provide to help.
BW
Paul
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
While these drives don't have a specific contraindication like the SMR WD Red drives, the "Green" line does/did have a very aggressive sleep timer that would often cause it to spin up and down far too frequently. Normally these would manifest as errors visible in your status list though, or a drive getting kicked out of the pool.

Can you post your full SMART data for the drives? An excessively high Load Cycle Count may also be indicative of a failing device.
 

csjjpm

Contributor
Joined
Feb 16, 2015
Messages
126
Hi, this is the output for the SMART for each disk. What should I be looking for? (edit: I see you say Load Cycle Count)

ada0
Code:
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   119   099   006    Pre-fail  Always       -       223447936
  3 Spin_Up_Time            0x0003   092   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   091   091   020    Old_age   Always       -       9548
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   075   060   030    Pre-fail  Always       -       65083094853
  9 Power_On_Hours          0x0032   027   027   000    Old_age   Always       -       64637
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       779
183 Runtime_Bad_Block       0x0032   097   097   000    Old_age   Always       -       3
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   093   093   000    Old_age   Always       -       7
190 Airflow_Temperature_Cel 0x0022   057   039   045    Old_age   Always   In_the_past 43 (Min/Max 16/44 #15753)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       699
193 Load_Cycle_Count        0x0032   094   094   000    Old_age   Always       -       12450
194 Temperature_Celsius     0x0022   043   061   000    Old_age   Always       -       43 (0 12 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       61668h+23m+48.195s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       204003892534
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       389816512146


ada1
Code:
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   064   044    Pre-fail  Always       -       194501756
  3 Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       72
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   089   060   045    Pre-fail  Always       -       787403380
  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       12992 (242 243 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       72
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   057   039   040    Old_age   Always   In_the_past 43 (Min/Max 16/44 #39)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       30
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       104
194 Temperature_Celsius     0x0022   043   061   000    Old_age   Always       -       43 (0 16 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       12991h+10m+33.043s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       44387902530
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       73943415455


ada2
Code:
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   115   099   006    Pre-fail  Always       -       99744272
  3 Spin_Up_Time            0x0003   092   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   091   091   020    Old_age   Always       -       9545
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   087   060   030    Pre-fail  Always       -       642310787
  9 Power_On_Hours          0x0032   027   027   000    Old_age   Always       -       64606
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       777
183 Runtime_Bad_Block       0x0032   098   098   000    Old_age   Always       -       2
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   081   081   000    Old_age   Always       -       19
190 Airflow_Temperature_Cel 0x0022   058   039   045    Old_age   Always   In_the_past 42 (155 147 43 16 0)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       699
193 Load_Cycle_Count        0x0032   094   094   000    Old_age   Always       -       12446
194 Temperature_Celsius     0x0022   042   061   000    Old_age   Always       -       42 (0 13 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       61644h+30m+56.143s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       203993256433
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       379992405565


ada3
Code:
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail  Always       -       162551712
  3 Spin_Up_Time            0x0003   093   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   091   091   020    Old_age   Always       -       9543
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   087   060   030    Pre-fail  Always       -       4943442881
  9 Power_On_Hours          0x0032   027   027   000    Old_age   Always       -       64606
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       777
183 Runtime_Bad_Block       0x0032   097   097   000    Old_age   Always       -       3
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   063   042   045    Old_age   Always   In_the_past 37 (5 116 38 15 0)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       698
193 Load_Cycle_Count        0x0032   094   094   000    Old_age   Always       -       12444
194 Temperature_Celsius     0x0022   037   058   000    Old_age   Always       -       37 (0 12 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       61644h+54m+41.196s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       202806842089
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       384182610605
 

csjjpm

Contributor
Joined
Feb 16, 2015
Messages
126
So I've just worked out that 3 of the disks have been used for at least 7.5 years! Maybe I need to replace them :smile:
 

csjjpm

Contributor
Joined
Feb 16, 2015
Messages
126
Would I be better off replacing this RAID10 (4 x 4TB) with a RAID1 (2 x 8TB)?
The cost is about the same.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Would I be better off replacing this RAID10 (4 x 4TB) with a RAID1 (2 x 8TB)?
The cost is about the same.

You might notice a slight decrease in random I/O performance from reducing to two drives.

Reviewing the drives, ada1 has a much lower LCC value relative to its power-on hours, which might mean it's had its idle timer changed at some point. None of the other "red flag" values on SMART are present, although temperatures are a little warm for my liking at 43C.

Your Marvell card (88SE9215) is less than ideal for some situations due to the PCIe 2.0 x1 limitation but unless it's overheating I wouldn't have expected it to spit bad data back to the degree that ZFS would spot it. Normally the concerns there are with slow performance or degraded response under scrub.
 

csjjpm

Contributor
Joined
Feb 16, 2015
Messages
126
ada1 is newer, I think I had a failing drive a couple of years ago and I swapped it out

thanks for the advice on drives.
 

csjjpm

Contributor
Joined
Feb 16, 2015
Messages
126
I've just swapped out an old drive in the mirror paired with the existing 'newer' (2020ish) drive. It is currently resilvering.

That mirror will be fine soon.

Is it safe for me to replace both the other 7 year old drives from the 2nd mirror of the RAID 10 at once OR should I replace each one in turn and let them resilver separately?

Thanks
Paul
 
Top