zpool degraded because of corrupt snapshots? how to fix?

Dopamin3

Dabbler
Joined
Aug 18, 2017
Messages
46
On my new TrueNAS Core 13.0_U5.3 server (disregard my signature system) I have a RAIDZ mirror of two Silicon Power P34A80 2TB NVME drives. I use this pool for iocage jails and a few VMs.

At first the zpool showed degraded and a zpool status -v on it showed this (screenshot from SSH on a phone):
1691757600369.png


I thought I could maybe fix this by deleting that snapshot so I did. Now after a scrub and another zpool status -v I'm getting this:
1691757665567.png


The drives are only a few months old, show nothing wrong in SMART and show no read errors issuing "nvmecontrol logpage -p 1 *drive*".

What is my path forward to correct this and take it out of a degraded state? Do I actually have to go as far as destroying the whole pool?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
ZFS is trying to tell you that you have 2 problems...

CRC errors: possibly your disks are not well connected to the system. (difficult to troubleshoot in an M.2 type connection... how are they attached? PCIe converter in a slot?)

Read Errors: possibly related to the CRC errors in terms of root cause, but may independently be an issue. Can you get reporting out of SMART?

I would start by trying to see what's going on with the drives first... maybe temperature behind it all, frying the controllers?
 
Joined
Jun 15, 2022
Messages
674
What's the hardware of the box in question? Are you using ECC RAM?

My experience with snapshots in VirtualBox is if your system isn't rock- solid it's going to need a rebuild in 6+ months of heavy use because there's no saving it at that point. TrueNAS is even harder on the hardware, so by extension I imagine if the hardware is at all flakey you're burning that biscuit.
 

Dopamin3

Dabbler
Joined
Aug 18, 2017
Messages
46
What's the hardware of the box in question? Are you using ECC RAM?

My experience with snapshots in VirtualBox is if your system isn't rock- solid it's going to need a rebuild in 6+ months of heavy use because there's no saving it at that point. TrueNAS is even harder on the hardware, so by extension I imagine if the hardware is at all flakey you're burning that biscuit.
Ryzen 5 Pro 5650GE with 4 x 32 GB Unbuffered ECC RAM.
 

Dopamin3

Dabbler
Joined
Aug 18, 2017
Messages
46
ZFS is trying to tell you that you have 2 problems...

CRC errors: possibly your disks are not well connected to the system. (difficult to troubleshoot in an M.2 type connection... how are they attached? PCIe converter in a slot?)

Read Errors: possibly related to the CRC errors in terms of root cause, but may independently be an issue. Can you get reporting out of SMART?

I would start by trying to see what's going on with the drives first... maybe temperature behind it all, frying the controllers?
The disks are connected directly to the M.2 slots on a Gigabyte X570S Aero G motherboard. I ran longtests with nvmecontrol as well as looking at the logpage for errors and nothing looked suspicious. If you or someone lets me know what command to run I can show the output here.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
Can you get smartctl -a /dev/nvme0 to give output? (changing nvme0 to whatever is right for you)
 

Dopamin3

Dabbler
Joined
Aug 18, 2017
Messages
46
Can you get smartctl -a /dev/nvme0 to give output? (changing nvme0 to whatever is right for you)
Here is output of both of the drives in the mirror:

Code:
root@truenas[~]# smartctl -a /dev/nvme1
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SPCC M.2 PCIe SSD
Serial Number:                      230155785181003
Firmware Version:                   VB421D65
PCI Vendor/Subsystem ID:            0x10ec
IEEE OUI Identifier:                0x00e04c
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,048,408,248,320 [2.04 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            500e04 c81e070c18
Local Time is:                      Fri Aug 11 10:01:05 2023 EDT
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0014):     DS_Mngmt Sav/Sel_Feat
Log Page Attributes (0x02):         Cmd_Eff_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     100 Celsius
Critical Comp. Temp. Threshold:     110 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.00W       -        -    0  0  0  0   230000   50000
 1 +     4.00W       -        -    1  1  1  1     4000   50000
 2 +     3.00W       -        -    2  2  2  2     4000  250000
 3 -     0.50W       -        -    3  3  3  3     4000    8000
 4 -   0.0090W       -        -    4  4  4  4     8000   30000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        46 Celsius
Available Spare:                    100%
Available Spare Threshold:          32%
Percentage Used:                    0%
Data Units Read:                    5,204,792 [2.66 TB]
Data Units Written:                 11,204,560 [5.73 TB]
Host Read Commands:                 29,829,295
Host Write Commands:                154,242,137
Controller Busy Time:               0
Power Cycles:                       10
Power On Hours:                     1,155
Unsafe Shutdowns:                   6
Media and Data Integrity Errors:    28
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 8 of 8 entries)
No Errors Logged

root@truenas[~]# smartctl -a /dev/nvme2
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SPCC M.2 PCIe SSD
Serial Number:                      230155785150062
Firmware Version:                   VB421D65
PCI Vendor/Subsystem ID:            0x10ec
IEEE OUI Identifier:                0x00e04c
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,048,408,248,320 [2.04 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            500e04 c91e010019
Local Time is:                      Fri Aug 11 10:01:17 2023 EDT
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0014):     DS_Mngmt Sav/Sel_Feat
Log Page Attributes (0x02):         Cmd_Eff_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     100 Celsius
Critical Comp. Temp. Threshold:     110 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.00W       -        -    0  0  0  0   230000   50000
 1 +     4.00W       -        -    1  1  1  1     4000   50000
 2 +     3.00W       -        -    2  2  2  2     4000  250000
 3 -     0.50W       -        -    3  3  3  3     4000    8000
 4 -   0.0090W       -        -    4  4  4  4     8000   30000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        42 Celsius
Available Spare:                    100%
Available Spare Threshold:          32%
Percentage Used:                    0%
Data Units Read:                    2,136,538 [1.09 TB]
Data Units Written:                 10,812,347 [5.53 TB]
Host Read Commands:                 14,599,941
Host Write Commands:                145,454,224
Controller Busy Time:               0
Power Cycles:                       10
Power On Hours:                     1,152
Unsafe Shutdowns:                   7
Media and Data Integrity Errors:    229
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 8 of 8 entries)
No Errors Logged
 
Joined
Jun 15, 2022
Messages
674
Can you get smartctl -a /dev/nvme0 to give output? (changing nvme0 to whatever is right for you)
I'd use -ax instead of -a...
 

Dopamin3

Dabbler
Joined
Aug 18, 2017
Messages
46
I'd use -ax instead of -a...
Here is -ax output from the drives

Code:
root@truenas[~]# smartctl -ax /dev/nvme1
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SPCC M.2 PCIe SSD
Serial Number:                      230155785181003
Firmware Version:                   VB421D65
PCI Vendor/Subsystem ID:            0x10ec
IEEE OUI Identifier:                0x00e04c
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,048,408,248,320 [2.04 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            500e04 c81e070c18
Local Time is:                      Fri Aug 11 10:05:49 2023 EDT
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0014):     DS_Mngmt Sav/Sel_Feat
Log Page Attributes (0x02):         Cmd_Eff_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     100 Celsius
Critical Comp. Temp. Threshold:     110 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.00W       -        -    0  0  0  0   230000   50000
 1 +     4.00W       -        -    1  1  1  1     4000   50000
 2 +     3.00W       -        -    2  2  2  2     4000  250000
 3 -     0.50W       -        -    3  3  3  3     4000    8000
 4 -   0.0090W       -        -    4  4  4  4     8000   30000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        47 Celsius
Available Spare:                    100%
Available Spare Threshold:          32%
Percentage Used:                    0%
Data Units Read:                    5,411,479 [2.77 TB]
Data Units Written:                 11,206,534 [5.73 TB]
Host Read Commands:                 30,937,136
Host Write Commands:                154,261,826
Controller Busy Time:               0
Power Cycles:                       10
Power On Hours:                     1,155
Unsafe Shutdowns:                   6
Media and Data Integrity Errors:    28
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 8 of 8 entries)
No Errors Logged

root@truenas[~]# smartctl -ax /dev/nvme2
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SPCC M.2 PCIe SSD
Serial Number:                      230155785150062
Firmware Version:                   VB421D65
PCI Vendor/Subsystem ID:            0x10ec
IEEE OUI Identifier:                0x00e04c
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,048,408,248,320 [2.04 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            500e04 c91e010019
Local Time is:                      Fri Aug 11 10:05:53 2023 EDT
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0014):     DS_Mngmt Sav/Sel_Feat
Log Page Attributes (0x02):         Cmd_Eff_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     100 Celsius
Critical Comp. Temp. Threshold:     110 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.00W       -        -    0  0  0  0   230000   50000
 1 +     4.00W       -        -    1  1  1  1     4000   50000
 2 +     3.00W       -        -    2  2  2  2     4000  250000
 3 -     0.50W       -        -    3  3  3  3     4000    8000
 4 -   0.0090W       -        -    4  4  4  4     8000   30000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        42 Celsius
Available Spare:                    100%
Available Spare Threshold:          32%
Percentage Used:                    0%
Data Units Read:                    2,136,538 [1.09 TB]
Data Units Written:                 10,812,347 [5.53 TB]
Host Read Commands:                 14,599,941
Host Write Commands:                145,454,224
Controller Busy Time:               0
Power Cycles:                       10
Power On Hours:                     1,152
Unsafe Shutdowns:                   7
Media and Data Integrity Errors:    229
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 8 of 8 entries)
No Errors Logged
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
I wasn't expecting to get a lot more than -a from an NVME drive as it seems you just confirmed.

Both drives look healthy and not too hot, so we should turn attention to other factors.

Is TrueNAS a VM?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700

Dopamin3

Dabbler
Joined
Aug 18, 2017
Messages
46
I wasn't expecting to get a lot more than -a from an NVME drive as it seems you just confirmed.

Both drives look healthy and not too hot, so we should turn attention to other factors.

Is TrueNAS a VM?
TrueNAS is installed directly on another drive, no VM.
Although it seems I'm blind...
I assume from the unsafe shutdown count.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
I assume from the unsafe shutdown count.
Doesn't seem related to me... I would expect to have matching numbers if that were the case (or at least closer together).

Also, I know full-well from experience with my NVME drives that every system shutdown (even if clean) is seen by the NVME as an unsafe power cut.
 
Joined
Jun 15, 2022
Messages
674
You're using gaming NVME, so that's the first sketchy part:
PCI Vendor/Subsystem ID: 0x10ec

Silicon Power has an SSD toolbox available for download on their website. You can use it to monitor your device's health, performance, and even secure erase. (review) I'm guessing the SSD itself will test fine.

NVME #1:
Unsafe Shutdowns: 6
Media and Data Integrity Errors: 28

NVME #2:
Unsafe Shutdowns: 7
Media and Data Integrity Errors: 229

I read what @sretalla posted and am not sure that's entirely accurate as NVME manufacturers generally do power off systems the NVME is installed in and would adjust the reporting accordingly.

I would guess the major issue is powering off the NVME before the data has been written, which is known to corrupt "random" memory locations different than what you're writing to due to how NVME memory cycling works; if that is accurate all your data on the drives is "suspect." If I understand correctly (and I may not) a flush and time delay are needed to give the NVME enough time to write the data and do any internal cleanup before powering down the NVME. In theory a simple way to test this is to backup whatever data is relevant (if possible), safely remove the drives from the TrueNAS configuration (but leave them in the computer), boot off USB (such as with SystemRescue) and try writing data then a quick shutdown -p and ensuring the Unsafe Shutdown count increases vs manually mounting/writing data/and manually unmounting before shutdown -h, waiting 10 seconds, and hitting the power switch if the system isn't already shut down (which hopefully it's not). (This isn't scientific, but it is a starting point.)

I think the second possible contributing issue is AMD who plays pretty loose and free with specs figuring if it works, it works and there you go, whereas Intel is more along the lines of, "No, you can't do that." So there may be an additional Ryzen Risk involved when using gaming NVME with AMD playing fast&free. Note this is probably solvable with delayed shutdown mentioned above, and if not then some other logical solution, as both companies make good products. (AMD: Specs? Eh, we can beat those...LOL Yeah, doesn't always work in all cases but that's on you)

Note I'm not knocking any brand or vendor, only looking at what might be contributing to the issues you're facing and attempting to isolate the problems so they can be resolved.
 
Last edited:

Dopamin3

Dabbler
Joined
Aug 18, 2017
Messages
46
Doesn't seem related to me... I would expect to have matching numbers if that were the case (or at least closer together).

Also, I know full-well from experience with my NVME drives that every system shutdown (even if clean) is seen by the NVME as an unsafe power cut.
At one point a few weeks ago, I got an alert that the pool was degraded because 1 drive just disappeared. I powered off the server, re-situated the NVME in the slot and when I powered it back on it was detected again. It hasn't disappeared since. I think that would explain the large difference maybe? Maybe even the whole issue?

You're using gaming NVME, so that's the first sketchy part:
PCI Vendor/Subsystem ID: 0x10ec

Silicon Power has an SSD toolbox available for download on their website. You can use it to monitor your device's health, performance, and even secure erase. (review) I'm guessing the SSD itself will test fine.

NVME #1:
Unsafe Shutdowns: 6
Media and Data Integrity Errors: 28

NVME #2:
Unsafe Shutdowns: 7
Media and Data Integrity Errors: 229

I read what @sretalla posted and am not sure that's entirely accurate as NVME manufacturers generally do power off systems the NVME is installed in and would adjust the reporting accordingly.

I would guess the major issue is powering off the NVME before the data has been written, which is known to corrupt "random" memory locations different than what you're writing to due to how NVME memory cycling works; if that is accurate all your data on the drives is "suspect." If I understand correctly (and I may not) a flush and time delay are needed to give the NVME enough time to write the data and do any internal cleanup before powering down the NVME. In theory a simple way to test this is to backup whatever data is relevant (if possible), safely remove the drives from the TrueNAS configuration, boot off USB (such as with SystemRescue) and try shutdown -p and ensuring the Unsafe Shutdown count increases vs manually mounting/writing data/and manually unmounting before shutdown -h, waiting 5 seconds, and hitting the power switch if the system isn't already shut down (which hopefully it's not). (This isn't scientific, but it is a starting point.)

I think the second possible contributing issue is AMD who plays pretty loose and free with specs figuring if it works, it works and there you go, whereas Intel is more along the lines of, "No, you can't do that." So there may be an additional Ryzen Risk involved when using gaming NVME with AMD playing fast&free. (AMD: Specs? Eh, we can beat those...LOL Yeah, doesn't always work in all cases but that's on you)

Note I'm not knocking any brand or vendor, only looking at what might be contributing to the issues you're facing and attempting to isolate the problems so they can be resolved.

I purchased them because they were inexpensive, had 5 year warranties and 1600TBW rating. Ultimately I think they were probably a bad choice. I was struggling to find any enterprise~ish NVME SSDs that are actually a physical m.2 2280 that could be used on the motherboard. Guess I'll assume everything on that pool is sketch, destroy it and reconsider even using them. I appreciate all the advice and info and don't take it to knocking a brand or vendor. It is true about the whole fast&free, like the CPU and motherboard basically officially unofficially support ECC, unlike an Intel Xeon and server board.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
At one point a few weeks ago, I got an alert that the pool was degraded because 1 drive just disappeared. I powered off the server, re-situated the NVME in the slot and when I powered it back on it was detected again. It hasn't disappeared since. I think that would explain the large difference maybe? Maybe even the whole issue?
I think we're onto something here...

I suspect it can't have been only one drive though as both show issues.
 
Joined
Jun 15, 2022
Messages
674
At one point a few weeks ago, I got an alert that the pool was degraded because 1 drive just disappeared. I powered off the server, re-situated the NVME in the slot and when I powered it back on it was detected again. It hasn't disappeared since. I think that would explain the large difference maybe? Maybe even the whole issue?

I purchased them because they were inexpensive, had 5 year warranties and 1600TBW rating. Ultimately I think they were probably a bad choice. I was struggling to find any enterprise~ish NVME SSDs that are actually a physical m.2 2280 that could be used on the motherboard. Guess I'll assume everything on that pool is sketch, destroy it and reconsider even using them. I appreciate all the advice and info and don't take it to knocking a brand or vendor. It is true about the whole fast&free, like the CPU and motherboard basically officially unofficially support ECC, unlike an Intel Xeon and server board.
I would first power down the system, unplug it, wait a few minutes, clean the contacts of the NVME drives and what they plug into with really high quality industrial cleaner, make sure all dust/lint is off the pins and don't touch them (skin oil is horrible), and plug them back in, then use the Silicon Power tools to fast-wipe them and then start testing the shutdown situation.

AMD supports installing ECC memory, it just might not really do "the ECC thing." Or maybe just not do it properly (like they might implement some "stuff," kinda maybe--bro, don't worry, it's all good). LOL

Don't despair, you may well have a good setup that just requires some tweaking.

1691765554776.png

(just joking, hopefully it lightens the mood)
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
What is your system's PSU? How is power in your place (ie stable, frequent blackouts, etc)? Are you using an UPS?

I'd zpool clear nvme and zpool scrub nvme.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
AMD supports installing ECC memory, it just might not really do "the ECC thing." Or maybe just not do it properly (like they might implement some "stuff," kinda maybe--bro, don't worry, it's all good). LOL
Their PRO lineup should officially support it iirc.
 
Top