SOLVED HELP: Pool nas state is DEGRADED: One or more devices are faulted in response to persistent errors...

itsjustfrank · Aug 30, 2023

Hi there, I received a CRITICAL alert:

Pool nas state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:

Disk WDC WD60EFZX-68B3FN0 [...] is FAULTED

This is my first error of this kind so I would greatly appreciate a bit of guidance so I can proceed with confidence. The drive in question is WD RED Plus 6TB NAS Internal Hard Drive HDD - 5640 RPM, SATA 6 Gb/s, CMR, 128 MB Cache, 3.5" -WD60EFZX and is just around a year old.

Very much a noob question but is the above error indicative that this drive is failing or failed and must be replaced?

Here are the results of smartctl which I have seen requested in other similar posts:

Code:

root@truenas[~]# smartctl -a /dev/ada4
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD60EFZX-68B3FN0
Serial Number:    [removed this]
LU WWN Device Id: 5 0014ee 2bfac7669
Firmware Version: 81.00A81
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5640 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Aug 30 16:39:02 2023 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (62760) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 664) minutes.
SCT capabilities:              (0x303d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   191   188   021    Pre-fail  Always       -       7408
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       16
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4122
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       16
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       1
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       14
194 Temperature_Celsius     0x0022   111   106   000    Old_age   Always       -       41
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      4122         -

Selective Self-tests/Logging not supported

And here are the zpool status results:

Code:

root@truenas[~]# zpool status
  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:21 with 0 errors on Wed Aug 30 03:45:21 2023
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          ada0p2    ONLINE       0     0     0

errors: No known data errors

  pool: nas
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: resilvered 24.7G in 01:01:22 with 0 errors on Tue Aug 29 21:26:53 2023
config:

        NAME                                            STATE     READ WRITE CKSUM
        nas                                             DEGRADED     0     0 0
          raidz1-0                                      DEGRADED     0     0 0
            gptid/28c8eadf-1f82-11ed-b273-4487fccacf7e  ONLINE       0     0 0
            gptid/28e0418c-1f82-11ed-b273-4487fccacf7e  ONLINE       0     0 0
            gptid/291f2a42-1f82-11ed-b273-4487fccacf7e  FAULTED      3 3.88K 2  too many errors
            gptid/2914c24d-1f82-11ed-b273-4487fccacf7e  ONLINE       0     0 0
            gptid/28ebeabc-1f82-11ed-b273-4487fccacf7e  ONLINE       0     0 0

errors: No known data errors

I am under the impression that this drive needs to be replaced but as I am relatively new to all of this I would like to get confirmation on how to proceed or if there are any other troubleshooting options I should proceed with.

Thank you all so much for your help!

WI_Hedgehog · Aug 30, 2023

As a starting point, have you looked at the Resources at the very top of the page?

itsjustfrank · Aug 30, 2023

Hi thanks for your reply. I've looked through a few forum posts which are discussing a similar problem to mine.

WI_Hedgehog said:
As a starting point, have you looked at the Resources at the very top of the page?

WI_Hedgehog · Aug 30, 2023

Several posts came up on this in the last 3 days, basically it's wise to run a S.M.A.R.T. report to see if it's a cable issue.

If it is a drive and you have an open port you can likely add a replacement drive in the system and have TrueNAS replace the failed drive, then detach and physically remove the failed drive. If you don't have an open port then you detach the bad drive and attach the replacement and re-silver, I believe.

itsjustfrank · Aug 30, 2023

WI_Hedgehog said:
Several posts came up on this in the last 3 days, basically it's wise to run a S.M.A.R.T. report to see if it's a cable issue.

If it is a drive and you have an open port you can likely add a replacement drive in the system and have TrueNAS replace the failed drive, then detach and physically remove the failed drive. If you don't have an open port then you detach the bad drive and attach the replacement and re-silver, I believe.

Hi there I just ran a short smart test and am in the process of running a long test. In my first post I provided the smartctl -a info for the drive in question. Would any of that info suggest something pertaining to a cable issue? Thanks

NickF · Aug 30, 2023

Its either a bad drive or a bad cable. Start by replacing the cable, then consider buying a new drive.

itsjustfrank · Aug 30, 2023

NickF said:
Its either a bad drive or a bad cable. Start by replacing the cable, then consider buying a new drive.

Thank you so much for the reply! I'll try swapping the cable sap

sretalla · Aug 31, 2023

I would point out that your SMART test is showing that you've only run one short test (and never a long test).

I don't know if the lack of errors in that test is a valid check.

I would be running smartctl -t long on that disk and waiting for the results before worrying about the 2 checksum errors (compared to the almost 4K write errors reported by ZFS).

itsjustfrank · Aug 31, 2023

sretalla said:
I would point out that your SMART test is showing that you've only run one short test (and never a long test).

I don't know if the lack of errors in that test is a valid check.

I would be running smartctl -t long on that disk and waiting for the results before worrying about the 2 checksum errors (compared to the almost 4K write errors reported by ZFS).

Thank you for your reply! I started a long smart test last night which just recently finished up. This one also came back without any errors. Does this potentially mean that the drive isn't the problem? I also ran a scrub of the pool overnight which came back with no errors. Thanks for your time and assistance!

LONG TEST RESULTS
smartctl -a /dev/ada4

Code:

smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD60EFZX-68B3FN0
Serial Number:    WD-C82LZAJK
LU WWN Device Id: 5 0014ee 2bfac7669
Firmware Version: 81.00A81
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5640 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Aug 31 05:59:11 2023 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (62760) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 664) minutes.
SCT capabilities:              (0x303d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   191   188   021    Pre-fail  Always       -       7408
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       16
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4135
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       16
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       1
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       14
194 Temperature_Celsius     0x0022   111   103   000    Old_age   Always       -       41
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      4133         -
# 2  Short offline       Completed without error       00%      4122         -

Selective Self-tests/Logging not supported

Note:
I made a really dumb error when I was first setting up my system and didn't add this drive to my regularly scheduled smart tests hence the lack of prior smart data.

sretalla · Aug 31, 2023

Indeed the drive looks fine and I would trust that result much more now there's a long test behind it.

Back to the cabling/port story then.

itsjustfrank · Aug 31, 2023

sretalla said:
Indeed the drive looks fine and I would trust that result much more now there's a long test behind it.

Back to the cabling/port story then.

Okay perfect, thank you! I will look into the cable and port.

As far as troubleshooting this is concerned, I first intend to swap the data cable running to this drive and later attach the drive to another port.
Is there a way I can confirm or check that one of these has solved the problem?

sretalla · Aug 31, 2023

itsjustfrank said:
Is there a way I can confirm or check that one of these has solved the problem?

A scrub on your pool will pretty quickly show the problem up, so I would say that's the way to test.

Davvo · Aug 31, 2023

It's good practice to give your system's specs when asking for help; more bluntly: please post your system's specs.

WI_Hedgehog · Aug 31, 2023

@itsjustfrank : I agree with @sretalla, the drive is reporting the drive is fine, so it could be cabling or controller or maybe even RAM issues, which is why @Davvo is asking for system specs (we want to help you find the problem).

Technically, it could even be a cheap fan going bad and creating a lot of noise that's being picked up by the drive (and I'm not suggesting that's your issue), we simply love this stuff and are willing to help if you want some assistance.

itsjustfrank · Sep 1, 2023

@WI_Hedgehog @Davvo Apologies, I was in a bit of a panic when I made my initial post and thus didn't think to attach my system specs.
The machine itself is quite old and by no means an optimal setup; I just reconfigured an old computer I had on hand and aim to upgrade to a better system later this year. Having said that, below are the specs:

Specs:
TrueNAS-12.0-U8.1 Core
Motherboard make and model: Acer Aspire M3910
Intel H57 Chipset
CPU make and model: Intel i7 870 2.93Ghz
RAM quantity: 12GB @ 1333MT/s 2x4GB 2x2GB
Hard drives, quantity, model numbers, and RAID configuration, Including boot drives:
6 drives total
3x 6TB WD Red Plus 5640RPM 6Gb/s CMR 128MB Cache WD60EFZX
2x 6TB Toshiba N300 7200RPM 6GB/s 256MB Cache HDWG160XZSTA
1x 128GB Silicon Power SSD Boot drive SU128GBSS3A58A25CA
RAIDZ-1 config
Network controller: Realtek RTL8111E

@sretalla I believe you were correct in your suggestion that it was a cabling issue. I swapped the SATA data cable running to the drive with a new one and have since not received any errors. While I can't say with complete certainty, there was a fair bit of strain on the previous cable due to how I had it set up in my case. I am following your advice and now running a scrub of the pool to confirm that everything is in working order.

Thank you all for your help and guidance. I'm incredibly grateful for everyone's willingness to assist me with this issue as well as provide suggestions for best practices moving forward.
All the best,

Davvo · Sep 1, 2023

As a side note, realtek NICs are known to be problematic. Also, it's not a RAID congif but a RAIDZ config... it's important because RAID and ZFS make a mess together.

Important Announcement for the TrueNAS Community.

SOLVED HELP: Pool nas state is DEGRADED: One or more devices are faulted in response to persistent errors...

itsjustfrank

Dabbler

WI_Hedgehog

Guru

itsjustfrank

Dabbler

WI_Hedgehog

Guru

itsjustfrank

Dabbler

NickF

Guru

itsjustfrank

Dabbler

sretalla

Powered by Neutrality

itsjustfrank

Dabbler

sretalla

Powered by Neutrality

itsjustfrank

Dabbler

sretalla

Powered by Neutrality

Davvo

MVP

WI_Hedgehog

Guru

itsjustfrank

Dabbler

Davvo

MVP

Similar threads

Important Announcement for the TrueNAS Community.

SOLVED HELP: Pool nas state is DEGRADED: One or more devices are faulted in response to persistent errors...

Dabbler

Guru

Dabbler

Guru

Dabbler

Guru

Dabbler

Powered by Neutrality

Dabbler

Powered by Neutrality

Dabbler

Powered by Neutrality

MVP

Guru

Dabbler

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "HELP: Pool nas state is DEGRADED: One or more devices are faulted in response to persistent errors..."

Similar threads