Unhealthy pool, read errors, device reset.

dewhite04

Cadet
Joined
Jan 24, 2018
Messages
7
Hello all,

I'm new to TrueNAS (SCALE), coming from ten+ years of keeping my files on a gentoo linux mdadm raid5 with ext4 over LUKS. Still feeling my way around but I've hit a little bump last night...

Before I unload the information I have so far, my basic system specs are in the signature below.

I received an email yesterday evening just after 8pm. Just for reference, I haven't physically touched or changed any configuration or settings in several days.
Code:
New alerts:
•    Pool data state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.


I looked at the pool status:
Code:
root@TrueNAS[~]# zpool status -v
  pool: boot-pool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:00:05 with 0 errors on Thu May 12 05:45:06 2022
config:

        NAME         STATE     READ WRITE CKSUM
        boot-pool    ONLINE       0     0     0
          nvme0n1p3  ONLINE       0     0     0

errors: No known data errors

  pool: data
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 06:36:11 with 0 errors on Thu May 12 00:06:57 2022
config:

        NAME                                      STATE     READ WRITE CKSUM
        data                                      ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            634627f6-3f8c-4030-aae9-df54d9a7e079  ONLINE       0     0     0
            53276fbb-7c18-41b4-a0cb-fec9ecc6d6ff  ONLINE       0     0     0
            44141a7c-2183-4b54-80d6-5f03a4fd73fc  ONLINE       0     0     0
            f5e73d43-423c-4c61-926a-ef831547a4e3  ONLINE       0     0     0
            00b03b46-fb24-4cd5-bb1d-75f1d411a781  ONLINE       0     0     0
            7639837d-962a-469f-b170-4ce4a146ee81  ONLINE       0     0     0
            b76dc0ce-9f34-491c-8118-b3fbd55080bf  ONLINE      11     0     0
            74ee8d93-5fd7-43ab-97dc-24ef7b092635  ONLINE       0     0     0

errors: No known data errors


I see that there are 11 uncorrected read errors on the disk whose partition gpt-id ends in 80bf, so I look at the list of disks and determined that was /dev/sdd. Checking the output of smartctl -a /dev/sdd | less showed no Reallocated, Current_Pending, or Offline_Uncorrectable Sectors, etc, so I rant a -t long test overnight and found the same results this morning.
Code:
root@TrueNAS[~]# smartctl -a /dev/sdd
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.93+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     TOSHIBA MD04ACA500
Serial Number:    35DGK54JFS9A
LU WWN Device Id: 5 000039 62be8146c
Firmware Version: FP2A
User Capacity:    5,000,981,078,016 bytes [5.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed May 18 10:15:09 2022 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 578) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       505
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       19509
  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   094   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   001   001   000    Old_age   Always       -       58776
 10 Spin_Retry_Count        0x0033   253   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       185
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       78
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       146
193 Load_Cycle_Count        0x0032   096   096   000    Old_age   Always       -       40038
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       51 (Min/Max 6/57)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   253   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       0
222 Loaded_Hours            0x0032   012   012   000    Old_age   Always       -       35348
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       209
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     58774         -
# 2  Short offline       Completed without error       00%     58762         -
# 3  Short offline       Completed without error       00%     58584         -
# 4  Extended offline    Completed without error       00%     58572         -
# 5  Extended offline    Aborted by host               70%     27109         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


I went back to the system logs and looked for events around the time of the email Alert and found a pair of closely spaced resets, just on this disk, around the time of the event.
Code:
May 14 00:00:17 TrueNAS syslog-ng[20741]: Configuration reload finished;
May 17 20:13:21 TrueNAS kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
May 17 20:13:21 TrueNAS kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
May 17 20:13:21 TrueNAS kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
May 17 20:13:21 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#2407 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s
May 17 20:13:21 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#2407 CDB: Read(16) 88 00 00 00 00 00 5c 38 8c d0 00 00 00 28 00 00
May 17 20:13:21 TrueNAS kernel: zio pool=data vdev=/dev/disk/by-partuuid/b76dc0ce-9f34-491c-8118-b3fbd55080bf error=5 type=1 offset=790023938048 size=20480 flags=180980
May 17 20:13:21 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#2411 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s
May 17 20:13:21 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#2411 CDB: Read(16) 88 00 00 00 00 00 5c 38 9d 38 00 00 00 88 00 00
May 17 20:13:21 TrueNAS kernel: zio pool=data vdev=/dev/disk/by-partuuid/b76dc0ce-9f34-491c-8118-b3fbd55080bf error=5 type=1 offset=790026088448 size=69632 flags=40080c80
May 17 20:13:21 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#2404 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s
May 17 20:13:21 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#2404 CDB: Read(16) 88 00 00 00 00 00 5c 38 9c b0 00 00 00 58 00 00
May 17 20:13:21 TrueNAS kernel: zio pool=data vdev=/dev/disk/by-partuuid/b76dc0ce-9f34-491c-8118-b3fbd55080bf error=5 type=1 offset=790026018816 size=45056 flags=40080c80
May 17 20:13:21 TrueNAS kernel: sd 0:0:13:0: Power-on or device reset occurred
May 17 20:13:22 TrueNAS kernel: sd 0:0:13:0: Power-on or device reset occurred
May 17 20:23:58 TrueNAS kernel: sd 0:0:13:0: Power-on or device reset occurred
May 17 20:23:58 TrueNAS kernel: sd 0:0:13:0: Power-on or device reset occurred
May 17 20:24:02 TrueNAS kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
May 17 20:24:02 TrueNAS kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
May 17 20:24:02 TrueNAS kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
May 17 20:24:02 TrueNAS kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
May 17 20:24:02 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#3043 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s
May 17 20:24:02 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#3043 CDB: Read(16) 88 00 00 00 00 02 46 30 cd a0 00 00 00 08 00 00
May 17 20:24:02 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#3008 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s
May 17 20:24:02 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#3008 CDB: Read(16) 88 00 00 00 00 00 5d f9 47 d8 00 00 00 28 00 00
May 17 20:24:02 TrueNAS kernel: zio pool=data vdev=/dev/disk/by-partuuid/b76dc0ce-9f34-491c-8118-b3fbd55080bf error=5 type=1 offset=805080838144 size=20480 flags=180980
May 17 20:24:02 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#3017 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=3s
May 17 20:24:02 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#3017 CDB: Read(16) 88 00 00 00 00 00 5d f9 56 18 00 00 00 28 00 00
May 17 20:24:02 TrueNAS kernel: zio pool=data vdev=/dev/disk/by-partuuid/b76dc0ce-9f34-491c-8118-b3fbd55080bf error=5 type=1 offset=805082705920 size=20480 flags=180980
May 17 20:24:02 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#3011 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=3s
May 17 20:24:02 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#3011 CDB: Read(16) 88 00 00 00 00 00 5d f9 48 08 00 00 00 28 00 00
May 17 20:24:02 TrueNAS kernel: zio pool=data vdev=/dev/disk/by-partuuid/b76dc0ce-9f34-491c-8118-b3fbd55080bf error=5 type=1 offset=805080862720 size=20480 flags=180980
May 17 20:24:02 TrueNAS kernel: sd 0:0:13:0: Power-on or device reset occurred
May 17 20:24:02 TrueNAS kernel: sd 0:0:13:0: Power-on or device reset occurred
May 18 06:26:40 TrueNAS kernel: loop: module loaded


The first and last lines in the log were hours or days before or after the event, so it does not appear to be an ongoing event.

I can't make any sense of these messages from a hardware perspective... The drive in question is in one of two hot-swap style drive cages containing 8x SATA drives of various sizes in this RaidZ2 pool (mix of 3T, 5T, and 6T; planning to migrate to 5T and 6T in the next few weeks), and they are all connected to an IBM ServeRAID M1015 (LSI 9220-8i flashed to IT-mode) through a pair of SFF-8087 to SATA cables. The drive obviously has some hours on it, but I beat the hell out of it before replacing a 2T with it (full 4-phase run of badblocks, followed by a long SMART test).

I don't really believe its a genuine power delivery problem because the other two drives in that cage should have been affected also (common supply for all drives in the cage). Hard to believe it would be a faulty cable because I have been using these for a few years without problems. Also, I stood this server up on April 28th and haven't had an issue like this since the initial boot. In fact, I haven't yet restarted the NAS since its first boot.

All the data I care about is backed up in a couple of other places, so I don't have a gun to my head. My thinking right now is that I should issue a zpool clear and see what happens next. Conceptually, is there some value to doing a scrub before doing that? Any additional troubleshooting or interpretation of the information already available that I should be considering?

Any thoughts, questions, advice, references, rants, etc will be deeply appreciated.
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
I'd give it a "zpool clear" just out of curiosity, and to see how long it will stay in the pool. I have had cable issues with Toshiba drives, so that is definitely worth looking into. But I wanted to note two items in your smart data:

1. 58776 power on hours. That's almost 7 years of power on time, it has to have seen first spin up in 2015 if not earlier. Yes, a chunk of that appears to be spun down / idle, but the load count is a factor also. Each cycle wears the actuator, etc...
2. 51 deg/C... Too hot. If you have good airflow, that temp would be indicative of bearings nearing end of life / failure. If you have marginal or bad airflow, it's a temp that will shorten the drive life.

You may get it to rejoin the pool, and it might even stay there for weeks / months... But its time to have a plan to replace it. Acquire a spare, do a burn in test on it, etc...
 

dewhite04

Cadet
Joined
Jan 24, 2018
Messages
7
Thank you for the reply and your thoughts!

1. 58776 power on hours. That's almost 7 years of power on time, it has to have seen first spin up in 2015 if not earlier. Yes, a chunk of that appears to be spun down / idle, but the load count is a factor also. Each cycle wears the actuator, etc...
This is a SOHO environment and I keep daily incremental backups on two other separate appliances, one local, one remote. So, a little downtime here or there isn't too serious for me, and I weigh that against the cost of brand-new first-class equipment. All the same, I've ordered a couple of 6T EXOS drives with 10-15k hours on them from a reliable seller on eBay that I have bought the same type of disks from recently. I'll have them tested and ready to swap-in as-needed.

2. 51 deg/C... Too hot. If you have good airflow, that temp would be indicative of bearings nearing end of life / failure. If you have marginal or bad airflow, it's a temp that will shorten the drive life.
Agreed - that was one of the first things I noticed. This drive is in a 3-space cage with one identical 5T and a 3T Toshiba. I popped the lid on the chassis and noticed that the fan speed switch on this cage had been bumped to the "low" setting. After flipping that switch, this drive has stabilized at 46c, however it's adjacent twin is at 42c - this lends some weight to your concern about a pre-failure hardware condition. The old fan in that cage is looking a little tired too. Maybe I'll have time this weekend to remove the back cover and graft a new 80mm in there to improve airflow a little bit...

You may get it to rejoin the pool, and it might even stay there for weeks / months... But its time to have a plan to replace it. Acquire a spare, do a burn in test on it, etc...
Interestingly, the drive never left the pool? It continues to show "online" status and the pool describes itself as "unhealthy" but not as "degraded." I've been keeping an eye on the syslog and the pool status. No additional disconnect/reconnect cycles for this or any drives, or any other errors related to the mpt2sas driver. Also, the read error count stands at 11, and only for this disk.

I guess I'll go ahead and issue the zpool clear and watch for the next hiccup. I don't mind discarding the aging 5T, but I hate to throw it in the trash over what amounts, in my mind, to superstition. I'd rather be able to tell myself, "That thing was failed and this is how I know it to be true." Otherwise, I fear I could end up ignoring a legitimate problem with power supply, power delivery, connectivity to the HBA, or a software or hardware problem with the HBA itself.

Thanks again for your thoughts!
 

ok1718

Cadet
Joined
May 25, 2021
Messages
6
I've had the very same error some days ago and just issued a "zpool clear"
Code:
  pool: jupiter
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 04:02:09 with 0 errors on Sun May 15 04:02:11 2022
config:

    NAME                                      STATE     READ WRITE CKSUM
    jupiter                                   ONLINE       0     0     0
      raidz1-0                                ONLINE       0     0     0
        b9e47752-12e2-4408-84d7-e3f0d359ac89  ONLINE       0     0     0
        0960a0de-21a7-4fab-aaa7-2645c908817d  ONLINE       0     0     0
        19d3f43b-47cb-4aa2-b2e2-026c9b4178f0  ONLINE       3     3     0
        053fd2dd-54fb-4df4-a5e1-d78468290238  ONLINE       0     0     0

errors: No known data errors

But my drive doesn't seem to be anywhere near EOL
Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   222   222   021    Pre-fail  Always       -       3891
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       16
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       6956
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       15
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       5
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       23
194 Temperature_Celsius     0x0022   116   105   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      6954         -
# 2  Short offline       Completed without error       00%      6947         -
# 3  Short offline       Completed without error       00%      6939         -
# 4  Short offline       Completed without error       00%      6915         -
# 5  Short offline       Completed without error       00%      6891         -
# 6  Short offline       Completed without error       00%      6867         -
# 7  Short offline       Completed without error       00%      6843         -
# 8  Short offline       Completed without error       00%      6819         -
# 9  Short offline       Completed without error       00%      6795         -
#10  Short offline       Completed without error       00%      6771         -
#11  Short offline       Completed without error       00%      6747         -
#12  Short offline       Completed without error       00%      6723         -
#13  Short offline       Completed without error       00%      6700         -
#14  Short offline       Completed without error       00%      6675         -
#15  Short offline       Completed without error       00%      6651         -
#16  Short offline       Completed without error       00%      6627         -
#17  Short offline       Completed without error       00%      6603         -
#18  Short offline       Completed without error       00%      6579         -
#19  Short offline       Completed without error       00%      6555         -
#20  Short offline       Completed without error       00%      6531         -
#21  Short offline       Completed without error       00%      6507         -


If you issue a "zpool clear" could you report back here, if the error comes back, what you did, etc.? @dewhite04
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I see that there are 11 uncorrected read errors on the disk whose partition gpt-id ends in 80bf, so I look at the list of disks and determined that was /dev/sdd.
You should have been able to do this directly in the GUI using the pool status page--at least you can in CORE; it's hard to believe they'd have removed that feature in SCALE. But a few other points that might be relevant:
  • As already mentioned, this is a very old drive, and regardless of the outcome of this particular issue, it's probably time to give serious thought to replacing it, or at least having a known-good, burned-in and tested spare on hand.
  • I'll also double-tap the issue of temps; 40°C or less is recommended for best service life.
  • You aren't running regular SMART self-tests. It's unlikely they would have contributed much to this situation, but it's generally considered good practice to schedule them regularly--short tests every day-week, long tests every week-month.
  • Just as a general matter--people seem to assume that there's going to be a direct correlation between SMART errors and ZFS pool errors, but that really isn't the case. For probably the simplest example, if you have a bad block on the disk, but there's no data there, ZFS doesn't know anything about it. In your case, something caused the disk to reset itself, several times, over the course of about 10 minutes. Whatever the cause is (and whether it's internal to the disk itself or not), that isn't likely to show up in SMART testing, but it obviously has the potential to cause data errors.
As to the way ahead, I'd probably agree with running zpool clear and seeing what happens. If the problem recurs, you can address it further at that time. But get a spare disk ready.
 

dewhite04

Cadet
Joined
Jan 24, 2018
Messages
7
Hello all,

I was just posting another question in the forum earlier today and thought I should stop back by here to say that after issuing the zpool clear I have not had any further issues with this pool or the above described disk. It continues to just chug right along. I did eventually phase out all the older 3TB disks for newer 6TB EXOS drives, so my overall capacity, with a mix of 5T and 6T drives, is now 25.72TiB.

Code:
root@TrueNAS[~]# zpool status
  pool: data
 state: ONLINE
  scan: scrub repaired 0B in 12:35:30 with 0 errors on Sun Aug  7 12:35:33 2022
config:

        NAME                                      STATE     READ WRITE CKSUM
        data                                      ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            634627f6-3f8c-4030-aae9-df54d9a7e079  ONLINE       0     0     0
            8dfccc34-e7dd-4faa-a396-ab7427e00fe9  ONLINE       0     0     0
            44141a7c-2183-4b54-80d6-5f03a4fd73fc  ONLINE       0     0     0
            7619622b-7fe0-4b24-ae0f-bcc5bca8bee9  ONLINE       0     0     0
            00b03b46-fb24-4cd5-bb1d-75f1d411a781  ONLINE       0     0     0
            7639837d-962a-469f-b170-4ce4a146ee81  ONLINE       0     0     0
            b76dc0ce-9f34-491c-8118-b3fbd55080bf  ONLINE       0     0     0
            fb4df3a3-b4ac-4d1c-a6bc-8e17dab59c98  ONLINE       0     0     0

errors: No known data errors


I did also acquire and "burn-in" an extra 6T EXOS disk, just in case this or any other start throwing-up errors in the future. Thanks all for your thoughts!
 
Top