zfs scrub stalls at ~50%

Brian Sturk

Dabbler
Joined
Apr 12, 2015
Messages
15
A scrub was kicked off early this morning (I assume via a cronjob, etc since I did not initiate it) and it seems to only get about 51% or so of the way done and then stalls out. I have rebooted to see if not having a lot of I/O going on would help and I get the same behavior. Here is the output of zpool status:

root@freenas:~ # zpool status -v zfs_root
pool: zfs_root
state: ONLINE
scan: scrub in progress since Sun May 24 00:00:14 2020
4.24T scanned at 410M/s, 3.63T issued at 252M/s, 7.02T total
0 repaired, 51.72% done, 0 days 03:54:42 to go
config:

NAME STATE READ WRITE CKSUM
zfs_root ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/ec7455d8-6dd8-11e4-9b22-d050992934fa ONLINE 0 0 0
gptid/ed2f8e60-6dd8-11e4-9b22-d050992934fa ONLINE 0 0 0
gptid/ed8369f5-6dd8-11e4-9b22-d050992934fa ONLINE 0 0 0
gptid/ede0a9cb-6dd8-11e4-9b22-d050992934fa ONLINE 0 0 0

errors: No known data errors


I've let it sit for a while and nothing changes other than the estimated time. Read I/O seems to somewhat work while in this condition, but I experienced lockups over smb and nfs when doing writes/deletes. I also don't see anything via dmesg. I'm running 11.1 STABLE:

FreeBSD 11.1-STABLE (FreeNAS.amd64) #0 r321665+d4625dcee3e(freenas/11.1-stable): Wed Dec 13 16:33:42 UTC 2017

Just wondering what else I can check or do.
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Have a look at dmesg maybe there's something going on with the hardware.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
A scrub was kicked off early this morning (I assume via a cronjob, etc since I did not initiate it) and it seems to only get about 51% or so of the way done and then stalls out. I have rebooted to see if not having a lot of I/O going on would help and I get the same behavior. Here is the output of zpool status:

root@freenas:~ # zpool status -v zfs_root
pool: zfs_root
state: ONLINE
scan: scrub in progress since Sun May 24 00:00:14 2020
4.24T scanned at 410M/s, 3.63T issued at 252M/s, 7.02T total
0 repaired, 51.72% done, 0 days 03:54:42 to go
config:

NAME STATE READ WRITE CKSUM
zfs_root ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/ec7455d8-6dd8-11e4-9b22-d050992934fa ONLINE 0 0 0
gptid/ed2f8e60-6dd8-11e4-9b22-d050992934fa ONLINE 0 0 0
gptid/ed8369f5-6dd8-11e4-9b22-d050992934fa ONLINE 0 0 0
gptid/ede0a9cb-6dd8-11e4-9b22-d050992934fa ONLINE 0 0 0

errors: No known data errors


I've let it sit for a while and nothing changes other than the estimated time. Read I/O seems to somewhat work while in this condition, but I experienced lockups over smb and nfs when doing writes/deletes. I also don't see anything via dmesg. I'm running 11.1 STABLE:

FreeBSD 11.1-STABLE (FreeNAS.amd64) #0 r321665+d4625dcee3e(freenas/11.1-stable): Wed Dec 13 16:33:42 UTC 2017

Just wondering what else I can check or do.
How big is your pool? Scrubs can take days depending on how big the pool is.

It's also possible you have a failing disk or corrupt pool.
 

Brian Sturk

Dabbler
Joined
Apr 12, 2015
Messages
15
thanks for the replies!

dmesg shows absolutely nothing. It is very repeatable and stops scrubbing at about the same point every time. All I/O locks up once the scrub hits about 50%, not just writes.

If I pause the scrub in progress I can do I/O but this concerns me, I don't want to continue making modifications to a file system that may have issues.

It's possible one of the disks is failing, but would that cause a scrub to lock up? I would think zfs would be able to tell disk is failing. I'm running raidz2. How does raidz2 usually handle a failing disk?

Any commands I can run on the disks to check their health, should I do SMART tests?

I should also mention, the day before the scrube started I managed to fill the volume by letting a torrent run overnight and it suddenly got a lot more peers so downloaded quicker than I expected. I was planning to stop it but had forgot. Is it possible that having filled the volume could cause these issues? I have since removed a lot of the content, so the disk is no longer full (it's now < 80% capacity again), but it seems like too much of a coincidence that things have been fine, the disk got full, and now I'm having issues.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
It's possible one of the disks is failing, but would that cause a scrub to lock up?
It's possible, but scrubs are designed to find disk errors, so should not be stopped by a disk error (certainly not one that dmesg isn't seeing either).

Any commands I can run on the disks to check their health, should I do SMART tests?
Absolutely... if you haven't set up the SMART tests in the GUI, do so.

smartctl -a /dev/daX will get you the results of tests already run (replace X with the actual disk number).
 

Brian Sturk

Dabbler
Joined
Apr 12, 2015
Messages
15
Thanks for that command. Output for all 4 of my drives show no errors/issues. Self assessment on each was PASSED.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Don't necessarily be fooled by that result... make sure you review the details also and check for read/write/checksum errors or pending sectors.
 

Brian Sturk

Dabbler
Joined
Apr 12, 2015
Messages
15
I went over all the counters/output and the only thing that stands out is this on 2 drives

1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 65536
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 3


Not sure what the timespan of these counters are, but the disks have been in service for over 6 years now.

If it would be more useful I can post the full output somewhere (or here if that is allowed).
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Hi,

You can post the output here using the code tags. To insert code, click on the triple dot in the edition bar and select the Code tag.

Also, this is a great moment for you to think about backups. Do you have backups ? If you do not, doing one would be important. When you start doubting your pool, you should re-assure yourself with that backup...
 

Brian Sturk

Dabbler
Joined
Apr 12, 2015
Messages
15
OK, here is the smart data for my 4 disks below. Is there a way to get the current call stack of the zfs process when there is a scrub that is hung like mine?

Also, yes, I have multiple backups in different places/formats. I'm just surprised that there could be an issue and other than the hang (and possibly the SMART data) there's not much indication of an issue as far as system reporting.

thanks again everyone for the help, I'm less stressed out now knowing there are knowledgable folks answering my questions. :)

Code:
################################
    ada0
################################

smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Ultrastar A7K2000
Device Model:     Hitachi HUA722020ALA330
Serial Number:    JK1131YAHXLWSV
LU WWN Device Id: 5 000cca 221db18d3
Firmware Version: JKAOA3EA
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Tue May 26 10:20:24 2020 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)    Offline data collection activity
                    was suspended by an interrupting command from host.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (21742) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      ( 362) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       65536
  2 Throughput_Performance  0x0005   134   134   054    Pre-fail  Offline      -       96
  3 Spin_Up_Time            0x0007   160   160   024    Pre-fail  Always       -       386 (Average 517)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       269
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   121   121   020    Pre-fail  Offline      -       35
  9 Power_On_Hours          0x0012   092   092   000    Old_age   Always       -       58973
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       112
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       465
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       465
194 Temperature_Celsius     0x0002   171   171   000    Old_age   Always       -       35 (Min/Max 19/48)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 0
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

################################
    ada1
################################

smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Ultrastar A7K2000
Device Model:     Hitachi HUA722020ALA330
Serial Number:    JK1130YAHW1SYT
LU WWN Device Id: 5 000cca 221da642d
Firmware Version: JKAOA3EA
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Tue May 26 10:20:38 2020 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)    Offline data collection activity
                    was suspended by an interrupting command from host.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (22330) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      ( 372) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   133   133   054    Pre-fail  Offline      -       102
  3 Spin_Up_Time            0x0007   150   150   024    Pre-fail  Always       -       440 (Average 521)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       137
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   119   119   020    Pre-fail  Offline      -       36
  9 Power_On_Hours          0x0012   093   093   000    Old_age   Always       -       55770
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       127
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       305
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       305
194 Temperature_Celsius     0x0002   166   166   000    Old_age   Always       -       36 (Min/Max 21/48)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 0
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

################################
    ada2
################################

smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD20EFRX-68EUZN0
Serial Number:    WD-WCC4MHT0U2PC
LU WWN Device Id: 5 0014ee 20a8cd646
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue May 26 10:20:44 2020 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (25980) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 263) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x703d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       3
  3 Spin_Up_Time            0x0027   178   173   021    Pre-fail  Always       -       4083
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       72
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   034   034   000    Old_age   Always       -       48311
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       72
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       63
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       1337
194 Temperature_Celsius     0x0022   117   108   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

################################
    ada3
################################

smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD20EFRX-68EUZN0
Serial Number:    WD-WCC4MKE5A6NR
LU WWN Device Id: 5 0014ee 20a6dcb60
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue May 26 10:20:48 2020 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (26940) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 272) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x703d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   176   171   021    Pre-fail  Always       -       4183
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       72
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   034   034   000    Old_age   Always       -       48309
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       72
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       63
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       1158
194 Temperature_Celsius     0x0022   118   111   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
No self-tests have been logged. [To run self-tests, use: smartctl -t]

You are not running smart tests no your drives so you have no way of knowing when or if they are dying. Start by running a smart long test on all drives and then a short test. Set these up under the SMART section of the gui so you can run them 2x a month or so. To run them manually run smartctl -t long /dev/adaX and smartctl -t short /dev/adaX.

At a quick glance ada0 has significant read errors, ada2 is starting to have read errors. You will probably have 2 drives in this system dying soon. If smart tests do not pass then replace them fast. If smart tests pass then just monitor them closely and make sure you have email notifications setup.
 

Brian Sturk

Dabbler
Joined
Apr 12, 2015
Messages
15
Thank you for that, I will set those up today. Should I continue to keep the zfs scrub paused? Assuming these tests come back OK (or even if they don't), I'm not sure how to address the scrub hang. I could replace the drive(s) if they don't pass, but if they do pass, what is my course of action?

I'd like to understand the issue ZFS is having, hoping getting a dump of the call stack while hung would provide some insight.
 

Brian Sturk

Dabbler
Joined
Apr 12, 2015
Messages
15
The long SMART tests finished and all without errors. The Raw_Read_Error_Rate now reads 0 for all 4 drives.
 
Top