Should I be worried? ZFS Degraded -> Faulted -> Online

gar · Mar 22, 2019

Alright, long day but things are ok... for now.

It started when I rebooted my server troubleshooting Plex and port-forwarding. After fixing that, I got a critical alarm email several minutes later. Upon logging in, I had the below output, I suspected my drive with SN Z1E5674Y had failed. No problem I thought, that's why I'm using ZFS, I'll just replace the drive, but lets reboot to see if this is just a fluke.

Code:

--------------------------------After reboot-----------------
root@NAS:~ # zpool status -v
  pool: NAS
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 0 days 03:44:25 with 0 errors on Sun Mar 10 04:44:26 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        NAS                                             DEGRADED     0     0     0
          raidz1-0                                      DEGRADED     0     0     0
            16998016746606625039                        UNAVAIL      0     0     0  was /dev/gptid/5d9c2204-7588-11e3-b46b-000c29403d9a
            gptid/5e351433-7588-11e3-b46b-000c29403d9a  ONLINE       0     0     0
            gptid/5ec15c36-7588-11e3-b46b-000c29403d9a  ONLINE       0     0     0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:28 with 0 errors on Tue Mar 19 03:45:28 2019
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          da0p2     ONLINE       0     0     0

errors: No known data errors
root@NAS:~ # smartctl -a /dev/ada0 | grep ^Serial
Serial Number:    Z1E5676K
root@NAS:~ # smartctl -a /dev/ada1 | grep ^Serial
Serial Number:    Z1E57PJS
root@NAS:~ # smartctl -a /dev/ada2 | grep ^Serial
root@NAS:~ # glabel status
                                      Name  Status  Components
gptid/796bbfc9-b651-11e5-a3e6-000c29403d9a     N/A  da0p1
gptid/5e351433-7588-11e3-b46b-000c29403d9a     N/A  ada0p2
gptid/5ec15c36-7588-11e3-b46b-000c29403d9a     N/A  ada1p2

Then I hit the below. Very confused how the volume had failed with 2 active drives. However it now looked like Z1E5676K had failed. So I panicked. Rebooted several times tried different cables, sata ports etc until...

Code:

---------------AFTER Reboot Troubleshooting Drives-----------------------------------
                                      Name  Status  Components
gptid/796bbfc9-b651-11e5-a3e6-000c29403d9a     N/A  da0p1
gptid/5d9c2204-7588-11e3-b46b-000c29403d9a     N/A  ada0p2
gptid/5ec15c36-7588-11e3-b46b-000c29403d9a     N/A  ada1p2
root@NAS:~ # zpool import
   pool: NAS
     id: 393758721156783558
  state: FAULTED
 status: One or more devices are missing from the system.
 action: The pool cannot be imported. Attach the missing
        devices and try again.
        The pool may be active on another system, but can be imported using
        the '-f' flag.
   see: http://illumos.org/msg/ZFS-8000-3C
 config:

        NAS                                             FAULTED  corrupted data
          raidz1-0                                      FAULTED  corrupted data
            gptid/5d9c2204-7588-11e3-b46b-000c29403d9a  ONLINE
            17347583837171412057                        UNAVAIL  cannot open
            gptid/5ec15c36-7588-11e3-b46b-000c29403d9a  ONLINE
root@NAS:~ # zpool status -v
  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:28 with 0 errors on Tue Mar 19 03:45:28 2019
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          da0p2     ONLINE       0     0     0

errors: No known data errors
root@NAS:~ #
root@NAS:~ # zpool status -v
  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:28 with 0 errors on Tue Mar 19 03:45:28 2019
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          da0p2     ONLINE       0     0     0

errors: No known data errors

It looks like it is working. But the question remains, what happened, and should I start to take some sort of action?
I'm pretty worried about whats going to happen when I power cycle the box again.
Think I should get a 4TB drive and either do a full backup or just add it to the pool before any power cycle?

Code:

---------------NOW-------------------------------------------
root@NAS:~ # zpool status -v
  pool: NAS
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: resilvered 879M in 0 days 00:01:34 with 0 errors on Fri Mar 22 13:49:59 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        NAS                                             ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/5d9c2204-7588-11e3-b46b-000c29403d9a  ONLINE       0     0     0
            gptid/5e351433-7588-11e3-b46b-000c29403d9a  ONLINE       0     0     0
            gptid/5ec15c36-7588-11e3-b46b-000c29403d9a  ONLINE       0     0     0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:28 with 0 errors on Tue Mar 19 03:45:28 2019
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          da0p2     ONLINE       0     0     0

errors: No known data errors
root@NAS:~ # zpool import
root@NAS:~ # glabel status
                                      Name  Status  Components
gptid/796bbfc9-b651-11e5-a3e6-000c29403d9a     N/A  da0p1
gptid/5ec15c36-7588-11e3-b46b-000c29403d9a     N/A  ada0p2
gptid/5e351433-7588-11e3-b46b-000c29403d9a     N/A  ada1p2
gptid/5d9c2204-7588-11e3-b46b-000c29403d9a     N/A  ada2p2
gptid/5eaaea3c-7588-11e3-b46b-000c29403d9a     N/A  ada0p1
root@NAS:~ #
root@NAS:~ # smartctl -a /dev/ada0 | grep ^Serial
Serial Number:    Z1E57PJS
root@NAS:~ # smartctl -a /dev/ada1 | grep ^Serial
Serial Number:    Z1E5676K
root@NAS:~ # smartctl -a /dev/ada2 | grep ^Serial
Serial Number:    Z1E5674Y

Other relevant data. Thanks in advance for the help.

Code:

oot@NAS:~ # smartctl -a /dev/ada0 | grep self-assessment
SMART overall-health self-assessment test result: PASSED
root@NAS:~ # smartctl -a /dev/ada1 | grep self-assessment
SMART overall-health self-assessment test result: PASSED
root@NAS:~ # smartctl -a /dev/ada2 | grep self-assessment
SMART overall-health self-assessment test result: PASSED
root@NAS:~ # camcontrol devlist
<NECVMWar VMware IDE CDR10 1.00>   at scbus1 target 0 lun 0 (cd0,pass0)
<VMware Virtual disk 1.0>          at scbus2 target 0 lun 0 (pass1,da0)
<ST2000DM001-1CH164 CC27>          at scbus4 target 0 lun 0 (pass2,ada0)
<ST2000DM001-1CH164 CC27>          at scbus5 target 0 lun 0 (pass3,ada1)
<ST2000DM001-1CH164 CC27>          at scbus6 target 0 lun 0 (pass4,ada2)

Kevin Horton · Mar 24, 2019

gar said:
It looks like it is working. But the question remains, what happened, and should I start to take some sort of action?
I'm pretty worried about whats going to happen when I power cycle the box again.
Think I should get a 4TB drive and either do a full backup or just add it to the pool before any power cycle?

I have no idea what happened, nor can I suggest what troubleshooting steps to apply. But there are others much more knowledgeable than I who will hopefully provide advice.

Do not simply add another disk to the pool, as this disk will be added as a strip. The new disk will not increase the redundancy, and if that new disk fails, you will lose all the data on the pool. See Slideshow explaining VDev, zpool, ZIL and L2ARC for more info.

Will it cause you any concern if this data is lost? If so, please ensure you have one (or preferably two) good backups. RAID is not a backup.

The PASSED output from SMART can be a bit misleading. Please provide full smartctl -x /dev/----- output for all disks.

artlessknave · Mar 24, 2019

"INSUFFICIENT DATA FOR MEANINGFUL ANSWER." (https://en.wikipedia.org/wiki/The_Last_Question)

forum rules require posting your full hardware, because questions like this are often unanswerable without knowing what you have (many posters here will outright ignore posts that don't follow at least the basic rules). from your device names it looks like you are using SATA; some of these controllers have poor quality, or poor compatibility with bsd, but since you haven't told us anything, there is no way to know. also, if you are using some kind of raid card, or virtualized hardware, those can mangle zfs. again, no info provided means to help can be provided.

initially, this sounds like it could be:
controller problems.
cable problems (loose, failing)
power supply problems, wherein drives are getting power intermittently. if so, adding another drive would make it worse.
1. system specs.
2: do you have a backup?
3: why are you using raiz1? are your drives larger than ~1TB? if so, raidz1 is risky, and absolutely should not be done without a backup (long resilver times can be very bad).
4. scrubing your drives will tell you if anything got corrupted, but will also put stress on the system and could trigger drives to go MIA again.
5. you cannot just add a 4tb to the pool, zfs doesn't work this way. without more information, it sounds like you might want to do some serious reading of the (usually stickied) newbie guides to zfs
6. panicing and rebooting a ton of times with a non importing pool with zfs is often a REALLY BAD IDEA (TM). if your pool looks dead after a reboot or two and you don't know why, seriously consider turning it off and asking for advise.

gar · Mar 24, 2019

You should change your link :) http://www.multivax.com/last_question.html

Let me know if you need anything else. This is a home setup I built in 2013 and haven't really done anything with it besides normal updates and setup Azure backups for certain folders last year. I've pretty much forgotten the little I did know about FreeNAS. Guess its time to get back to reading.

1. system specs.

FreeNAS-11.1-U7
AMD FX-8120, 8gb Ram. Built in SATA controller on ASRock 970 Extreme3 R2.0.
https://www.newegg.com/Product/Product.aspx?Item=N82E16813157394
Not doing anything fancy on the sata controller, it just presents the individual disks to FreeNAS.

2: do you have a backup?

Yeah but only critical data to Azure. Movies, games, etc I rely on the redundancy of the setup to protect me from having to re-download them all again. :) I know not the best idea.

3: why are you using raiz1? are your drives larger than ~1TB? if so, raidz1 is risky, and absolutely should not be done without a backup (long resilver times can be very bad).

Didn't know that it was a bad idea. What should I be doing that's optimal for a 3 drive setup? Can't really afford to do much more than replace the disks I have if they go bad.

4. scrubing your drives will tell you if anything got corrupted, but will also put stress on the system and could trigger drives to go MIA again.

I did this. Seemed fine

Code:

Scrub
Status: Completed

Errors: 0     Repaired: 0     Date: Sun Mar 24 07:57:11 2019

5. you cannot just add a 4tb to the pool, zfs doesn't work this way. without more information, it sounds like you might want to do some serious reading of the (usually stickied) newbie guides to zfs

I was hoping for maybe an option to put in a hot spare. I think this is not an option.

6. panicing and rebooting a ton of times with a non importing pool with zfs is often a REALLY BAD IDEA (TM). if your pool looks dead after a reboot or too and you don't know why, seriously consider turning it off and asking for advise.

Crap... Pretty much did exactly this.

gar · Mar 24, 2019

Kevin Horton said:
.The PASSED output from SMART can be a bit misleading. Please provide full smartctl -x /dev/----- output for all disks.

Code:

root@NAS:~ # smartctl -x /dev/ada0
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST2000DM001-1CH164
Serial Number:    Z1E57PJS
LU WWN Device Id: 5 000c50 0647acbb4
Firmware Version: CC27
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Mar 24 21:57:35 2019 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  33) The self-test routine was interrupted
                                        by the host with a hard or soft reset.
Total time to complete Offline
data collection:                (  592) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 226) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   112   099   006    -    47849568
  3 Spin_Up_Time            PO----   096   095   000    -    0
  4 Start_Stop_Count        -O--CK   100   100   020    -    277
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  7 Seek_Error_Rate         POSR--   087   060   030    -    543660448
  9 Power_On_Hours          -O--CK   073   073   000    -    24405
 10 Spin_Retry_Count        PO--C-   100   100   097    -    0
 12 Power_Cycle_Count       -O--CK   100   100   020    -    270
183 Runtime_Bad_Block       -O--CK   099   099   000    -    1
184 End-to-End_Error        -O--CK   100   100   099    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
188 Command_Timeout         -O--CK   100   099   000    -    0 0 184
189 High_Fly_Writes         -O-RCK   100   100   000    -    0
190 Airflow_Temperature_Cel -O---K   065   049   045    -    35 (Min/Max 32/36)
191 G-Sense_Error_Rate      -O--CK   100   100   000    -    0
192 Power-Off_Retract_Count -O--CK   100   100   000    -    148
193 Load_Cycle_Count        -O--CK   100   100   000    -    1176
194 Temperature_Celsius     -O---K   035   051   000    -    35 (0 10 0 0 0)
197 Current_Pending_Sector  -O--C-   100   100   000    -    0
198 Offline_Uncorrectable   ----C-   100   100   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    9124
240 Head_Flying_Hours       ------   100   253   000    -    24369h+59m+04.770s
241 Total_LBAs_Written      ------   100   253   000    -    71229471196
242 Total_LBAs_Read         ------   100   253   000    -    152601127735
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa1       GPL,SL  VS      20  Device vendor specific log
0xa2       GPL     VS    4496  Device vendor specific log
0xa8       GPL,SL  VS     129  Device vendor specific log
0xa9       GPL,SL  VS       1  Device vendor specific log
0xab       GPL     VS       1  Device vendor specific log
0xb0       GPL     VS    5176  Device vendor specific log
0xbe-0xbf  GPL     VS   65535  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL,SL  VS      10  Device vendor specific log
0xc4       GPL,SL  VS       5  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

ATA_READ_LOG_EXT (addr=0x03:0x00, page=0, n=1) failed: Input/output error
Read SMART Extended Comprehensive Error Log failed

SMART Error Log Version: 1
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Interrupted (host reset)      00%     24371         -
# 2  Short offline       Interrupted (host reset)      00%     24359         -
# 3  Extended offline    Interrupted (host reset)      00%     24358         -
# 4  Short offline       Completed without error       00%     24324         -
# 5  Short offline       Completed without error       00%     24312         -
# 6  Short offline       Completed without error       00%     24276         -
# 7  Short offline       Completed without error       00%     24264         -
# 8  Short offline       Completed without error       00%     24228         -
# 9  Short offline       Completed without error       00%     24216         -
#10  Short offline       Completed without error       00%     24180         -
#11  Short offline       Completed without error       00%     24168         -
#12  Short offline       Completed without error       00%     24132         -
#13  Short offline       Completed without error       00%     24120         -
#14  Short offline       Completed without error       00%     24084         -
#15  Short offline       Completed without error       00%     24072         -
#16  Short offline       Completed without error       00%     24037         -
#17  Short offline       Completed without error       00%     24025         -
#18  Short offline       Completed without error       00%     23989         -
#19  Short offline       Completed without error       00%     23977         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       522 (0x020a)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    35 Celsius
Power Cycle Min/Max Temperature:     32/35 Celsius
Lifetime    Min/Max Temperature:     10/51 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Data Table command not supported

SCT Error Recovery Control command not supported

Device Statistics (GP/SMART Log 0x04) not supported

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2         7746  Device-to-host register FISes sent due to a COMRESET
0x0001  2         9125  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2         9128  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2           33  R_ERR response for host-to-device non-data FIS

root@NAS:~ # smartctl -x /dev/ada1
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST2000DM001-1CH164
Serial Number:    Z1E5676K
LU WWN Device Id: 5 000c50 06478d75f
Firmware Version: CC27
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Mar 24 21:57:38 2019 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  584) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 217) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   119   099   006    -    222492016
  3 Spin_Up_Time            PO----   096   095   000    -    0
  4 Start_Stop_Count        -O--CK   100   100   020    -    278
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  7 Seek_Error_Rate         POSR--   087   060   030    -    568656609
  9 Power_On_Hours          -O--CK   072   072   000    -    24563
 10 Spin_Retry_Count        PO--C-   100   100   097    -    0
 12 Power_Cycle_Count       -O--CK   100   100   020    -    272
183 Runtime_Bad_Block       -O--CK   093   093   000    -    7
184 End-to-End_Error        -O--CK   100   100   099    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
188 Command_Timeout         -O--CK   100   100   000    -    0 0 4
189 High_Fly_Writes         -O-RCK   100   100   000    -    0
190 Airflow_Temperature_Cel -O---K   067   049   045    -    33 (Min/Max 30/33)
191 G-Sense_Error_Rate      -O--CK   100   100   000    -    0
192 Power-Off_Retract_Count -O--CK   100   100   000    -    151
193 Load_Cycle_Count        -O--CK   100   100   000    -    1184
194 Temperature_Celsius     -O---K   033   051   000    -    33 (0 11 0 0 0)
197 Current_Pending_Sector  -O--C-   100   100   000    -    0
198 Offline_Uncorrectable   ----C-   100   100   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   200   160   000    -    151
240 Head_Flying_Hours       ------   100   253   000    -    24522h+23m+19.131s
241 Total_LBAs_Written      ------   100   253   000    -    71234689699
242 Total_LBAs_Read         ------   100   253   000    -    152898938333
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa1       GPL,SL  VS      20  Device vendor specific log
0xa2       GPL     VS    4496  Device vendor specific log
0xa8       GPL,SL  VS     129  Device vendor specific log
0xa9       GPL,SL  VS       1  Device vendor specific log
0xab       GPL     VS       1  Device vendor specific log
0xb0       GPL     VS    5176  Device vendor specific log
0xbe-0xbf  GPL     VS   65535  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL,SL  VS      10  Device vendor specific log
0xc4       GPL,SL  VS       5  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     24557         -
# 2  Short offline       Completed without error       00%     24481         -
# 3  Short offline       Completed without error       00%     24469         -
# 4  Short offline       Completed without error       00%     24433         -
# 5  Short offline       Completed without error       00%     24421         -
# 6  Short offline       Completed without error       00%     24385         -
# 7  Short offline       Completed without error       00%     24373         -
# 8  Short offline       Completed without error       00%     24337         -
# 9  Short offline       Completed without error       00%     24325         -
#10  Short offline       Completed without error       00%     24289         -
#11  Short offline       Completed without error       00%     24277         -
#12  Short offline       Completed without error       00%     24241         -
#13  Short offline       Completed without error       00%     24229         -
#14  Short offline       Completed without error       00%     24194         -
#15  Short offline       Completed without error       00%     24182         -
#16  Short offline       Completed without error       00%     24146         -
#17  Short offline       Completed without error       00%     24134         -
#18  Short offline       Completed without error       00%     24098         -
#19  Short offline       Completed without error       00%     24086         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       522 (0x020a)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    33 Celsius
Power Cycle Min/Max Temperature:     30/33 Celsius
Lifetime    Min/Max Temperature:     11/50 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Data Table command not supported

SCT Error Recovery Control command not supported

Device Statistics (GP/SMART Log 0x04) not supported

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2           13  Device-to-host register FISes sent due to a COMRESET
0x0001  2            0  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS

root@NAS:~ # smartctl -x /dev/ada2
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST2000DM001-1CH164
Serial Number:    Z1E5674Y
LU WWN Device Id: 5 000c50 06478d3d3
Firmware Version: CC27
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Mar 24 21:57:40 2019 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  584) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 213) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   116   099   006    -    105850560
  3 Spin_Up_Time            PO----   096   095   000    -    0
  4 Start_Stop_Count        -O--CK   100   100   020    -    275
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  7 Seek_Error_Rate         POSR--   078   060   030    -    39192365095
  9 Power_On_Hours          -O--CK   072   072   000    -    24559
 10 Spin_Retry_Count        PO--C-   100   100   097    -    0
 12 Power_Cycle_Count       -O--CK   100   100   020    -    271
183 Runtime_Bad_Block       -O--CK   098   098   000    -    2
184 End-to-End_Error        -O--CK   100   100   099    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
188 Command_Timeout         -O--CK   100   100   000    -    0 0 0
189 High_Fly_Writes         -O-RCK   099   099   000    -    1
190 Airflow_Temperature_Cel -O---K   066   049   045    -    34 (Min/Max 30/35)
191 G-Sense_Error_Rate      -O--CK   100   100   000    -    0
192 Power-Off_Retract_Count -O--CK   100   100   000    -    150
193 Load_Cycle_Count        -O--CK   100   100   000    -    1185
194 Temperature_Celsius     -O---K   034   051   000    -    34 (0 11 0 0 0)
197 Current_Pending_Sector  -O--C-   100   100   000    -    0
198 Offline_Uncorrectable   ----C-   100   100   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   200   160   000    -    81
240 Head_Flying_Hours       ------   100   253   000    -    24522h+09m+51.539s
241 Total_LBAs_Written      ------   100   253   000    -    71202057703
242 Total_LBAs_Read         ------   100   253   000    -    152490686152
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa1       GPL,SL  VS      20  Device vendor specific log
0xa2       GPL     VS    4496  Device vendor specific log
0xa8       GPL,SL  VS     129  Device vendor specific log
0xa9       GPL,SL  VS       1  Device vendor specific log
0xab       GPL     VS       1  Device vendor specific log
0xb0       GPL     VS    5176  Device vendor specific log
0xbe-0xbf  GPL     VS   65535  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL,SL  VS      10  Device vendor specific log
0xc4       GPL,SL  VS       5  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     24553         -
# 2  Short offline       Completed without error       00%     24478         -
# 3  Short offline       Completed without error       00%     24466         -
# 4  Short offline       Completed without error       00%     24430         -
# 5  Short offline       Completed without error       00%     24418         -
# 6  Short offline       Completed without error       00%     24382         -
# 7  Short offline       Completed without error       00%     24370         -
# 8  Short offline       Completed without error       00%     24334         -
# 9  Short offline       Completed without error       00%     24322         -
#10  Short offline       Completed without error       00%     24286         -
#11  Short offline       Completed without error       00%     24274         -
#12  Short offline       Completed without error       00%     24238         -
#13  Short offline       Completed without error       00%     24226         -
#14  Short offline       Completed without error       00%     24191         -
#15  Short offline       Completed without error       00%     24179         -
#16  Short offline       Completed without error       00%     24143         -
#17  Short offline       Completed without error       00%     24131         -
#18  Short offline       Completed without error       00%     24095         -
#19  Short offline       Completed without error       00%     24083         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):


SCT Status Version:                  3
SCT Version (vendor specific):       522 (0x020a)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    34 Celsius
Power Cycle Min/Max Temperature:     30/34 Celsius
Lifetime    Min/Max Temperature:     11/50 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Data Table command not supported

SCT Error Recovery Control command not supported

Device Statistics (GP/SMART Log 0x04) not supported

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2           26  Device-to-host register FISes sent due to a COMRESET
0x0001  2            0  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS[/CODE}

artlessknave · Mar 24, 2019

you should be using either mirror or raidz2 minimum, because the chances of a 2nd disk failing when resilvering TB of raidz data goes up dramatically, enough that you should consider it a guarantee.

with only a handful of drives raidz2 isn't going to do much for you; while mirror is only one drive redundant, resilvers are faster and far less taxing on the remaining drive, while being MUCH easier to work with administratively (you can add more drives to a mirror vdev (x3, x4), expand pools with less disk batches, detach mirror drives, see more space when replacing with larger disks, etc)

your board doesn't appear to support ECC and you have the bare minimum RAM (8GB), which is not recommended; server grade hardware is best for zfs. I cant figure out what SATA controller the board uses, but since the LAN is realtek and the board is AMD, its probably not teh vastly reliable intel, and only having 5 ports is critically limiting.

some examples.
https://www.amazon.com/Intel-BX8067...eywords=G4620&qid=1553488534&s=gateway&sr=8-2
https://www.newegg.ca/Product/Product.aspx?Item=N82E16819117611
https://www.newegg.ca/Product/Produ...=G4620&cm_re=G4620-_-9SIADDZ8CF3242-_-Product
https://www.newegg.ca/Product/Product.aspx?Item=9SIA7RD6CN9045
https://www.newegg.com/Product/Product.aspx?Item=N82E16813183013

assuming you don't have a large enough budget currently available to properly replace the core system - general recommendation:
1. replace power supply - a lot of people neglect to consider power (older ones supply less power over time; nooname brands have poor quality; you dont give us yours so again with the less meaningful answer - get a quality one (seasonic, Corsair HX/TX/RM) with the most wattage you can reasonably afford)
2. get 2 4TB drives, create new mirrored pool (burn-in drives if able/inclined).
3. replicate your old pool to the new pool; this will give you a backup on new disks (plus snaps if you have them configured - which you should).
4. export/remove the backup pool/disks (do NOT mark as new), and diagnose your setup in relative safety.

reinstall/import the backup disks/pool, and replicate one to the other
tank 4TB
4TB
4TB
btank 4TB
2TB
2TB
2TB

reinstall/import the backup disks/pool, destroy the old pool, and extend your new mirror with 2 of the 2TB drives. you will have 6TB. add the 3rd 2TB as a spare for the other 2TB.
tank 6TB
mirror1
4TB
4TB
mirror2
2TB
2TB
spare
2TB

Kevin Horton · Mar 25, 2019

gar said:
/dev/ada0 shows a very high UDMA_CRC_Error_Count value (9124). The other two drives also show UDMA_CRC_Error_Counts, but quite a bit lower. This may be caused by bad SATA cables, or loose cable connections, which could possibly explain drives disappearing. As a first step, try reseating both ends of all SATA cables. Pay attention to the UDMA_CRC_Error_Counts, and buy new cables if those values keep increasing.

It doesn't look like the system is set up to do scheduled long SMART tests, or they are at a very long interval. I recommend you have a long SMART test done every week or two.

gar · Mar 26, 2019

artlessknave said:
you should be using either mirror or raidz2 minimum, because the chances of a 2nd disk failing when resilvering TB of raidz data goes up dramatically, enough that you should consider it a guarantee.

with only a handful of drives raidz2 isn't going to do much for you; while mirror is only one drive redundant, resilvers are faster and far less taxing on the remaining drive, while being MUCH easier to work with administratively (you can add more drives to a mirror vdev (x3, x4), expand pools with less disk batches, detach mirror drives, see more space when replacing with larger disks, etc)

Alright. So I think the best course of action based on both of your recommendations consist of a 2 phase approach. Let me know your thoughts.

Phase 1: Near term. Fix and protect current setup. Budget is key limiter here.
1. I found a friend that will lend me his 4tb drive. I will use this to backup all current data.
2. I will purchase another 2tb drive, new PSU, and SATA cables.
3. I will blow away the raidz1, install the new drive, giving me 4x 2tb, and set up a new pool as radiz2.
4. Restore data to new pool.

Phase 2: In next few years
1. Get server grade hardware.
2. Migrate setup to that.
3. Add some drives in order to have multiple pools

tank 8TB raidz2
4TB
4TB
4TB
4TB

btank 8TB raidz1 (Mostly for reusue of my 2tb drives) Might need to think about this some more.
2TB
2TB
2TB
2TB
2TB

artlessknave · Mar 26, 2019

note that with a single drive zfs can find bad data but cannot correct it, so if you put everything on one 4tb drive you will have a single point of failure, and will be stressing the drive by both copying everything onto it, and then everything of of it again, increasing the chances of failure
a raidz2 of 4 drives will have the same space as mirrors, along with the reduced performance from the parity calc, and the increased resilver times, mostly making the benefits of raidz2 pointless, while also being more complicated to expand the pool (you should add another raidz2 of 4 drives)
that said, it's your setup so you have to decide if the risks are worth it

pro lamer · Mar 28, 2019

Disclaimer: just a sanity chime in. In general I recommend raidz2 over raidz1 for big drives. One can argue if 2TB is already too big!

artlessknave said:
resilvers are faster

Not really unless you have a 3-way mirror

It's just a matter of reading data - one needs to read the same amount no matter if mirrors or raidz1 - if one needs to read 2TB it may mean reading 2TB from one HDD (in case of 2-way mirror) or sometimes 1TB from each of the two other disks (3-way mirrors or raidz1 three disks wide).
Edited because it's not relevant - see Chris's posts below.

It used to be a concern when CPUs were slower - resilvering a raidz is more CPU intensive than in case of mirrors.

Inspired by @Chris Moore

Sent from my phone

artlessknave · Mar 28, 2019

mirrors have no parity calculation, so while yes, CPU's are fast, there is still more overhead for raidz; you have to read from every single drive in the raiz vdev to rebuild from parity which is limited by a few things, not just CPU speed - since you have to read from every drive, you are stressing every single drive for for every resilver, while with mirror you only have to read from the specific vdevs with the failed mirrored drive.

a raidz is recalculating parity every time, while in a mirror all it is doing is copying the data to make the drives the same.
exactly how much this overhead is, though, I am not sure.

I prefer mirrors because they are simpler, have some of the best performance possible, are easier to attach new vdevs to the pool, easier to attach drives, expand faster (smaller vdev means expand by replacement occurs after only a few disks instead of the whole raiz stripe), and so any recommendation I make is going to be biased in that direction.

I try, however, to not forget the raidz alternatives, because it does have advantages - primarily storage efficiency and being able to lose any 1/2/3 disks in the vdev, whereas in mirror you cannot lose a whole mirror (in order to be able to lose any 2 disks with mirror you need to have a 3-way).

(some of the attachment and management complications will go away if things like vdev detachment and raidz expansion/contraction ever make it out of development)

Chris Moore · Mar 28, 2019

artlessknave said:
because the chances of a 2nd disk failing when resilvering TB of raidz data goes up dramatically, enough that you should consider it a guarantee

It isn't really guaranteed, just statistically more likely to happen. I have a backup pool running 6TB drives in RAIDz1 and in more than two years it has been problem free. I would never suggest RAIDz1 unless there was a duplicate copy of the data as a precaution. The data in my backup also exists in two other locations, so it would be only an inconvenience , if the pool did fail.

artlessknave said:
while being MUCH easier to work with administratively (you can add more drives to a mirror vdev (x3, x4), expand pools with less disk batches, detach mirror drives, see more space when replacing with larger disks, etc)

Mirrors are easier to manage for all the reasons you stated, but they do not resilver faster with less stress on the other drives in the vdev. That is a misunderstanding that has been perpetuated to the point of becoming an urban legend. Rebuild speed is controlled by how fast the new drive is able to write the data to disk and the amount of data that needs to be written to disk. I have done a long explanation of this a couple times before. If I can find it, I will provide a link.

artlessknave said:
mirrors have no parity calculation, so while yes, CPU's are fast, there is still more overhead for raidz;

CPU and RAM are not a limitation to rebuild speed unless the system is underpowered to begin with. Those calculations are done "on the fly" and don't significantly affect rebuild time. The rebuild time is controlled by the write speed to the destination drive.

Chris Moore · Mar 28, 2019

gar said:
It looks like it is working. But the question remains, what happened, and should I start to take some sort of action?

I looked through your post and I think you did some things that you didn't tell us you did because it sounds non sequential and doesn't fully make sense to me.
Here is a very well written guide from one of the other forum moderators that I think you should read:

Hard Drive Troubleshooting Guide (All Versions of FreeNAS)
https://forums.freenas.org/index.ph...bleshooting-guide-all-versions-of-freenas.17/

artlessknave · Mar 28, 2019

I realize it's not a guaranteed chance of failure, which is why I chose the words I used (consider it guaranteed). the point was to emphasize that, Like you say, you should have other copies if you care about it, and if you assume it will fail you probably wont lose anything important when/if it does.

I would be interested in the anti urban legend write up, because I would prefer to be giving out correct info....

I get that rebuild speed is going to be limited by destination write speed, but would it not also be limited by source read speed, at least in theory? and would not having a disk controller read from multiple drives and then parity calc that to reconstruct data at least have the potential to be, if only marginally, slower than just copying data from one disk to another? does zfs just function in a way that those possibilities are insignificant?

also...how do you quote parts and split things that way..I can't seem to figure out anything but full quote....

Chris Moore said:
It isn't really guaranteed, just statistically more likely to happen. I have a backup pool running 6TB drives in RAIDz1 and in more than two years it has been problem free. I would never suggest RAIDz1 unless there was a duplicate copy of the data as a precaution. The data in my backup also exists in two other locations, so it would be only an inconvenience , if the pool did fail.

Mirrors are easier to manage for all the reasons you stated, but they do not resilver faster with less stress on the other drives in the vdev. That is a misunderstanding that has been perpetuated to the point of becoming an urban legend. Rebuild speed is controlled by how fast the new drive is able to write the data to disk and the amount of data that needs to be written to disk. I have done a long explanation of this a couple times before. If I can find it, I will provide a link.

CPU and RAM are not a limitation to rebuild speed unless the system is underpowered to begin with. Those calculations are done "on the fly" and don't significantly affect rebuild time. The rebuild time is controlled by the write speed to the destination drive.

Chris Moore · Mar 28, 2019

gar said:
Other relevant data. Thanks in advance for the help.

See where you just grep for self-assessment.

Code:

oot@NAS:~ # smartctl -a /dev/ada0 | grep self-assessment
SMART overall-health self-assessment test result: PASSED
root@NAS:~ # smartctl -a /dev/ada1 | grep self-assessment
SMART overall-health self-assessment test result: PASSED
root@NAS:~ # smartctl -a /dev/ada2 | grep self-assessment
SMART overall-health self-assessment test result: PASSED
root@NAS:~ # camcontrol devlist
<NECVMWar VMware IDE CDR10 1.00>   at scbus1 target 0 lun 0 (cd0,pass0)
<VMware Virtual disk 1.0>          at scbus2 target 0 lun 0 (pass1,da0)
<ST2000DM001-1CH164 CC27>          at scbus4 target 0 lun 0 (pass2,ada0)
<ST2000DM001-1CH164 CC27>          at scbus5 target 0 lun 0 (pass3,ada1)
<ST2000DM001-1CH164 CC27>          at scbus6 target 0 lun 0 (pass4,ada2)

What you need to be looking at is the full output of the SMART test with smartctl -X /dev/ada0 because that

SMART overall-health self-assessment test result: PASSED

is a big lie and doesn't tell you anything useful. I always completely ignore that. It has no value at all. Don't even look at it. A drive can have thousands of bad sectors and still say it is passing the self assessment. Lets see the full output on those drives.

Chris Moore · Mar 28, 2019

artlessknave said:
I get that rebuild speed is going to be limited by destination write speed, but would it not also be limited by source read speed, at least in theory?

That is the thing that is interesting about it. I will make a quick and simple example that I hope will illustrate the situation.
If you are recovering from a mirror vdev, you are limited by the read speed of the single donor (data donor) drive that is providing data to the target drive as well as the potential limitation of the write speed of the target. In any case, the two disks will be working in lockstep with each other, so the aggregate speed of the slower of the two. The donor drive must read every bit of data it contains to transfer that data to the target drive. Lets say there is 1TB of data in that vdev, that means a full TB of read must happen on the donor and that full TB of data must also be written to the target.

How long does it take to write 1 TB of data? Depends on the drive.

If you are recovering in a RAIDz(some number) vdev, and you also have 1TB of data in the vdev. Lets say it is a 5 drive vdev, just so we have some number to work with. If you have 5 drives in RAIDz1, roughly speaking you have a quarter of that 1TB of data on each drive. Actually it is a little more because you have checksum and overhead, but it is an example. If you need to recover from a disk failure, you need to read about 256GB of data from each of the 4 surviving disks and write about 256GB of data to the target disk. It takes less time to read 256GB of data from each of four donor disks, because the system is reading in parallel from all four donor disks at once, it is still reading the full 1TB of data, but it is only getting a quarter of the data from each donor disk and the system calculates what needs to be written to the target disk on the fly, as it is doing the read. So, it reads a full TB very quickly because it is doing a parallel read across four drives an only needs to read 256GB of data from each. Thus the limiting factor on speed of rebuild is the write speed of the target disk and it only needs to write around a quarter of the data to the target disk to restore parity. It takes less time to write 256GB of data to a drive than it takes to write 1TB of data. Also, each of the donor disks in the RAIDz pool is only needing to do the work of accessing 256GB of data instead of accessing a full 1TB of data, so it is less stressful for the RAIDz pool.

Chris Moore · Mar 28, 2019

artlessknave said:
also...how do you quote parts and split things that way..I can't seem to figure out anything but full quote....

I suppose it depends on the browser you are using, but if you highlight the portion of the text you want to reply to, the forum should give you a prompt to reply. Like this:

Just click that reply tag and it will quote only that portion of the message for you.

Chris Moore · Mar 28, 2019

artlessknave said:
does zfs just function in a way that those possibilities are insignificant?

One significant difference with ZFS vs a hardware RAID controller, which is the source of the legend, that I forgot to mention earlier. ZFS is only copying the data, not the entire drive. ZFS ignores empty space, where hardware RAID is going to exercise the entire drive, not just the data space.

pro lamer · Mar 28, 2019

Chris Moore said:
write about 256GB

That's what I confused in my post above, thx. And I edited mine.

Sent from my phone

Chris Moore · Mar 28, 2019

Chris Moore said:

I just noticed this. Are you passing disks into a virtual machine?
Is this FreeNAS running inside virtualization? You might have mentioned that?

I see where you posted the output of the SMART report and you have some CRC errors on all the disks. That usually points to a cabling issue. I also see some "runtime badblocks" which are not a good indicator for disk health.

I would like to know about that VMware question but it looks like you have some hardware problems.

Important Announcement for the TrueNAS Community.

Should I be worried? ZFS Degraded -> Faulted -> Online

Dabbler

Guru

Wizard

Dabbler

Dabbler

Wizard

Guru

Dabbler

Wizard

Guru

Wizard

Hall of Famer

Hall of Famer

Wizard

Hall of Famer

Hall of Famer

Hall of Famer

Hall of Famer

Guru

Hall of Famer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Should I be worried? ZFS Degraded -> Faulted -> Online"

Similar threads