Critical error - What should I do?

BlazeStar · Nov 1, 2015

Using FreeNAS-9.3-STABLE-201509282017

I just got these alerts :

CRITICAL: Device: /dev/ada3, 1 Currently unreadable (pending) sectors
CRITICAL: The volume NAS (ZFS) state is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected.

What should I do?

DrKK · Nov 1, 2015

It looks like you have a failing disk. You should plan on replacing it.

Do this:

Code:

smartctl -x /dev/ada3

from a shell prompt, and upload the results here inside of "code" tags---or, if you don't know how to do that, put it on pastebin.com and paste the link here.

Also, let us see:

Code:

zpool status -v

DrKK · Nov 1, 2015

BlazeStar said:
Using FreeNAS-9.3-STABLE-201509282017

I just got these alerts :

CRITICAL: Device: /dev/ada3, 1 Currently unreadable (pending) sectors

CRITICAL: The volume NAS (ZFS) state is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected.

What should I do?

I'm going to bed in the next 15 minutes, so if you can get to it quickly, you'll get immediate service :)

BlazeStar · Nov 2, 2015

DrKK said:
It looks like you have a failing disk. You should plan on replacing it.

Do this:
Code:
smartctl -x /dev/ada3
from a shell prompt, and upload the results here inside of "code" tags---or, if you don't know how to do that, put it on pastebin.com and paste the link here.

Code:

smartctl -x /dev/ada3
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p26 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Se
Device Model:     WDC WD4000F9YZ-09N20L0
Serial Number:    WD-WCC132056754
LU WWN Device Id: 5 0014ee 20a11da36
Firmware Version: 01.01A01
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Nov  2 10:52:42 2015 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM level is:     128 (minimum power consumption without standby)
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)    Offline data collection activity
                    was suspended by an interrupting command from host.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (41760) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 451) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x70bd)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   142   142   021    -    11875
  4 Start_Stop_Count        -O--CK   100   100   000    -    30
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   088   088   000    -    8760
10 Spin_Retry_Count        -O--CK   100   253   000    -    0
11 Calibration_Retry_Count -O--CK   100   253   000    -    0
12 Power_Cycle_Count       -O--CK   100   100   000    -    30
183 Runtime_Bad_Block       -O--CK   100   100   000    -    0
192 Power-Off_Retract_Count -O--CK   200   200   000    -    19
193 Load_Cycle_Count        -O--CK   200   200   000    -    10
194 Temperature_Celsius     -O---K   103   094   000    -    49
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    1
198 Offline_Uncorrectable   ----CK   200   200   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters
0x24       GPL     R/O      1  Current Device Internal Status Data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb7  GPL,SL  VS       1  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 10
    CR     = Command Register
    FEATR  = Features Register
    COUNT  = Count (was: Sector Count) Register
    LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
    LH     = LBA High (was: Cylinder High) Register    ]   LBA
    LM     = LBA Mid (was: Cylinder Low) Register      ] Register
    LL     = LBA Low (was: Sector Number) Register     ]
    DV     = Device (was: Device/Head) Register
    DC     = Device Control Register
    ER     = Error register
    ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 10 [9] occurred at disk power-on lifetime: 8559 hours (356 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 f9 bd 3b e8 40 00  Error: UNC at LBA = 0xf9bd3be8 = 4189928424

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 01 00 00 c8 00 00 f9 bd 3b 80 40 08 19d+04:17:33.265  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08 19d+04:17:33.262  READ LOG EXT
  60 01 00 00 c8 00 00 f9 bd 75 50 40 08 19d+04:17:31.267  READ FPDMA QUEUED
  60 01 00 00 c0 00 00 f9 bd 3b 80 40 08 19d+04:17:31.267  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08 19d+04:17:31.264  READ LOG EXT

Error 9 [8] occurred at disk power-on lifetime: 8559 hours (356 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 f9 bd 3b e8 40 00  Error: UNC at LBA = 0xf9bd3be8 = 4189928424

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 01 00 00 c8 00 00 f9 bd 75 50 40 08 19d+04:17:31.267  READ FPDMA QUEUED
  60 01 00 00 c0 00 00 f9 bd 3b 80 40 08 19d+04:17:31.267  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08 19d+04:17:31.264  READ LOG EXT
  60 01 00 00 c0 00 00 f9 bd 75 50 40 08 19d+04:17:29.296  READ FPDMA QUEUED
  60 01 00 00 c0 00 00 f9 bd 74 50 40 08 19d+04:17:29.295  READ FPDMA QUEUED

Error 8 [7] occurred at disk power-on lifetime: 8559 hours (356 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 f9 bd 3b e8 40 00  Error: UNC at LBA = 0xf9bd3be8 = 4189928424

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 01 00 00 c0 00 00 f9 bd 75 50 40 08 19d+04:17:29.296  READ FPDMA QUEUED
  60 01 00 00 c0 00 00 f9 bd 74 50 40 08 19d+04:17:29.295  READ FPDMA QUEUED
  60 01 00 00 c0 00 00 f9 bd 73 50 40 08 19d+04:17:29.294  READ FPDMA QUEUED
  60 01 00 00 c0 00 00 f9 bd 72 50 40 08 19d+04:17:29.293  READ FPDMA QUEUED
  60 01 00 00 c0 00 00 f9 bd 71 50 40 08 19d+04:17:29.293  READ FPDMA QUEUED

Error 7 [6] occurred at disk power-on lifetime: 8559 hours (356 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 f9 bd 3b e8 40 00  Error: UNC at LBA = 0xf9bd3be8 = 4189928424

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 01 00 00 b8 00 00 f9 bd 6c 50 40 08 19d+04:17:27.299  READ FPDMA QUEUED
  60 01 00 00 b8 00 00 f9 bd 6b 50 40 08 19d+04:17:27.298  READ FPDMA QUEUED
  60 00 d0 00 b8 00 00 f9 bd 6a 80 40 08 19d+04:17:27.297  READ FPDMA QUEUED
  60 01 00 00 b8 00 00 f9 bd 69 80 40 08 19d+04:17:27.296  READ FPDMA QUEUED
  60 01 00 00 b8 00 00 f9 bd 68 80 40 08 19d+04:17:27.296  READ FPDMA QUEUED

Error 6 [5] occurred at disk power-on lifetime: 8559 hours (356 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 f9 bd 3b e8 40 00  Error: UNC at LBA = 0xf9bd3be8 = 4189928424

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 01 00 00 b0 00 00 f9 bd 62 60 40 08 19d+04:17:25.293  READ FPDMA QUEUED
  60 01 00 00 b0 00 00 f9 bd 61 48 40 08 19d+04:17:25.291  READ FPDMA QUEUED
  60 00 08 00 b0 00 00 f9 bd 5f f8 40 08 19d+04:17:25.290  READ FPDMA QUEUED
  60 01 00 00 a8 00 00 f9 bd 3b 80 40 08 19d+04:17:25.281  READ FPDMA QUEUED
  61 00 10 00 a8 00 00 00 40 02 90 40 08 19d+04:17:25.257  WRITE FPDMA QUEUED

Error 5 [4] occurred at disk power-on lifetime: 8559 hours (356 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 f9 bd 3b e8 40 00  Error: UNC at LBA = 0xf9bd3be8 = 4189928424

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 01 00 00 48 00 00 f9 bd 3c 80 40 08 19d+04:17:23.200  READ FPDMA QUEUED
  60 01 00 00 40 00 00 f9 bd 3b 80 40 08 19d+04:17:23.200  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08 19d+04:17:23.198  READ LOG EXT
  60 01 00 00 40 00 00 f9 bd 3c 80 40 08 19d+04:17:21.219  READ FPDMA QUEUED
  60 01 00 00 38 00 00 f9 bd 3b 80 40 08 19d+04:17:21.219  READ FPDMA QUEUED

Error 4 [3] occurred at disk power-on lifetime: 8559 hours (356 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 f9 bd 3b e8 40 00  Error: UNC at LBA = 0xf9bd3be8 = 4189928424

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 01 00 00 40 00 00 f9 bd 3c 80 40 08 19d+04:17:21.219  READ FPDMA QUEUED
  60 01 00 00 38 00 00 f9 bd 3b 80 40 08 19d+04:17:21.219  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08 19d+04:17:21.216  READ LOG EXT
  60 01 00 00 38 00 00 f9 bd 3c 80 40 08 19d+04:17:19.229  READ FPDMA QUEUED
  60 01 00 00 30 00 00 f9 bd 3b 80 40 08 19d+04:17:19.229  READ FPDMA QUEUED

Error 3 [2] occurred at disk power-on lifetime: 8559 hours (356 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 f9 bd 3b e8 40 00  Error: UNC at LBA = 0xf9bd3be8 = 4189928424

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 01 00 00 38 00 00 f9 bd 3c 80 40 08 19d+04:17:19.229  READ FPDMA QUEUED
  60 01 00 00 30 00 00 f9 bd 3b 80 40 08 19d+04:17:19.229  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08 19d+04:17:19.227  READ LOG EXT
  60 00 08 00 30 00 00 f9 bd d9 d8 40 08 19d+04:17:17.239  READ FPDMA QUEUED
  60 01 00 00 28 00 00 f9 bd 3c 80 40 08 19d+04:17:17.239  READ FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      8029         -
# 2  Extended offline    Completed without error       00%      7373         -
# 3  Extended offline    Completed without error       00%      6702         -
# 4  Extended offline    Completed without error       00%      5862         -
# 5  Extended offline    Completed without error       00%      5201         -
# 6  Extended offline    Completed without error       00%      4375         -
# 7  Extended offline    Completed without error       00%      3703         -
# 8  Extended offline    Completed without error       00%      3033         -
# 9  Extended offline    Completed without error       00%      2366         -
#10  Extended offline    Completed without error       00%      1526         -
#11  Extended offline    Completed without error       00%       857         -
#12  Extended offline    Completed without error       00%       642         -
#13  Extended offline    Interrupted (host reset)      50%       469         -
#14  Extended offline    Completed without error       00%       305         -
#15  Extended offline    Completed without error       00%        27         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    49 Celsius
Power Cycle Min/Max Temperature:     22/54 Celsius
Lifetime    Min/Max Temperature:     18/58 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (386)

Index    Estimated Time   Temperature Celsius
387    2015-11-02 02:55    50  *******************************
...    ..(  9 skipped).    ..  *******************************
397    2015-11-02 03:05    50  *******************************
398    2015-11-02 03:06    49  ******************************
...    ..(387 skipped).    ..  ******************************
308    2015-11-02 09:34    49  ******************************
309    2015-11-02 09:35    50  *******************************
...    ..( 18 skipped).    ..  *******************************
328    2015-11-02 09:54    50  *******************************
329    2015-11-02 09:55    51  ********************************
...    ..( 10 skipped).    ..  ********************************
340    2015-11-02 10:06    51  ********************************
341    2015-11-02 10:07    50  *******************************
...    ..( 44 skipped).    ..  *******************************
386    2015-11-02 10:52    50  *******************************

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page Offset Size         Value  Description
  1  =====  =                =  == General Statistics (rev 2) ==
  1  0x008  4               30  Lifetime Power-On Resets
  1  0x010  4             8760  Power-on Hours
  1  0x018  6      58548614156  Logical Sectors Written
  1  0x020  6        213734372  Number of Write Commands
  1  0x028  6      44589828420  Logical Sectors Read
  1  0x030  6        200099955  Number of Read Commands
  3  =====  =                =  == Rotating Media Statistics (rev 1) ==
  3  0x008  4             8049  Spindle Motor Power-on Hours
  3  0x010  4             8049  Head Flying Hours
  3  0x018  4               30  Head Load Events
  3  0x020  4              200~ Number of Reallocated Logical Sectors
  3  0x028  4               14  Read Recovery Attempts
  3  0x030  4                0  Number of Mechanical Start Failures
  4  =====  =                =  == General Errors Statistics (rev 1) ==
  4  0x008  4               10  Number of Reported Uncorrectable Errors
  4  0x010  4               10  Resets Between Cmd Acceptance and Completion
  5  =====  =                =  == Temperature Statistics (rev 1) ==
  5  0x008  1               49  Current Temperature
  5  0x010  1               49  Average Short Term Temperature
  5  0x018  1               50  Average Long Term Temperature
  5  0x020  1               58  Highest Temperature
  5  0x028  1               25  Lowest Temperature
  5  0x030  1               57  Highest Average Short Term Temperature
  5  0x038  1               42  Lowest Average Short Term Temperature
  5  0x040  1               53  Highest Average Long Term Temperature
  5  0x048  1               43  Lowest Average Long Term Temperature
  5  0x050  4                0  Time in Over-Temperature
  5  0x058  1               60  Specified Maximum Operating Temperature
  5  0x060  4                0  Time in Under-Temperature
  5  0x068  1                0  Specified Minimum Operating Temperature
  6  =====  =                =  == Transport Statistics (rev 1) ==
  6  0x008  4              164  Number of Hardware Resets
  6  0x010  4              437  Number of ASR Events
  6  0x018  4                0  Number of Interface CRC Errors
                              |_ ~ normalized value

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            8  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            9  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4      2380133  Vendor specific

DrKK said:
Also, let us see:
Code:
zpool status -v

Code:

zpool status -v
  pool: Business
state: ONLINE
  scan: scrub repaired 0 in 6h52m with 0 errors on Sun Oct 11 06:52:54 2015
config:

    NAME                                              STATE     READ WRITE CKSUM
    Business                               ONLINE       0     0     0
      gptid/e932a723-7672-11e4-87aa-bcee7b74e9a5.eli  ONLINE       0     0     0

errors: No known data errors

  pool: Personal
state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 5h58m with 1 errors on Sun Oct 25 05:58:37 2015
config:

    NAME                                              STATE     READ WRITE CKSUM
    Personal                                                ONLINE       1     0     0
      gptid/cdee59bd-7672-11e4-87aa-bcee7b74e9a5.eli  ONLINE       0     0     0
      gptid/ce3653b4-7672-11e4-87aa-bcee7b74e9a5.eli  ONLINE       0     0     0
      gptid/ce7d6eda-7672-11e4-87aa-bcee7b74e9a5.eli  ONLINE       1     0     0
    logs
      gptid/09b87714-7673-11e4-87aa-bcee7b74e9a5.eli  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /mnt/Personal/TimeMachine/MacBookPro/MacBook.sparsebundle/bands/1b2b6

  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0h3m with 0 errors on Wed Sep 30 03:48:18 2015
config:

    NAME        STATE     READ WRITE CKSUM
    freenas-boot  ONLINE       0     0     0
      da0p2     ONLINE       0     0     0

errors: No known data errors

DrKK said:
I'm going to bed in the next 15 minutes, so if you can get to it quickly, you'll get immediate service :)

I wasn't expecting such a quick response, sorry for my lateness, and thank you for your reply !!!

danb35 · Nov 2, 2015

Well, you've got trouble, right here in River City. The disk problem noted with ada3 wouldn't ordinarily be a terrible thing, though it would bear watchful waiting and getting ready to replace the disk--a single bad sector isn't usually an emergency. However, that disk is way too hot--it shouldn't see over 40 deg C (I think @DrKK is a bit too conservative in insisting on <= 35 deg C), while it's currently at 49, and it's seen 58. That's bad for the life of your disk.

The bigger problem, though, is that neither of your pools has any redundancy. That's why a single bad block has led to data corruption. That also means that when (not if) a drive fails, all your data will be gone. And combining that with an SLOG device seems highly unusual also.

Bidule0hm · Nov 2, 2015

58 °C woaw... I guess now we know why this drive is throwing errors.

And danb35 is right about your pools, you should backup any data that isn't already backed up, (unless you don't care about that data of course).

BlazeStar · Nov 2, 2015

First :

danb35 said:
combining that with an SLOG device seems highly unusual also.

I had a SSD drive so I just thought I'd use it for log... I thought it could speed up access in some cases... is it a bad thing?

Then, a few notes :

1) The pool BUSINESS is just one HD which is used for replication purpose... it's a distant volume used to replicate the FreeNAS in my small office

I don't think I need redundancy for this one... there is already redundancy at my office (stripped-mirrored zvols), this one is an offsite backup... if the drive fails, I just replace it and I don't loose anything... what do you think?

2) The pool PERSONAL is my personal stuff... like movies and stuff... but of course I don't want to loose them

3) I'm shocked at the temperature issue... all the hardware is less than 1 year old, including a case that's designed for NAS boxes :(
The box itself is in a room that never go above 25'C

So I guess now I have to :

1) try to backup important data ASAP

2) Understand and fix the overheating issue (change cases? add fans? any advice would be appreciated)

3) Once that is done I will buy a IT mode card, or RAID card and flash the firmware to IT mode so I can add more HD
(any recommandation on the card to buy?)

4) Add 3 X 4TB drives to get redundancy (mirror + stripes)

Any comments?

I will for sure need help when I get to the swaping part because I've never done this.

I was thinking of adding the new drives first, configure redundancy, and then swap the failed disk.

What do you think?

Bidule0hm · Nov 2, 2015

BlazeStar said:
all the hardware is less than 1 year old, including a case that's designed for NAS boxes

Hardware age doesn't have much importance here. I guess marketing people say "designed" but engineers say otherwise...

1) & 2) --> yes, exactly ;)

3) M1015. But how many SATA ports do you have on your MB?

4) Not sure. What space do you need and what drives size do you have now?

BlazeStar · Nov 2, 2015

Bidule0hm said:
3) M1015. But how many SATA ports do you have on your MB?

I have 6 but 5 are already taken:

1 X SSD
4 X 4TB

Where 1 X 4T is used for my Business volume

And 3 X 4TB is for my Personal volume.

Bidule0hm said:
4) Not sure. What space do you need and what drives size do you have now?

Right now my Personal volume is stripped over 3 X 4TB.

So I would like to add 3 X 4TB to get a stripped-mirrored volume.

What do you think?

Bidule0hm · Nov 2, 2015

Ok, you can't mirror stripes (mirror of 2 stripes of 3 drives each), but you can stripe mirrors (stripe of 3 mirrors of 2 drives each).

I recommend a RAID-Z2 of 6 drives, I think it's the best solution here.

danb35 · Nov 2, 2015

How full is your personal volume, and how much data do you envision storing on it? And why are you using a SLOG device? If you don't have a specific need for the SLOG (which most people don't), you could remove it, put the business stuff on a dataset in your main pool, and recreate your pool in a 6-disk RAIDZ2 configuration. That would probably give you the best balance between redundancy and use of storage capacity, and it would only require buying two more disks (and no HBA). The downside is that you'd need to back up all your data, destroy both pools, and create a new pool.

OTOH, it's (probably) possible to add disks as mirrors of your existing personal pool disks. It involves a fair bit of CLI-fu, though. Upside is that you wouldn't need to bother with backup/restore, because your data would be mirrored in place.

@Bidule0hm recommended the IBM M1015 as an HBA, and it's the go-to choice around here, along with other LSI 9211-type cards. It will give you two mini-SAS ports which will handle a total of up to 8 drives directly with breakout cables, or lots more using SAS expanders.

DrKK · Nov 2, 2015

Looks like you guys have this under control.

That drive (and presumably the others) is ***MUCH*** too hot. The longevity of that drive will already have been badly adversely impacted.

BlazeStar · Nov 3, 2015

One question though

All my drives are 4 TB WD "Se"

I was thinking about buying 4 TB WD "Red" or "Black" for the mirrors...

Can that be a problem?

Thank you!

Bidule0hm · Nov 3, 2015

No, I don't see why it can be a problem ;)

BlazeStar · Nov 3, 2015

Ok I wasn't sure, I thought you might have to use identical hard drives... for striping but also for mirroring.
So I could also replace the failing WD Se 4 TB with a WD black 4 TB ?
Or even a completely different brand, but still 4TB ?

Next: in order to resolve my issue, if I proceed like that, would be a good way?

1) Install an IBM M1015 card with 3 X 4TB drives

2) Configure these 3 new drives as mirrors for the 3 first drive, therefore having a RAID-Z2 of 6 drives pool.

Once that is operational

3) Replace the failing drive among the first 3 drives and let it rebuild the data from the mirrors

Also, where do you recommend to buy the M1015 card? I saw it on amazon for 325$
http://www.amazon.ca/IBM-Serveraid-M1015-Controller-46M0831/dp/B0034DMSO6

danb35 · Nov 3, 2015

BlazeStar said:
Install an IBM M1015 card with 3 X 4TB drives

Be sure to flash the firmware to IT mode, P16 version if running 9.3, or P20 version if running 9.3.1 first.

This:

BlazeStar said:
Configure these 3 new drives as mirrors for the 3 first drive

will not accomplish this:

BlazeStar said:
therefore having a RAID-Z2 of 6 drives pool.

eBay is the place to go for the M1015. If you can instead find an LSI 9211-8i, the flashing process is a little easier.

BlazeStar · Nov 3, 2015

danb35 said:
This:

will not accomplish this:.

How come? Can you explain?

danb35 · Nov 3, 2015

Three pairs of striped mirrors are not the same as a six-disk RAIDZ2. If you can add mirrors to each of your disks (which I'm not at all sure you can, and @Bidule0hm thinks you can't), the result would be similar to a RAID 10 with the net capacity of three of your disks. A six-disk RAIDZ2 would be comparable to a RAID 6 array, with the net capacity of four of your disks.

Bidule0hm · Nov 3, 2015

I think you need to read some doc on the RAID types of ZFS, look at the link Cyberjock's ZFS Guide in my signature ;)

Edit: ah, danb35 has been faster :)

BlazeStar · Nov 3, 2015

danb35 said:
Three pairs of striped mirrors are not the same as a six-disk RAIDZ2. If you can add mirrors to each of your disks (which I'm not at all sure you can, and @Bidule0hm thinks you can't), the result would be similar to a RAID 10 with the net capacity of three of your disks. A six-disk RAIDZ2 would be comparable to a RAID 6 array, with the net capacity of four of your disks.

Ok yeah I'm very confused about all of this...
I did try to read the manual... not sure I understand everything...

When I was setting up my FreeNAS server for my small office, and after reading CyberJock's manual, I decided to go for an array of 8 drives.... just like RAID 10 as you would say.

To do that, I just used the GUI : 4 stripped drives, all of which are mirrored.

I thought it was the best option in terms of reliability, even if you loose half of the capacity

So that's what I was going to do for this second (personal) FreeNAS box...

But would you say RAIDZ2 is superior to this "RAID 10"ish set-up?

Important Announcement for the TrueNAS Community.

Critical error - What should I do?

Patron

FreeNAS Generalissimo

FreeNAS Generalissimo

Patron

Hall of Famer

Server Electronics Sorcerer

Patron

Server Electronics Sorcerer

Patron

Server Electronics Sorcerer

Hall of Famer

FreeNAS Generalissimo

Patron

Server Electronics Sorcerer

Patron

Hall of Famer

Patron

Hall of Famer

Server Electronics Sorcerer

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Critical error - What should I do?"

Similar threads