Read Errors on System, FreeNas Mini XL

stoth

Cadet
Joined
Jan 7, 2019
Messages
7
So I've got a hand built FreeNas with 8 Drives HDD (and 2 ssd for ZIL/Caching) and when it works it's great, but I think that either my mainboard is failing due to age/abuse (I recently had the main CPU heatsink fall off and had to reattach with new thermal paste). So now I'm getting timeouts to the drives after an extended copy session which seems to have made my drive array degraded. Overall I really like FreeNas as an OS and I would like to fix my current system but I've been through a lot recently with this machine and am almost at the end of my rope.

So if I upgrade to an Freenas Mini XL, what hardware can I take to it?
Will my current RAIDZ2 pool (8 x 4TB 3.5" drives) 'just work' if I put it into a new system?
Can I also put in the 2.5" SSD drives in for cache/zil?
What's upgrade-able on a Freenas Mini XL?

Is my current system worth saving if I might have fried it?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
(I recently had the main CPU heatsink fall off and had to reattach with new thermal paste).
You mean to say that there is not something holding it other than the thermal compound?
Can you tell us what hardware you are using?
Overall I really like FreeNas as an OS and I would like to fix my current system but I've been through a lot recently with this machine and am almost at the end of my rope.
It would be a good idea (I think) to have some idea what condition your drives are in and you already say that the pool is degraded. Would you please share with us the output of zpool status which should look something like this:
Code:
  pool: Backup
 state: ONLINE
  scan: scrub repaired 0 in 0 days 09:00:22 with 0 errors on Mon Dec 17 09:00:25 2018
config:

    NAME                                            STATE     READ WRITE CKSUM
    Backup                                          ONLINE       0     0     0
      raidz1-0                                      ONLINE       0     0     0
        gptid/181101e2-a35b-11e8-aefa-0cc47a9cd5a4  ONLINE       0     0     0
        gptid/18e924eb-a35b-11e8-aefa-0cc47a9cd5a4  ONLINE       0     0     0
        gptid/19a7111b-a35b-11e8-aefa-0cc47a9cd5a4  ONLINE       0     0     0
        gptid/1a9e1915-a35b-11e8-aefa-0cc47a9cd5a4  ONLINE       0     0     0

errors: No known data errors
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
So if I upgrade to an Freenas Mini XL, what hardware can I take to it?
The drives, probably, we need to know more about what you are using.
Will my current RAIDZ2 pool (8 x 4TB 3.5" drives) 'just work' if I put it into a new system?
Maybe, but we need to see what the health of the drives is.
Can I also put in the 2.5" SSD drives in for cache/zil?
It is possible, but you may not even need those. How are you using the system. What purpose is it serving?
Here is a video that shows the process for installing SSDs and memory:
https://www.youtube.com/watch?v=sIY1HZHa5NE
What's upgrade-able on a Freenas Mini XL?
Just the SSDs and RAM, and adding data drives.
 

stoth

Cadet
Joined
Jan 7, 2019
Messages
7
Thanks, I'll get the hw data tonight when I'm back at home. If this sounds like a cry for help it is cause I've been fighting this system for a long time now and every little thing has been a battle.

When I said that the cpu cooler fell off, I mean it was attached properly when I put together the system, then probably through being moved around it fell off. Note that this system had been running for a few years before this happened. I noticed because the system became unstable after extended use and eventually I think the cooler became so loose that it started to overhead very quickly and throw temp warnings after only a few minutes of use.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I'll get the hw data tonight when I'm back at home. If this sounds like a cry for help it is cause I've been fighting this system for a long time now and every little thing has been a battle.

Sorry to hear that. With good hardware, FreeNAS should be pretty problem free.
Do you know about how old the drives are?
The worry I have is that the drives might be failing due to age or prolonged exposure to heat.
I noticed because the system became unstable after extended use and eventually I think the cooler became so loose that it started to overhead very quickly and throw temp warnings after only a few minutes of use.
Often, once a component overheats, it never works properly again. If it is a socketed CPU, it might be possible to replace it. If you can share more details, someone can help you get answers. If you decide that you just want to fully replace the old system, it is possible to build a system from parts for less than the cost of a Mini XL.
 

stoth

Cadet
Joined
Jan 7, 2019
Messages
7
Thanks so much for helping me debug this, it's really appreciated.

So I restarted it to get data for the post and it resilvered the 2 disks which had errors, they both currently have CKSUM which is probably an issue with cabling. It seems to be ok now, though I haven't restarted and large transfers. When it declared the 2 disks bad I was in the middle of a multi-TB copy into the array. I pulled the smartctl for the bad drives.

Drives 0 and 3 have Checksum errors but their smartctl doesn't show anything. Drive 3 has previous smart errors, included below as well.

Hardware
Code:
6 x WD Red 4TB
1 x TOSHIBA N300 4TB (errors logged in smart)
1 x HGST Deskstar NAS HDN724040ALE640 4TB
1 x SATA m2 128GB Flash drive for L2ARC Cache  
(unused) 2 more flash drives (32 GB and 128 GB) for possible ZIL and L2ARC Cache

32GB Ram
Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
Intel(R) PRO/1000 Gigabit On Motherboard
Marvell 88SE912x AHCI SATA controller
Marvell 88SE9172 AHCI SATA controller
Intel Panther Point AHCI SATA controller


pool status
Code:
  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:07:00 with 0 errors on Mon Jan  7 03:52:00 2019
config:

    NAME        STATE     READ WRITE CKSUM
    freenas-boot  ONLINE       0     0     0
      da0p2     ONLINE       0     0     0

errors: No known data errors

  pool: smdata
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 394G in 0 days 00:51:07 with 0 errors on Mon Jan  7 23:30:21 2019
config:

    NAME                                            STATE     READ WRITE CKSUM
    smdata                                          ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/88d5c1b4-ef75-11e8-b7e8-902b3439aa37  ONLINE       0     0     0
        gptid/2a182a0d-bd2b-11e7-be0d-902b3439aa37  ONLINE       0     0    17
        gptid/2b577a33-bd2b-11e7-be0d-902b3439aa37  ONLINE       0     0     9
        gptid/2c8bc5f0-bd2b-11e7-be0d-902b3439aa37  ONLINE       0     0     0
        gptid/a01bac70-ef3b-11e8-8f20-902b3439aa37  ONLINE       0     0     0
        gptid/2ef861ad-bd2b-11e7-be0d-902b3439aa37  ONLINE       0     0     0
        gptid/30264650-bd2b-11e7-be0d-902b3439aa37  ONLINE       0     0     0
        gptid/31789c08-bd2b-11e7-be0d-902b3439aa37  ONLINE       0     0     0
    cache
      nvd0p1                                        ONLINE       0     0     0

errors: No known data errors

glabel output
Code:
Name  Status  Components
gptid/2b577a33-bd2b-11e7-be0d-902b3439aa37     N/A  ada0p2
                                  gpt/logs     N/A  ada1p1
gptid/c3d2b566-0be1-11e9-a4eb-902b3439aa37     N/A  ada1p1
gptid/c9e9171d-0be1-11e9-a4eb-902b3439aa37     N/A  ada1p2
gptid/6f0958f7-0a23-11e9-bee9-902b3439aa37     N/A  ada2p1
gptid/797aca39-0a23-11e9-bee9-902b3439aa37     N/A  ada2p2
gptid/2a182a0d-bd2b-11e7-be0d-902b3439aa37     N/A  ada3p2
gptid/88d5c1b4-ef75-11e8-b7e8-902b3439aa37     N/A  ada4p2
gptid/2c8bc5f0-bd2b-11e7-be0d-902b3439aa37     N/A  ada5p2
gptid/a01bac70-ef3b-11e8-8f20-902b3439aa37     N/A  ada6p2
gptid/2ef861ad-bd2b-11e7-be0d-902b3439aa37     N/A  ada7p2
gptid/30264650-bd2b-11e7-be0d-902b3439aa37     N/A  ada8p2
gptid/31789c08-bd2b-11e7-be0d-902b3439aa37     N/A  ada9p2
gptid/44a7cade-f049-11e8-98d9-902b3439aa37     N/A  da0p1


ada4 (has smart errors)
Code:
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     TOSHIBA HDWQ140
Serial Number:    28J3K22WFPBE
LU WWN Device Id: 5 000039 85b80216c
Firmware Version: FJ1M
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jan  8 07:17:57 2019 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (  24)    The self-test routine was aborted by
                    the host.
Total time to complete Offline 
data collection:         (  120) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 450) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       7110
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       35
  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       937
 10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       35
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       16
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       20
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       35
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       30 (Min/Max 19/41)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   253   000    Old_age   Always       -       19
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       0
222 Loaded_Hours            0x0032   098   098   000    Old_age   Always       -       937
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       552
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 19 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 19 occurred at disk power-on lifetime: 907 hours (37 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 90 27 70 59 40  Error: ICRC, ABRT at LBA = 0x00597027 = 5861415

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 90 b0 28 73 59 40 00      10:17:14.868  WRITE FPDMA QUEUED
  61 00 a8 28 72 59 40 00      10:17:14.868  WRITE FPDMA QUEUED
  61 00 a0 28 71 59 40 00      10:17:14.868  WRITE FPDMA QUEUED
  61 00 98 28 70 59 40 00      10:17:14.868  WRITE FPDMA QUEUED
  61 00 90 28 6f 59 40 00      10:17:14.868  WRITE FPDMA QUEUED

Error 18 occurred at disk power-on lifetime: 906 hours (37 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 e0 cf f5 09 40  Error: ICRC, ABRT at LBA = 0x0009f5cf = 652751

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 a0 e0 30 f5 09 40 00      09:32:57.325  WRITE FPDMA QUEUED
  61 00 d8 30 f4 09 40 00      09:32:57.325  WRITE FPDMA QUEUED
  61 00 d0 30 f3 09 40 00      09:32:57.325  WRITE FPDMA QUEUED
  61 00 c8 30 f2 09 40 00      09:32:57.325  WRITE FPDMA QUEUED
  61 00 c0 30 f1 09 40 00      09:32:57.325  WRITE FPDMA QUEUED

Error 17 occurred at disk power-on lifetime: 906 hours (37 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 60 e7 32 00 40  Error: ICRC, ABRT at LBA = 0x000032e7 = 13031

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 30 60 b8 32 00 40 00      09:31:03.941  WRITE FPDMA QUEUED
  61 58 58 60 32 00 40 00      09:31:03.938  WRITE FPDMA QUEUED
  61 e8 50 78 31 00 40 00      09:31:03.936  WRITE FPDMA QUEUED
  61 00 48 78 30 00 40 00      09:31:03.936  WRITE FPDMA QUEUED
  61 00 40 78 2f 00 40 00      09:31:03.936  WRITE FPDMA QUEUED

Error 16 occurred at disk power-on lifetime: 905 hours (37 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 20 d7 a6 49 40  Error: ICRC, ABRT at LBA = 0x0049a6d7 = 4826839

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 40 28 d8 a6 49 40 00      09:10:44.143  WRITE FPDMA QUEUED
  61 00 20 d8 a5 49 40 00      09:10:44.143  WRITE FPDMA QUEUED
  61 08 18 d0 a5 49 40 00      09:10:44.143  WRITE FPDMA QUEUED
  61 00 10 d0 a4 49 40 00      09:10:44.143  WRITE FPDMA QUEUED
  61 30 08 a0 a4 49 40 00      09:10:44.142  WRITE FPDMA QUEUED

Error 15 occurred at disk power-on lifetime: 905 hours (37 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 58 ff d3 4e 40  Error: ICRC, ABRT at LBA = 0x004ed3ff = 5166079

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 10 60 00 d4 4e 40 00      08:44:23.193  WRITE FPDMA QUEUED
  61 00 58 00 d3 4e 40 00      08:44:23.193  WRITE FPDMA QUEUED
  61 30 50 d0 d2 4e 40 00      08:44:23.192  WRITE FPDMA QUEUED
  61 28 48 c0 ca 4e 40 00      08:44:23.188  WRITE FPDMA QUEUED
  61 98 40 10 d2 4e 40 00      08:44:23.187  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Aborted by host               80%       908         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


The drives which have checksum errors according to the zpool status command are ada0 and ada3.

smartctl for ada0
Code:
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68WT0N0
Serial Number:    WD-WCC4EFFXURX0
LU WWN Device Id: 5 0014ee 20b0307d7
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jan  8 07:17:07 2019 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         (52680) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 527) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x703d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   185   180   021    Pre-fail  Always       -       7750
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       62
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       10701
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       61
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       36
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2691
194 Temperature_Celsius     0x0022   125   115   000    Old_age   Always       -       27
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


smartctl for ada3
Code:
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68WT0N0
Serial Number:    WD-WCC4ERVS3TNV
LU WWN Device Id: 5 0014ee 2b5ada23d
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jan  8 07:17:54 2019 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         (50100) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 501) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x703d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   183   179   021    Pre-fail  Always       -       7816
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       94
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   068   068   000    Old_age   Always       -       23423
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       93
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       62
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2980
194 Temperature_Celsius     0x0022   124   093   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I agree that the CRC errors on ada4 are likely a cable. You might try reseating the connector on each end or replacing the cable entirely but that is just an annoyance and doesn't appear to be causing serious issues.
For drives ada0 and ada3, could you post the output of smatrctl using the -x option instead of the -a because I think the additional details might answer some questions.

Do you know when the system ran a scrub of the pool last?
 

stoth

Cadet
Joined
Jan 7, 2019
Messages
7
I ran a scrub about 2 weeks ago I think, before the CPU fan fell off. Both these drives have over 1 year of run time.


ada3 with -x
Code:
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68WT0N0
Serial Number:    WD-WCC4ERVS3TNV
LU WWN Device Id: 5 0014ee 2b5ada23d
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jan  8 09:19:52 2019 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         (50100) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 501) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x703d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   183   179   021    -    7816
  4 Start_Stop_Count        -O--CK   100   100   000    -    94
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   068   068   000    -    23425
 10 Spin_Retry_Count        -O--CK   100   253   000    -    0
 11 Calibration_Retry_Count -O--CK   100   253   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    93
192 Power-Off_Retract_Count -O--CK   200   200   000    -    62
193 Load_Cycle_Count        -O--CK   200   200   000    -    2980
194 Temperature_Celsius     -O---K   124   093   000    -    28
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   100   253   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb7  GPL,SL  VS       1  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 2
    CR     = Command Register
    FEATR  = Features Register
    COUNT  = Count (was: Sector Count) Register
    LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
    LH     = LBA High (was: Cylinder High) Register    ]   LBA
    LM     = LBA Mid (was: Cylinder Low) Register      ] Register
    LL     = LBA Low (was: Sector Number) Register     ]
    DV     = Device (was: Device/Head) Register
    DC     = Device Control Register
    ER     = Error register
    ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 [1] occurred at disk power-on lifetime: 22491 hours (937 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 00 00 40 01 f8 40 00  Error: IDNF at LBA = 0x004001f8 = 4194808

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 08 00 f0 00 00 00 40 01 f8 40 08 17d+09:44:29.889  WRITE FPDMA QUEUED
  ea 00 00 00 00 00 00 00 00 00 00 40 08 17d+09:44:26.999  FLUSH CACHE EXT
  61 00 08 00 e0 00 00 54 9d 5d 40 40 08 17d+09:44:26.876  WRITE FPDMA QUEUED
  61 00 08 00 d8 00 00 54 9d 5c f8 40 08 17d+09:44:26.807  WRITE FPDMA QUEUED
  61 00 10 00 d0 00 01 04 5f 39 e8 40 08 17d+09:44:26.030  WRITE FPDMA QUEUED

Error 1 [0] occurred at disk power-on lifetime: 22491 hours (937 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 00 54 9b b9 00 40 00  Error: IDNF at LBA = 0x549bb900 = 1419491584

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 08 00 98 00 00 54 9b b9 00 40 08 17d+09:09:45.584  WRITE FPDMA QUEUED
  61 00 08 00 90 00 01 04 5e b2 d8 40 08 17d+09:09:45.328  WRITE FPDMA QUEUED
  61 00 08 00 88 00 01 04 5e b2 d0 40 08 17d+09:09:45.312  WRITE FPDMA QUEUED
  61 00 10 00 80 00 00 ac 78 46 90 40 08 17d+09:09:45.299  WRITE FPDMA QUEUED
  61 00 68 00 78 00 00 54 9b b9 00 40 08 17d+09:09:45.261  WRITE FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    28 Celsius
Power Cycle Min/Max Temperature:     23/29 Celsius
Lifetime    Min/Max Temperature:      3/59 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (176)

Index    Estimated Time   Temperature Celsius
 177    2019-01-08 01:22    28  *********
 ...    ..(476 skipped).    ..  *********
 176    2019-01-08 09:19    28  *********

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP/SMART Log 0x04) not supported

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            3  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            3  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4        38520  Vendor specific



ada0 with -x
Code:
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68WT0N0
Serial Number:    WD-WCC4EFFXURX0
LU WWN Device Id: 5 0014ee 20b0307d7
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jan  8 09:19:56 2019 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         (52680) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 527) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x703d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   185   180   021    -    7750
  4 Start_Stop_Count        -O--CK   100   100   000    -    62
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   086   086   000    -    10703
 10 Spin_Retry_Count        -O--CK   100   253   000    -    0
 11 Calibration_Retry_Count -O--CK   100   253   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    61
192 Power-Off_Retract_Count -O--CK   200   200   000    -    36
193 Load_Cycle_Count        -O--CK   200   200   000    -    2691
194 Temperature_Celsius     -O---K   125   115   000    -    27
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   100   253   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb7  GPL,SL  VS       1  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 2
    CR     = Command Register
    FEATR  = Features Register
    COUNT  = Count (was: Sector Count) Register
    LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
    LH     = LBA High (was: Cylinder High) Register    ]   LBA
    LM     = LBA Mid (was: Cylinder Low) Register      ] Register
    LL     = LBA Low (was: Sector Number) Register     ]
    DV     = Device (was: Device/Head) Register
    DC     = Device Control Register
    ER     = Error register
    ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 [1] occurred at disk power-on lifetime: 10686 hours (445 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 01 0e a8 7e b0 40 00  Error: IDNF at LBA = 0x10ea87eb0 = 4540890800

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 08 00 30 00 01 0e a8 7e b0 40 08     11:00:48.944  WRITE FPDMA QUEUED
  ea 00 00 00 00 00 00 00 00 00 00 40 08     11:00:40.154  FLUSH CACHE EXT
  61 00 08 00 20 00 01 d1 c0 be 00 40 08     11:00:40.153  WRITE FPDMA QUEUED
  61 00 08 00 18 00 01 d1 c0 bc 00 40 08     11:00:40.153  WRITE FPDMA QUEUED
  61 00 08 00 10 00 00 00 40 04 00 40 08     11:00:40.153  WRITE FPDMA QUEUED

Error 1 [0] occurred at disk power-on lifetime: 10686 hours (445 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 00 b2 4e 3b 70 40 00  Error: IDNF at LBA = 0xb24e3b70 = 2991471472

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 08 00 f8 00 00 54 4f 76 18 40 08     10:15:14.339  READ FPDMA QUEUED
  60 00 08 00 f0 00 00 79 b2 5f 10 40 08     10:15:14.338  READ FPDMA QUEUED
  60 00 08 00 e8 00 00 80 49 f2 08 40 08     10:15:14.308  READ FPDMA QUEUED
  60 00 08 00 e0 00 00 80 49 f1 78 40 08     10:15:14.303  READ FPDMA QUEUED
  61 00 30 00 d8 00 00 b2 4e 3b 70 40 08     10:15:14.297  WRITE FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    27 Celsius
Power Cycle Min/Max Temperature:     23/28 Celsius
Lifetime    Min/Max Temperature:     20/37 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (329)

Index    Estimated Time   Temperature Celsius
 330    2019-01-08 01:22    27  ********
 ...    ..(476 skipped).    ..  ********
 329    2019-01-08 09:19    27  ********

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP/SMART Log 0x04) not supported

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            2  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            2  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4        38524  Vendor specific
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Code:
SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]
You have not setup scheduled self-tests on your drives and no tests have been run.
Here is how you setup the schedule:
https://www.ixsystems.com/documentation/freenas/11/tasks.html#s-m-a-r-t-tests

I do a short test on my drives every day. It takes less than 6 minutes even on my 6TB drives, so just pick a time and do it. I have mine set for 6AM daily. I have a long test scheduled to start at 7 AM, so it runs while I am at work and it starts after the short test is done, and I run those once a week.

For the moment, because you have some issues and we are troubleshooting, I would say it is a good idea to do a long test on all your drives, and they can run all at the same time because it happens inside the drive. The command to kick that off from the command prompt is this:
smartctl -t long /dev/adaX -- just replace X with the number of the drive you want to test.
It should give you an estimate of how long the test will take. This will check the drive for mechanical defects.

Do you have spare drives? You might want to get a couple. Usually, when this kind of thing happens, it is because there is a problem with the drive.

Has the system been running mostly stable?
 

stoth

Cadet
Joined
Jan 7, 2019
Messages
7
You have not setup scheduled self-tests on your drives and no tests have been run.
Here is how you setup the schedule:
https://www.ixsystems.com/documentation/freenas/11/tasks.html#s-m-a-r-t-tests

I do a short test on my drives every day. It takes less than 6 minutes even on my 6TB drives, so just pick a time and do it. I have mine set for 6AM daily. I have a long test scheduled to start at 7 AM, so it runs while I am at work and it starts after the short test is done, and I run those once a week.

For the moment, because you have some issues and we are troubleshooting, I would say it is a good idea to do a long test on all your drives, and they can run all at the same time because it happens inside the drive. The command to kick that off from the command prompt is this:
smartctl -t long /dev/adaX -- just replace X with the number of the drive you want to test.
It should give you an estimate of how long the test will take. This will check the drive for mechanical defects.

Do you have spare drives? You might want to get a couple. Usually, when this kind of thing happens, it is because there is a problem with the drive.

Has the system been running mostly stable?

Thanks for the tip, I thought that I had the smart long test scheduled, I tried to reschedule it in the GUI. I started the long test but it's going to take till tomorrow morning for most of the drives, will update then.

I do have spare drives, so will sub in if there are issues found.

This system has always had issues under heavy load. As long as I kept the load lighter it seemed more stable but under long copies it seems to have issues. This is just my home server so I don't have the best logs. I should really invest in a logging system with a durable store :)
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
This system has always had issues under heavy load. As long as I kept the load lighter it seemed more stable but under long copies it seems to have issues.
It may have temperature issues. If you can give us some details on exactly what the hardware is, it might help us figure out what the problem might be.
Here is a link to a configuration guide that might help you get your system adjusted to work better:

Uncle Fester's Basic FreeNAS Configuration Guide
https://www.familybrown.org/dokuwiki/doku.php?id=fester:intro

Here is a link to some scripts that will help you keep track of your system help:

Github repository for FreeNAS scripts, including disk burnin
https://forums.freenas.org/index.ph...for-freenas-scripts-including-disk-burnin.28/
 

stoth

Cadet
Joined
Jan 7, 2019
Messages
7
So I finished the full long SMART scan on my drives and no errors on any of them. They all have a section that looks like

Code:
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       959         -


Which to me seems to mean that there are no issues with my drives which is good! I'll try doing a long copy again and see if I can replicate the errors.
 

stoth

Cadet
Joined
Jan 7, 2019
Messages
7
So after all this debugging, it's been stable for the last 3 days with no issues. The Resilvering went fine and I even managed to just use it like normal, making a backup of some files just incase.

Thanks for all your help with the debugging!
 
Top