working disk keeps detected as faulted (r/w errors)

Speeedymauss

Cadet
Joined
Jun 19, 2020
Messages
7
Hi everyone,
i got a problem with my new FreeNAS box: One disk of my encrypted z2 array keeps detected as "faulted" because of to many read/write errors but i know that the disk is fine (completely new, long SMART check was fine), I also replaced the new disk with another new disk (completely new, long SMART check was fine). Everytime the NAS starts resilvering the array i will get read and write errors and the disk marked as faulted. I even tested different slots in the chassis and get the same results and I also reordered the disk the other way around in the chassis and still get the errors with this single disk. I have no idea what else i can do

EDIT: disks i'm using: Seagate ST4000LM024

Hardware:
i3 6100
Gigabyte Z170 G1 Gaming Mobo
2x8 GB DDR4 RAM
LSI 9217-4i4e (IT-Mode)
Intel R2224GZ4GC4 als Chassis incl. SAS expander backplane.

Software:
FreeNAS-11.3-U3.2


Any ideas what i can do for it?
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I have no idea what else i can do
How does the cabling look? or the port on the backplane (assuming you're using the same one for both (good?) disks).
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Sorry, i edited my post and added the HW information

EDIT: disks i'm using: Seagate ST4000LM024

How did I know SMR would be involved here somehow? I'm not necessarily saying that's 100% the root cause, but it's almost certainly an aggravating factor. "Fault following disk" seems to indicate that your cabling/HBA/backplane are fine.

Is this the first resilver operation that was done on this pool?
 

Speeedymauss

Cadet
Joined
Jun 19, 2020
Messages
7
How does the cabling look? or the port on the backplane (assuming you're using the same one for both (good?) disks).

Cabling should be ok. There are 11 other disks in this server and all other disks are fine. Last time I checked the connecter it seems to be good, but it was a quick check. I can doublecheck that when the resilvering process is done. But i get the same error when I plug the disk into another port.

How did I know SMR would be involved here somehow? I'm not necessarily saying that's 100% the root cause, but it's almost certainly an aggravating factor. "Fault following disk" seems to indicate that your cabling/HBA/backplane are fine.

Is this the first resilver operation that was done on this pool?

i would bet that this is an SMR disk, i don't know how you will get 4 TB in an 2,5" disk w/o SMR. But so far i didn't had any problems with that drives. I use them since some years for my NAS servers.
Due to my testing this is the 4th (or maybe more) resilver process and it's a completely new pool. I build this server in parallel to my old one and i just copied the data to it for testing.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
i would bet that this is an SMR disk, i don't know how you will get 4 TB in an 2,5" disk w/o SMR.

"With NAND" - although you only get 3.84T. ;)

Joking aside, the only options are SAS, and even those tend to cap out at around the 2T mark. 2.5" spinning disks have for the most part been relegated to the footnotes of history.

But so far i didn't had any problems with that drives. I use them since some years for my NAS servers.
Due to my testing this is the 4th (or maybe more) resilver process and it's a completely new pool. I build this server in parallel to my old one and i just copied the data to it for testing.
Check the SMART data and see if there's a firmware mismatch where this misbehaving drive is either newer or older than the rest of them. The workload that ZFS generates is especially bad for SMR drives, and a resilver could be enough to cause the drive to stall for long enough periods to make the controller think it's dropped from the array entirely.
 

subhuman

Contributor
Joined
Nov 21, 2019
Messages
121
2x8 GB DDR4 RAM
...
There are 11 other disks in this server ...
You're also way under the recommended RAM for that amount of storage space. SMR + low RAM might be compounding the timeout errors.
 

Speeedymauss

Cadet
Joined
Jun 19, 2020
Messages
7
"With NAND" - although you only get 3.84T. ;)
But that for a way higher price


Check the SMART data and see if there's a firmware mismatch where this misbehaving drive is either newer or older than the rest of them. The workload that ZFS generates is especially bad for SMR drives, and a resilver could be enough to cause the drive to stall for long enough periods to make the controller think it's dropped from the array entirely.

SMART Output from new disk (the new new one):
Code:
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 2.5 5400
Device Model:     ST4000LM024-2AN17V
Serial Number:    XXXXX
LU WWN Device Id: 5 000c50 0befcc6b6
Firmware Version: 0001
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5526 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Jun 20 15:16:51 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 669) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x30a5) SCT Status supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   078   071   006    Pre-fail  Always       -       56487232
  3 Spin_Up_Time            0x0003   098   098   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       6
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   253   045    Pre-fail  Always       -       295123
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       22 (174 168 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       6
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       1
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   067   067   040    Old_age   Always       -       33 (Min/Max 27/33)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       111
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       127
194 Temperature_Celsius     0x0022   033   040   000    Old_age   Always       -       33 (0 20 0 0 0)
195 Hardware_ECC_Recovered  0x001a   078   071   000    Old_age   Always       -       56487232
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       2 (182 79 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       305607921
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1280138
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


And here some other disk:
Code:
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 2.5 5400
Device Model:     ST4000LM024-2AN17V
Serial Number:    XXXXXX
LU WWN Device Id: 5 000c50 0befd8b24
Firmware Version: 0001
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5526 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Jun 20 15:21:44 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 644) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x30a5) SCT Status supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   081   064   006    Pre-fail  Always       -       127574616
  3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       63
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   078   060   045    Pre-fail  Always       -       57611998
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       955 (12 106 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       35
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       18
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   069   055   040    Old_age   Always       -       31 (Min/Max 27/31)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       80
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       712
194 Temperature_Celsius     0x0022   031   045   000    Old_age   Always       -       31 (0 18 0 0 0)
195 Hardware_ECC_Recovered  0x001a   081   064   000    Old_age   Always       -       127574616
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       853 (108 123 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       14167156212
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       19438231640
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Firmware is same for all disks.
What i did today is that i completely deleted the pool and wiped the disks. Then i swapped around disk 1 with disk 6 (the erroring disk) and recreated the pool. Now i'm syncing 12TB of data as test.


You're also way under the recommended RAM for that amount of storage space. SMR + low RAM might be compounding the timeout errors.

I already ordered 16GB of RAM to add it to the server, but i'm still waiting for it.
 

subhuman

Contributor
Joined
Nov 21, 2019
Messages
121
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
Run both a short and a long test. They probably won't turn up anything, but the thing that "can't possibly be at fault" often ends up being so.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
In addition to the SMART tests suggested by @subhuman - if you can handle a downtime, I would consider doing a few memtest86 runs for CPU and RAM stress-testing - you aren't using what most users here would consider "server-grade" (a "gaming" motherboard with no ECC support) so you may have a component that isn't failing loudly enough to break anything but intermittently enough to cause trouble.
 

Speeedymauss

Cadet
Joined
Jun 19, 2020
Messages
7
Weird things are getting weirder: After swapping around the disks the error occours on a disk that was fine until now (and i didn't touch it) and the failing disks are fine.
Maybe it is a fault with the backplane, the SAS Expander or the controller. Is there a good way to test it? I moved the disk to another backplane (the disk cage is splitted into 3 backplanes) and do the tests.


Run both a short and a long test. They probably won't turn up anything, but the thing that "can't possibly be at fault" often ends up being so.

The last test was interrupted (host reset) so i restarted it right now. I will provide the output tomorrow (at least if its finished)

EDIT & Update: All SMART tests are getting interrupted (host reset). For all six drives the SMART test is not working

In addition to the SMART tests suggested by @subhuman - if you can handle a downtime, I would consider doing a few memtest86 runs for CPU and RAM stress-testing - you aren't using what most users here would consider "server-grade" (a "gaming" motherboard with no ECC support) so you may have a component that isn't failing loudly enough to break anything but intermittently enough to cause trouble.

i can do that when the SMART tests are done. The server is still in testing-mode so i can do anything i want to it.

I know that the hardware is not server-grade but its also no datacenter in my basement, it's personal storage ;)
It was not planned to build the server like this but the planned server was damaged during transport and was unusable. Unfortunately, the transport company denies any responsibility and I am sitting on the scrap heap. I had to take what I still had and that's how it came about. I also spent some time looking for an affordable but good combination of mainboard and CPU, but somehow I didn't find anything that I liked and that is reasonably energy efficient and affordable. If you have a tip always like to give it to me.
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Failure to complete a SMART test would indicate that you might have hardware issues elsewhere. Might need to replace the HBA (or update the firmware?) and/or cables. Hopefully it's not the backplane.

I know that the hardware is not server-grade but its also no datacenter in my basement, it's personal storage ;)
It was not planned to build the server like this but the planned server was damaged during transport and was unusable. Unfortunately, the transport company denies any responsibility and I am sitting on the scrap heap. I had to take what I still had and that's how it came about. I also spent some time looking for an affordable but good combination of mainboard and CPU, but somehow I didn't find anything that I liked and that is reasonably energy efficient and affordable. If you have a tip always like to give it to me.

Was the backplane part of the server that took a hit during transport? I'd be a bit more suspicious of it in that case.

As far as equipment, a Supermicro X10 board and chip shouldn't be terribly expensive. Here's a filtered link for the Supermicro X10 offerings (single-processor, Socket R3/LGA2011, RDIMMs)

https://www.supermicro.com/en/produ...(Socket R3)&formfactor=ATX,Micro-ATX,microATX

One of these and a Xeon E5-v3 will be relatively cheap, support dirt-cheap DDR3 RDIMMs, and have fairly low idle power consumption as long as you select an appropriate processor (ie: not a high-clocked 12-core)

The X9 is also available for lower cost - but the caution on moving to the "cheaper" Socket H/H2 solutions (eg: X9SCM) is that the RAM will be much more expensive, and limited to 32G total. DDR3 RDIMMs are incredibly cheap (8G sticks for $10 all day long, 16G for $20-25 but check your compatibility and DIMM ranks) but DDR3 UDIMMs are not. (Although if someone can prove me wrong on the latter, feel free to DM me a link to 8G UDIMM sticks in quantity.)
 
Last edited:

Speeedymauss

Cadet
Joined
Jun 19, 2020
Messages
7
Failure to complete a SMART test would indicate that you might have hardware issues elsewhere. Might need to replace the HBA (or update the firmware?) and/or cables. Hopefully it's not the backplane.

I will check if i have some replacement HBA in stock to test it. Will comment that later.


SMART tests are now done. Seems like the resilvering process stopped the tests.

Results:
one disk, so far w/o any errors:
Code:
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 2.5 5400
Device Model:     ST4000LM024-2AN17V
Serial Number:    XXXXX
LU WWN Device Id: 5 000c50 0befd8b24
Firmware Version: 0001
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5526 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jun 23 13:20:41 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 644) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x30a5) SCT Status supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   064   006    Pre-fail  Always       -       207549136
  3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       64
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   078   060   045    Pre-fail  Always       -       66294638
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1024 (145 29 0)
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       36
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       21
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   069   055   040    Old_age   Always       -       31 (Min/Max 30/35)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       81
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       713
194 Temperature_Celsius     0x0022   031   045   000    Old_age   Always       -       31 (0 18 0 0 0)
195 Hardware_ECC_Recovered  0x001a   083   064   000    Old_age   Always       -       207549136
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       923 (104 161 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       17226890516
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       22772872583
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      1023         -
# 2  Short offline       Completed without error       00%      1000         -
# 3  Extended offline    Interrupted (host reset)      00%      1000         -
# 4  Extended offline    Interrupted (host reset)      00%       999         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


another ok disk:
Code:
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 2.5 5400
Device Model:     ST4000LM024-2AN17V
Serial Number:    XXXXX
LU WWN Device Id: 5 000c50 0befee05d
Firmware Version: 0001
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5526 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jun 23 13:23:46 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 659) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x30a5) SCT Status supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   064   006    Pre-fail  Always       -       650072
  3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       15
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   075   060   045    Pre-fail  Always       -       31703327
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       339 (138 190 0)
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       14
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       12
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   069   065   040    Old_age   Always       -       31 (Min/Max 30/35)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       42
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       98
194 Temperature_Celsius     0x0022   031   040   000    Old_age   Always       -       31 (0 18 0 0 0)
195 Hardware_ECC_Recovered  0x001a   100   064   000    Old_age   Always       -       650072
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       338 (153 51 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       11589786134
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       20571135157
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       338         -
# 2  Short offline       Completed without error       00%       315         -
# 3  Extended offline    Interrupted (host reset)      00%       314         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Current faulted disk:
Code:
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 2.5 5400
Device Model:     ST4000LM024-2AN17V
Serial Number:    XXXXX
LU WWN Device Id: 5 000c50 0cfa14a62
Firmware Version: 0001
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5526 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jun 23 13:24:33 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 646) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x30a5) SCT Status supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   080   064   006    Pre-fail  Always       -       91677056
  3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       13
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   075   060   045    Pre-fail  Always       -       28699115
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       339 (210 54 0)
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       12
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       17
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   070   065   040    Old_age   Always       -       30 (Min/Max 30/34)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       33
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       92
194 Temperature_Celsius     0x0022   030   040   000    Old_age   Always       -       30 (0 21 0 0 0)
195 Hardware_ECC_Recovered  0x001a   080   064   000    Old_age   Always       -       91677056
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       338 (215 224 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       10622655902
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       17075822391
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       338         -
# 2  Extended offline    Completed without error       00%       325         -
# 3  Extended offline    Interrupted (host reset)      00%       314         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


the old faulted disk (currently fine):
Code:
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 2.5 5400
Device Model:     ST4000LM024-2AN17V
Serial Number:    XXXX
LU WWN Device Id: 5 000c50 0befcc6b6
Firmware Version: 0001
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5526 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jun 23 13:25:09 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 669) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x30a5) SCT Status supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   065   006    Pre-fail  Always       -       217870136
  3 Spin_Up_Time            0x0003   098   098   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       7
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   070   060   045    Pre-fail  Always       -       10971966
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       92 (49 136 0)
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       7
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       1
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   070   065   040    Old_age   Always       -       30 (Min/Max 28/34)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       112
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       128
194 Temperature_Celsius     0x0022   030   040   000    Old_age   Always       -       30 (0 20 0 0 0)
195 Hardware_ECC_Recovered  0x001a   083   065   000    Old_age   Always       -       217870136
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       72 (33 45 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       3379827345
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       3405880821
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%        91         -
# 2  Extended offline    Interrupted (host reset)      00%        68         -
# 3  Extended offline    Interrupted (host reset)      00%        68         -
# 4  Extended offline    Interrupted (host reset)      00%        60         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.




Was the backplane part of the server that took a hit during transport? I'd be a bit more suspicious of it in that case.

At least not noticeably damaged, Visually okay and apart from the problems described above, everything works. I also have 6 SAS drives in it, which run completely without problems.


many thanks for the HW recommendations! I'll see if I can find some at a reasonable price.
 

Speeedymauss

Cadet
Joined
Jun 19, 2020
Messages
7
Update:
i replaced the controller and the internal SAS Cable and added 16GB memory.

After that i run the memtest86 tests:
Memtest86.png


The system itself seems to be good.

After that a reconfigured my zpool and started the copy process. So far i copied 6TB of data w/o any errors.

I don't know if the controller or the cable was faulty or maybe the extra ram is better. But currently it seems to work fine.
 
Top