Hi,
long time FreeNAS user, first time harddrive issue..
Basically, I've got 6 Seagate Exos X16 (ST16000NM002G) in a RAIDZ2 configuration inside a Supermicro 2U machine with BPN-SAS3-826EL1 backplane - this setup has been working just fine for 6 months.
I received email alert containing:
Checking on the specific disk reveals following:
ZFS pool seems healthy and has ONLINE status.
Is the disk broken and I should request replacement or is there anything I should try?
long time FreeNAS user, first time harddrive issue..
Basically, I've got 6 Seagate Exos X16 (ST16000NM002G) in a RAIDZ2 configuration inside a Supermicro 2U machine with BPN-SAS3-826EL1 backplane - this setup has been working just fine for 6 months.
I received email alert containing:
so I logged into the machine to check - syslog is full of CAM error messages originating from one specific drive (/dev/da2):Device: /dev/da2, SMART Failure: DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH.
Code:
Sep 2 00:00:00 nas newsyslog[49055]: logfile turned over due to size>200K
Sep 2 00:00:00 nas.redacted.com syslog-ng[2194]: Configuration reload request received, reloading configuration;
Sep 2 00:00:00 nas.redacted.com syslog-ng[2194]: Configuration reload finished;
Sep 2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Sep 2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): CAM status: SCSI Status Error
Sep 2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): SCSI status: Check Condition
Sep 2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): SCSI sense: HARDWARE FAILURE asc:42,0 (Power-on or self-test failure)
Sep 2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): Info: 0
Sep 2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): Field Replaceable Unit: 20
Sep 2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): Descriptor 0x80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 2 01:05:24 nas.redacted.com (da2:mpr0:0:10:0): Retrying command (per sense data)
Sep 2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Sep 2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): CAM status: SCSI Status Error
Sep 2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): SCSI status: Check Condition
Sep 2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): SCSI sense: HARDWARE FAILURE asc:42,0 (Power-on or self-test failure)
Sep 2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): Info: 0
Sep 2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): Field Replaceable Unit: 20
Sep 2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): Descriptor 0x80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 2 02:22:46 nas.redacted.com (da2:mpr0:0:10:0): Retrying command (per sense data)
Sep 2 02:33:49 nas.redacted.com (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Sep 2 02:33:49 nas.redacted.com (da2:mpr0:0:10:0): CAM status: SCSI Status Error
Sep 2 02:33:49 nas.redacted.com (da2:mpr0:0:10:0): SCSI status: Check Condition
Sep 2 02:33:49 nas.redacted.com (da2:mpr0:0:10:0): SCSI sense: HARDWARE FAILURE asc:42,0 (Power-on or self-test failure)
Sep 2 02:33:49 nas.redacted.com (da2:mpr0:0:10:0): Info: 0
...
Sep 3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): CAM status: SCSI Status Error
Sep 3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): SCSI status: Check Condition
Sep 3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): SCSI sense: Deferred error: MEDIUM ERROR asc:c,2 (Write error - auto reallocation failed)
Sep 3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): Info: 0x38595b9a7
Sep 3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): Field Replaceable Unit: 6
Sep 3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): Actual Retry Count: 255
Sep 3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): Descriptor 0x80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 3 11:15:48 nas.redacted.com (da2:mpr0:0:10:0): Retrying command (per sense data)
Sep 3 11:16:10 nas.redacted.com collectd[1560]: Traceback (most recent call last):
File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 61, in read
temperatures = c.call('disk.temperatures', self.disks, self.powermode, self.smartctl_args)
File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 61, in read
temperatures = c.call('disk.temperatures', self.disks, self.powermode, self.smartctl_args)
File "/usr/local/lib/python3.7/site-packages/middlewared/client/client.py", line 386, in call
raise CallTimeout("Call timeout")
middlewared.client.client.CallTimeout: Call timeout
Checking on the specific disk reveals following:
Code:
# smartctl -a /dev/da2
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p11 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST16000NM002G
Revision: E003
Compliance: SPC-5
User Capacity: 16,000,900,661,248 bytes [16.0 TB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000c500af1fa897
Serial number: REDACTED
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Thu Sep 3 11:32:01 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH [asc=5d, ascq=32]
Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned = 295
Power on minutes since format <not available>
Current Drive Temperature: 37 C
Drive Trip Temperature: 60 C
Manufactured in week 04 of year 2020
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 109
Specified load-unload count over device lifetime: 600000
Accumulated load-unload cycles: 284
Elements in grown defect list: 309
Vendor (Seagate Cache) information
Blocks sent to initiator = 2066229080
Blocks received from initiator = 1150438328
Blocks read from cache and sent to initiator = 1046461852
Number of read and write commands whose size <= segment size = 57891915
Number of read and write commands whose size > segment size = 1865688
Vendor (Seagate/Hitachi) factory information
number of hours powered up = 3678.77
number of minutes until next internal SMART test = 50
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 36208.974 0
write: 0 0 0 0 0 5017.512 3
verify: 0 30 0 30 197 33.307 0
Non-medium error count: 8
[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed - 3677 - [- - -]
# 2 Background short Completed - 3667 - [- - -]
# 3 Background short Completed - 3595 - [- - -]
# 4 Background short Completed - 3547 - [- - -]
# 5 Background short Completed - 3499 - [- - -]
# 6 Background long Completed - 3477 - [- - -]
# 7 Background short Completed - 3451 - [- - -]
# 8 Background short Completed - 3403 - [- - -]
# 9 Background short Completed - 3355 - [- - -]
#10 Background short Completed - 3307 - [- - -]
#11 Background short Completed - 3259 - [- - -]
#12 Background short Completed - 3211 - [- - -]
#13 Background short Completed - 3163 - [- - -]
#14 Background long Completed - 3141 - [- - -]
#15 Background short Completed - 3115 - [- - -]
#16 Background short Completed - 3067 - [- - -]
#17 Background short Completed - 3019 - [- - -]
#18 Background short Completed - 2971 - [- - -]
#19 Background short Completed - 2923 - [- - -]
#20 Background short Completed - 2851 - [- - -]
Long (extended) Self-test duration: 65535 seconds [1092.2 minutes]ZFS pool seems healthy and has ONLINE status.
Is the disk broken and I should request replacement or is there anything I should try?