Hi,
Today one of my pools switched to degraded state because one of the SSDs is causing read/write errors. I did a scrub and looks like the data is fine.
I didn't open the server for weeks and the pool is working fine for nearly a year now. All 3 SSDs of the pool are Intel DC S3710 400GB connected to a Dell PERC H310 flashed to IT-Mode on a Supermicro X10SSL-F motherboard with 32GB ECC RAM and a Xeon E3-1230v3.
Here is the output of smartctl:
And the output of zpool status:
And this is all over /var/log/messages since today at 00:03 AM (at 00:07 zabbix logged that the pool switched to degraded):
Is the drive just dying or a problem with the HBA or might there be a software problem caused by the TrueNAS12.0-U6 update?
I'm not sure how to interpret that error.
Today one of my pools switched to degraded state because one of the SSDs is causing read/write errors. I did a scrub and looks like the data is fine.
I didn't open the server for weeks and the pool is working fine for nearly a year now. All 3 SSDs of the pool are Intel DC S3710 400GB connected to a Dell PERC H310 flashed to IT-Mode on a Supermicro X10SSL-F motherboard with 32GB ECC RAM and a Xeon E3-1230v3.
Here is the output of smartctl:
Code:
root@MainNAS[~]# smartctl -a /dev/da1
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p10 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Intel 730 and DC S35x0/3610/3700 Series SSDs
Device Model: INTEL SSDSC2BA400G4
Serial Number: XXXXXXX
LU WWN Device Id: 5 5cd2e4 14db3ebfa
Firmware Version: G2010170
User Capacity: 400,088,457,216 bytes [400 GB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s
Local Time is: Thu Oct 14 08:51:03 2021 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x00) Offline data collection not supported.
SMART capabilities: (0x0000) Automatic saving of SMART data is not implemented.
Error logging capability: (0x00) Error logging supported.
General Purpose Logging supported.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
Read SMART Error Log failed: scsi error aborted command
Read SMART Self-test Log failed: scsi error aborted command
Selective Self-tests/Logging not supportedAnd the output of zpool status:
Code:
pool: SSDpool2
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub repaired 0B in 00:11:11 with 0 errors on Thu Oct 14 08:50:54 2021
config:
NAME STATE READ WRITE CKSUM
SSDpool2 DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
gptid/96651e4b-f6ce-11ea-8b6d-6805ca1f5bda.eli ONLINE 0 0 0
gptid/967d3c9d-f6ce-11ea-8b6d-6805ca1f5bda.eli FAULTED 6 188 0 too many errors
gptid/9678c860-f6ce-11ea-8b6d-6805ca1f5bda.eli ONLINE 0 0 0
errors: No known data errorsAnd this is all over /var/log/messages since today at 00:03 AM (at 00:07 zabbix logged that the pool switched to degraded):
Code:
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): WRITE(10). CDB: 2a 00 28 3e 58 a8 00 00 08 00 Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): CAM status: SCSI Status Error Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI status: Check Condition Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): Retrying command (per sense data) Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): WRITE(10). CDB: 2a 00 28 3e 58 a8 00 00 08 00 Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): CAM status: SCSI Status Error Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI status: Check Condition Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range) Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): Info: 0 Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): Error 22, Unretryable error Oct 14 00:03:07 MainNAS GEOM_ELI: g_eli_write_done() failed (error=22) gptid/967d3c9d-f6ce-11ea-8b6d-6805ca1f5bda.eli[WRITE(offset=343541829632, length=4096)] Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00 Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): CAM status: SCSI Status Error Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI status: Check Condition Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range) Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): Info: 0 Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): Error 22, Unretryable error Oct 14 00:03:07 MainNAS GEOM_ELI: g_eli_read_done() failed (error=22) gptid/967d3c9d-f6ce-11ea-8b6d-6805ca1f5bda.eli[READ(offset=270336, length=8192)] Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): READ(10). CDB: 28 00 2e 93 8c 90 00 00 10 00 Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): CAM status: SCSI Status Error Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI status: Check Condition Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range) Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): Info: 0 Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): Error 22, Unretryable error
Is the drive just dying or a problem with the HBA or might there be a software problem caused by the TrueNAS12.0-U6 update?
I'm not sure how to interpret that error.