Hi,
Today one of my pools switched to degraded state because one of the SSDs is causing read/write errors. I did a scrub and looks like the data is fine.
I didn't open the server for weeks and the pool is working fine for nearly a year now. All 3 SSDs of the pool are Intel DC S3710 400GB connected to a Dell PERC H310 flashed to IT-Mode on a Supermicro X10SSL-F motherboard with 32GB ECC RAM and a Xeon E3-1230v3.
Here is the output of smartctl:
And the output of zpool status:
And this is all over /var/log/messages since today at 00:03 AM (at 00:07 zabbix logged that the pool switched to degraded):
Is the drive just dying or a problem with the HBA or might there be a software problem caused by the TrueNAS12.0-U6 update?
I'm not sure how to interpret that error.
Today one of my pools switched to degraded state because one of the SSDs is causing read/write errors. I did a scrub and looks like the data is fine.
I didn't open the server for weeks and the pool is working fine for nearly a year now. All 3 SSDs of the pool are Intel DC S3710 400GB connected to a Dell PERC H310 flashed to IT-Mode on a Supermicro X10SSL-F motherboard with 32GB ECC RAM and a Xeon E3-1230v3.
Here is the output of smartctl:
Code:
root@MainNAS[~]# smartctl -a /dev/da1 smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p10 amd64] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Intel 730 and DC S35x0/3610/3700 Series SSDs Device Model: INTEL SSDSC2BA400G4 Serial Number: XXXXXXX LU WWN Device Id: 5 5cd2e4 14db3ebfa Firmware Version: G2010170 User Capacity: 400,088,457,216 bytes [400 GB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: Solid State Device Form Factor: 2.5 inches TRIM Command: Available, deterministic, zeroed Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-3 T13/2161-D revision 5 SATA Version is: SATA 3.1, 6.0 Gb/s Local Time is: Thu Oct 14 08:51:03 2021 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. No failed Attributes found. General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x00) Offline data collection not supported. SMART capabilities: (0x0000) Automatic saving of SMART data is not implemented. Error logging capability: (0x00) Error logging supported. General Purpose Logging supported. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. Read SMART Error Log failed: scsi error aborted command Read SMART Self-test Log failed: scsi error aborted command Selective Self-tests/Logging not supported
And the output of zpool status:
Code:
pool: SSDpool2 state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: scrub repaired 0B in 00:11:11 with 0 errors on Thu Oct 14 08:50:54 2021 config: NAME STATE READ WRITE CKSUM SSDpool2 DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 gptid/96651e4b-f6ce-11ea-8b6d-6805ca1f5bda.eli ONLINE 0 0 0 gptid/967d3c9d-f6ce-11ea-8b6d-6805ca1f5bda.eli FAULTED 6 188 0 too many errors gptid/9678c860-f6ce-11ea-8b6d-6805ca1f5bda.eli ONLINE 0 0 0 errors: No known data errors
And this is all over /var/log/messages since today at 00:03 AM (at 00:07 zabbix logged that the pool switched to degraded):
Code:
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): WRITE(10). CDB: 2a 00 28 3e 58 a8 00 00 08 00 Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): CAM status: SCSI Status Error Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI status: Check Condition Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): Retrying command (per sense data) Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): WRITE(10). CDB: 2a 00 28 3e 58 a8 00 00 08 00 Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): CAM status: SCSI Status Error Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI status: Check Condition Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range) Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): Info: 0 Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): Error 22, Unretryable error Oct 14 00:03:07 MainNAS GEOM_ELI: g_eli_write_done() failed (error=22) gptid/967d3c9d-f6ce-11ea-8b6d-6805ca1f5bda.eli[WRITE(offset=343541829632, length=4096)] Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00 Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): CAM status: SCSI Status Error Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI status: Check Condition Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range) Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): Info: 0 Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): Error 22, Unretryable error Oct 14 00:03:07 MainNAS GEOM_ELI: g_eli_read_done() failed (error=22) gptid/967d3c9d-f6ce-11ea-8b6d-6805ca1f5bda.eli[READ(offset=270336, length=8192)] Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): READ(10). CDB: 28 00 2e 93 8c 90 00 00 10 00 Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): CAM status: SCSI Status Error Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI status: Check Condition Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range) Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): Info: 0 Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): Error 22, Unretryable error
Is the drive just dying or a problem with the HBA or might there be a software problem caused by the TrueNAS12.0-U6 update?
I'm not sure how to interpret that error.