SSD dying?

Dunuin · Oct 14, 2021

Hi,

Today one of my pools switched to degraded state because one of the SSDs is causing read/write errors. I did a scrub and looks like the data is fine.
I didn't open the server for weeks and the pool is working fine for nearly a year now. All 3 SSDs of the pool are Intel DC S3710 400GB connected to a Dell PERC H310 flashed to IT-Mode on a Supermicro X10SSL-F motherboard with 32GB ECC RAM and a Xeon E3-1230v3.

Here is the output of smartctl:

Code:

root@MainNAS[~]# smartctl -a /dev/da1
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p10 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Intel 730 and DC S35x0/3610/3700 Series SSDs
Device Model:     INTEL SSDSC2BA400G4
Serial Number:    XXXXXXX
LU WWN Device Id: 5 5cd2e4 14db3ebfa
Firmware Version: G2010170
User Capacity:    400,088,457,216 bytes [400 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s
Local Time is:    Thu Oct 14 08:51:03 2021 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x00)         Offline data collection not supported.
SMART capabilities:            (0x0000) Automatic saving of SMART data                                  is not implemented.
Error logging capability:        (0x00) Error logging supported.
                                        General Purpose Logging supported.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

Read SMART Error Log failed: scsi error aborted command

Read SMART Self-test Log failed: scsi error aborted command

Selective Self-tests/Logging not supported

And the output of zpool status:

Code:

  pool: SSDpool2
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 00:11:11 with 0 errors on Thu Oct 14 08:50:54 2021
config:

        NAME                                                STATE     READ WRITE CKSUM
        SSDpool2                                            DEGRADED     0     0     0
          raidz1-0                                          DEGRADED     0     0     0
            gptid/96651e4b-f6ce-11ea-8b6d-6805ca1f5bda.eli  ONLINE       0     0     0
            gptid/967d3c9d-f6ce-11ea-8b6d-6805ca1f5bda.eli  FAULTED      6   188     0  too many errors
            gptid/9678c860-f6ce-11ea-8b6d-6805ca1f5bda.eli  ONLINE       0     0     0

errors: No known data errors

And this is all over /var/log/messages since today at 00:03 AM (at 00:07 zabbix logged that the pool switched to degraded):

Code:

Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): WRITE(10). CDB: 2a 00 28 3e 58 a8 00 00 08 00
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): CAM status: SCSI Status Error
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI status: Check Condition
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): Retrying command (per sense data)
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): WRITE(10). CDB: 2a 00 28 3e 58 a8 00 00 08 00
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): CAM status: SCSI Status Error
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI status: Check Condition
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): Info: 0
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): Error 22, Unretryable error
Oct 14 00:03:07 MainNAS GEOM_ELI: g_eli_write_done() failed (error=22) gptid/967d3c9d-f6ce-11ea-8b6d-6805ca1f5bda.eli[WRITE(offset=343541829632, length=4096)]
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): CAM status: SCSI Status Error
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI status: Check Condition
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): Info: 0
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): Error 22, Unretryable error
Oct 14 00:03:07 MainNAS GEOM_ELI: g_eli_read_done() failed (error=22) gptid/967d3c9d-f6ce-11ea-8b6d-6805ca1f5bda.eli[READ(offset=270336, length=8192)]
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): READ(10). CDB: 28 00 2e 93 8c 90 00 00 10 00
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): CAM status: SCSI Status Error
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI status: Check Condition
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): Info: 0
Oct 14 00:03:07 MainNAS (da1:mps0:0:6:0): Error 22, Unretryable error

Is the drive just dying or a problem with the HBA or might there be a software problem caused by the TrueNAS12.0-U6 update?

I'm not sure how to interpret that error.

Samuel Tai · Oct 14, 2021

The most likely scenario is the drive is failing, although a problem with the HBA, the cabling, or the backplane can't be ruled out at this point. Try replacing the drive first to see if the errors go away. (Note, since you have a GELI-encrypted pool, remember to regenerate your encryption and recovery keys after the resilver completes.)

Dunuin · Oct 14, 2021

Thanks. Failing drive would be very annoying because all the smart trributes were at 98% and above untill yesterday.

NugentS · Oct 14, 2021

Should be hard to kill a 3710 with use given they are high endurance drives with a DWPD I think of 1
Take the drive out (and test the cabling with another drive), but also put the drive onto the Intel DC tests in a suitable box and see what that says

Dunuin · Oct 14, 2021

Jep, they are rated even higher for 10 DWPD over 5 years. Bought them especially because of this after seeing how bad the ZFS write amplification is (got a write amplification from guest OS to NAND of factor 3 to 82 depending on the workload).
Didn't know there is a special tool for testing Intel SSDs. Only used the Intel DC tool once for updating the firmwares. So I will try to switch the cables and boot from a live linux to run the Intel tool. Just a reboot didn't helped and the other 2 identical model SSDs of the pool are attached to the same HBA and are still working fine. And the HBA recognizes all 3 SSDs.

Dunuin · Oct 14, 2021

I changed the power and SATA cable but still the same problem.
While booting TrueNAS I also see this now...I think this is new:

Is that something I should worry about?
Can that be caused by my failed SSD? gptzfsboot sounds more like it is a problem with my freenas-boot pool but I just scrubbed that pool and scrub wasn't able to find any errors. And the freenas-boot pool is directly connected to my mainboards chipset SATA ports, not to the HBA.

Samuel Tai · Oct 14, 2021

Hmm, another possibility is your system power supply going south, and not providing sufficient power to both the boot drive and this supposedly failing SSD.

Dunuin · Oct 14, 2021

Atleast the IPMI hasn't logged any critical voltage drops.

NugentS · Oct 14, 2021

Dunuin said:
Jep, they are rated even higher for 10 DWPD over 5 years.......

I was sure I wrote 10 in the first place - must have deleted the 0 by mistake

The software I was referring to was the stuff you can update the firmware with. Its got some test functions.

Important Announcement for the TrueNAS Community.

SSD dying?

Dunuin

Contributor

Samuel Tai

Never underestimate your own stupidity

Dunuin

Contributor

NugentS

MVP

Dunuin

Contributor

Dunuin

Contributor

Samuel Tai

Never underestimate your own stupidity

Dunuin

Contributor

NugentS

MVP

Similar threads