system reboot during volume creation of SAS SSDs (SSD then disappears)

jpi

Dabbler
Joined
Apr 21, 2019
Messages
14
Hello,

I bought two 800GB SAS SSDs and connected them to ports 12 and 13 on my R510 (yes, the internal 2.5" bays). I ran a short smart test on both of them and they look fine, been powered on ~4years but both have less than 1TB written or read to\from them. The first time I tried to create a striped volume from them (da2 and da3) the volume creation failed. When this occurred da3 disappeared from FreeNAS View Disks and did not report on boot as connected to the H200 (flashed to IT mode). I suspected the SAS SSD that showed up as da3 as bad and RMA'd it via the eBay seller. The seller shipped another SAS SSD, I hook it up, shows up as da3 and again when I tried to create a striped volume the creation fails (this time causing FreeNAS to unexpectedly reboot). Upon rebooting da3 is again gone. The SSD does not show up. Is FreeNAS or my H200 frying my SSD (now twice!)?

Thanks for the help!

Additional info:
Logs: cat /var/log/messages | grep da3 ---> (https://pastebin.com/raw/CS3vqmFb)
R510: 8GB RAM, 1x E5620
FreeNAS: 11.2-U2
SAS SSDs: They are HGST Ultrastar HUSMM8080ASS201. Purchased off eBay for ~$130/pc.
HBA info via sas2flash -listall: Ctlr= SAS2008(B2), Fw Ver = 20.00.07.00

Update #1: I took the SSD that "disappeared" and moved it from port 13 to 12 while leaving the working SSD disconnected (i.e. I did not plug anything into port 13). Now the "disappeared" shows up. However "geom disk list da2" shows "Mode: r0w0e0" and "smartctl -t short /dev/da2" gives me an immediate "Short offline self test failed [medium or hardware error (serious)]". This is a test that previously worked. Did I just get another bad SSD or did something else cause this?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Lets see the full output of smartctl -x on each of these drives.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
PS. Please put the listing inside code tags like this [CODE] your text here [/CODE]
 

jpi

Dabbler
Joined
Apr 21, 2019
Messages
14
@Chris Moore I have connected both drives back up. I have returned the working SSD back to port 12 and the non-working SSD back to port 13.

working drive on port 12
Code:
root@freenas[~]# smartctl -x /dev/da2
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HITACHI
Product:              HUSMM818 CLAR800
Revision:             C260
Compliance:           SPC-4
User Capacity:        800,176,914,432 bytes [800 GB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x5000cca02b0dae4c
Serial number:        2MV7J7KA
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Mon Apr 22 09:15:51 2019 MDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 1%
Current Drive Temperature:     28 C
Drive Trip Temperature:        70 C

Manufactured in week 07 of year 2014
Specified cycle count over device lifetime:  0
Accumulated start-stop cycles:  0
Specified load-unload count over device lifetime:  0
Accumulated load-unload cycles:  0
defect list format 6 unknown
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 130167674503168

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0       1035.504           0
write:         0        0         0         0          0        357.003           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   37288                 - [-   -    -]
# 2  Background short  Completed                   -   37288                 - [-   -    -]
# 3  Background short  Completed                   -   37196                 - [-   -    -]

Long (extended) Self Test duration: 6 seconds [0.1 minutes]

Background scan results log
  Status: scan is active
    Accumulated power on time, hours:minutes 37290:56 [2237456 minutes]
    Number of background scans performed: 46,  scan progress: 37.86%
    Number of background medium scans performed: 46

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 3
  number of phys = 1
  phy identifier = 0
    attached device type: expander device
    attached reason: SMP phy control function
    reason: unknown
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000cca02b0dae4d
    attached SAS address = 0x500065b36789abff
    attached phy identifier = 1
    Invalid DWORD count = 5
    Running disparity error count = 5
    Loss of DWORD synchronization = 5
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 5
     Running disparity error count: 5
     Loss of dword synchronization count: 5
     Phy reset problem count: 0
relative target port id = 2
  generation code = 3
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: power on
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000cca02b0dae4e
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0




not working drive on port 13
Code:

root@freenas[~]# smartctl -x /dev/da3
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUSMM818 CLAR800
Revision:             C118
Compliance:           SPC-4
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x5000cca02b0d9570
Serial number:        2MV7GL7A
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Mon Apr 22 09:16:03 2019 MDT
device Test Unit Ready  [medium or hardware error (serious)]
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
root@freenas[~]# smartctl -t short /dev/da3
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

Short offline self test failed [medium or hardware error (serious)]


Below is a screen grab of when I was successfully able to run a shart smart test. This is the only "proof" I have that at what time it did work. Yes, I also thought it was interesting the drive has zero writes.

2019-04-21.png
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Upon rebooting da3 is again gone. The SSD does not show up. Is FreeNAS or my H200 frying my SSD (now twice!)?
It certainly looks broken. I would say that it is not FreeNAS. I have used HGST SAS SSDs (the 100GB model) in my FreeNAS with those drives connected by way of a Dell H310 (LSI 9211-8i P20 IT Mode) and I have used those same SSDs also connected to FreeNAS by way of an HP H220 (also flashed to IT mode).

I am certain that those drives can work with FreeNAS.

I would suspect some problem in the Dell backplane or possibly (but not likely) some problem with the H200.
 

jpi

Dabbler
Joined
Apr 21, 2019
Messages
14
Well, if it is indeed the backplane that really is awful. A bad system and now two bad SSDs. Homelab fail :(.

Just to clarify, are you suggesting that a faulty backplane does have the ability to physically break a drive?

Thanks again for the help.
 

jpi

Dabbler
Joined
Apr 21, 2019
Messages
14
I guess what I don't get is twice now an SSD connected to port 13 initially worked. The HBA recognized it at boot and FreeNAS showed it in the UI. SMART tests passed, albeit short. It's not until I try to create a volume in FreeNAS does the disk finally fail. Perhaps that volume creation process taxes the disk in a way to surface the backplane issue. I have even taken the working SSD and connected it to port 13, everything works.

I guess if I really wanted to prove things, I'd create another volume with the working SSD on port 13.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Just to clarify, are you suggesting that a faulty backplane does have the ability to physically break a drive?
I can see how pushing bad data into a drive could cause the internal drive electronics to freak out. It shouldn't happen under normal circumstances, but if there is something wrong with that port on the backplane, it might even be pushing out bad voltage instead of just bad data. If the same port makes two different drives go crazy, I would be looking at the port, not the drive. Still, see if the vendor will replace the drive because I am still a little suspicious of the reporting of data vs power on hours.

You could get one of these cables:
https://www.ebay.com/itm/Mini-SAS-S...SATA-Power-Splitter-Cable-1m-3ft/201805073616
This would let you go direct from the SAS controller to the SAS drives to test without going through the backplane. I would test it that way.
 

jpi

Dabbler
Joined
Apr 21, 2019
Messages
14
Do you think a loose cable could have caused this? I reseated the SAS port 13 cable off the backplane to the internal 2.5 bay. TBH, it didn't seem on there very well. I've never seen a loose cable 1) work just enough for S.M.A.R.T. and 2) be loose in a way to ruin a drive. Thanks again and I'll look into that cable. As of now, I will attempt to RMA the drive.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I've never seen a loose cable 1) work just enough for S.M.A.R.T. and 2) be loose in a way to ruin a drive.
I have seen bad cables cause CRC errors and those drives pass SMART tests because the test is all internal to the drive.
I also have a server chassis where where I work that I suspect of having a bad group of ports because three drives that were connected on those ports all went bad in the first month of operation. Two of the three had massive numbs of data errors but were still working but the one in the center of the group was completely dead. It is not something I have attempted to test because I needed to get the server into production, for work, but I did swap some parts to eliminate the suspect components. I have been doing this kind of work since 1992 and in all that time I have only had two instances where I suspected the cabling of causing a drive to fail and one of those was back in the day of IDE (PATA) 40 pin ribbon cables.
So, if this is a bad port killing drives, you just got really lucky, but it is only a theory at this point and I am a little reluctant to suggest that you continue testing with the possibility of rendering more drives into scrap.
 
  • Like
Reactions: jpi
Top