HDD gets detached during Solnet test

troudee · Aug 21, 2020

Hello!

I am still building my first NAS (which is actually not for me):

Mainboard: Supermicro X10SDV-4C+-TLN4F bulk (MBD-X10SDV-4C+-TLN4F-B)
RAM: Crucial DIMM 16GB, DDR4-2666, CL19, ECC (CT16G4WFD8266)
PSU: Corsair SF750 Platinum (SFX/80+)
Case: SilverStone Case Storage DS380, Mini-ITX (SST-DS380B)
HBA: InLine 76617E, 4x SATA, PCIe 2.0 x1 (the Mainboard has only 6 SATA ports)
SATA-Cables: Supermicro Spare 2FT Amphenol SATA CB
HDDs: WD Red 4TB WD40EFRX
SSD: Intenso High Performance interne SSD 120GB (6,3 cm (2,5 Zoll), SATA III, 520 MB/Sekunden) schwarz – 120 GB – Sata 2,5

Assembling everything (without SSD)
Smoke Test
Update IPMI
Burn-In-Tests of CPU (OCCT and stress and firestarter and prime95... Wanted to do everything I could, right? ^^)
Two weeks of memtest86
SMART-Tests of the WD-HDDs (fine)
Badblocks-Test of the WD-HDDs (fine)
First Solnet-Test of the WD-HDDs (fine, but the HDD-Activity LED breaks)
Disassemble everything, after lots of swearing, simply replace the HDD-Activity LED with the one from my test kit
Assemble everything again, this time with SSD and improved cable management

On the image, you can (not) see the SSD in the right cage.
At the first smoke test afterwards, the SSD threw CRC errors into my mfsbsd kernel log -- cross testing showed that the cable was bad.
Replaced the cable.

I now wanted to do another solnet test, but during the test, one of the HDDs suddenly detached!
I placed the 4 HDDs into the slots "I0" to "I3" ("I" = internal / Mainboard, "E" = extended / HBA), launched mfsbsd and started the test (for the HDDs only, I ignored the SSD).

This is the log of solnet (I enhanced it with a time-printing wrapper):

Code:

Tue Aug 18 19:29:20 CEST 2020 Starte Solnet...
sol.net disk array test v2

1) Use all disks (from camcontrol)
2) Use selected disks (from camcontrol|grep)
3) Specify disks
4) Show camcontrol list

Option: <JAJS600M128C S0222A0>             at scbus3 target 0 lun 0 (ada0,pass0)  // That's the SSD on the HBA
<WDC WD40EFRX-68N32N0 82.00A82>    at scbus4 target 0 lun 0 (ada1,pass1)  // HDD in I0
<WDC WD40EFRX-68N32N0 82.00A82>    at scbus5 target 0 lun 0 (ada2,pass2)  // HDD in I1
<WDC WD40EFRX-68N32N0 82.00A82>    at scbus6 target 0 lun 0 (ada3,pass3)  // HDD in I2
<WDC WD40EFRX-68N32N0 82.00A82>    at scbus7 target 0 lun 0 (ada4,pass4)  // HDD in I3
<HL-DT-ST DVDRAM GSA-E10N JE05>    at scbus10 target 0 lun 0 (pass5,cd0)  // The CD drive I booted mfsbsd from
<Intenso Rainbow Line 6.90>        at scbus11 target 0 lun 0 (pass6,da0)  // The USB stick for storing the logs
Press Return:
1) Use all disks (from camcontrol)
2) Use selected disks (from camcontrol|grep)
3) Specify disks
4) Show camcontrol list

Option:
Enter disk devices separated by spaces (e.g. da1 da2):
Selected disks: ada1 ada2 ada3 ada4
<WDC WD40EFRX-68N32N0 82.00A82>    at scbus4 target 0 lun 0 (ada1,pass1)
<WDC WD40EFRX-68N32N0 82.00A82>    at scbus5 target 0 lun 0 (ada2,pass2)
<WDC WD40EFRX-68N32N0 82.00A82>    at scbus6 target 0 lun 0 (ada3,pass3)
<WDC WD40EFRX-68N32N0 82.00A82>    at scbus7 target 0 lun 0 (ada4,pass4)
Is this correct? (y/N): Performing initial serial array read (baseline speeds)
Tue Aug 18 19:29:55 CEST 2020
Tue Aug 18 19:38:56 CEST 2020
Completed: initial serial array read (baseline speeds)

Array's average speed is 175.505 MB/sec per disk

Disk    Disk Size  MB/sec %ofAvg
------- ---------- ------ ------
ada1     3815447MB    173     98
ada2     3815447MB    169     96
ada3     3815447MB    182    104
ada4     3815447MB    179    102

Performing initial parallel array read
Tue Aug 18 19:38:56 CEST 2020
The disk ada1 appears to be 3815447 MB.      
Disk is reading at about 174 MB/sec      
This suggests that this pass may take around 366 minutes

                   Serial Parall % of
Disk    Disk Size  MB/sec MB/sec Serial
------- ---------- ------ ------ ------
ada1     3815447MB    173    173    101
ada2     3815447MB    169    170    101
ada3     3815447MB    182    182    100
ada4     3815447MB    179    179    100

Awaiting completion: initial parallel array read
Wed Aug 19 03:34:22 CEST 2020
Completed: initial parallel array read

Disk's average time is 28015 seconds per disk

Disk    Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
ada1        4000787030016   28471    102
ada2        4000787030016   28525    102
ada3        4000787030016   27402     98
ada4        4000787030016   27660     99

Performing initial parallel seek-stress array read
Wed Aug 19 03:34:22 CEST 2020
The disk ada1 appears to be 3815447 MB.      
Disk is reading at about 174 MB/sec      
This suggests that this pass may take around 365 minutes

                   Serial Parall % of
Disk    Disk Size  MB/sec MB/sec Serial
------- ---------- ------ ------ ------
ada1     3815447MB    173    170     98
ada2     3815447MB    169    164     97
ada3     3815447MB    182    181     99
ada4     3815447MB    179    175     98

Awaiting completion: initial parallel seek-stress array read
Fri Aug 21 04:00:07 CEST 2020
Completed: initial parallel seek-stress array read
!!ERROR!! dd: /dev/ada3: Device not configured

Disk's average time is 90762 seconds per disk

Disk    Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
ada1        4000787030016   72090     79 ++FAST++
ada2        4000787030016   82851     91 ++FAST++
ada3        2884906254336   54532     60 ++FAST++
ada4        4000787030016  153575    169 --SLOW--
Fri Aug 21 04:00:07 CEST 2020 Solnet ended

As you can see, ada3 falls out: !!ERROR!! dd: /dev/ada3: Device not configured.

The kernel says this:

Code:

Aug 19 16:43:14 mfsbsd kernel: ada3 at ahcich6 bus 0 scbus6 target 0 lun 0
Aug 19 16:43:14 mfsbsd kernel: ada3: <WDC WD40EFRX-68N32N0 82.00A82> s/n WD-WCC7K0ER3TKT detached
Aug 19 16:43:14 mfsbsd kernel: (ada3:ahcich6:0:0:0): Periph destroyed
Aug 19 16:43:17 mfsbsd kernel: (aprobe0:ahcich6:0:0:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
Aug 19 16:43:17 mfsbsd kernel: (aprobe0:ahcich6:0:0:0): CAM status: ATA Status Error
Aug 19 16:43:17 mfsbsd kernel: (aprobe0:ahcich6:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 (ABRT )
Aug 19 16:43:17 mfsbsd kernel: (aprobe0:ahcich6:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff
Aug 19 16:43:17 mfsbsd kernel: (aprobe0:ahcich6:0:0:0): Error 5, Retries exhausted

What does that mean, and what would be the most reasonable steps to do first and what should I suspect first? The drive? The cable (maybe I rolled it too tightly? See picture)? The mainboard? Myself?

troudee · Aug 27, 2020

Any ideas on that? What could that "NOP FLUSHQUEUE" mean?

Samuel Tai · Aug 27, 2020

NOP FLUSHQUEUE means the kernel thinks the drive doesn't implement the FLUSHQUEUE command. I'd try the cable first, and those are the likeliest component to go bad. The next likeliest would probably be the SATA port, so see if the error moves with the drive when shuffling disks around, or stays with the port.

troudee · Aug 28, 2020

Samuel Tai said:
NOP FLUSHQUEUE means the kernel thinks the drive doesn't implement the FLUSHQUEUE command. I'd try the cable first, and those are the likeliest component to go bad. The next likeliest would probably be the SATA port, so see if the error moves with the drive when shuffling disks around, or stays with the port.

Hey Samuel Tai, thank you very much for your reply! I will try it some more times.

When looking at the picture I posted under "What I did so far", would you say I've been too violent against these cables, or is it just a bad cable? I am asking since it would already be the second bad cable out of 9.

Samuel Tai · Aug 28, 2020

troudee said:
When looking at the picture I posted under "What I did so far", would you say I've been too violent against these cables, or is it just a bad cable? I am asking since it would already be the second bad cable out of 9.

You should try to get shorter cables. Those loops are a lot of weight to be hanging off the SATA ports on both the drive backplane and motherboard sides. Barring that, some kind of weight relief would be warranted, like a short vertical stick to which you could velcro all the loops.

troudee · Aug 30, 2020

Just for update: Another solnet test is running since ~43 hours now, without having touched anything (cable, HDD, ... not even opened the chassis) before. No error so far.

troudee · Aug 31, 2020

Oh dear. The second solnet test completed tonight, without any errors... I hate nonreproducible errors!

What do you think, @Samuel Tai, should I replace all the loops, just to be safe?

troudee · Sep 3, 2020

Update: Just running another one.

troudee · Sep 5, 2020

I am really not sure, what I should do now. There have been another 60 hours (120 in total) of Solnet testing in the untouched configuration now, without any error. Should I ignore it? Should I replace cables or the drives or the mainboard or myself or the HBA just for the sake of it?

How is anyone supposed to fix errors that do not occur when looking for them? :( :( :(

troudee · Sep 20, 2020

Samuel Tai said:
You should try to get shorter cables.

I got new ones! That looks better, right?

joeschmuck · Sep 20, 2020

troudee said:
I got new ones! That looks better, right?

Yes, much cleaner and better for airflow.

Waiting on your test results but if they are the same, read below. I hope it was just the one SATA cable.

troudee said:
<WDC WD40EFRX-68N32N0 82.00A82> at scbus4 target 0 lun 0 (ada1,pass1) // HDD in I0
<WDC WD40EFRX-68N32N0 82.00A82> at scbus5 target 0 lun 0 (ada2,pass2) // HDD in I1
<WDC WD40EFRX-68N32N0 82.00A82> at scbus6 target 0 lun 0 (ada3,pass3) // HDD in I2
<WDC WD40EFRX-68N32N0 82.00A82> at scbus7 target 0 lun 0 (ada4,pass4) // HDD in I3

How do you know this is accurate? I'm asking if you have traced out each SATA port to each hard drive (by serial number). Ensure that the suspect drive is actually plugged into SATA3 port. I just see no serial numbers here and in order to prove it, you must track the hard drives by serial number. I bring this up because if you still have the problem then you need to move the SATA cables from each port around to another port, for example swap SATA1 and SATA2. and retest. Does the fault stay at the same drive (by serial number not adax number). If it's the same, next swap SATA2 and SATA3 on the motherboard. Retest, is the fault the same or does it move to a different drive. This is all basic isolation troubleshooting. Just remember to look at the drive serial numbers.

In order to correlate ada0 to a serial number you just need to enter the command smartctl -a /dev/ada0 and it will show you the serial number of the drive. You can also get it from the FreeNAS Dashboard if you go through the Pool menus down to the hard drive detail data.

I'm bringin this up because on your motherboard the SATA0 port is slightly different electrically than the other SATA ports, it's not that it's just a different color. If you find out that the drive connected to SATA0 is your issue, just move it to a different SATA port. In fact I would skip using SATA0 port just becasue it is different, at least for your pool if you don't need it. I'm grasping at straws here just in case your testing fails, giving you other possible options.

I hope the SATA cables fixed your issue and good luck with the build.

troudee · Sep 25, 2020

joeschmuck said:
Yes, much cleaner and better for airflow.

I hope the SATA cables fixed your issue and good luck with the build.

Thank you very much!

(I rearranged your post a bit, hope that's okay for you)

How do you know this is accurate? I'm asking if you have traced out each SATA port to each hard drive (by serial number).

Yeeeeeeeees, of course!

I've written the mainboard (and HBA) ports next to the drive slots, written the serial number of the drive onto its "forehead", and before starting solnet, I have my solnet-launcher-script loop over all smartctl --scan results, ask me where the respective drive is plugged into and write all these infos to the reports directory.

I bring this up because if you still have the problem then you need to move the SATA cables from each port around to another port, for example swap SATA1 and SATA2. and retest. Does the fault stay at the same drive (by serial number not adax number). If it's the same, next swap SATA2 and SATA3 on the motherboard. Retest, is the fault the same or does it move to a different drive. This is all basic isolation troubleshooting. Just remember to look at the drive serial numbers.

That is about what I wanted to do -- however, the fault did not reappear in 2 Solnet tests, even without touching any drive/cable/etc, so I got really frustrated (the test runs about 60h) but exchanged the cables nonetheless.

In order to correlate ada0 to a serial number you just need to enter the command smartctl -a /dev/ada0 and it will show you the serial number of the drive.

Thank you.

That's what I am doing with my script.

You can also get it from the FreeNAS Dashboard if you go through the Pool menus down to the hard drive detail data.

I do not have any FreeNAS yet -- these tests are all pre-install tests, the solnet tests are running with a mfsbsd CD.

I'm bringin this up because on your motherboard the SATA0 port is slightly different electrically than the other SATA ports, it's not that it's just a different color. If you find out that the drive connected to SATA0 is your issue, just move it to a different SATA port. In fact I would skip using SATA0 port just becasue it is different, at least for your pool if you don't need it. I'm grasping at straws here just in case your testing fails, giving you other possible options.

That is very nice.

After all, I ~~do not know much about NAS~~ have learned just enough about NAS installation to know how much there is still to be learned, so I am thankful for any advice!

Waiting on your test results but if they are the same, read below. I hope it was just the one SATA cable.

The first full run of the Solnet script (about 60h) completed without errors in /car/log/messages and with these values -- what do you think about the values? Are they that crucial, especially the seek-stress-ones?

Selected disks: ada1 ada2 ada3 ada4
<WDC WD40EFRX-68N32N0 82.00A82> at scbus4 target 0 lun 0 (ada1,pass1)
<WDC WD40EFRX-68N32N0 82.00A82> at scbus5 target 0 lun 0 (ada2,pass2)
<WDC WD40EFRX-68N32N0 82.00A82> at scbus6 target 0 lun 0 (ada3,pass3)
<WDC WD40EFRX-68N32N0 82.00A82> at scbus7 target 0 lun 0 (ada4,pass4)
Is this correct? (y/N): Performing initial serial array read (baseline speeds)
Tue Sep 22 07:27:21 CEST 2020
Tue Sep 22 07:36:22 CEST 2020
Completed: initial serial array read (baseline speeds)

Array's average speed is 175.363 MB/sec per disk

Disk Disk Size MB/sec %ofAvg
------- ---------- ------ ------
ada1 3815447MB 173 99
ada2 3815447MB 169 96
ada3 3815447MB 181 103
ada4 3815447MB 178 102

Performing initial parallel array read
Tue Sep 22 07:36:22 CEST 2020
The disk ada1 appears to be 3815447 MB.
Disk is reading at about 174 MB/sec
This suggests that this pass may take around 366 minutes

Serial Parall % of
Disk Disk Size MB/sec MB/sec Serial
------- ---------- ------ ------ ------
ada1 3815447MB 173 174 101
ada2 3815447MB 169 170 101
ada3 3815447MB 181 182 100
ada4 3815447MB 178 179 100

Awaiting completion: initial parallel array read
Tue Sep 22 15:31:46 CEST 2020
Completed: initial parallel array read

Disk's average time is 28013 seconds per disk

Disk Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
ada1 4000787030016 28474 102
ada2 4000787030016 28524 102
ada3 4000787030016 27392 98
ada4 4000787030016 27660 99

Performing initial parallel seek-stress array read
Tue Sep 22 15:31:46 CEST 2020
The disk ada1 appears to be 3815447 MB.
Disk is reading at about 167 MB/sec
This suggests that this pass may take around 381 minutes

Serial Parall % of
Disk Disk Size MB/sec MB/sec Serial
------- ---------- ------ ------ ------
ada1 3815447MB 173 167 97
ada2 3815447MB 169 166 99
ada3 3815447MB 181 177 98
ada4 3815447MB 178 188 106

Awaiting completion: initial parallel seek-stress array read
Thu Sep 24 15:48:42 CEST 2020
Completed: initial parallel seek-stress array read

Disk's average time is 100201 seconds per disk

Disk Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
ada1 4000787030016 72781 73 ++FAST++
ada2 4000787030016 134384 134 --SLOW--
ada3 4000787030016 122450 122 --SLOW--
ada4 4000787030016 71189 71 ++FAST++
Thu Sep 24 15:48:42 CEST 2020 Solnet ended

The second test run is currently running. If both are okay, I was planning to repeat them with the drives in the lower 4 bays. And then repeat the cpu-burn-in because I've changed the airflow. And then ...

troudee · Sep 30, 2020

The second test completed without (visible) errors:

Enter disk devices separated by spaces (e.g. da1 da2):
Selected disks: ada1 ada2 ada3 ada4
<WDC WD40EFRX-68N32N0 82.00A82> at scbus4 target 0 lun 0 (ada1,pass1)
<WDC WD40EFRX-68N32N0 82.00A82> at scbus5 target 0 lun 0 (ada2,pass2)
<WDC WD40EFRX-68N32N0 82.00A82> at scbus6 target 0 lun 0 (ada3,pass3)
<WDC WD40EFRX-68N32N0 82.00A82> at scbus7 target 0 lun 0 (ada4,pass4)
Is this correct? (y/N): Performing initial serial array read (baseline speeds)
Thu Sep 24 20:32:08 CEST 2020
Thu Sep 24 20:41:08 CEST 2020
Completed: initial serial array read (baseline speeds)

Array's average speed is 175.33 MB/sec per disk

Disk Disk Size MB/sec %ofAvg
------- ---------- ------ ------
ada1 3815447MB 173 98
ada2 3815447MB 169 96
ada3 3815447MB 181 103
ada4 3815447MB 178 102

Performing initial parallel array read
Thu Sep 24 20:41:08 CEST 2020
The disk ada1 appears to be 3815447 MB.
Disk is reading at about 174 MB/sec
This suggests that this pass may take around 366 minutes

Serial Parall % of
Disk Disk Size MB/sec MB/sec Serial
------- ---------- ------ ------ ------
ada1 3815447MB 173 173 100
ada2 3815447MB 169 170 101
ada3 3815447MB 181 182 100
ada4 3815447MB 178 179 100

Awaiting completion: initial parallel array read
Fri Sep 25 04:36:33 CEST 2020
Completed: initial parallel array read

Disk's average time is 28014 seconds per disk

Disk Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
ada1 4000787030016 28470 102
ada2 4000787030016 28524 102
ada3 4000787030016 27403 98
ada4 4000787030016 27659 99

Performing initial parallel seek-stress array read
Fri Sep 25 04:36:33 CEST 2020
The disk ada1 appears to be 3815447 MB.
Disk is reading at about 167 MB/sec
This suggests that this pass may take around 380 minutes

Serial Parall % of
Disk Disk Size MB/sec MB/sec Serial
------- ---------- ------ ------ ------
ada1 3815447MB 173 167 97
ada2 3815447MB 169 166 99
ada3 3815447MB 181 184 101
ada4 3815447MB 178 212 119

Awaiting completion: initial parallel seek-stress array read
Sun Sep 27 04:41:15 CEST 2020
Completed: initial parallel seek-stress array read

Disk's average time is 138144 seconds per disk

Disk Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
ada1 4000787030016 85178 62 ++FAST++
ada2 4000787030016 140520 102
ada3 4000787030016 167170 121 --SLOW--
ada4 4000787030016 159709 116 --SLOW--
Sun Sep 27 04:41:15 CEST 2020 Solnet ended

Are those values good or bad or irrelevant?

Important Announcement for the TrueNAS Community.

HDD gets detached during Solnet test

troudee

Explorer

troudee

Explorer

Samuel Tai

Never underestimate your own stupidity

troudee

Explorer

Samuel Tai

Never underestimate your own stupidity

troudee

Explorer

troudee

Explorer

troudee

Explorer

troudee

Explorer

troudee

Explorer

joeschmuck

Old Man

troudee

Explorer

troudee

Explorer

Similar threads

Important Announcement for the TrueNAS Community.

HDD gets detached during Solnet test

Explorer

Explorer

Never underestimate your own stupidity

Explorer

Never underestimate your own stupidity

Explorer

Explorer

Explorer

Explorer

Explorer

Old Man

Explorer

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "HDD gets detached during Solnet test"

Similar threads