Are 2 of My Drives Failed? (See Edit: Moving Data Onto To New Vdev, To Remove Old)

isopropyl · Aug 3, 2023

Davvo said:
You have wrong firmware version. See the following resource.

LSI 9300-xx Firmware Update

Hey Community, If you are using an LSI 9300 HBA with FreeNAS or the soon-to-be TrueNAS CORE, you may experience some performance issues causing the controller to reset when using SATA HDDs. After working with Broadcom, we’ve come up with a...

www.truenas.com

I got them directly from Supermicro's site under Firmware
Also that states it's for a LSI 9300-xx, not my 3008..?
I have been running this on the HBA since release, and have had 3 of the same HBA running it (1 on my old machine, and these 2 although one is inactive it was previously in this system) and have not run into any issues. Maybe performance bottleneck that I am not aware of? But as far as disconnects or anything I have yet to notice any issues.
That being said, he does state in the post that it only affects SATA drives. The only SATA drive currently attached is my boot ssd for TrueNAS. So I assume I'd run into major issues if the HBA was causing issues with that drive. All the pool drives are SAS HDDs.
I'll throw a post confirming this in the discussion there for the hell of it though..

Redcoat said:
You have listed 13 out of your 16 drives.

Edited my post. Thanks for pointing it out. They seem to all be there.. for some reason the shell seems to be cutting off the bottom of both outputs I pasted because why would it not just copy everything I highlighted.
Anyways both codeblocks above have updated info that was missing before in my post.

Are 2 of My Drives Failed? (See Edit: Moving Data Onto To New Vdev, To Remove Old)

EDIT: SEE FROM POST #106 FOR UPDATES (https://www.truenas.com/community/threads/are-2-of-my-drives-failed.111640/post-777996) I find the way the GUI and CLI list the pool very confusing when looking at spare drives, and failed/failing drives. From what it looks like, 2 spare drives are in...

www.truenas.com

joeschmuck said:
It appears that da12 and da13 are the spares and are being used to replace da9 and da14 respectively, if I'm reading the output correctly (I added the device idents and I've never paid attention to an output that had spares active). This would be drives K4K0BBJB and K4J3EXNB (I have the full multi_report output).

This is what I had gathered, but I also was unsure if I was reading it correctly as I find it a little confusing how it is all displayed.

joeschmuck said:
But what I do not understand is it looks like the Spares were manually assigned, the drive it replaced is not "UNAVAIL" and the mirrors are not "DEGRADED".

That is very strange. I did not even know you can manually assign spares like that. It was not something I have done (at least to my knowledge haha)

And exactly, that's what I did not understand. Nothing showed as 'UNAVIL', 'DEGRADED', 'FAILED' anywhere. Normally when I get a drive failing, I get an alert and e-mail about it (or SMART concerns), and then it also shows the pool as being degraded, and then it seems to auto swap in one of the spares, and resilver. Then I would replace the drive, and it would resilver, and I would run a scrub, and all would show good again.

joeschmuck said:
As far as I can tell, you can manually detach the spares HOWEVER, WAIT until someone tells you that it's safe. Don't do it because I mentioned it, I don't consider myself the expert on this topic.

That is very reassuring to hear if true, however yes I would really appreciate further input from someone who is very knowledgeable here before I go ahead and just detach it.

Also the proper way of doing it?
I assume Web GUI > Pools > Pool Status
Under the respective mirrors, I would click da12 and da13 and click detach?
Please correct me if I am wrong.

(Also confirmation that it is da12 and da13 would be highly appreciated just for a sanity check to ensure it's the correct drives.)

Davvo · Aug 4, 2023

isopropyl said:
I got them directly from Supermicro's site under Firmware
Also that states it's for a LSI 9300-xx, not my 3008..?
I have been running this on the HBA since release, and have had 3 of the same HBA running it (1 on my old machine, and these 2 although one is inactive it was previously in this system) and have not run into any issues. Maybe performance bottleneck that I am not aware of? But as far as disconnects or anything I have yet to notice any issues.
That being said, he does state in the post that it only affects SATA drives. The only SATA drive currently attached is my boot ssd for TrueNAS. So I assume I'd run into major issues if the HBA was causing issues with that drive. All the pool drives are SAS HDDs.
I'll throw a post confirming this in the discussion there for the hell of it though..

The firmware you have flashed is the 16.00.01.00, dated 2018; the latest firmware on the supermicro's page is the 16.00.10.00, dated 2021.
The firmware I linked you to, which is compatible with your controller as other users have pointed out in that thread, is the 16.00.12.00: I see no evil in flashing the latest known working firmware, especially since you are having some kind of issue regarding the drives and their smart tests.
If you don't feel confident in flashing the 16.00.12.00 please flash the 16.00.10.00 taken from the supermicro page you linked. This might solve your issue.

joeschmuck · Aug 4, 2023

So if you could generate another SMART report for da0 smartctl -a /dev/da0 and post it, we would know if SMART tests are being reported correctly on the drive.

Johnny Fartpants · Aug 4, 2023

Up until recently hot-spares when triggered into a pool would not automatically go back to being spares after successful replacement and you needed to enter the below command to do so.

zpool detach PrimaryPool gptid/0d48d4ab-1e91-11ed-a6aa-ac1f6be66d76

zpool detach PrimaryPool gptid/0d56b97d-1e91-11ed-a6aa-ac1f6be66d76

Although I haven't tested it I believe this was supposed to be fixed in the U5 release.

PS: My guess is that two drives dropped for some reason and the spares kicked-in. After that the two drives must have come back online and ZFS sorted itself out leaving the spares stuck (as mentioned above). Question is WHY did the two drives drop.

joeschmuck · Aug 4, 2023

Johnny Fartpants said:
PS: My guess is that two drives dropped for some reason and the spares kicked-in. After that the two drives must have come back online and ZFS sorted itself out leaving the spares stuck (as mentioned above). Question is WHY did the two drives drop.

I was thinking that as well, but we may never know why they dropped, but SMART reported the data below 671 times for da9 but the log was truncated so it's very likely this happened a lot more than 671 times. I did not find any errors with da14, it actually looks good all around.

Code:

 200 31306:36  00000001cf2af404  [1,18,2]   Successfully reassigned
 201 31306:33  00000001cf2af22b  [1,18,2]   Successfully reassigned
 202 31292:50  00000001cf31d999  [1,18,2]   Successfully reassigned
 203 31292:50  00000001cf31d2d6  [3,11,0]   Recovered via rewrite in-place
 204 31292:50  00000001cf31d2d5  [3,11,0]   Recovered via rewrite in-place
 205 31292:49  00000001cf31d2d4  [3,11,0]   Recovered via rewrite in-place

Johnny Fartpants · Aug 4, 2023

Would be interesting to know the drives physical location in the system to see if there is any correlation there.

isopropyl · Aug 4, 2023

joeschmuck said:
Also the drive is telling you that a Long test is in progress.

When the drive reaches 49127 hours (9.5 hours) then check the SMART data again. You should have a test completed in the 48,xxx or 49,xxx range. You should not have this line
Code:
# 2 Background short Completed - 38827 - [- - -]
, it should have moved into spot #3. If it didn't then the failure is in your drive firmware.

I just went ahead and checked the status of the SMART test on /da0
smartctl -a /dev/da0 and it still is showing that a Long test is running on it. That 38827 is still in slot #2
It has been far over 9.5 hours since yesterday.

Code:

# smartctl -a /dev/da0
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUS726040AL4210
Revision:             A980
Compliance:           SPC-4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca244194888
Serial number:        N8GEX1NY
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri Aug  4 07:21:51 2023 EDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     41 C
Drive Trip Temperature:        85 C

Accumulated power on time, hours:minutes 48596:35
Manufactured in week 29 of year 2016
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  154
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2178
Elements in grown defect list: 1

Vendor (Seagate Cache) information
  Blocks sent to initiator = 5867418979139584

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0       58         0        58   13376933     454325.925  0
write:         0     2858         0      2858    3392811      36649.271  0
verify:        0        0         0         0     410966          0.000  0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Self test in progress ...   -     NOW                 - [-   -    -]
# 2  Background short  Completed                   -   38827                 - [-   -    -]
# 3  Background short  Completed                   -   38803                 - [-   -    -]
# 4  Background short  Completed                   -   38779                 - [-   -    -]
# 5  Background short  Completed                   -   38755                 - [-   -    -]
# 6  Background short  Completed                   -   38731                 - [-   -    -]
# 7  Background short  Completed                   -   38707                 - [-   -    -]
# 8  Background short  Completed                   -   38683                 - [-   -    -]
# 9  Background long   Completed                   -   38671                 - [-   -    -]
#10  Background short  Completed                   -   38659                 - [-   -    -]
#11  Background short  Completed                   -   38635                 - [-   -    -]
#12  Background short  Completed                   -   38611                 - [-   -    -]
#13  Background short  Completed                   -   38587                 - [-   -    -]
#14  Background short  Completed                   -   38563                 - [-   -    -]
#15  Background short  Completed                   -   38539                 - [-   -    -]
#16  Background short  Completed                   -   38515                 - [-   -    -]
#17  Background long   Completed                   -   38510                 - [-   -    -]
#18  Background short  Completed                   -   38491                 - [-   -    -]

isopropyl · Aug 4, 2023

Davvo said:
The firmware you have flashed is the 16.00.01.00, dated 2018; the latest firmware on the supermicro's page is the 16.00.10.00, dated 2021.
The firmware I linked you to, which is compatible with your controller as other users have pointed out in that thread, is the 16.00.12.00: I see no evil in flashing the latest known working firmware, especially since you are having some kind of issue regarding the drives and their smart tests.
If you don't feel confident in flashing the 16.00.12.00 please flash the 16.00.10.00 taken from the supermicro page you linked. This might solve your issue.

Yeah I notice that it is not the latest on the Supermicro site either. I will go ahead in flash the 12 update. Just waiting for confirmation in that thread as to safe process to proceed. However, again all my drives are SAS and that states that only applies to SATA.
Whether or not something between .01 and .10 fixed an issue, I don't know. But I have been running this rig for about 1.5years now and have not noticed an issue like this before. Whether or not that means anything. I don't think it's related, but could be wrong.

Johnny Fartpants said:
Would be interesting to know the drives physical location in the system to see if there is any correlation there.

What do you mean by physical location? They are all situated on the back plane, in the bays, connected to 1 HBA..?
On my server rack?

Johnny Fartpants said:
Up until recently hot-spares when triggered into a pool would not automatically go back to being spares after successful replacement and you needed to enter the below command to do so.

zpool detach PrimaryPool gptid/0d48d4ab-1e91-11ed-a6aa-ac1f6be66d76

zpool detach PrimaryPool gptid/0d56b97d-1e91-11ed-a6aa-ac1f6be66d76

Although I haven't tested it I believe this was supposed to be fixed in the U5 release.

Well I am on TrueNAS-13.0-U5.3
But how can I confirm that the original drives are good and okay before detaching these? Are you confident in your assesment that I should just go ahead and detach them?

Johnny Fartpants said:
PS: My guess is that two drives dropped for some reason and the spares kicked-in. After that the two drives must have come back online and ZFS sorted itself out leaving the spares stuck (as mentioned above). Question is WHY did the two drives drop.

Well I have reqplaced two drives not that long ago. I replaced one about 2 months ago. I forget when the last one was, a little longer probably. However, If I remember correctly, after the replacement of both, I do not recall seeing the spares in the pool. But I could be wrong, but I'm almost confident that they were not active after replacing the drives. So all had seemed good.

joeschmuck · Aug 4, 2023

As for the SMART tests that never stop running, my advice is to power off the drives, not sleep but remove power. Wait a few minutes and then power back on. Once all is up and running, check SMART again. Powering off the drives will kick them out of a SMART test. If they are still reporting that a long test is running, report that back. I do not think this is related to the LSI firmware but I guess anything is possible. You could also boot up Ubuntu and see what the drives report if things are not going well.

Best of luck.

isopropyl · Aug 4, 2023

joeschmuck said:
As for the SMART tests that never stop running, my advice is to power off the drives, not sleep but remove power. Wait a few minutes and then power back on. Once all is up and running, check SMART again. Powering off the drives will kick them out of a SMART test. If they are still reporting that a long test is running, report that back.

Would a simple machine shutdown and flick of the power switch of the machine achieve this?
I have rebooted the machine not that long ago. But not since making this thread. Guess a shutdown couldn't hurt.

joeschmuck · Aug 4, 2023

Yes, so long as you powered off the hard drives. If they are in a separate enclosure, ensure that gets powered off as well. The goal is to make sure the SMART test is terminated. Removing power accomplishes this.

Johnny Fartpants · Aug 4, 2023

isopropyl said:
What do you mean by physical location? They are all situated on the back plane, in the bays, connected to 1 HBA..?
On my server rack?

I mean are they next to each other, both in the front or both in the back etc? Try to see if there is any cable link etc.

isopropyl said:
Well I am on TrueNAS-13.0-U5.3
But how can I confirm that the original drives are good and okay before detaching these? Are you confident in your assesment that I should just go ahead and detach them?

This is your machine so you need to make these decisions not me. However if it were mine I'd first make sure I had a backup and then try one of the above commands. My 'GUESS' is that all is ok as drives are reporting online and that after issuing the command the drive would return as a spare.

PS: naturally try one first and if all is ok the other.

isopropyl · Aug 4, 2023

joeschmuck said:
As for the SMART tests that never stop running, my advice is to power off the drives, not sleep but remove power. Wait a few minutes and then power back on. Once all is up and running, check SMART again. Powering off the drives will kick them out of a SMART test. If they are still reporting that a long test is running, report that back.

Ok, just booted the machine back up. I had it off for significantly longer than necessary anyways. It was down for about 2 hours, power unplugged.
So I just ran smartctl -a /dev/da0 and I am still seeing that background long test 'running'.
I went ahead and sent you one more email dump if you wanted to take a peek at the report results to see if you notice anything different that stands out to you since the last one. Seemed the same though. Mirrors still in effect too (I haven't detached them yet).

Code:

# smartctl -a /dev/da0
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUS726040AL4210
Revision:             A980
Compliance:           SPC-4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca244194888
Serial number:        N8GEX1NY
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri Aug  4 18:43:32 2023 EDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     28 C
Drive Trip Temperature:        85 C

Accumulated power on time, hours:minutes 48604:46
Manufactured in week 29 of year 2016
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  155
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2180
Elements in grown defect list: 1

Vendor (Seagate Cache) information
  Blocks sent to initiator = 5888237742587904

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0       58         0        58   13378166     454330.167  0
write:         0     2872         0      2872    3395057      36800.589  0
verify:        0        0         0         0     410966          0.000  0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Self test in progress ...   -     NOW                 - [-   -    -]
# 2  Background short  Completed                   -   38827                 - [-   -    -]
# 3  Background short  Completed                   -   38803                 - [-   -    -]
# 4  Background short  Completed                   -   38779                 - [-   -    -]
# 5  Background short  Completed                   -   38755                 - [-   -    -]
# 6  Background short  Completed                   -   38731                 - [-   -    -]
# 7  Background short  Completed                   -   38707                 - [-   -    -]
# 8  Background short  Completed                   -   38683                 - [-   -    -]
# 9  Background long   Completed                   -   38671                 - [-   -    -]
#10  Background short  Completed                   -   38659                 - [-   -    -]
#11  Background short  Completed                   -   38635                 - [-   -    -]
#12  Background short  Completed                   -   38611                 - [-   -    -]
#13  Background short  Completed                   -   38587                 - [-   -    -]
#14  Background short  Completed                   -   38563                 - [-   -    -]
#15  Background short  Completed                   -   38539                 - [-   -    -]
#16  Background short  Completed                   -   38515                 - [-   -    -]
#17  Background long   Completed                   -   38510                 - [-   -    -]
#18  Background short  Completed                   -   38491                 - [-   -    -]

joeschmuck · Aug 4, 2023

isopropyl said:
I went ahead and sent you one more email dump if you wanted to take a peek at the report results to see if you notice anything different that stands out to you since the last one.

Got the email. Yes, you still have a problem. Let's try to do some troubleshooting.

0: Try to force the Long Test to abort smartctl -X /dev/da0, check the SMART results, and then try to run a SMART Short test smartctl -t short /dev/da0 and see if it runs a Short test.

1: Power down, Swap drive da0 (N8GEX1NY) and da1 (N8GEBA2Y), power on, check the SMART data for both drives, ensure you are examining the correct drive by the serial number. Does N8GEX1NY still have the Long test running or has it moved to N8GEBA2Y. Swapping the drive means to electrically change the power and SAS/SATA cables between the drives. If you are using a removable drive bay then just relocate the drives.

2: If the problem moves to the other drive now we need to figure out why, could it be a backplane issue? You could also just look at the drives locations to see if there is something common.

3: If the problem stayed with N8GEX1NY then that is a data point.

4: Update the LSI controller firmware. I do not think this is the problem but stranger things have happened.

5: Boot up something like an Ubuntu Live image and check the SMART data for a few of the suspect drives. Does it still report a long test is running, it most certainly should not. If it shows the drives are running a test then I have to think the drives have a firmware glitch. But I honestly don't believe that since the same firmware appears to be on several of the good and troubled drives.

6: If Ubuntu says the drives are normal again, you may need to submit a bug report, and/or change to TrueNAS Scale. You can backup your configuration data and DO NOT UPDATE the ZFS features or you will likely not be able to roll back to TrueNAS Core if you wanted to. I go between both Core and Scale all the time, swapping it several times a day when I'm testing a script.

What I do not want to tell you is that you should replace your problematic drives. Keep in mind that a Scrub will detect problems but it's a different type of test compared to SMART, but it's better than nothing. If you ran a SMART Long test once a week, you can run a Scrub once a week as an alternative. Again, they are not exactly the same but it's better than no testing at all.

I am out of options until you report back what you have accomplished.

As for the spares being INUSE, that is a separate topic. As mentioned before, you should ensure you have a backup of at least your important data just in case something goes wrong if you detach the spare drives. But this is your decision. You have 20TB of data, that is quite a bit.

isopropyl · Aug 5, 2023

joeschmuck said:
Got the email. Yes, you still have a problem. Let's try to do some troubleshooting.

0: Try to force the Long Test to abort smartctl -X /dev/da0, check the SMART results

So upon aborting, I get the output Abort self test failed [unsupported field in scsi command
Do you want me to continue on with the other steps still?
It looks like it still shows background long test "NOW" under smartctl -a /dev/da0 so it definitely didn't abort as it stated.

Davvo · Aug 5, 2023

Seems to be a HGST issue.

Long SMART self test failed [unsupported field in scsi command]

I've had this server for a couple of months and its working fine but from the beginning I've gotten the above error on all smartctl long tests. I've tried everything I could think of, and tried some of the different options on the smartmontools wiki. All drives have this error, short tests...

www.truenas.com

Redcoat · Aug 5, 2023

No help I think, but your misery with this is being reported as a Scale experience, too https://www.truenas.com/community/t...mart-long-test-even-after-power-cycle.107151/

EDITED TO FIX BAD LINK -- THX @Davvo

Davvo · Aug 5, 2023

Redcoat said:
No help I think, but your misery with this is being reported as a Scale experience, too https://www.truenas.com/community/threads/new-scale-build-sas-drives-seem-to-be-stuck-in-smart-long-test-even-after-power-cycle.107151/

Whatch out, that link sends to the HBA resource.
Anyway, worth creating a bug report to see what the engineering team says imho.

Davvo · Aug 5, 2023

Likely HGST firmware bug.

#608 (Long test hanged on HGST drives) – smartmontools

www.smartmontools.org

Entries from previously running self-tests may not be cleaned up properly if the self-test was aborted by power loss or similar.

This could be caused in case of SCSI related issues - OS may try to reset controller in this case and its typically aborts the test.

No idea how to solve it.

joeschmuck · Aug 5, 2023

isopropyl said:
Do you want me to continue on with the other steps still?

It appears to be a known issue without a fix. You could try the other steps if you desire but I do not think the problem will go away. I know there are some postings on the internet about HGST drives and fixing this issue with a firmware load, but I have not located the binary file for your drive. If you find it, I'd give it a try. This is going to require some work on your part to figure it all out. If you plan to take this action, I'd recommend you replace one of the drives with a new drive and then mess around with the single drive alone so you do not mess up any of the other drives by accident.

As I said in an earlier post, I'd recommend running weekly Scrubs, or whatever you are comfortable with. This would at least tell you if you are having data corruption or not. Since all your drives are over 2000 days (5.4 years) of runtime, I'm pretty sure they have no more warranty so I'd start replacing them as you can afford it. This also affords you the opportunity to restructure your pool design to utilize larger and fewer drives. You are also at 73% capacity which is fairly high. You have some time to figure out what you desire to do but I would not wait too long.

I sent you via email an updated multi_report_config.txt file that will ignore the Test Age alarm for those 9 drives. This could also be done using the -config switch and the Advanced Configuration.

Important Announcement for the TrueNAS Community.

Are 2 of My Drives Failed? (See Edit: Moving Data Onto To New Vdev, To Remove Old)

Contributor

MVP

Old Man

Guru

Old Man

Guru

Contributor

Contributor

Old Man

Contributor

Old Man

Guru

Contributor

Old Man

Contributor

MVP

MVP

MVP

MVP

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Are 2 of My Drives Failed? (See Edit: Moving Data Onto To New Vdev, To Remove Old)"

Similar threads