How to detect failed drive when all disks are marked "online"?

Scampicfx · May 30, 2021

Dear Community,

I have a monitored a strange behaviour of FreeNAS for a couple of days now. It looks like that one drive keeps failing, my GUI says:

Pool backup state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state..

First question: How can I identify which disk failed?

The output of zpool status -v shows:

Code:

root@backupunit:~ # zpool status
  pool: backup
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun May 30 07:36:00 2021
        1.01T scanned at 62.4M/s, 978G issued at 59.2M/s, 44.0T total
        341M resilvered, 2.17% done, 8 days 19:29:17 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
       backup                                          ONLINE       0     0 0
          raidz3-0                                      ONLINE       0     0 0
            gptid/11b5e11d-e339-11e7-8aec-0cc47adbe5ec  ONLINE       0     0
            gptid/1253fcf3-e339-11e7-8aec-0cc47adbe5ec  ONLINE       0     0 0
            da5p2                                       ONLINE       8  322K 0
            gptid/1392f3f7-e339-11e7-8aec-0cc47adbe5ec  ONLINE       0     0 0
            gptid/142b94f2-e339-11e7-8aec-0cc47adbe5ec  ONLINE       0     0 0
            gptid/14c33495-e339-11e7-8aec-0cc47adbe5ec  ONLINE       0     0 0
            gptid/15681cb4-e339-11e7-8aec-0cc47adbe5ec  ONLINE       0     0 0
            gptid/162998cf-e339-11e7-8aec-0cc47adbe5ec  ONLINE       0     0 0
            gptid/16bd6a6a-e339-11e7-8aec-0cc47adbe5ec  ONLINE       0     0 0

errors: No known data errors

The problem is: It looks like one of the drives goes online and offline in intervals multiple times per day. FreeNAS tries to resilver this disk again and again - but it's worthless. Of course, whe looking at the output of the command above, I guess that da5p2 is the faulty one. But I would like to have 100% assurance which drive failed? I'm also wondering why FreeNAS rebuilds this zpool when da5p2 keeps failing? The resilver process has never came accross 3% - then it starts from beginning again.

Thank you!

sretalla · May 31, 2021

From your posted status, it certainly looks like da5 is the issue.

You should be able to confirm it more clearly with SMART data:
smartctl -a /dev/da5

Perhaps consider running that for the rest of your disks too if you're concerned it may be others that are going bad.

If you're asking about how to physically identify da5 in the chassis, you might use the Disks section in the GUI to find the Serial number or dmesg | grep Serial in the shell.

Scampicfx · Jun 2, 2021

Hey there,
thanks for your answer. I still couldn't solve this problem because FreeNAS tries to mount this device again and again. Therefore, I would like to tell FreeNAS, to not use this device any more.
When I click "Offline" next to da5, it takes a while and then following messages appear:

Code:

Jun  2 20:59:09 backupunit savecore: /dev/da5p1: Operation not permitted
Jun  2 20:59:09 backupunit savecore: /dev/da5p1: Operation not permitted

Is there a way how to force this drive offline via command line?

csax · Jun 2, 2021

From your posted status, it certainly looks like da5 is the issue.

Are you sure? Because when I remove a disk from a pool it looks like this:
gptid/11b5e11d-e339-11e7-8aec-0cc47adbe5ec

So my guess would have been to check all other except DA5 :)

ChrisRJ · Jun 3, 2021

From your zpool status it is indeed the da5 disk that has errors (322k to be precise).

joeschmuck · Jun 3, 2021

Just to be clear here, the error message you have does not indicate the drive failed, it indicates that you have some corrupt data. This is a fairly large distinction. I would assume that you also installed this drive via some manual means hence the da5p2 name, not that it's a major deal but I would think you didn't follow the instructions on how to replace a drive properly or maybe you didn't and you have some odd hardware issue.

If it were me (and I'm just speaking for myself) I would first check the SMART data for the drive da5 to see if something is wrong and based on if all shows good, remove the drive from the pool using the User Guide to Replace a failed drive, wipe all the data from the drive using badblocks (single pass is fine to also find drive issues), and then complete the Replace Drive procedure with the same drive. Once done I would expect it to have a gptid and your zpool status to be clean.

The other option you have is to just wait on the Scrub to finish and check your zpool status again, it should be all cleared up but you would still have a drive ID of da5p2, which is not a deal breaker. If after the scrub the error remains, try the command zpool clear backup and run another scrub and see what happens.

As previously stated, if you are not checking your hard drives using the built in SMART Test checking periodically, then you should setup a routine. My personal preference is to run a Short SMART Test every morning at 1 AM and then a Long SMART Test once a week at 1:15 AM. Depending on your system requirements you may want to stagger a few drives per day for the Long test. That is entirely up to you. And then you should go look at the SMART data on your drives anytime you feel something might be wrong.

Finally... Report your system hardware configuration, it may make a difference in how we move forward if the scrub continues to repeat itself. I feel that since you have a drive lettered "da5" and not "ada5" then you might be using an add-on controller card or maybe even it's a built in high capacity drive controller. Some of these have cables that could go bad and have gone bad. So if we know your hardware, it could help us provide better troubleshooting advice.

In the meantime please post the output of smartctl -a da5 and we can determine if the drive has some obvious failure. Also, what version of FreeNAS are you running?

Good luck on this.

Scampicfx · Jun 4, 2021

Thanks for all your answers so far!

Just to be clear here, the error message you have does not indicate the drive failed, it indicates that you have some corrupt data. This is a fairly large distinction. I would assume that you also installed this drive via some manual means hence the da5p2 name, not that it's a major deal but I would think you didn't follow the instructions on how to replace a drive properly or maybe you didn't and you have some odd hardware issue.

The disk is installed in Supermicro Chassis 847BE1C-R1K28LPB (with SAS3 Backplane)
The motherboard used is Supermicro X10SRH-CLN4F with onboard SAS controller.
The SAS cable used is Supermicro CBL-SAST-0593

The disks are still the ones from day 1 - so there was no disk swap so far. The name "da5p2" came automatically out of a sudden. In previous days, all disks were shown with labels like gptid/11b5e11d-e339-11e7-8aec-0cc47adbe5ec.

If it were me (and I'm just speaking for myself) I would first check the SMART data for the drive da5 to see if something is wrong and based on if all shows good, remove the drive from the pool using the User Guide to Replace a failed drive, wipe all the data from the drive using badblocks (single pass is fine to also find drive issues), and then complete the Replace Drive procedure with the same drive. Once done I would expect it to have a gptid and your zpool status to be clean.

This is the output of smartctl -a /dev/da5 :

Code:

Please specify device type with the -d option.

Use smartctl -h to get a usage summary

root@backupunit:~ # smartctl -a /dev/da5
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p14 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUH721010AL4200
Revision:             A21D
Compliance:           SPC-4
User Capacity:        10,000,831,348,736 bytes [10.0 TB]
Logical block size:   4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca2666f2980
Serial number:        7JHZ45ZG
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri Jun  4 16:23:29 2021 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     37 C
Drive Trip Temperature:        85 C

Manufactured in week 27 of year 2017
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  7
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  5462
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 12300551370833920

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0   413995         0    413995    7205446     209976.009  0
write:         0        0         0         0     165038      29778.840  0
verify:        0        0         0         0   104514138          0.000   0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Failed in segment -->       3   29952                 - [0x1 0x5d 0xfd]
# 2  Background short  Failed in segment -->       3   29784                 - [0x1 0x5d 0xfd]
# 3  Background long   Failed in segment -->       3   29688                 - [0x1 0x5d 0xfd]
# 4  Background short  Failed in segment -->       3   29616                 - [0x1 0x5d 0xfd]
# 5  Background short  Failed in segment -->       3   29400                 - [0x1 0x5d 0xfd]
# 6  Background long   Failed in segment -->       3   29304                 - [0x1 0x5d 0xfd]
# 7  Background short  Failed in segment -->       3   29232                 - [0x1 0x5d 0xfd]
# 8  Background short  Failed in segment -->       3   29064                 - [0x1 0x5d 0xfd]
# 9  Background long   Failed in segment -->       3   28968                 - [0x1 0x5d 0xfd]
#10  Background short  Failed in segment -->       3   28896                 - [0x1 0x5d 0xfd]
#11  Background short  Failed in segment -->       3   28657                 - [0x1 0x5d 0xfd]
#12  Background long   Failed in segment -->       3   28561                 - [0x1 0x5d 0xfd]
#13  Background short  Failed in segment -->       3   28489                 - [0x1 0x5d 0xfd]
#14  Background short  Failed in segment -->       3   28321                 - [0x1 0x5d 0xfd]
#15  Background long   Failed in segment -->       3   28225                 - [0x1 0x5d 0xfd]
#16  Background short  Failed in segment -->       3   28153                 - [0x1 0x5d 0xfd]
#17  Background short  Failed in segment -->       3   27985                 - [0x1 0x5d 0xfd]
#18  Background long   Failed in segment -->       3   27889                 - [0x1 0x5d 0xfd]

For comparison, this is the outpout of e.g. smartctl -a /dev/da4 :

Code:

Long (extended) Self-test duration: 65535 seconds [1092.2 minutes]

root@backupunit:~ # smartctl -a /dev/da4
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p14 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUH721010AL4200
Revision:             A21D
Compliance:           SPC-4
User Capacity:        10,000,831,348,736 bytes [10.0 TB]
Logical block size:   4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca2666e1e34
Serial number:        7JHYKD4G
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri Jun  4 16:29:36 2021 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     34 C
Drive Trip Temperature:        85 C

Manufactured in week 27 of year 2017
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  8
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  5548
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 14780047265103872

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0    36935         0     36935    1400333     209383.280  0
write:         0        0         0         0     268862      29623.964  0
verify:        0        0         0         0    1755781          0.000  0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Self test in progress ...   -     NOW                 - [-   -    -]
# 2  Background short  Completed                   -   25998                 - [-   -    -]
# 3  Background short  Completed                   -   25782                 - [-   -    -]
# 4  Background long   Completed                   -   25706                 - [-   -    -]
# 5  Background short  Completed                   -   25614                 - [-   -    -]
# 6  Background short  Completed                   -   25446                 - [-   -    -]
# 7  Background long   Completed                   -   25370                 - [-   -    -]
# 8  Background short  Completed                   -   25278                 - [-   -    -]
# 9  Background short  Completed                   -   25038                 - [-   -    -]
#10  Background long   Completed                   -   24961                 - [-   -    -]
#11  Background short  Completed                   -   24869                 - [-   -    -]
#12  Background short  Completed                   -   24701                 - [-   -    -]
#13  Background long   Completed                   -   24625                 - [-   -    -]
#14  Background short  Completed                   -   24533                 - [-   -    -]
#15  Background short  Completed                   -   24317                 - [-   -    -]
#16  Background long   Completed                   -   24241                 - [-   -    -]
#17  Background short  Completed                   -   24149                 - [-   -    -]
#18  Background short  Completed                   -   23981                 - [-   -    -]
#19  Background long   Completed                   -   23903                 - [-   -    -]

The other option you have is to just wait on the Scrub to finish and check your zpool status again, it should be all cleared up but you would still have a drive ID of da5p2, which is not a deal breaker. If after the scrub the error remains, try the command zpool clear backup and run another scrub and see what happens.

What does zpool clear backup do? So, what tasks gets triggered when running this command?

As previously stated, if you are not checking your hard drives using the built in SMART Test checking periodically, then you should setup a routine. My personal preference is to run a Short SMART Test every morning at 1 AM and then a Long SMART Test once a week at 1:15 AM. Depending on your system requirements you may want to stagger a few drives per day for the Long test. That is entirely up to you. And then you should go look at the SMART data on your drives anytime you feel something might be wrong.

Short and Long Smart Tets get done regularly and multiple times a month. When I remember correctly I spotted multiple errors in the past at da5...

Finally... Report your system hardware configuration, it may make a difference in how we move forward if the scrub continues to repeat itself. I feel that since you have a drive lettered "da5" and not "ada5" then you might be using an add-on controller card or maybe even it's a built in high capacity drive controller. Some of these have cables that could go bad and have gone bad. So if we know your hardware, it could help us provide better troubleshooting advice.

See above :)

Thank you guys so far for your help! But I think according to SMART, da5 is really broken, isn't it?

Scampicfx · Jun 4, 2021

Gents, I pulled da5 and replaced it by a new one. Now FreeNAS GUI is working correctly again. With the old one it was stalling and very laggy! Sometimes I received the message "Error when getting pool data" when I tried to access Storage -> Pool.

The output of zpool status now is:

Code:

root@backupunit:~ # zpool status
  pool: backup
state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Jun  4 17:43:16 2021
        1.72T scanned at 1.08G/s, 43.6G issued at 433M/s, 44.6T total
        3.71G resilvered, 0.10% done, 1 days 05:58:40 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        backup                                          ONLINE       0     0 0
          raidz3-0                                      ONLINE       0     0 0
            gptid/11b5e11d-e339-11e7-8aec-0cc47adbe5ec  ONLINE       0     0 0
            gptid/1253fcf3-e339-11e7-8aec-0cc47adbe5ec  ONLINE       0     0 0
            gptid/6e48c055-c54b-11eb-bcdb-0007435be350  ONLINE       0     0 0
            gptid/1392f3f7-e339-11e7-8aec-0cc47adbe5ec  ONLINE       0     0 0
            gptid/142b94f2-e339-11e7-8aec-0cc47adbe5ec  ONLINE       0     0 0
            gptid/14c33495-e339-11e7-8aec-0cc47adbe5ec  ONLINE       0     0 0
            gptid/15681cb4-e339-11e7-8aec-0cc47adbe5ec  ONLINE       0     0 0
            gptid/162998cf-e339-11e7-8aec-0cc47adbe5ec  ONLINE       0     0 0
            gptid/16bd6a6a-e339-11e7-8aec-0cc47adbe5ec  ONLINE       0     0 0

errors: No known data errors

Are there any additional commands I should use to check integrity right now?

The resilver speed is now also blazing fast compared to the previous attempts of da5 trying to self-recover it. Is there a 5 years warranty on HGST Ultrastar drives ? If so, this would be a case for warranty!

EDIT: My FreeNAS version: FreeNAS-11.3-U5

joeschmuck · Jun 4, 2021

Scampicfx said:
When I remember correctly I spotted multiple errors in the past at da5...

This was your first clue. I'm not sure why you waited so long to replace the drive. Any errors are cause for immediate investigation regardless on the redundancy you have. As you can see that it takes a SCRUB forever when it's trying to deal with a failing hard drive.

As for the warranty, I checked the HGST website and it stated that your drive was not covered by any warranty, and I did enter United States for the location, if I got that wrong then you can check the website and get hopefully a better answer. Did you by chance Shuck the drive? If you did, you should check the drive case it was shucked from and see if it's under warranty, if so, put it back together and send it in.

Glad you got it working, keep looking at those SMART results.

Scampicfx · Jun 4, 2021

Lesson learned. Thank you gents :) Resilver is at 28% at the moment!
The pool performance was so low in the past because this drive stalled the complete zpool... Now it's nearly back to normal although resilvering is running! So thanks again!

I checked warranty and indeed there is still warranty on it - so the device will get sent back to HGST within the next days :)

joeschmuck · Jun 4, 2021

Good to hear.

Important Announcement for the TrueNAS Community.

How to detect failed drive when all disks are marked "online"?

Scampicfx

Contributor

sretalla

Powered by Neutrality

Scampicfx

Contributor

csax

Dabbler

ChrisRJ

Wizard

joeschmuck

Old Man

Scampicfx

Contributor

Scampicfx

Contributor

joeschmuck

Old Man

Scampicfx

Contributor

joeschmuck

Old Man

Similar threads

Important Announcement for the TrueNAS Community.

How to detect failed drive when all disks are marked "online"?

Contributor

Powered by Neutrality

Contributor

Dabbler

Wizard

Old Man

Contributor

Contributor

Old Man

Contributor

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "How to detect failed drive when all disks are marked "online"?"

Similar threads