Disk failing - unclear which

FlyingPersian

Patron
Joined
Jan 27, 2014
Messages
237
Hello,

while I was on holiday, I let my server run. I received a few emails regarding a degraded state of my pool:

05.12.2021:
TrueNAS @ truenas.local

New alerts:
* Pool Data state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

07.12.2021 + 19.12.2021 + 20.12.2021:
TrueNAS @ truenas.local

New alerts:
* Pool Data state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
  • Disk 4823604944551511973 is REMOVED

20.12.2021:
TrueNAS @ truenas.local

New alerts:
* Pool Data state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
rueNAS @ truenas.local

AND

New alerts:
* Pool Data state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
  • Disk 4823604944551511973 is REMOVED

I could not identify, which disk "4823604944551511973" is.

zpool status (the resilvering took place 10mins after the last email):
Code:
root@truenas[~]# zpool status -v
  pool: Data
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 1.61M in 00:00:03 with 0 errors on Mon Dec 20 08:39:52 2021
config:

        NAME                                            STATE     READ WRITE CKSUM
        Data                                            ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/111dd78e-103b-11e9-b74e-0cc47a406253  ONLINE       0     0     1
            gptid/f862c9ea-00b4-11ea-9daf-0cc47a406253  ONLINE       0     0     1
            gptid/e79d9e0e-f731-11e9-9ef4-0cc47a406253  ONLINE       0     0     2
            gptid/ed4cf2d0-e4eb-11e8-bded-0cc47a406253  ONLINE       0     0     0
            ada3p2                                      ONLINE       0     0     2
            gptid/b9c40f77-68f3-11e8-a08d-0cc47a406253  ONLINE       0     0     1

errors: No known data errors

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:02:03 with 0 errors on Thu Dec 23 03:47:03 2021
config:

        NAME                                          STATE     READ WRITE CKSUM
        boot-pool                                     ONLINE       0     0     0
          gptid/a39dbd2e-63d6-11eb-b3a3-0cc47a406253  ONLINE       0     0     0

errors: No known data errors


I ran short SMART Tests on each drive, all completed withouot error.

/var/log/messages says the following:

It seems to point to ada2. I have started long SMART tests on all drives for now, which will tomorrow at 8PM (UTC+1). I'll report back once the tests are done. Maybe somebody has an idea of what's going on until then.

I have also been reading through this thread. Funnily enough, in my "zpool status", only ada2 does not have a checksum error

OS: TrueNAS-12.0-U6
Drives: 6x WD Red 8TB
MB: Supermicro X10SLM-F
 
Last edited:

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
chksum errors are often, but not always related to cabling or HBA errors. The fact that you have them on 5 disks would point to something further upstream unless they are all really really old and you have been unlucky.
Make a backup and then look at cabling or host adapter overheating issues.
Maybe
 

FlyingPersian

Patron
Joined
Jan 27, 2014
Messages
237
chksum errors are often, but not always related to cabling or HBA errors. The fact that you have them on 5 disks would point to something further upstream unless they are all really really old and you have been unlucky.
Make a backup and then look at cabling or host adapter overheating issues.
Maybe

Thanks for your reply! How do I look at host adapter overheating issues? My motherboard is a Supermicro X10SLM-F. Most of the drives are between 2 - 2,5 years old. One drive is still in warrenty till Feb 22.
 
Last edited:

ClassicGOD

Contributor
Joined
Jul 28, 2011
Messages
145
I had random checksum errors on many drives in my old nas when I was using the 'cool thin SATA cables'. I replaced the mall with normal cables and all errors went away. If you are using dedicated HBA point a fan onto it and check if the errors go away.

If you run 'blkid' you should be able to find which drive has uuid of '4823604944551511973'
 

FlyingPersian

Patron
Joined
Jan 27, 2014
Messages
237
I had random checksum errors on many drives in my old nas when I was using the 'cool thin SATA cables'. I replaced the mall with normal cables and all errors went away. If you are using dedicated HBA point a fan onto it and check if the errors go away.

If you run 'blkid' you should be able to find which drive has uuid of '4823604944551511973'

I'm running cables like the one attached.

blkid doesn't return anything. Maybe because I'm running long SMART tests on all drives?
 

Attachments

  • 619XIAiUh6L._SL1300_.jpg
    619XIAiUh6L._SL1300_.jpg
    71.8 KB · Views: 369

ClassicGOD

Contributor
Joined
Jul 28, 2011
Messages
145
Those cables should be fine but you never know. So if you have a replacement on hand just for testing I would swap them at least on the failing drive.

blkid result is empty or you getting message that command does not exist? I just realized that you are on Core and blkid is usually available on Linux systems (like Scale) although there is FreeBSD version. I don't have Core system to check anymore so sorry for the confusion. You can try 'gpart list' but I'm not sure right now if it will return the value you got from the allert.
 

FlyingPersian

Patron
Joined
Jan 27, 2014
Messages
237
blkid result is empty or you getting message that command does not exist? I just realized that you are on Core and blkid is usually available on Linux systems (like Scale) although there is FreeBSD version. I don't have Core system to check anymore so sorry for the confusion. You can try 'gpart list' but I'm not sure right now if it will return the value you got from the allert.

It's emtpy, so nothing is reported back. "gpart list" didn't help unfortunately.
 

ClassicGOD

Contributor
Joined
Jul 28, 2011
Messages
145
It's emtpy, so nothing is reported back.
One more idea - did you ran 'blkid' as root? It shows empty output when ran as different user. If you are running it from a different user you need to run 'sudo blkid'
 

FlyingPersian

Patron
Joined
Jan 27, 2014
Messages
237
One more idea - did you ran 'blkid' as root? It shows empty output when ran as different user. If you are running it from a different user you need to run 'sudo blkid'
I'm running it as root.

Another funny thing: I ran all 6 long SMART tests yesterday. 5 of them are still running, evne though 2 of them should be done. All 5 say "10% of test remaining".

The one that is finished should still be running another 2,5h It doesn't seem though as it ran:

smartctl -l selftest /dev/ada2:
Code:
root@truenas[~]# smartctl -l selftest /dev/ada2
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p10 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      9771         -
# 2  Short offline       Completed without error       00%      8898         -


smartctl -a /dev/ada2:
 

FlyingPersian

Patron
Joined
Jan 27, 2014
Messages
237
So I changed all the cables (shame, bought extra short ones :P ) and all the errors are gone for now:

Code:
root@truenas[~]# zpool status -v
  pool: Data
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: resilvered 588K in 00:00:01 with 0 errors on Mon Dec 27 01:42:04 2021
config:

        NAME                                            STATE     READ WRITE CKSUM
        Data                                            ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/111dd78e-103b-11e9-b74e-0cc47a406253  ONLINE       0     0     0
            gptid/f862c9ea-00b4-11ea-9daf-0cc47a406253  ONLINE       0     0     0
            gptid/e79d9e0e-f731-11e9-9ef4-0cc47a406253  ONLINE       0     0     0
            gptid/ed4cf2d0-e4eb-11e8-bded-0cc47a406253  ONLINE       0     0     0
            ada3p2                                      ONLINE       0     0     0
            gptid/b9c40f77-68f3-11e8-a08d-0cc47a406253  ONLINE       0     0     0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:02:03 with 0 errors on Thu Dec 23 03:47:03 2021
config:

        NAME                                          STATE     READ WRITE CKSUM
        boot-pool                                     ONLINE       0     0     0
          gptid/a39dbd2e-63d6-11eb-b3a3-0cc47a406253  ONLINE       0     0     0

errors: No known data errors
 
Top