Pool offline Troubleshooting

Morikaen · Nov 21, 2021

Hi everyone and thanks for reading another "Pool offline thread" and try to help.
I've been using Truenas for almost a year and is the first situation I don't know how to handle.

One of my pools went offline without further notice and I don't know how to fix it.

I'll share my system information

Platform: Generic
Version: TrueNAS-12.0-U6.1
Installed over a Generic AMD PC

I have 3 pools and this one went offline.

This pool uses this Western Digital disk, which label strangely doesn't exist any more

A couple of CLI outputs

Code:

root@truenas[~]# zpool status -v
  pool: SSDPool
 state: ONLINE
  scan: scrub repaired 0B in 00:01:22 with 0 errors on Sun Nov 21 00:01:22 2021
config:

        NAME                                          STATE     READ WRITE CKSUM
        SSDPool                                       ONLINE       0     0     0
          gptid/4fb2d2ec-447c-11eb-86ab-f46d04a297df  ONLINE       0     0     0

errors: No known data errors

  pool: StripePool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:03 with 0 errors on Sun Nov 21 00:00:03 2021
config:

        NAME                                          STATE     READ WRITE CKSUM
        StripePool                                    ONLINE       0     0     0
          gptid/a88992bb-49d4-11eb-bfc3-f46d04a297df  ONLINE       0     0     0
          gptid/ac2729d1-49d4-11eb-bfc3-f46d04a297df  ONLINE       0     0     0
          gptid/ae780bfd-49d4-11eb-bfc3-f46d04a297df  ONLINE       0     0     0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00

as you can see, there is no trace of the missing TrasientPool

Code:

root@truenas[~]# camcontrol devlist
<ST3500630AS 3.AAK>                at scbus0 target 0 lun 0 (pass0,ada0)
<ST9250827AS 3.AAA>                at scbus3 target 2 lun 0 (pass1,ada1)
<HTS541080G9SA00 MB4IC60R>         at scbus3 target 3 lun 0 (pass2,ada2)
<Port Multiplier 5755197b 000e>    at scbus3 target 15 lun 0 (pass3,pmp0)
<KINGSTON SV300S37A120G 521ABBF0>  at scbus7 target 0 lun 0 (pass4,ada3)
<SanDisk SDSSDA120G Z32080RL>      at scbus8 target 0 lun 0 (pass5,ada4)
<WDC WD40EFAX-68JH4N0 82.00A82>     at scbus7 target 0 lun 0 (pass4,ada3)

Any help is greatly appreciated.
Thanks a lot.

Morikaen · Nov 22, 2021

Additional info

Code:

root@truenas[~]# zpool import
   pool: TrasientPool
     id: 4131727059966993776
  state: FAULTED
status: One or more devices contains corrupted data.
 action: The pool cannot be imported due to damaged devices or data.
    The pool may be active on another system, but can be imported using
    the '-f' flag.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-5E
 config:

    TrasientPool                                  FAULTED  corrupted data
      gptid/40d07770-45fd-11eb-b71e-001e8c565388  FAULTED  corrupted data

with -f option, same output :(

Code:

root@truenas[~]# zpool import -f
   pool: TrasientPool
     id: 4131727059966993776
  state: FAULTED
status: One or more devices contains corrupted data.
 action: The pool cannot be imported due to damaged devices or data.
    The pool may be active on another system, but can be imported using
    the '-f' flag.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-5E
 config:

    TrasientPool                                  FAULTED  corrupted data
      gptid/40d07770-45fd-11eb-b71e-001e8c565388  FAULTED  corrupted data

sretalla · Nov 22, 2021

Morikaen said:
<Port Multiplier 5755197b 000e>

I think you should probably read this:

Multiply your problems with SATA Port Multipliers and cheap SATA controllers

This resource was originally created by user: jgreco on the TrueNAS Community Forums Archive. Please DM this account or comment in this thread to claim it. In the last year or two, we’ve had a resurgence of users asking about SATA Port Multipliers and cheap SATA controllers. Please, do NOT use...

www.truenas.com

Can you run smartctl -a /dev/ada5 ?

Morikaen · Nov 22, 2021

Thanks for your answer @sretalla !

Here is the output of smartctl.
I'm currently running a long test, expected to finish in an hour.
It seems that the disk is ok...But the pool isnt.

Code:

root@truenas[~]# smartctl  -a /dev/ada5
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p10 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red (SMR)
Device Model:     WDC WD40EFAX-68JH4N0
Serial Number:    WD-WX32D70DNLAH
LU WWN Device Id: 5 0014ee 2bdb73e0f
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Nov 22 14:38:12 2021 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (22184) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 200) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x3039)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   222   201   021    Pre-fail  Always       -       1883
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       220
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       761
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       219
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       150
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       72
194 Temperature_Celsius     0x0022   118   103   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       751         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

sretalla · Nov 22, 2021

OK, your disk is looking OK (pending the long test results).

if that comes up clean too, you can try importing the pool with some options to handle the corruption.

You could start with zpool import -F -n TrasientPool (which should not do anything, but will tell you what will happen if you run the command again with only -F instead... expect that you will lose some transactions in order to handle the corruption).

For reference:

illumos: manual page: zpool.8

Morikaen · Nov 22, 2021

Thanks again @sretalla

The command zpool import -F -n TrasientPool exit with no return, but Pool didnt change status.

Code:

root@truenas[~]# zpool import -F -n TrasientPool
root@truenas[~]#

sretalla · Nov 22, 2021

OK, with that lack of feedback, it may have been the case that a "normal import" was possible...

What does zpool status -v show?

Morikaen · Nov 22, 2021

sretalla said:
OK, with that lack of feedback, it may have been the case that a "normal import" was possible...

What does zpool status -v show?

No trace of TrasientPool

Code:

root@truenas[~]# zpool status -v
  pool: SSDPool
 state: ONLINE
  scan: scrub repaired 0B in 00:01:22 with 0 errors on Sun Nov 21 00:01:22 2021
config:

    NAME                                          STATE     READ WRITE CKSUM
    SSDPool                                       ONLINE       0     0     0
      gptid/4fb2d2ec-447c-11eb-86ab-f46d04a297df  ONLINE       0     0     0

errors: No known data errors

  pool: StripePool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:03 with 0 errors on Sun Nov 21 00:00:03 2021
config:

    NAME                                          STATE     READ WRITE CKSUM
    StripePool                                    ONLINE       0     0     0
      gptid/a88992bb-49d4-11eb-bfc3-f46d04a297df  ONLINE       0     0     0
      gptid/ac2729d1-49d4-11eb-bfc3-f46d04a297df  ONLINE       0     0     0
      gptid/ae780bfd-49d4-11eb-bfc3-f46d04a297df  ONLINE       0     0     0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:01:35 with 0 errors on Sat Nov 20 03:46:35 2021
config:

    NAME        STATE     READ WRITE CKSUM
    boot-pool   ONLINE       0     0     0
      ada2p2    ONLINE       0     0     0

errors: No known data errors

sretalla · Nov 22, 2021

OK, so it should still be in zpool import then.

Let's try -f first and see if that works... if not, then -F

zpool import -f -n TrasientPool

Morikaen · Nov 22, 2021

thanks a lot again, @sretalla
Here is the result

Code:

root@truenas[~]# zpool import -f  TrasientPool
internal error: cannot import 'TrasientPool': Integrity check failed
zsh: abort (core dumped)  zpool import -f TrasientPool

root@truenas[~]# zpool import -F -n TrasientPool
root@truenas[~]#      (no output)

these are the last line of the messages log

Code:

Nov 22 22:47:23 truenas kernel: pid 6263 (zpool), jid 0, uid 0: exited on signal 6 (core dumped)
Nov 22 22:48:01 truenas (aprobe0:ata5:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Nov 22 22:48:01 truenas (aprobe0:ata5:0:0:0): CAM status: Unconditionally Re-queue Request
Nov 22 22:48:01 truenas (aprobe0:ata5:0:0:0): Error 5, Retry was blocked
Nov 22 22:48:09 truenas (aprobe0:ata5:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Nov 22 22:48:09 truenas (aprobe0:ata5:0:0:0): CAM status: Unconditionally Re-queue Request
Nov 22 22:48:09 truenas (aprobe0:ata5:0:0:0): Error 5, Retry was blocked
Nov 22 22:48:15 truenas (noperiph:ata5:0:-1:ffffffff): rescan already queued
Nov 22 22:48:21 truenas (aprobe0:ata5:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Nov 22 22:48:21 truenas (aprobe0:ata5:0:0:0): CAM status: Unconditionally Re-queue Request
Nov 22 22:48:21 truenas (aprobe0:ata5:0:0:0): Error 5, Retry was blocked
Nov 22 22:48:22 truenas kernel: pid 6281 (zpool), jid 0, uid 0: exited on signal 6 (core dumped)

sretalla · Nov 22, 2021

OK, so if your long test came back clean, we're looking at either a sata cable or port problem or a failing controller.

Check in that order... Swap cables, ports and see if anything changes.

Morikaen · Nov 23, 2021

Thanks @sretalla,
The long test task doesnt seem to finish correctly. I'll will launch it again this afternoon.
I'll try to swap some cables and connect this device to the motherboard input directly to see if it makes some difference.

Morikaen · Nov 24, 2021

Changing cables didnt work.
Long test never finishes

Probably the disk is not as healthy as it seems. It started to make sounds....
Im thinking about testing it thoroughly with another tool.
Do you recommend any testing tool?

I was looking at Challenger Os...

sretalla · Nov 24, 2021

You probably don't need to spend too much time confirming that it's broken if there are clicking noises coming from it.

You could look at running the burnin routines (with badblocks) which are described here:

GitHub - Spearfoot/disk-burnin-and-testing: Shell script for burn-in and testing of new or re-purposed drives

Shell script for burn-in and testing of new or re-purposed drives - Spearfoot/disk-burnin-and-testing

github.com

Morikaen · Nov 25, 2021

thanks again for your support @sretalla

Just to update the post:
Yesterday I restarted the Server, plugged again the disk and launched the long test again.
It come out clean. No errors.
zpool import still shows the same error.

Today I plugged the disk to a linux server and launched a complete surface test with ChallegerRocket (27 hours time expected)

Thanks again.

Morikaen · Dec 5, 2021

Hi, sorry for the delay, I've been dealing with a popular virus.

So, a quick summary
- Pool Offline
- Disk Test results : Success

Code:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       874         -
# 2  Conveyance offline  Completed without error       00%       786         -
# 3  Extended offline    Completed without error       00%       784         -
# 4  Short offline       Completed without error       00%       751         -

- External Disk Test: OK
- changing cables made no difference

zpool import

Code:

root@truenas[~]# zpool import
   pool: TrasientPool
     id: 4131727059966993776
  state: FAULTED
status: One or more devices contains corrupted data.
 action: The pool cannot be imported due to damaged devices or data.
    The pool may be active on another system, but can be imported using  the '-f' flag.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-5E
 config:

    TrasientPool                                  FAULTED  corrupted data
      gptid/40d07770-45fd-11eb-b71e-001e8c565388  FAULTED  corrupted data

zpool import -FX TrasientPool

Code:

cannot import 'TrasientPool': one or more devices is currently unavailable

The console.log excerpt regarding pool import at startup

Code:

Dec  5 17:14:01 truenas spa.c:6032:spa_import(): spa_import: importing TrasientPool
Dec  5 17:14:01 truenas spa_misc.c:411:spa_load_note(): spa_load(TrasientPool, config trusted): LOADING
Dec  5 17:14:01 truenas vdev.c:129:vdev_dbgmsg(): disk vdev '/dev/gptid/40d07770-45fd-11eb-b71e-001e8c565388': best uberblock found for spa TrasientPool. txg 527084
Dec  5 17:14:01 truenas spa_misc.c:411:spa_load_note(): spa_load(TrasientPool, config untrusted): using uberblock with txg=527084
Dec  5 17:14:01 truenas vdev.c:129:vdev_dbgmsg(): disk vdev '/dev/gptid/40d07770-45fd-11eb-b71e-001e8c565388': vdev_load: zap_lookup(top_zap=130) failed [error=97]
Dec  5 17:14:01 truenas spa_misc.c:396:spa_load_failed(): spa_load(TrasientPool, config trusted): FAILED: vdev_load failed [error=97]
Dec  5 17:14:01 truenas spa_misc.c:411:spa_load_note(): spa_load(TrasientPool, config trusted): UNLOADING
Dec  5 17:14:01 truenas spa_misc.c:411:spa_load_note(): spa_load(TrasientPool, config trusted): spa_load_retry: rewind, max txg: 527083
Dec  5 17:14:01 truenas spa_misc.c:411:spa_load_note(): spa_load(TrasientPool, config trusted): LOADING
Dec  5 17:14:01 truenas vdev.c:129:vdev_dbgmsg(): disk vdev '/dev/gptid/40d07770-45fd-11eb-b71e-001e8c565388': best uberblock found for spa TrasientPool. txg 527083
Dec  5 17:14:01 truenas spa_misc.c:411:spa_load_note(): spa_load(TrasientPool, config untrusted): using uberblock with txg=527083
Dec  5 17:14:01 truenas vdev.c:129:vdev_dbgmsg(): disk vdev '/dev/gptid/40d07770-45fd-11eb-b71e-001e8c565388': vdev_load: zap_lookup(top_zap=130) failed [error=97]
Dec  5 17:14:01 truenas spa_misc.c:396:spa_load_failed(): spa_load(TrasientPool, config trusted): FAILED: vdev_load failed [error=97]
Dec  5 17:14:01 truenas spa_misc.c:411:spa_load_note(): spa_load(TrasientPool, config trusted): UNLOADING
Dec  5 17:14:01 truenas spa_misc.c:411:spa_load_note(): spa_load(TrasientPool, config trusted): spa_load_retry: rewind, max txg: 527082
Dec  5 17:14:01 truenas spa_misc.c:411:spa_load_note(): spa_load(TrasientPool, config trusted): LOADING
Dec  5 17:14:01 truenas vdev.c:129:vdev_dbgmsg(): disk vdev '/dev/gptid/40d07770-45fd-11eb-b71e-001e8c565388': best uberblock found for spa TrasientPool. txg 527082
Dec  5 17:14:01 truenas spa_misc.c:411:spa_load_note(): spa_load(TrasientPool, config untrusted): using uberblock with txg=527082
Dec  5 17:14:01 truenas vdev.c:129:vdev_dbgmsg(): disk vdev '/dev/gptid/40d07770-45fd-11eb-b71e-001e8c565388': vdev_load: zap_lookup(top_zap=130) failed [error=97]
Dec  5 17:14:01 truenas spa_misc.c:396:spa_load_failed(): spa_load(TrasientPool, config trusted): FAILED: vdev_load failed [error=97]
Dec  5 17:14:01 truenas spa_misc.c:411:spa_load_note(): spa_load(TrasientPool, config trusted): UNLOADING
Dec  5 17:14:01 truenas spa_misc.c:411:spa_load_note(): spa_load(TrasientPool, config trusted): spa_load_retry: rewind, max txg: 527081
Dec  5 17:14:01 truenas spa_misc.c:411:spa_load_note(): spa_load(TrasientPool, config trusted): LOADING
Dec  5 17:14:01 truenas vdev.c:129:vdev_dbgmsg(): disk vdev '/dev/gptid/40d07770-45fd-11eb-b71e-001e8c565388': best uberblock found for spa TrasientPool. txg 527081
Dec  5 17:14:01 truenas spa_misc.c:411:spa_load_note(): spa_load(TrasientPool, config untrusted): using uberblock with txg=527081
Dec  5 17:14:01 truenas vdev.c:129:vdev_dbgmsg(): disk vdev '/dev/gptid/40d07770-45fd-11eb-b71e-001e8c565388': vdev_load: zap_lookup(top_zap=130) failed [error=97]
Dec  5 17:14:01 truenas spa_misc.c:396:spa_load_failed(): spa_load(TrasientPool, config trusted): FAILED: vdev_load failed [error=97]
Dec  5 17:14:01 truenas spa_misc.c:411:spa_load_note(): spa_load(TrasientPool, config trusted): UNLOADING

Heracles · Dec 5, 2021

Hey @Morikaen,

So that is a pool with a single vdev itself containing a single drive. As such, it does not have any redundancy. No matter it is from cable, non-ECC memory or anything else, if some corruption indeed happened in critical parts of ZFS, that would mean your pool is toasted.

And unfortunately, from what I see here, this is the most probable explanation...

Morikaen · Dec 5, 2021

Thanks @Heracles.
So, data is lost forever?

Any advice on next steps?

Heracles · Dec 5, 2021

Morikaen said:
So, data is lost forever?

It may very well be...

Morikaen said:
Any advice on next steps?

next step would be to restore your backups but I guess you do not have any...

The error message is about hardware problem or data corruption. @sretalla guided you about the typical hardware problems and the hardware came clean. As such, it points to data corruption which here will be unfixable. If it was, ZFS would have fix it already. You had a pool without any redundancy and you see the end result here...

Nest time, ensure that
1-You have redundancy
2-You have backups because no single server, TrueNAS or other, can be more than a single point of failure

danb35 · Dec 5, 2021

Morikaen said:
Any advice on next steps?

See what it tells you with zpool import -F -f -n TransientPool. Depending on where the bad data is, it's possible it will allow you to import the pool with the loss of whatever data had been most recently stored there, and this command will tell you if the system can do that. It won't actually import the pool (you'd run the same command without the -n to do that), but it should let you know if it's possible.

Important Announcement for the TrueNAS Community.

Pool offline Troubleshooting

Dabbler

Dabbler

Powered by Neutrality

Dabbler

Powered by Neutrality

Dabbler

Powered by Neutrality

Dabbler

Powered by Neutrality

Dabbler

Powered by Neutrality

Dabbler

Dabbler

Powered by Neutrality

Dabbler

Dabbler

Wizard

Dabbler

Wizard

Hall of Famer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Pool offline Troubleshooting"

Similar threads