Pool Offline, cannot import, but shows up Online via zpool import

kooplaah · Oct 9, 2023

Hi all, I'm having an issue I would greatly appreciate help with. First thing's first:

HARDWARE SETUP:
- Motherboard make and model
Asrock Rack ROMED8-2T
- CPU make and model
AMD EPYC 7282
- RAM quantity
64GB RAM
- Hard drives, quantity, model numbers, and RAID configuration, including boot drives
x7 Seagate_IronWolf_ZA2000NM10002-2ZG103 - 2TB... we'll come back to this, I'm not 100% sure what the RAID config was...
- Hard disk controllers
LSI 9300-8i
- Network cards
Intel Corporation Ethernet Controller 10G X550T

I've been running this config for ~10 months and while I've been having an issue for a while now (few months) where my pool has been degraded, I've never lost any data. I have a couple of spares in there, there haven't been any SMART failures, and when the pool gets scrubbed it comes back to full health so I've been ignoring it until I could spin up another pool for some testing. I'm pretty sure it is related to write errors from a sketchy portainer instance I'm running but I can't confirm... Of course, I didn't get the chance before today when I rebooted the machine to find that my pool was not remounted on boot. This is what I'm seeing:

The system knows the disks belong to this pool, but I cannot import the pool (its not available in the dropdown on the GUI) and via zpool import I'm getting:

Code:

# zpool import
   pool: iliad
     id: 7288961483482980890
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

        iliad                                       ONLINE
          raidz1-0                                  ONLINE
            aede0578-5130-4402-8629-70d8a5452253    ONLINE
            508f62dc-9ce6-4016-a2de-33db254537f4    ONLINE
            spare-2                                 ONLINE
              58c92f01-fef9-4b0e-8d21-2eb5677cf696  ONLINE
              d368cb6b-b3a6-4fe9-bbec-9cab8d6a9661  ONLINE
            80dfaa92-bf60-466a-8fd0-5377b96102e5    ONLINE
            spare-4                                 ONLINE
              913800a2-da16-4c57-9c47-73b5a2f94754  ONLINE
              9aa0a10d-bdc4-41f4-9b21-edbdf4df2cb7  ONLINE
        spares
          d368cb6b-b3a6-4fe9-bbec-9cab8d6a9661
          9aa0a10d-bdc4-41f4-9b21-edbdf4df2cb7

# zpool import iliad
cannot import 'iliad': I/O error
        Destroy and re-create the pool from
        a backup source.

I've seen other posts where one or several of the drives are shown as FAULTED, or the pool comes up as such, but that's not the case here. I've also tried some more forceful import commands I've seen in other posts (zpool import -f -F -R /mnt <pool>) using the name and the ID but I get the same response. Any help would be much appreciated. My #1 hope is to just get the data off these drives at this point... also to anyone out there suggesting I don't know what I'm doing, you're definitely mostly right. I'm still fairly new to TruNAS and zfs in general.

kooplaah · Oct 9, 2023

Important thing I forgot, I'm currently running TrueNAS-SCALE-22.12.3.3

kooplaah · Oct 10, 2023

I hope I'm not breaking the rules by "bumping" this, but I am doing things and trying to post the results (I am impatient... its a character flaw I know). So I had the option to export the pool via the GUI and I did that. I am now given the option to import the pool via the dropdown with no luck, BUT(!) I have new info being reported to me this way!

Code:

[EZFS_IO] Failed to import 'iliad' pool: cannot import 'iliad' as 'iliad': I/O error

 Error: concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs.py", line 438, in import_pool
    zfs.import_pool(found, pool_name, properties, missing_log=missing_log, any_host=any_host)
  File "libzfs.pyx", line 1265, in libzfs.ZFS.import_pool
  File "libzfs.pyx", line 1293, in libzfs.ZFS.__import_pool
libzfs.ZFSException: cannot import 'iliad' as 'iliad': I/O error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.9/concurrent/futures/process.py", line 243, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/worker.py", line 115, in main_worker
    res = MIDDLEWARE._run(*call_args)
  File "/usr/lib/python3/dist-packages/middlewared/worker.py", line 46, in _run
    return self._call(name, serviceobj, methodobj, args, job=job)
  File "/usr/lib/python3/dist-packages/middlewared/worker.py", line 40, in _call
    return methodobj(*params)
  File "/usr/lib/python3/dist-packages/middlewared/worker.py", line 40, in _call
    return methodobj(*params)
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1382, in nf
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs.py", line 444, in import_pool
    self.logger.error(
  File "libzfs.pyx", line 465, in libzfs.ZFS.__exit__
  File "/usr/lib/python3/dist-packages/middlewared/plugins/zfs.py", line 442, in import_pool
    raise CallError(f'Failed to import {pool_name!r} pool: {e}', e.code)
middlewared.service_exception.CallError: [EZFS_IO] Failed to import 'iliad' pool: cannot import 'iliad' as 'iliad': I/O error
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 428, in run
    await self.future
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 463, in __run_body
    rv = await self.method(*([self] + args))
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1378, in nf
    return await func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1246, in nf
    res = await f(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/pool.py", line 1459, in import_pool
    await self.middleware.call('zfs.pool.import_pool', guid, opts, any_host, use_cachefile, new_name)
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1395, in call
    return await self._call(
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1352, in _call
    return await self._call_worker(name, *prepared_call.args)
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1358, in _call_worker
    return await self.run_in_proc(main_worker, name, args, job)
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1273, in run_in_proc
    return await self.run_in_executor(self.__procpool, method, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1258, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
middlewared.service_exception.CallError: [EZFS_IO] Failed to import 'iliad' pool: cannot import 'iliad' as 'iliad': I/O error

NugentS · Oct 10, 2023

Well your pool has issues. It would appear to have 2 faulted disks and the spares have kicked in. How you fix that given that the pool isn't imported - I have no idea - but its not looking good. I would suggest you probably have another failed / failing disk and no more spares.

My current diagnosis is a seriously neglected pool. Do you have a backup?

kooplaah · Oct 10, 2023

Out of curiosity, how can you tell there are faulted drives? I would have figured they'd show up as FAULTED when I run zpool import.
And of course my backup is "in the mail" ugh. Waiting for a final drive before I added my backup pool. I do have 1 identical "cold spare" right now, but I'm not sure if that helps me at all.

NugentS · Oct 10, 2023

Because the 2 spares are both listed under spare AND are listed as replacing 2 other disks. I am guessing one or more of the other disks has also gone faulty.

You had a RaidZ1, 5 wide with 2 spares, so could lose 3 disks. You have definately lost 2, and I assume at least 2 more are faulty. A better configuration would have been RaidZ3 8 wide

kooplaah · Oct 11, 2023

NugentS said:
Because the 2 spares are both listed under spare AND are listed as replacing 2 other disks. I am guessing one or more of the other disks has also gone faulty.

You had a RaidZ1, 5 wide with 2 spares, so could lose 3 disks. You have definately lost 2, and I assume at least 2 more are faulty. A better configuration would have been RaidZ3 8 wide

yeah, unfortunately those spares are only in there because I got them afterwards and obviously couldn't change the raid array unless I had another pool to back the data up to.
I'm still not understanding how these drives are faulted though. I don't have any SMART errors on any of those drives, and I don't understand why the reboot caused this. I know what how to better do things in the future lol but I'm still hoping I can get through this w/ data intact before I can get there.

sfatula · Oct 11, 2023

Having no smart errors doesn't mean there are no errors. Zfs detects errors without using smart. It replaced those 2 drives for a reason, smart errors or not. It's going to be difficult NOW to figure out why they were failing as the data is likely long gone. It appears all your drives are online now, perhaps a resilver "fixed" them. If you ever got it to import, you can detach the spares and pool would go back to normal.

I suspect ignoring the degraded drives in the past was not a good idea. It is possible metadata got messed up here or other sorts of corruption since you are getting I/O error when trying to import or force import if I understand correctly. It's possible it's a hardware error of some sort. But anything I would say would be a random guess at this point. The problems needed to be handled when they first occurred. I am not aware of a way to make it import. It's possible there is, but if you have tried the force and variants that's all I got. Without a backup, you may be out of luck.

kooplaah · Oct 11, 2023

Alright so I ran an experiment. At this point I've accepted I may be SOL, but I found something interesting here. TL;DR I'm pretty sure I only have 3 faulted drives, which I should be able to survive.

I noticed that when I ran zpool import w/ all of the drives in the pool I wasn't getting helpful or accurate information (see above). As you both have said, clearly there's some messed up drives... so I started pulling them out one-at-a-time and seeing what happened. Strangely, when I did this and ran zpool import again I got different information. It was telling me which drive I pulled (duh) but it would also give me the new status of the pool w/ the missing drive. Here's the long version:

Code:

drive 0 out = SN: 7TD005TV            80dfaa92-bf60-466a-8fd0-5377b96102e5
state: DEGRADED

        iliad                                       DEGRADED
          raidz1-0                                  DEGRADED
            aede0578-5130-4402-8629-70d8a5452253    ONLINE
            508f62dc-9ce6-4016-a2de-33db254537f4    ONLINE
            spare-2                                 ONLINE
              58c92f01-fef9-4b0e-8d21-2eb5677cf696  ONLINE
              d368cb6b-b3a6-4fe9-bbec-9cab8d6a9661  ONLINE
            80dfaa92-bf60-466a-8fd0-5377b96102e5    UNAVAIL       
            spare-4                                 ONLINE
              913800a2-da16-4c57-9c47-73b5a2f94754  ONLINE
              9aa0a10d-bdc4-41f4-9b21-edbdf4df2cb7  ONLINE


drive 1 out = SN: 7TD005GG            d368cb6b-b3a6-4fe9-bbec-9cab8d6a9661
state: FAULTED
        iliad                                       FAULTED  corrupted data
          raidz1-0                                  DEGRADED
            aede0578-5130-4402-8629-70d8a5452253    ONLINE
            508f62dc-9ce6-4016-a2de-33db254537f4    ONLINE
            spare-2                                 DEGRADED
              58c92f01-fef9-4b0e-8d21-2eb5677cf696  ONLINE
              d368cb6b-b3a6-4fe9-bbec-9cab8d6a9661  UNAVAIL
            80dfaa92-bf60-466a-8fd0-5377b96102e5    ONLINE
            spare-4                                 ONLINE
              913800a2-da16-4c57-9c47-73b5a2f94754  ONLINE
              9aa0a10d-bdc4-41f4-9b21-edbdf4df2cb7  ONLINE

drive 2 out = SN: 7TD0052M            58c92f01-fef9-4b0e-8d21-2eb5677cf696
state: DEGRADED
        iliad                                       DEGRADED
          raidz1-0                                  DEGRADED
            aede0578-5130-4402-8629-70d8a5452253    ONLINE
            508f62dc-9ce6-4016-a2de-33db254537f4    ONLINE
            spare-2                                 DEGRADED
              58c92f01-fef9-4b0e-8d21-2eb5677cf696  UNAVAIL
              d368cb6b-b3a6-4fe9-bbec-9cab8d6a9661  ONLINE
            80dfaa92-bf60-466a-8fd0-5377b96102e5    ONLINE
            spare-4                                 ONLINE
              913800a2-da16-4c57-9c47-73b5a2f94754  ONLINE
              9aa0a10d-bdc4-41f4-9b21-edbdf4df2cb7  ONLINE
        spares
          d368cb6b-b3a6-4fe9-bbec-9cab8d6a9661
          9aa0a10d-bdc4-41f4-9b21-edbdf4df2cb7

drive 3 out = SN: 7TD0067V            913800a2-da16-4c57-9c47-73b5a2f94754
state: DEGRADED
        iliad                                       DEGRADED
          raidz1-0                                  DEGRADED
            aede0578-5130-4402-8629-70d8a5452253    ONLINE
            508f62dc-9ce6-4016-a2de-33db254537f4    ONLINE
            spare-2                                 ONLINE
              58c92f01-fef9-4b0e-8d21-2eb5677cf696  ONLINE
              d368cb6b-b3a6-4fe9-bbec-9cab8d6a9661  ONLINE
            80dfaa92-bf60-466a-8fd0-5377b96102e5    ONLINE
            spare-4                                 DEGRADED
              913800a2-da16-4c57-9c47-73b5a2f94754  UNAVAIL
              9aa0a10d-bdc4-41f4-9b21-edbdf4df2cb7  ONLINE
        spares
          d368cb6b-b3a6-4fe9-bbec-9cab8d6a9661
          9aa0a10d-bdc4-41f4-9b21-edbdf4df2cb7

drive 4 out = SN: 7TD005ZB            508f62dc-9ce6-4016-a2de-33db254537f4
state: FAULTED
        iliad                                       FAULTED  corrupted data
          raidz1-0                                  DEGRADED
            aede0578-5130-4402-8629-70d8a5452253    ONLINE
            508f62dc-9ce6-4016-a2de-33db254537f4    UNAVAIL
            spare-2                                 ONLINE
              58c92f01-fef9-4b0e-8d21-2eb5677cf696  ONLINE
              d368cb6b-b3a6-4fe9-bbec-9cab8d6a9661  ONLINE
            80dfaa92-bf60-466a-8fd0-5377b96102e5    ONLINE
            spare-4                                 ONLINE
              913800a2-da16-4c57-9c47-73b5a2f94754  ONLINE
              9aa0a10d-bdc4-41f4-9b21-edbdf4df2cb7  ONLINE

drive 5 out = SN: 7TD007NP            aede0578-5130-4402-8629-70d8a5452253
state: FAULTED
        iliad                                       FAULTED  corrupted data
          raidz1-0                                  DEGRADED
            aede0578-5130-4402-8629-70d8a5452253    UNAVAIL
            508f62dc-9ce6-4016-a2de-33db254537f4    ONLINE
            spare-2                                 ONLINE
              58c92f01-fef9-4b0e-8d21-2eb5677cf696  ONLINE
              d368cb6b-b3a6-4fe9-bbec-9cab8d6a9661  ONLINE
            80dfaa92-bf60-466a-8fd0-5377b96102e5    ONLINE
            spare-4                                 ONLINE
              913800a2-da16-4c57-9c47-73b5a2f94754  ONLINE
              9aa0a10d-bdc4-41f4-9b21-edbdf4df2cb7  ONLINE

drive 6 out = SN: 7TD00BXB            9aa0a10d-bdc4-41f4-9b21-edbdf4df2cb7
state: FAULTED
        iliad                                       FAULTED  corrupted data
          raidz1-0                                  DEGRADED
            aede0578-5130-4402-8629-70d8a5452253    ONLINE
            508f62dc-9ce6-4016-a2de-33db254537f4    ONLINE
            spare-2                                 ONLINE
              58c92f01-fef9-4b0e-8d21-2eb5677cf696  ONLINE
              d368cb6b-b3a6-4fe9-bbec-9cab8d6a9661  ONLINE
            80dfaa92-bf60-466a-8fd0-5377b96102e5    ONLINE
            spare-4                                 DEGRADED
              913800a2-da16-4c57-9c47-73b5a2f94754  ONLINE
              9aa0a10d-bdc4-41f4-9b21-edbdf4df2cb7  UNAVAIL

A summary of this per drive is here:

Code:

       iliad                                     
          raidz1-0                                 
            aede0578-5130-4402-8629-70d8a5452253    **    -when out- POOL FAULTED   
            508f62dc-9ce6-4016-a2de-33db254537f4    **    -when out- POOL FAULTED   
              58c92f01-fef9-4b0e-8d21-2eb5677cf696    **    -when out- POOL DEGRADED
              d368cb6b-b3a6-4fe9-bbec-9cab8d6a9661    **    -when out- POOL FAULTED   
            80dfaa92-bf60-466a-8fd0-5377b96102e5       **    -when out- POOL DEGRADED
              913800a2-da16-4c57-9c47-73b5a2f94754    **    -when out- POOL DEGRADED
              9aa0a10d-bdc4-41f4-9b21-edbdf4df2cb7    **    -when out- POOL FAULTED

So I'm making some assumptions here. I either have 3 or 4 faulted drives, but I don't know which one... so I made a system of equations to figure this out. I know that I can survive 3 faulted drives, and with 4 I'm facing total data loss. So, in math terms if I say a GOOD drive is sdx=1 and a FAULTED drive is sdx=0, then if the POOL is FAULTED that means sda + sdb + sdc + sdd + sde + sdf + sdg < 4, or with my assumptions that sum should equal 3.

I can list that summary above as a system of equations like this:

Code:

b + c + d + e + f + g = 3    (or FAULTED)
a + c + d + e + f + g = 3    (or FAULTED)
a + b + d + e + f + g = 4    (or DEGRADED)
a + b + c + e + f + g =3    (or FAULTED)
a + b + c + d + f + g = 4    (or DEGRADED)
a + b + c + d + e + g = 4    (or DEGRADED)
a + b + c + d + e + f =3    (or FAULTED)

If you solve for this you get :
a = 1, b = 1, c = 0, d = 1, e = 0, f = 0, g = 1

So... I only have 3 faulted drives: sdc, sde, and sdf
I'd be super open for someone to check my assumptions (and math) here. Again, I'm definitely not understanding something here... otherwise things would be working which they're not. I just don't know what I don't know...

sfatula · Oct 11, 2023

I have not done the math you asked about, however, it seems super unlikely to have 4 faulted drives. I guess I have to ask, where did these drives come from? Did you own them before Truenas? Did you buy them used? Burn in tested? How old are they?

Your LSI 9300-8i, what is the output of:

sas2flash -list

kooplaah · Oct 12, 2023

sfatula said:
I have not done the math you asked about, however, it seems super unlikely to have 4 faulted drives. I guess I have to ask, where did these drives come from? Did you own them before Truenas? Did you buy them used? Burn in tested? How old are they?

Your LSI 9300-8i, what is the output of:

sas2flash -list

Yeah I tend to agree, something isn't adding up here. I bought all the drives new, though a few were in an Unraid server for about a year before I moved them all onto TruNAS. Longest lifetime of any of the 7 is ~18000 hrs according to SMART.

As for that command, here ya go!

Code:

# sas2flash -list
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved

        No LSI SAS adapters found! Limited Command Set Available!
        ERROR: Command Not allowed without an adapter!
        ERROR: Couldn't Create Command -list
        Exiting Program.

I have also been using this thing for a while, it also migrated from the Unraid server FYI it was the all the same hardware; same box new OS w/ TruNAS. I did that back at the beginning of 2023. Feb-March range and its been running pretty much 24/7 since. I mainly use (this pool) for docker volume storage for a few VMs I have running docker swarm nodes between this server and 2 more.

sfatula · Oct 12, 2023

Your adapter is still in the machine? Do you have sas3flash? If so, sas3flash -list

sas2flash is wrong for that adapter, sorry.

kooplaah · Oct 12, 2023

sfatula said:
Your adapter is still in the machine? Do you have sas3flash? If so, sas3flash -list

sas2flash is wrong for that adapter, sorry.

That one has much more interesting results:

Code:

# sas3flash -list
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02)
Copyright 2008-2017 Avago Technologies. All rights reserved.

        Adapter Selected is a Avago SAS: SAS3008(C0)

        Controller Number              : 0
        Controller                     : SAS3008(C0)
        PCI Address                    : 00:c1:00:00
        SAS Address                    : 500605b-0-0a00-da60
        NVDATA Version (Default)       : 05.00.00.05
        NVDATA Version (Persistent)    : 05.00.00.05
        Firmware Product ID            : 0x2221 (IT)
        Firmware Version               : 05.00.00.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : SAS9300-8i
        BIOS Version                   : 08.11.00.00
        UEFI BSD Version               : 06.00.00.00
        FCODE Version                  : N/A
        Board Name                     : SAS9300-8i
        Board Assembly                 : H3-25573-00H
        Board Tracer Number            : SV50340436

        Finished Processing Commands Successfully.
        Exiting SAS3Flash.

Not sure if this is helpful, but I also grabbed this during a recent reboot:

sfatula · Oct 12, 2023

That's really behind firmware! I believe 16 is current. I don't know where they have changelogs but that's old. I would be concerned about that. But at least you have IT mode. If no one else chimes in here about how you may be able to recover the pool, I'd post in reddit /zfs to see if anyone there can find a way.

I wonder what you'd see with c, e, and f out?

kooplaah · Oct 12, 2023

sfatula said:
That's really behind firmware! I believe 16 is current. I don't know where they have changelogs but that's old. I would be concerned about that. But at least you have IT mode. If no one else chimes in here about how you may be able to recover the pool, I'd post in reddit /zfs to see if anyone there can find a way.

I would never have even thought to check HBA firmware, but it makes sense! I've also found a v16 here. I'll give that a go today so at least that's taken care of.

sfatula · Oct 12, 2023

I don't see any harm, but without c, d, f, try an import? Doubtful. Esp if you can flash the firmware first, just make sure to keep IT mode, not IR.

sfatula · Oct 12, 2023

If you ever do get it to import, I'd be sure to immediately export until you can add at least 1 spare drive.

kooplaah · Oct 13, 2023

sfatula said:
If you ever do get it to import, I'd be sure to immediately export until you can add at least 1 spare drive.

Thanks for the heads up! I do have 1 extra drive that I haven't put in yet. Hopefully that can hold me over until I get the last of my other drives for the "backup pool" I need to make...

kooplaah · Oct 14, 2023

sfatula said:
That's really behind firmware! I believe 16 is current. I don't know where they have changelogs but that's old. I would be concerned about that. But at least you have IT mode. If no one else chimes in here about how you may be able to recover the pool, I'd post in reddit /zfs to see if anyone there can find a way.

I wonder what you'd see with c, e, and f out?

I didn't realize how right you were until I did a bit more searching. Found a great post about what may be my exact issue right here on this forum! Still working on getting the pool to import with not much luck, but I flashed the new firmware and I'm waiting on the reboot to finish. May not help my current situation, but hopefully it will prevent any future issues.

sfatula · Oct 14, 2023

It sounded familiar actually, glad you found some details in the forums! Really, it's the first thing I do when I get any card, I see if there are firmware updates. Not sure much chance it will help with current issue as likely damage already done.

Important Announcement for the TrueNAS Community.

Pool Offline, cannot import, but shows up Online via zpool import

Dabbler

Dabbler

Dabbler

MVP

Dabbler

MVP

Dabbler

Guru

Dabbler

Guru

Dabbler

Guru

Dabbler

Guru

Dabbler

Guru

Guru

Dabbler

Dabbler

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Pool Offline, cannot import, but shows up Online via zpool import"

Similar threads