System crashes on pool import

Billabong · Feb 9, 2022

My TrueNAS system started crashing recently. During the boot process, when importing the single pool poolio that I have defined, the system panics, with error messages that look like this (plus a traceback):

Code:

panic: Solaris(panic): poolio: blkptr at 0xfffffe00a0f40040 has invalid CHECKSUM 72

I tried entering single-user mode and importing the pool in a few different ways: with readonly=on; with the -F flag to find a recent working transaction; and even specifying a particular transaction ID (the oldest one I could find through zdb -ul <vdev>). However, it panics every time, with an error message like:

Code:

panic: VERIFY3(size <= SPA_MAXBLOCKSIZE) failed (20855296 > 16777216)

I ran hard disk (SMART long test) and RAM (MemTest86) checks, which surface zero problems.

I also tried fresh reinstalling TrueNAS, after backing up freenas-v1.db, and importing the pool through the GUI, but the same bootloop resulted.

My diagnosis based on this info is that the pool metadata is corrupted. Please let me know if you think it's something else!

Here are the potential next steps I see based on that diagnosis:

Shell out 400 USD to use Klennet ZFS Recovery tool, and also figure out how to temporarily store my ~30 TB of data when later rebuilding a pool
Replace my non-ECC RAM with ECC RAM
Try to recover the pool on Ubuntu or some non-TrueNAS system

Please share your thoughts on how likely these are to succeed! I'd also like to know how to avoid something like this happening in the future, since avoiding sudden data loss is one of the main reasons I'm using this system (along with data availability). But I didn't care quite enough about the data to figure out a backup solution for all 40 TB.

Finally, here are some system specs, in case they're relevant; it's a repurposed old PC, so I imagine it's not up-to-snuff with recommendations for a purpose-built NAS :)

MOBO: Gigabyte GA-H87N-WIFI
CPU: Intel Core i5-4670
RAM: Kingston HyperX 8 GB (2 x 4 GB) DDR3-1600 CL9 (non-ECC)
Boot disk: ADATA XPG SX900 256 GB 2.5" Solid State Drive
HDDs: Seagate Ironwolf Pro 10TB -- 5x in RAIDZ1

Billabong · Feb 11, 2022

Arwen · Feb 12, 2022

First, it should be harmless to use a non TrueNAS OS to examine your pool. Just don't "upgrade" your pool features. Or make pool / dataset changes.

Sometimes issues are caused by power supply issues, so using your 5 ZFS data disks in another computer may allow them to work properly.

Another cause can be your boot media. Run a ZFS scrub against it, (you state you re-installed it, so it should be good, just run the scrub).

As for why it happened, their can be multiple reasons:

Non-ECC memory as you point out. (Though this is extremely rare, ZFS is so heavily used it is bound to cause the rare problem for someone out their.)
Not performing regular ZFS scrubs of your pool, so that you collected multiple disk issues, without ZFS knowing it, so ZFS could not fix simple block issues. (Linus Tech Tips had this one SPECIFIC cause take out a major video archive. See a recent Youtube from them if interested.)
Using RAID-Z1 on larger disks, (>1TB/2B), is considered less than ideal because of the statistical probability that you will run into an unrecoverable read error during a disk failure recovery.
You don't mention it, but their are ZFS options no one should be changing on a NAS. For example, if you set "redundant_metadata" to "most" instead of the default of "all", that would reduce your redundancy.

On the subject of ECC verses Non-ECC. In a true server, which TrueNAS is a SERVER, it may stay running for weeks if not months. During that time a bit in memory can flip and without ECC, you won't know it. Then if that bit is used and written to disk, it could theoretically corrupt a pool. This is quite unlikely. To counter this issue, simply reboot often, like once or twice a month. That will either find the bad pool sooner, or prevent it from happening.

Now can I help you further?

Maybe, but no certainty.

This is why people here in the TrueNAS forums are a bit conservative, and backup their NAS(es). (Plus, try to use server grade hardware, even if it's used.)

Billabong · Feb 16, 2022

Thanks so much for the detailed answer! Sorry I wasn't able to get back to you right away.

I tried running Ubuntu from a USB drive and importing the pool from there, but all the zpool commands I executed just hung. Even just sudo zpool import hangs. Unsure if this indicates a problem with the pool itself, or just with the exact thing I'm doing.

I'll try ZFS scrubbing next, but it seems likely that that won't rectify the issue, so I'll probably end up just reformatting the drives and maybe upgrading my machine. Most of what I'm losing is media for which I've kept the "receipts", so it's just a matter of redownloading. And for that reason I'll probably stick with RAID-Z1, because I care more about having an extra 10TB of space than guarding against a disk read error; no idea how likely that is though.

I never knew about having to "[perform] regular ZFS scrubs". Should I try to set up a cron job to do this?

Billabong · Feb 16, 2022

Couldn't run the ZFS scrub on account of not being able to import the pool (unless I'm misunderstanding). I'll take the last-ditch action of replacing the PSU (it's 12 years old anyway, and non-modular, so bad airflow in the little ITX case), and if that doesn't solve the problem, I'll write off the data as lost, and just reformat and start from scratch.

Probably won't upgrade to ECC mobo/RAM until I want to expand my machine with more disks, which would be an opportune moment to upgrade everything.

winnielinnie · Feb 16, 2022

@Arwen was referring to scrubbing your boot-pool, not the data pool in question.

System -> Boot -> Actions -> Scrub Boot Pool

Arwen · Feb 16, 2022

Billabong said:
Thanks so much for the detailed answer! Sorry I wasn't able to get back to you right away.

Your welcome.

Billabong said:
...
I never knew about having to "[perform] regular ZFS scrubs". Should I try to set up a cron job to do this?

Yes, I was referring to the boot pool in part of my answer.

However, you would still want regular data pool scrubs, which are available to configure in the TrueNAS GUI. So, whether you get your existing pool back, or recreate it, setup regular SMART short test and data pool scrubs.

Billabong · Feb 21, 2022

I tried setting the vfs.zfs.spa_load_verify_metadata sysctl tunable to 0. Running zpool import poolname crashed as usual (didn't catch the traceback); running that same command but with a bunch of other flags like I have before (targeting a particular txg id, readonly mode) is taking a long time to execute, just like before.

However, importing the pool through the TrueNAS GUI yielded a different result than previous -- specifically, the following traceback:

Code:

Error: concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/concurrent/futures/process.py", line 243, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 111, in main_worker
    res = MIDDLEWARE._run(*call_args)
  File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 45, in _run
    return self._call(name, serviceobj, methodobj, args, job=job)
  File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 39, in _call
    return methodobj(*params)
  File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 39, in _call
    return methodobj(*params)
  File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 979, in nf
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/zfs.py", line 371, in import_pool
    self.logger.error(
  File "libzfs.pyx", line 391, in libzfs.ZFS.__exit__
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/zfs.py", line 365, in import_pool
    zfs.import_pool(found, new_name or found.name, options, any_host=any_host)
  File "libzfs.pyx", line 1095, in libzfs.ZFS.import_pool
  File "libzfs.pyx", line 1123, in libzfs.ZFS.__import_pool
libzfs.ZFSException: I/O error
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/middlewared/job.py", line 367, in run
    await self.future
  File "/usr/local/lib/python3.9/site-packages/middlewared/job.py", line 403, in __run_body
    rv = await self.method(*([self] + args))
  File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 975, in nf
    return await f(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/pool.py", line 1421, in import_pool
    await self.middleware.call('zfs.pool.import_pool', pool['guid'], {
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1256, in call
    return await self._call(
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1221, in _call
    return await self._call_worker(name, *prepared_call.args)
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1227, in _call_worker
    return await self.run_in_proc(main_worker, name, args, job)
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1154, in run_in_proc
    return await self.run_in_executor(self.__procpool, method, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1128, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
libzfs.ZFSException: ('I/O error',)

Is this useful?

Important Announcement for the TrueNAS Community.

System crashes on pool import

Billabong

Dabbler

Billabong

Dabbler

Arwen

MVP

Billabong

Dabbler

Billabong

Dabbler

winnielinnie

MVP

Arwen

MVP

Billabong

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

System crashes on pool import

Dabbler

Dabbler

MVP

Dabbler

Dabbler

MVP

MVP

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "System crashes on pool import"

Similar threads