System crashes on pool import

Billabong

Dabbler
Joined
Feb 9, 2022
Messages
10
My TrueNAS system started crashing recently. During the boot process, when importing the single pool poolio that I have defined, the system panics, with error messages that look like this (plus a traceback):

Code:
panic: Solaris(panic): poolio: blkptr at 0xfffffe00a0f40040 has invalid CHECKSUM 72


I tried entering single-user mode and importing the pool in a few different ways: with readonly=on; with the -F flag to find a recent working transaction; and even specifying a particular transaction ID (the oldest one I could find through zdb -ul <vdev>). However, it panics every time, with an error message like:

Code:
panic: VERIFY3(size <= SPA_MAXBLOCKSIZE) failed (20855296 > 16777216)


I ran hard disk (SMART long test) and RAM (MemTest86) checks, which surface zero problems.

I also tried fresh reinstalling TrueNAS, after backing up freenas-v1.db, and importing the pool through the GUI, but the same bootloop resulted.

My diagnosis based on this info is that the pool metadata is corrupted. Please let me know if you think it's something else!

Here are the potential next steps I see based on that diagnosis:
  • Shell out 400 USD to use Klennet ZFS Recovery tool, and also figure out how to temporarily store my ~30 TB of data when later rebuilding a pool
  • Replace my non-ECC RAM with ECC RAM
  • Try to recover the pool on Ubuntu or some non-TrueNAS system
Please share your thoughts on how likely these are to succeed! I'd also like to know how to avoid something like this happening in the future, since avoiding sudden data loss is one of the main reasons I'm using this system (along with data availability). But I didn't care quite enough about the data to figure out a backup solution for all 40 TB.

Finally, here are some system specs, in case they're relevant; it's a repurposed old PC, so I imagine it's not up-to-snuff with recommendations for a purpose-built NAS :)
  • MOBO: Gigabyte GA-H87N-WIFI
  • CPU: Intel Core i5-4670
  • RAM: Kingston HyperX 8 GB (2 x 4 GB) DDR3-1600 CL9 (non-ECC)
  • Boot disk: ADATA XPG SX900 256 GB 2.5" Solid State Drive
  • HDDs: Seagate Ironwolf Pro 10TB -- 5x in RAIDZ1
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
First, it should be harmless to use a non TrueNAS OS to examine your pool. Just don't "upgrade" your pool features. Or make pool / dataset changes.

Sometimes issues are caused by power supply issues, so using your 5 ZFS data disks in another computer may allow them to work properly.

Another cause can be your boot media. Run a ZFS scrub against it, (you state you re-installed it, so it should be good, just run the scrub).

As for why it happened, their can be multiple reasons:
  • Non-ECC memory as you point out. (Though this is extremely rare, ZFS is so heavily used it is bound to cause the rare problem for someone out their.)
  • Not performing regular ZFS scrubs of your pool, so that you collected multiple disk issues, without ZFS knowing it, so ZFS could not fix simple block issues. (Linus Tech Tips had this one SPECIFIC cause take out a major video archive. See a recent Youtube from them if interested.)
  • Using RAID-Z1 on larger disks, (>1TB/2B), is considered less than ideal because of the statistical probability that you will run into an unrecoverable read error during a disk failure recovery.
  • You don't mention it, but their are ZFS options no one should be changing on a NAS. For example, if you set "redundant_metadata" to "most" instead of the default of "all", that would reduce your redundancy.

On the subject of ECC verses Non-ECC. In a true server, which TrueNAS is a SERVER, it may stay running for weeks if not months. During that time a bit in memory can flip and without ECC, you won't know it. Then if that bit is used and written to disk, it could theoretically corrupt a pool. This is quite unlikely. To counter this issue, simply reboot often, like once or twice a month. That will either find the bad pool sooner, or prevent it from happening.

Now can I help you further?

Maybe, but no certainty.

This is why people here in the TrueNAS forums are a bit conservative, and backup their NAS(es). (Plus, try to use server grade hardware, even if it's used.)
 

Billabong

Dabbler
Joined
Feb 9, 2022
Messages
10
Thanks so much for the detailed answer! Sorry I wasn't able to get back to you right away.

I tried running Ubuntu from a USB drive and importing the pool from there, but all the zpool commands I executed just hung. Even just sudo zpool import hangs. Unsure if this indicates a problem with the pool itself, or just with the exact thing I'm doing.

I'll try ZFS scrubbing next, but it seems likely that that won't rectify the issue, so I'll probably end up just reformatting the drives and maybe upgrading my machine. Most of what I'm losing is media for which I've kept the "receipts", so it's just a matter of redownloading. And for that reason I'll probably stick with RAID-Z1, because I care more about having an extra 10TB of space than guarding against a disk read error; no idea how likely that is though.

I never knew about having to "[perform] regular ZFS scrubs". Should I try to set up a cron job to do this?
 

Billabong

Dabbler
Joined
Feb 9, 2022
Messages
10
Couldn't run the ZFS scrub on account of not being able to import the pool (unless I'm misunderstanding). I'll take the last-ditch action of replacing the PSU (it's 12 years old anyway, and non-modular, so bad airflow in the little ITX case), and if that doesn't solve the problem, I'll write off the data as lost, and just reformat and start from scratch.

Probably won't upgrade to ECC mobo/RAM until I want to expand my machine with more disks, which would be an opportune moment to upgrade everything.
 
Joined
Oct 22, 2019
Messages
3,641
@Arwen was referring to scrubbing your boot-pool, not the data pool in question.

System -> Boot -> Actions -> Scrub Boot Pool
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Thanks so much for the detailed answer! Sorry I wasn't able to get back to you right away.
Your welcome.
...
I never knew about having to "[perform] regular ZFS scrubs". Should I try to set up a cron job to do this?
Yes, I was referring to the boot pool in part of my answer.

However, you would still want regular data pool scrubs, which are available to configure in the TrueNAS GUI. So, whether you get your existing pool back, or recreate it, setup regular SMART short test and data pool scrubs.
 

Billabong

Dabbler
Joined
Feb 9, 2022
Messages
10
I tried setting the vfs.zfs.spa_load_verify_metadata sysctl tunable to 0. Running zpool import poolname crashed as usual (didn't catch the traceback); running that same command but with a bunch of other flags like I have before (targeting a particular txg id, readonly mode) is taking a long time to execute, just like before.

However, importing the pool through the TrueNAS GUI yielded a different result than previous -- specifically, the following traceback:
Code:
Error: concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/concurrent/futures/process.py", line 243, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 111, in main_worker
    res = MIDDLEWARE._run(*call_args)
  File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 45, in _run
    return self._call(name, serviceobj, methodobj, args, job=job)
  File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 39, in _call
    return methodobj(*params)
  File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 39, in _call
    return methodobj(*params)
  File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 979, in nf
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/zfs.py", line 371, in import_pool
    self.logger.error(
  File "libzfs.pyx", line 391, in libzfs.ZFS.__exit__
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/zfs.py", line 365, in import_pool
    zfs.import_pool(found, new_name or found.name, options, any_host=any_host)
  File "libzfs.pyx", line 1095, in libzfs.ZFS.import_pool
  File "libzfs.pyx", line 1123, in libzfs.ZFS.__import_pool
libzfs.ZFSException: I/O error
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/middlewared/job.py", line 367, in run
    await self.future
  File "/usr/local/lib/python3.9/site-packages/middlewared/job.py", line 403, in __run_body
    rv = await self.method(*([self] + args))
  File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 975, in nf
    return await f(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/pool.py", line 1421, in import_pool
    await self.middleware.call('zfs.pool.import_pool', pool['guid'], {
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1256, in call
    return await self._call(
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1221, in _call
    return await self._call_worker(name, *prepared_call.args)
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1227, in _call_worker
    return await self.run_in_proc(main_worker, name, args, job)
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1154, in run_in_proc
    return await self.run_in_executor(self.__procpool, method, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1128, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
libzfs.ZFSException: ('I/O error',)

Is this useful?
 
Top