Disk Replacement Issue

Wozza J · Aug 8, 2020

Hi Guys.

I'm having a storage issue following a Degraded State warning. I took the offending disk offline, replaced it, formatted t, put it online and let it resilver. I have done this three times now with different disks and each time after the resilvering has finished the status shows Degrade and when I look at the pool it shows as per the attached images.

I also can't detach the disks as I get the error:

Error: concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 247, in __zfs_vdev_operation
op(target, *args)
File "libzfs.pyx", line 369, in libzfs.ZFS.__exit__
File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 247, in __zfs_vdev_operation
op(target, *args)
File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 256, in <lambda>
self.__zfs_vdev_operation(name, label, lambda target: target.detach())
File "libzfs.pyx", line 1764, in libzfs.ZFSVdev.detach
libzfs.ZFSException: Can detach disks from mirrors and spares only

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/concurrent/futures/process.py", line 239, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/usr/local/lib/python3.7/site-packages/middlewared/worker.py", line 97, in main_worker
res = loop.run_until_complete(coro)
File "/usr/local/lib/python3.7/asyncio/base_events.py", line 579, in run_until_complete
return future.result()
File "/usr/local/lib/python3.7/site-packages/middlewared/worker.py", line 53, in _run
return await self._call(name, serviceobj, methodobj, params=args, job=job)
File "/usr/local/lib/python3.7/site-packages/middlewared/worker.py", line 45, in _call
return methodobj(*params)
File "/usr/local/lib/python3.7/site-packages/middlewared/worker.py", line 45, in _call
return methodobj(*params)
File "/usr/local/lib/python3.7/site-packages/middlewared/schema.py", line 965, in nf
return f(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 256, in detach
self.__zfs_vdev_operation(name, label, lambda target: target.detach())
File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/zfs.py", line 249, in __zfs_vdev_operation
raise CallError(str(e), e.code)
middlewared.service_exception.CallError: [EZFS_NOTSUP] Can detach disks from mirrors and spares only
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 130, in call_method
io_thread=False)
File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1084, in _call
return await methodobj(*args)
File "/usr/local/lib/python3.7/site-packages/middlewared/schema.py", line 961, in nf
return await f(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/pool.py", line 1195, in detach
await self.middleware.call('zfs.pool.detach', pool['name'], found[1]['guid'])
File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1141, in call
app=app, pipes=pipes, job_on_progress_cb=job_on_progress_cb, io_thread=True,
File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1081, in _call
return await self._call_worker(name, *args)
File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1101, in _call_worker
return await self.run_in_proc(main_worker, name, args, job)
File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1036, in run_in_proc
return await self.run_in_executor(self.__procpool, method, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/middlewared/main.py", line 1010, in run_in_executor
return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
middlewared.service_exception.CallError: [EZFS_NOTSUP] Can detach disks from mirrors and spares only

I'm by no means an expert with FreeNAS, and I'm sure the disks I am using are as replacements are fine, but I can't see why I can't just replace this disk and add it into the pool.

Appreciate any assistance.

Cheers,
Woz.

Wozza J · Aug 11, 2020

I guess this one must be a tough one then. Can anyone suggest what would be the best way forward so I don't lose the data just using the NAS I have? I just can't get rid of that REPLACING in the Pool. I can't add another disk as that gives another error:

Error: Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/middlewared/job.py", line 349, in run
await self.future
File "/usr/local/lib/python3.7/site-packages/middlewared/job.py", line 386, in __run_body
rv = await self.method(*([self] + args))
File "/usr/local/lib/python3.7/site-packages/middlewared/schema.py", line 961, in nf
return await f(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/pool.py", line 2021, in import_disk
async with MountFsContextManager(self.middleware, device, src, fs_type, fs_options, ["ro"]):
File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/pool.py", line 283, in __aenter__
await mount(self.device, self.path, *self.args, **self.kwargs)
File "/usr/local/lib/python3.7/site-packages/middlewared/plugins/pool.py", line 142, in mount
output[1].decode("utf-8"),
ValueError: Mount failed (exit code 1):
mount: /dev/ada5p2: Invalid argument

Feeling like I am stuck :/

Cheers.
Woz

Alecmascot · Aug 11, 2020

By the terminology you're using you have not followed the correct procedure.
Please supply details of your hardware and the output of

Code:

zpool status

in code tags.;

Wozza J · Aug 15, 2020

I followed the guide and marked the disk as offline. I then shut the NAS down, replaced the drive, booted it up and the selected the Replace option, selected the drive etc. and then it started resilvering. When I checked it the following day it had failed so I tried the same procedure again with two different disks and am now in this predicament.

The NAS is one I put together. It's an i5 Intel board with 16GB RAM, 6 x 2TB drives with raidz1-0 configuration and has been running fine for years. I have attached the zpool status file as requested.

Thanks.
Woz.

Alecmascot · Aug 16, 2020

Do

Code:

zpool status -v

to list the corrupted files.
Can you restore them ?

Alecmascot · Aug 16, 2020

Is the pool accessible ?
If so how much of the data is actually irreplaceable ?
Depending on the size of what you want to backup and the upload speed of your internet connection:
.
1. Buy/acquire a big enough hard drive and add it as a new pool and replicate to it.
2. Do a Cloud Sync to BackBlaze B2

Once the data you really need is backed up then try a few things like a re-boot after pulling the 6th drive that is giving you grief and/or export and re-import the pool.

Wozza J · Aug 23, 2020

Hi Alecmascot.

There is a lot of data that is irreplaceable because there are many, many months of security camera images and videos and you never know if any of it will be required (probably not). That is one dataset. The rest of the dataset are are staff files etc.

I ran the 'zpool status -v' command as you suggested and the output is below. It does mention "Permanent errors have been detected in the following files", but then further down says "errors: No known data errors" so I am a bit confused :/

state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: scrub repaired 4K in 0 days 11:00:20 with 37 errors on Mon Aug 10 02:02:29 2020
config:

NAME STATE READ WRITE CKSUM
data DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
gptid/591fb26f-b700-11e9-aa86-50e54945320d ONLINE 0 0 0
gptid/59e53d53-b700-11e9-aa86-50e54945320d ONLINE 0 0 0
gptid/5aeec8b3-b700-11e9-aa86-50e54945320d ONLINE 0 0 0
gptid/5bb9ed2d-b700-11e9-aa86-50e54945320d ONLINE 0 0 0
gptid/5c766878-b700-11e9-aa86-50e54945320d ONLINE 0 0 0
replacing-5 UNAVAIL 0 0 0
18229271153412976963 UNAVAIL 0 0 0 was /dev/gptid/5d30837d-b700-11e9-aa86-50e54945320d
932613553316633864 UNAVAIL 0 0 0 was /dev/gptid/ee2afc3b-d487-11ea-9908-50e54945320d
784056624987433835 UNAVAIL 0 0 0 was /dev/gptid/c058d053-d91d-11ea-9908-50e54945320d
8761826467377111656 OFFLINE 0 0 0 was /dev/gptid/85370ca7-d951-11ea-979c-50e54945320d

errors: Permanent errors have been detected in the following files:

<metadata>:<0x128>
/mnt/data/Cameras/RearExit/20200408/record/A200408_095956_100010.264
/mnt/data/Cameras/RearExit/20200802/record/A200802_154141_154155.264
/mnt/data/Cameras/FrontEntrance/20200415/record/A200415_084227_084241.264
/mnt/data/Cameras/CommonArea/20200411/record/A200411_203720_203735.264
/mnt/data/Cameras/RearExit/20200427/record/A200427_115332_115346.264
/mnt/data/Cameras/CommonArea/20200418/record/A200418_184913_184928.264
/mnt/data/Cameras/CommonArea/20200420/record/A200420_231812_231826.264
/mnt/data/Cameras/CommonArea/20200412/record/A200412_072411_072426.264
/mnt/data/Cameras/CommonArea/20200804/record/A200804_183802_183816.264

pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0 days 00:04:50 with 0 errors on Mon Aug 10 03:49:50 2020
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
da0p2 ONLINE 0 0 0

errors: No known data errors".

I have thought about exporting the pool to another drive, but am now wondering if it is just the errors in those files causing the 'replacing-5' problem and if I export the pool containing those files, when I wipe the drives and import the pool back in again, will it have the same problem?

Regards,
Woz.

Alecmascot · Aug 23, 2020

Can you backup the camera dataset then delete it in the pool ?
If not try deleting the files in question and try a resilver.

Wozza J said:
have thought about exporting the pool to another drive, but am now wondering if it is just the errors in those files causing the 'replacing-5' problem and if I export the pool containing those files, when I wipe the drives and import the pool back in again, will it have the same problem?

You don't really mean "export" do you ? Backup perhaps ?
If you export the pool, wiping the drives then there is no pool to import and all data is lost and the pool must be re-created.
What version of FreeNAS is this ?

Wozza J · Aug 23, 2020

It is version 11.3.
I thought to export the pool, delete the 'degraded' pool, create a new one and then import back into it?

Alecmascot · Aug 23, 2020

Wozza J said:
It is version 11.3.
I thought to export the pool, delete the 'degraded' pool, create a new one and then import back into it?

Are there two pools ????
Do you mean: backup the pool, destroy it; recreate it and restore ?

Yorick · Aug 23, 2020

<metadata>:<0x128>

As far as I know, you are cooked. The only way to come back from a pool with corrupt metadata is to back up everything you can, destroy the pool, recreate it, and copy files in from backup. As I understand it. I haven't had these issues myself and I recommend getting a more expert opinion to back mine up, or refute it.

Edit: You can try deleting the permanently errored files, run a scrub, and hope that the metadata corruption was in those files and is gone after the delete.

A good question is how your pool got into this state. You want to know that so you don't just run into the same problem again. You have an i5 which means no ECC. So, RAM could be at fault. Could also be that more than one drive has issues, smartctl -x /dev/DRIVEDEVICE will tell you. It could be a faulty cable on one drive, or a faulty port on the HBA, whether on-board or add-in HBA.

Do use CODE tags please for CLI output, it makes everything so much more readable. To the right of the smiley, the down arrow, then Code.

Yorick · Aug 23, 2020

Wozza J said:
further down says "errors: No known data errors" so I am a bit confused

That's your boot pool. Which has no errors.

Reading through the thread, I see you trying a bunch of things, such as "detaching" a disk, which happily ZFS doesn't let you do. Now that you are in error state, you'll be learning a lot about what these ZFS terms such as "detach" and "import/export" actually mean. And you'll want to look into backup options for critical files, both for this recovery and on-going. In system administration terms, building a 6-wide raidz1 on non-ECC memory says "I have a backup and I'm not afraid to use it, so what if this pool dies". Which is a perfectly reasonable risk stance to take. I advocate for taking it very deliberately, that is all.

Important Announcement for the TrueNAS Community.

Disk Replacement Issue

Wozza J

Dabbler

Attachments

Wozza J

Dabbler

Alecmascot

Guru

Wozza J

Dabbler

Attachments

Alecmascot

Guru

Alecmascot

Guru

Wozza J

Dabbler

Alecmascot

Guru

Wozza J

Dabbler

Alecmascot

Guru

Yorick

Wizard

Yorick

Wizard

Similar threads

Important Announcement for the TrueNAS Community.

Disk Replacement Issue

Dabbler

Attachments

Dabbler

Guru

Dabbler

Attachments

Guru

Guru

Dabbler

Guru

Dabbler

Guru

Wizard

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Disk Replacement Issue"

Similar threads