Pool suddenly at 100% sudden increase from 1.5TB to 14TB

exwhywhyzee · Nov 4, 2021

I honestly don't know what has happened. I rebooted my TrueNAS VM after migrating it from one datastore to another.

I noted that there my Win2k22 file server wasn't connecting to the iSCSI share, so when I checked to see what was going on, I noticed that the console is flooded with these messages:

I also noticed something really peculiar: the pool is showing 100% used space., which makes absolutely no sense to me. The pool is 14TB, and only about 1.5TB-2TB outta be used. I have no idea where that 13.5-14TB came from

I see two error CRITICAL error messages:

CRITICAL
Failed to check for alert HasUpdate: Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/alert.py", line 740, in __run_source alerts = (await alert_source.check()) or [] File "/usr/local/lib/python3.9/site-packages/middlewared/alert/base.py", line 211, in check return await self.middleware.run_in_thread(self.check_sync) File "/usr/local/lib/python3.9/site-packages/middlewared/utils/run_in_thread.py", line 10, in run_in_thread return await self.loop.run_in_executor(self.run_in_thread_executor, functools.partial(method, *args, **kwargs)) File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 52, in run result = self.fn(*self.args, **self.kwargs) File "/usr/local/lib/python3.9/site-packages/middlewared/alert/source/update.py", line 67, in check_sync path = self.middleware.call_sync("update.get_update_location") File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1272, in call_sync return self.run_coroutine(methodobj(*prepared_call.args)) File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1312, in run_coroutine return fut.result() File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 438, in result return self.__get_result() File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result raise self._exception File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/update.py", line 412, in get_update_location os.chmod(path, 0o755) OSError: [Errno 28] No space left on device: '/var/db/system/update'
2021-11-04 03:35:02 (America/Los_Angeles)

CRITICAL
Space usage for pool "HGST-raid2z_16" is 96%. Optimal pool performance requires used space remain below 80%.
2021-11-02 00:45:57 (America/Los_Angeles)

Arwen · Nov 4, 2021

Your copy process may have "blown up" the size of the VM. If the VM is not copied with a type of SPARSE setting, it may fill every block it thinks is available. I can't walk you through it, but perhaps someone else can.

Next, you hint that a Win2k22 server is using iSCSI to TrueNAS. If it's on the same RAID-Z2 pool, then you should maintain it lower than 80% used. The suggested amount is 50% full.

exwhywhyzee · Nov 4, 2021

Arwen said:
Your copy process may have "blown up" the size of the VM. If the VM is not copied with a type of SPARSE setting, it may fill every block it thinks is available. I can't walk you through it, but perhaps someone else can.

Next, you hint that a Win2k22 server is using iSCSI to TrueNAS. If it's on the same RAID-Z2 pool, then you should maintain it lower than 80% used. The suggested amount is 50% full.

Thank you for your response. Just a point of clarification, the TrueNAS VM is on a separate SSD separate from the ZFS pool. It's in a basic datastore. I migrated the TrueNAS VM from one SSD to another.

Yes, the Win2k22 server is the initiator that's connecting to the TrueNAS target. Of the 14TB, only about 1.5-2TB outta be filled. I am not sure what started this process of "blowing up". I'm getting quite worried that I am not able to access that pool and as a result any data on that pool.

Patrick M. Hausen · Nov 5, 2021

Let's start with zfs list -t all ...

exwhywhyzee · Nov 5, 2021

Patrick M. Hausen said:
Let's start with zfs list -t all ...

Thank you for your response. Here's the output:

Patrick M. Hausen · Nov 5, 2021

Please copy and paste the text after you logged on to your system with ssh. Screenshots are useless.

exwhywhyzee · Nov 5, 2021

Patrick M. Hausen said:
Please copy and paste the text after you logged on to your system with ssh. Screenshots are useless.

Here you go:

Code:

NAME                                                                   USED  AVA                                                                            IL     REFER  MOUNTPOINT
HGST-raid2z_16                                                        14.0T  1.1                                                                            0M      176K  /mnt/HGST-raid2z_16
HGST-raid2z_16@manual-2021-09-02_20-56                                 128K                                                                                  -      176K  -
HGST-raid2z_16/.system                                                 205M  1.1                                                                            0M      256K  legacy
HGST-raid2z_16/.system/configs-17a88b624aa64e01b68939563b5fcb8c       51.2M  1.1                                                                            0M     51.2M  legacy
HGST-raid2z_16/.system/cores                                           192K  1.1                                                                            0M      192K  legacy
HGST-raid2z_16/.system/rrd-17a88b624aa64e01b68939563b5fcb8c            140M  1.1                                                                            0M      140M  legacy
HGST-raid2z_16/.system/samba4                                         8.78M  1.1                                                                            0M     6.13M  legacy
HGST-raid2z_16/.system/samba4@update--2021-09-16-08-56--12.0-RELEASE   416K                                                                                  -     5.35M  -
HGST-raid2z_16/.system/samba4@update--2021-10-30-23-26--12.0-U5.1      320K                                                                                  -     5.56M  -
HGST-raid2z_16/.system/samba4@wbc-1635636630                           464K                                                                                  -     5.56M  -
HGST-raid2z_16/.system/samba4@wbc-1635731029                           496K                                                                                  -     5.56M  -
HGST-raid2z_16/.system/samba4@wbc-1635770799                           512K                                                                                  -     5.56M  -
HGST-raid2z_16/.system/services                                        192K  1.1                                                                            0M      192K  legacy
HGST-raid2z_16/.system/syslog-17a88b624aa64e01b68939563b5fcb8c        4.39M  1.1                                                                            0M     4.39M  legacy
HGST-raid2z_16/.system/webui                                           192K  1.1                                                                            0M      192K  legacy
HGST-raid2z_16/HGST-raid2z_16                                         14.0T  1.1                                                                            0M     14.0T  -
WDC-mirror_1.5                                                        1.31T  7.5                                                                            3G      128K  /mnt/WDC-mirror_1.5
WDC-mirror_1.5@WDC-mirror_1.5:20210213                                  88K                                                                                  -      144K  -
WDC-mirror_1.5/WDC-1_5T                                               1.31T   60                                                                            4G      744G  -
boot-pool                                                             3.42G  20.                                                                            3G       24K  none
boot-pool/ROOT                                                        3.42G  20.                                                                            3G       24K  none
boot-pool/ROOT/12.0-U5.1                                               228K  20.                                                                            3G     1.18G  /
boot-pool/ROOT/12.0-U6                                                3.42G  20.                                                                            3G     1.18G  /
boot-pool/ROOT/12.0-U6@2021-09-15-09:21:23                            1.69M                                                                                  -     1.06G  -
boot-pool/ROOT/12.0-U6@2021-09-16-01:53:25                            1.76M                                                                                  -     1.06G  -
boot-pool/ROOT/12.0-U6@2021-10-30-16:22:56                            1.18G                                                                                  -     1.18G  -
boot-pool/ROOT/Initial-Install                                           1K  20.                                                                            3G     1.06G  legacy
boot-pool/ROOT/default                                                 210K  20.                                                                            3G     1.06G  legacy

Patrick M. Hausen · Nov 6, 2021

Thanks. Now zfs get all HGST-raid2z_16/HGST-raid2z_16 ...

exwhywhyzee · Nov 7, 2021

Patrick M. Hausen said:
Thanks. Now zfs get all HGST-raid2z_16/HGST-raid2z_16 ...

Here you go:

Code:

NAME                           PROPERTY              VALUE                  SOURCE
HGST-raid2z_16/HGST-raid2z_16  type                  volume                 -
HGST-raid2z_16/HGST-raid2z_16  creation              Tue Jun  1  0:12 2021  -
HGST-raid2z_16/HGST-raid2z_16  used                  14.0T                  -
HGST-raid2z_16/HGST-raid2z_16  available             1007K                  -
HGST-raid2z_16/HGST-raid2z_16  referenced            14.0T                  -
HGST-raid2z_16/HGST-raid2z_16  compressratio         1.02x                  -
HGST-raid2z_16/HGST-raid2z_16  reservation           none                   default
HGST-raid2z_16/HGST-raid2z_16  volsize               14T                    local
HGST-raid2z_16/HGST-raid2z_16  volblocksize          512                    -
HGST-raid2z_16/HGST-raid2z_16  checksum              on                     local
HGST-raid2z_16/HGST-raid2z_16  compression           on                     local
HGST-raid2z_16/HGST-raid2z_16  readonly              off                    default
HGST-raid2z_16/HGST-raid2z_16  createtxg             72                     -
HGST-raid2z_16/HGST-raid2z_16  copies                1                      default
HGST-raid2z_16/HGST-raid2z_16  refreservation        none                   default
HGST-raid2z_16/HGST-raid2z_16  guid                  6299821513454525574    -
HGST-raid2z_16/HGST-raid2z_16  primarycache          all                    local
HGST-raid2z_16/HGST-raid2z_16  secondarycache        all                    local
HGST-raid2z_16/HGST-raid2z_16  usedbysnapshots       0B                     -
HGST-raid2z_16/HGST-raid2z_16  usedbydataset         14.0T                  -
HGST-raid2z_16/HGST-raid2z_16  usedbychildren        0B                     -
HGST-raid2z_16/HGST-raid2z_16  usedbyrefreservation  0B                     -
HGST-raid2z_16/HGST-raid2z_16  logbias               latency                local
HGST-raid2z_16/HGST-raid2z_16  objsetid              85                     -
HGST-raid2z_16/HGST-raid2z_16  dedup                 off                    local
HGST-raid2z_16/HGST-raid2z_16  mlslabel              none                   default
HGST-raid2z_16/HGST-raid2z_16  sync                  standard               local
HGST-raid2z_16/HGST-raid2z_16  refcompressratio      1.02x                  -
HGST-raid2z_16/HGST-raid2z_16  written               14.0T                  -
HGST-raid2z_16/HGST-raid2z_16  logicalused           980G                   -
HGST-raid2z_16/HGST-raid2z_16  logicalreferenced     980G                   -
HGST-raid2z_16/HGST-raid2z_16  volmode               default                local
HGST-raid2z_16/HGST-raid2z_16  snapshot_limit        none                   default
HGST-raid2z_16/HGST-raid2z_16  snapshot_count        none                   default
HGST-raid2z_16/HGST-raid2z_16  snapdev               hidden                 default
HGST-raid2z_16/HGST-raid2z_16  context               none                   default
HGST-raid2z_16/HGST-raid2z_16  fscontext             none                   default
HGST-raid2z_16/HGST-raid2z_16  defcontext            none                   default
HGST-raid2z_16/HGST-raid2z_16  rootcontext           none                   default
HGST-raid2z_16/HGST-raid2z_16  redundant_metadata    all                    default
HGST-raid2z_16/HGST-raid2z_16  encryption            off                    default
HGST-raid2z_16/HGST-raid2z_16  keylocation           none                   default
HGST-raid2z_16/HGST-raid2z_16  keyformat             none                   default
HGST-raid2z_16/HGST-raid2z_16  pbkdf2iters           0                      default

Patrick M. Hausen · Nov 7, 2021

I suspected refreservation, but it looks like "something" actually filled that volume from within. Sorry, out of ideas.

exwhywhyzee · Nov 7, 2021

Patrick M. Hausen said:
I suspected refreservation, but it looks like "something" actually filled that volume from within. Sorry, out of ideas.

Hmmm, this isn't good. Any way to check if there are duplicated files? I can't imagine that the pool would be filled with ~12T of new content.

Patrick M. Hausen · Nov 7, 2021

Possibly, but you need to check from the Windows client. There is no way to look into the volume from TrueNAS.

exwhywhyzee · Nov 8, 2021

Patrick M. Hausen said:
Possibly, but you need to check from the Windows client. There is no way to look into the volume from TrueNAS.

This, unfortunately, is a big problem

Patrick M. Hausen · Nov 8, 2021

I lack experience running iSCSI in production. A road to recovery I can think of would be:

extend the pool, e.g. by adding two more disks in a mirror
extend the iSCSI target by some reasonable amount of space
mount in Windows and investigate

You can create a pool checkpoint before adding the two extra drives to roll back to that point after you recovered the data.

I'd wait for others with more iSCSI knowledge to confirm this is a feasable approach.

Arwen · Nov 9, 2021

One thing that can screw up zVol space, is write amplification due to mis-matched volblocksize and RAID-Zx stripe size. Meaning a 4 data disk RAID-Z2 needs to waste space to write a single 512 byte block. If the pool ashift value is 4KB blocks, a single 512 byte write will waste:

4 data disks @ 4Kbytes = 16Kbytes
16Kbytes - 512 volblocksize = 15,872 bytes of waste, PER BLOCK IN VOLUME

This is one reason for using 4K bytes volblocksize on Mirrored vDevs. You can get the pool ashift size from zpool get all

Please don't quote me on the numbers, I am no expert in this aspect of ZFS.

exwhywhyzee · Nov 11, 2021

Patrick M. Hausen said:
I lack experience running iSCSI in production. A road to recovery I can think of would be:

extend the pool, e.g. by adding two more disks in a mirror

extend the iSCSI target by some reasonable amount of space

mount in Windows and investigate

You can create a pool checkpoint before adding the two extra drives to roll back to that point after you recovered the data.

I'd wait for others with more iSCSI knowledge to confirm this is a feasable approach.

Adding a mirrored pair is a great solution if I can then remove them from the pool after the issue's been fixed. Is this a possibility, or will I not be able to remove the mirrored drives once they're added to the pool?

Windows shows some craziness. The volume is full, but the drive information shows only about 1.2TB used.

I tried to use DISKPART to solve the issue because it shows the volume can be "shrunk" but about 12TB. The problem is that every time I remove the "read-only" attributes to run the shrink operation, the volume goes offline.

Patrick M. Hausen · Nov 11, 2021

exwhywhyzee said:
Adding a mirrored pair is a great solution if I can then remove them from the pool after the issue's been fixed. Is this a possibility, or will I not be able to remove the mirrored drives once they're added to the pool?

I am not quite sure - removal of single disk or mirror vdevs is possible, but I don't know if that is only when the entire pool contains only mirrors and no RAIDZn vdevs. Anyway I would not recommend it - my idea was more about being able to access the iSCSI target at all, recover the data, find out what went wrong, then rebuild from scratch.

exwhywhyzee · Nov 30, 2021

A small bump to see if anyone has anymore ideas? I tried to run a backup so that I could destroy and re-create, but the read-only volume cannot be backed-up for some reason.

I still stuck, and I could really use the help to resolve this issue.

Heracles · Nov 30, 2021

Hi,

Here, I do use iSCSI to present a datastore to my ESXi server.

What I can tell you is :
--You made an error when you created your zVol.
By allowing it to grow up to 100% of your storage, you allowed it to create this situation. I would recommend you to never allow datasets / zvols to grow beyond a certain level to ensure your pool will never reach 100%. Here, it could have mean limiting that zVol to say 12 TB. If your need is for 14TB, then you must provide yourself with a pool that will be bigger than that, 16 TB or more.

--That error will be fixable only by destroying that zVol and creating a new one.
That will destroy everything that it contains.

--To destroy the zVol itself may not be possible because of the Copy-On-Write approach ZFS uses.
ZFS needs space for basically everything, including to delete stuff.

--As @Patrick M. Hausen said, that zVol was filled from its client
Here, I reserved all of the space my zVol may use, so from the storage's point of view, it is using 2.5 TB. When checking its details with zfs get all, I see that only 1.26T has been written into it. On your stats, the entire 14T is said to have been written in it.

--Windows itself does not offer you to empty it because it offlined the drive.
If you can force Windows to connect the drive / delete stuff from it, you may have a chance to recover.

So you had an unstable config that just failed. It may be possible to recover from it but it is not sure.
--Find a way to mount that iSCSI volume from a system that will actually mount it
--See what is inside and what took all that space
--Extract whatever valuable data you have and save it to another support
--Try to delete stuff to give some air to TrueNAS
--Try to delete the zVol
--If you can not, move your system dataset outside of that pool and you will have to destroy the pool

Know that TrueNAS itself is unable to delete stuff from within the zVol, available space or not. It will destroy it all or not touch it.

Good luck for recovering from that bad config,

Heracles · Nov 30, 2021

Oups ; just noticed that you had a snapshot of that... So to delete stuff will not help as long as that snapshot exists. You need to delete that snapshot before trying to regain any space. If the snapshot itself can not be deleted, then there will be no way to recover any space.

Important Announcement for The TrueNAS Community.

Pool suddenly at 100% sudden increase from 1.5TB to 14TB

Dabbler

MVP

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Hall of Famer

MVP

Dabbler

Hall of Famer

Dabbler

Wizard

Wizard

Similar threads

Important Announcement for The TrueNAS Community.