Pool suddenly at 100% sudden increase from 1.5TB to 14TB

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,737
That error will be fixable only by destroying that zVol and creating a new one.
What about extending the pool? Add two disks in a mirror vdev and the pool is not 100%, anymore.
As a temporary measure, of course. Recover data if possible, then destroy and start fresh.
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
What about extending the pool? Add two disks in a mirror vdev and the pool is not 100%, anymore.
As a temporary measure, of course. Recover data if possible, then destroy and start fresh.

I would not say that it does not worth a try; when facing such a data loss, it is always better to try more than less. But I doubt it will fix the problem.

The data must be recovered from the iSCSI client; TrueNAS itself is unable to handle the inside content of a zVol as you mentioned yourself. As such, even if the pool space that supports the zVol is increased, even if the zVol is then increased, the "partition" created by Windows will remain the same, will remain full and Windows will keep it offline for that reason. As long as that partition is offline, its data can not be recovered.

So Yes, it is possible and because that pool contains only the zVol and the system dataset, to destroy it after the try will still be required and will not create any increased damage. But I do not see a lot of hope on that path and to destroy / re-create the pool after that will just be even more critical to do...
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,737
the "partition" created by Windows will remain the same, will remain full and Windows will keep it offline for that reason.
That wasn't clear to me. I thought Windows offlined the target because the zpool was full.
 

exwhywhyzee

Dabbler
Joined
Sep 13, 2021
Messages
30
Heracles,

Thank you for your response. I'm trying to get my head 'round all the information you gave me; my knowledge of ZFS is quite limited.

Let me start with the last point and then move upwards:

With respect to the snapshot - I have a snapshot? Where? How do I delete it? I don't recall creating a snapshot.

Accessing the iSCSI volume - I can mount it in Windows, but it presents as read-only. When I use DISKPART to clear the read-only attribute, the iSCSI volume goes offline/dismounts. Bringing it back online just starts this cycle. At most, all I can do at the moment is browse the contents, which only take up, as the screenshot shows, 1.75TB. I tried to run Windows Server Backup to make a backup of the contents and then destroy the volume, but Windows Server Backup will not do anything with a read-only volume.

Potential suggested work around - As Patrick suggested, what about if I add a mirrored pair of drives (I have pairs of 500GB and 1TB drives), increasing the space, running shrink through DISKPART, and then removing the added pair of drives? And if that doesn't work, I am hoping that the added pair of drives will at least allow me to break this read-only problem.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,737
With respect to the snapshot - I have a snapshot? Where? How do I delete it? I don't recall creating a snapshot.
zfs list -t all -r poolname/path/to/your/zvol
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
With respect to the snapshot - I have a snapshot? Where?

Here it is :

HGST-raid2z_16@manual-2021-09-02_20-56 128K - 176K -

How do I delete it?

In the WebUI, Storage ; Snapshots. You have all your snapshots there and this is the place to delete it. The pool being full, the process may fail but give it a try.

I don't recall creating a snapshot.

Per its name, you took it manually at 20:56 on the 2nd of September 2021...

Accessing the iSCSI volume - I can mount it in Windows, but it presents as read-only

Good. That way, you can extract whatever valuable data you have in it. You may also search for whatever filled that partition but to recover your data is by far the most important.

I tried to run Windows Server Backup to make a backup of the contents

You may have to do that backup manually then...

Potential suggested work around - As Patrick suggested, what about if I add a mirrored pair of drives (I have pairs of 500GB and 1TB drives), increasing the space, running shrink through DISKPART, and then removing the added pair of drives?

The last step is impossible. Once a vdev is added in a pool, the only way to take it out is to destroy the pool.

So Yes, you can try it :
--Power off the Windows iSCSI client (to be sure it will not keep filling the zvol until a hard limit protects the pool. Dealing with a critical situation, better be safe than sorry)
--Add the vDev in your pool to extend it (hope ZFS will be able to do that despite the pool being 100% but it may very well be)
--Configure the zVol to cap its max size below the new maximum (cap it to 15T if your new max is 16T) (to protect the pool against being loaded to 100% again)
--Boot up the Windows iSCSI
--Try to extend that partition but remain below the new zvol's limit (extend it to 14.5T if zvol is extended to 15T) (again, I hope you learned the lesson about pushing things to their limits...)

If all of that work, recover your data and know that you will MUST destroy that pool and re-create it. In that new pool, create a zvol with the appropriate volume for your actual need (you said that you have a.5 TB, so 5 TB is surely more than enough).

After that, beware to never again fill a pool up to its limit and also start designing your backup plan. It is for situations like this (and many others) that you need it and that it is important to have it in a separate support.

Good luck,
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,737
The last step is impossible. Once a vdev is added in a pool, the only way to take it out is to destroy the pool.
Well ... not quite. You can create a pool checkpoint before the extension takes place and roll back after having recovered the data. Then destroy the zvol and create a new one.
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
I highly doubt that will work with a pool loaded to its max. By definition such a checkpoint must save more data to exist...

Again, give it a try but because the zvol must be destroyed and it is the main thing in that pool, I do not ser any reason not to destroy that pool. To move the system dataset is a no brainer...
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,600
Ignoring the snapshot for the moment, (which does not seem to be using much space), I still think part of the problem is using a zVol on RAID-Z2, without matching block sizes.

Your zVol block size is 512 bytes;
HGST-raid2z_16/HGST-raid2z_16 volblocksize 512
Yet your RAID-Z2 stripes across 4 data disks, (6 disks total, 2 are parity).

Some indication of that is this;
HGST-raid2z_16/HGST-raid2z_16 logicalused 980G -

If the ashift size uses 4096 byte sectors, that blows up space usage by a factor of 8. Add in RAID-Z striping across 4 data disks, that just makes it 4 times worse.

But again, I am not certain of the numbers as I am not as familiar with zVols on RAID-Zx. And the affects of 4096 byte sectors, when using 512 byte volblocksize, just complicates the mater.
 

exwhywhyzee

Dabbler
Joined
Sep 13, 2021
Messages
30
Thank you all for your excellent advice and interest in my issue. I will begin working through some of the ideas presented here, and keep everyone posted on the progress.

In the meanwhile, while I doubt that this will help me solve the problem, for my edification and ultimately to learn something, please explain this to me more one time because I don't quite get what's happening:

First of all, I have to say that I have pretty bad reading abilities. I thought that Windows was showing the volume as 92% full. It's showing it as 92% FREE, which explains the ~1.75T/14T.

I have two pools that I am using: the problematic 14T (which itself confuses me given that it should be around 12T free, because it's made of 6*4T drives set as RAID2Z) and a 1.5T made of a mirrored pair of 1.5T drives.

TrueNAS shows the mirrored pool as 99% full, but I can still access it.

So why is it that breaching the 80% capacity rule, one pool is accessible and the other is not?

Furthemore, why is it that things were working perfectly fine until the ROBODISK copy (assuming this is what did indeed break it), even though the pool was at 100% capacity - like why was it working then and suddenly not now?
 
Last edited:

exwhywhyzee

Dabbler
Joined
Sep 13, 2021
Messages
30
Ignoring the snapshot for the moment, (which does not seem to be using much space), I still think part of the problem is using a zVol on RAID-Z2, without matching block sizes.

Your zVol block size is 512 bytes;
HGST-raid2z_16/HGST-raid2z_16 volblocksize 512
Yet your RAID-Z2 stripes across 4 data disks, (6 disks total, 2 are parity).

Some indication of that is this;
HGST-raid2z_16/HGST-raid2z_16 logicalused 980G -

If the ashift size uses 4096 byte sectors, that blows up space usage by a factor of 8. Add in RAID-Z striping across 4 data disks, that just makes it 4 times worse.

But again, I am not certain of the numbers as I am not as familiar with zVols on RAID-Zx. And the affects of 4096 byte sectors, when using 512 byte volblocksize, just complicates the mater.

When I end up re-creating this volume, what should I do to correct for this problem and how?

My lack of knowledge seems to become amplified with each post.
 

exwhywhyzee

Dabbler
Joined
Sep 13, 2021
Messages
30
In the WebUI, Storage ; Snapshots. You have all your snapshots there and this is the place to delete it. The pool being full, the process may fail but give it a try.

I was able to use the WebGUI to delete the snapshot, so one small(ish) victory.

Good. That way, you can extract whatever valuable data you have in it. You may also search for whatever filled that partition but to recover your data is by far the most important.

You may have to do that backup manually then...

This is a serious problem. I tried running ROBOCOPY again, but all it was able to do was re-create the directories on the backup storage. Content copying is not really happening.

I also tried a simple copy using FILE EXPLORER, but again, another no go.

So Yes, you can try it :
--Power off the Windows iSCSI client (to be sure it will not keep filling the zvol until a hard limit protects the pool. Dealing with a critical situation, better be safe than sorry)
--Add the vDev in your pool to extend it (hope ZFS will be able to do that despite the pool being 100% but it may very well be)
--Configure the zVol to cap its max size below the new maximum (cap it to 15T if your new max is 16T) (to protect the pool against being loaded to 100% again)
--Boot up the Windows iSCSI
--Try to extend that partition but remain below the new zvol's limit (extend it to 14.5T if zvol is extended to 15T) (again, I hope you learned the lesson about pushing things to their limits...)

I really hope that this works.

If all of that work, recover your data and know that you will MUST destroy that pool and re-create it. In that new pool, create a zvol with the appropriate volume for your actual need (you said that you have a.5 TB, so 5 TB is surely more than enough).

After that, beware to never again fill a pool up to its limit and also start designing your backup plan. It is for situations like this (and many others) that you need it and that it is important to have it in a separate support.

Absolutely, a lesson that I don't see how I'll be able to forget.

Here's another question, that may or may not be related this issue (a potential cause?)

I was tinkering with the write protection options for the problematic zpool; changing it to OFF (although zfs/zpool get readonly *pool* shows OFF), but when I click on save, I get the following error (which I seem to get whenever I try to do anything).

Code:
Error: Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 138, in call_method
    result = await self.middleware._call(message['method'], serviceobj, methodobj, params, app=self,
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1213, in _call
    return await methodobj(*prepared_call.args)
  File "/usr/local/lib/python3.9/site-packages/middlewared/service.py", line 495, in update
    rv = await self.middleware._call(
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1213, in _call
    return await methodobj(*prepared_call.args)
  File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 974, in nf
    args, kwargs = clean_and_validate_args(args, kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 932, in clean_and_validate_args
    value = attr.clean(args[args_index + i])
  File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 607, in clean
    data[key] = attr.clean(value)
  File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 167, in clean
    value = super(Str, self).clean(value)
  File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 53, in clean
    raise Error(self.name, f'Invalid choice: {value}')
middlewared.schema.Error: [compression] Invalid choice: ON


I also get this error while trying to check for updates:

Code:
[Errno 28] No space left on device: '/var/db/system/tmpf67thovp': Automatic update check failed. Please check system network settings.
 
Last edited:

exwhywhyzee

Dabbler
Joined
Sep 13, 2021
Messages
30
Okay, so get this: before I committed to beginning the process to add additional storage to try and resolve this issue, I decided to export/disconnect the pool and then reconnect it.

I did that, recreated the iSCSI instances/etc, and re-linked the target and initiator. Well guess what? The volume is no longer read-only.

I'm trying to run a backup with Acronis as I type this. I'll update on what happens when when Acronis lets me know.
 
Last edited:

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
I was able to use the WebGUI to delete the snapshot, so one small(ish) victory.

Good! We will take them all :smile:

This is a serious problem. I tried running ROBOCOPY again, but all it was able to do was re-create the directories on the backup storage. Content copying is not really happening.

I also tried a simple copy using FILE EXPLORER, but again, another no go.

Good that you tried.

Absolutely, a lesson that I don't see how I'll be able to forget.

We always prefer to learn without consequences first but be sure that every senior here like @Arwen, @Patrick M. Hausen or myself also learned things the hard way in the past and that we are not immune to that either.

Okay, so get this: before I committed to beginning the process to add additional storage to try and resolve this issue, I decided to export/disconnect the pool and then reconnect it.

I did that, recreated the iSCSI instances/etc, and re-linked the target and initiator. Well guess what? The volume is no longer read-only.

Great! Out of immediate trouble now.

I'm trying to run a backup with Acronis as I type this. I'll update on what happens when when Acronis lets me know.

Good reflex : now that you can access your stuff, to do a full backup is the first thing to do.

So once it will be done, you will then be able to disconnect the volume from Windows, disconnect that iSCSI storage and destroy the too large zvol.

Once that will be done, I would move the system dataset to another pool and re-create that pool to adresse the mismatch @Arwen mentioned to you in some previous post. Once your pool will reflect the good practices recommended and the matching block sizes, you will be able to move back your system dataset to this pool and create a new, much smaller, zvol that will provide your Windows with the required storage without risking to fill the pool completely.

Glad that your case is now going much better,
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,600
@exwhywhyzee - The way I would approach this, is this:
  • Find out if your 6 disks in the RAID-Z2 pool are natively 512 byte sectors or 4096 byte sectors.
    If one or more is 4096 byte sectors, then you should use 4096 for your shift & volblocksize
  • Re-create the pool as 3, 2 way disk Mirrors, using what the first step has, 512 or 4096
  • In the iSCSI, if at all possible, export what the first step has, 512 or 4096
  • On the MS-Windows side, if at all possible, use what the first step has, 512 or 4096
I really don't know all the details, as I don't use zVols, iSCSI or MS-Windows over iSCSI. But, as I said before, their can be an issue with mis-matched block sizes using excessive space.

In the case of disks with 4096 byte native sector sizes, if you send it less than 4096, like 512, the disk has to do read-modify-write, a time costly action. However, sending a disk with 512 byte native sectors a stream of 4096 bytes, sector aligned, is nothing special to the disk. It simply writes to multiple 512 byte sectors in a row.

All that said, zVols on RAID-Zx should work, (and did for you initially), just that the record size appears to be less than ideal.
 

exwhywhyzee

Dabbler
Joined
Sep 13, 2021
Messages
30
@exwhywhyzee - The way I would approach this, is this:
  • Find out if your 6 disks in the RAID-Z2 pool are natively 512 byte sectors or 4096 byte sectors.
    If one or more is 4096 byte sectors, then you should use 4096 for your shift & volblocksize
Is there a command that will allow me to determine this, so that I don't have to pull the drives from the array and look at the label?

My poor google skills have failed to find an answer to this.

EDIT

Found it:

Code:
egrep 'da[0-9]|cd[0-9]' /var/run/dmesg.boot

...
da4 at mps0 bus 0 scbus32 target 13 lun 0
da4: <ATA HGST HMS5C4040BL A5D0> Fixed Direct Access SPC-4 SCSI device
da4: Serial Number PL2331LAGVKZNJ
da4: 300.000MB/s transfers
da4: Command Queueing enabled
da4: 3815447MB (7814037168 512 byte sectors)
da10 at mps0 bus 0 scbus32 target 20 lun 0
da10: <ATA HGST HMS5C4040BL A5D0> Fixed Direct Access SPC-4 SCSI device
da10: Serial Number PL2331LAHAKUKJ
da10: 300.000MB/s transfers
da10: Command Queueing enabled
da10: 3815447MB (7814037168 512 byte sectors)
da2 at mps0 bus 0 scbus32 target 10 lun 0
da2: <ATA HGST HMS5C4040BL A5D0> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number PL2331LAH7EU4J
da2: 300.000MB/s transfers
da2: Command Queueing enabled
da2: 3815447MB (7814037168 512 byte sectors)
...


So it looks like that these drives are natively at 512?

There's also this:

Code:
root@truenas[~]# geom disk list da5
Geom name: da5
Providers:
1. Name: da5
   Mediasize: 4000787030016 (3.6T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e1
   descr: ATA HGST HMS5C4040BL
   lunid: 5000cca22ecc1553
   ident: PL2331LAGVL0WJ
   rotationrate: 5700
   fwsectors: 63
   fwheads: 255


So, what's the story with Sectorsize vs Stripesize? Is it native vs assigned, respectively?
 
Last edited:

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,108
More like "I'm 4k-native but I'll pretend to be 512 bytes for old OSes". If you run smartctl -a /dev/da5 you'll likely see:
Sector Size: 512 bytes logical, 4096 bytes physical
which amount to being 4096 for what @Arwen described.
 

exwhywhyzee

Dabbler
Joined
Sep 13, 2021
Messages
30
More like "I'm 4k-native but I'll pretend to be 512 bytes for old OSes". If you run smartctl -a /dev/da5 you'll likely see:
Sector Size: 512 bytes logical, 4096 bytes physical
which amount to being 4096 for what @Arwen described.
Great info thank you. It's exactly what you said:

Code:
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p10 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST MegaScale 4000
Device Model:     HGST HMS5C4040BLE640
Serial Number:    PL2331LAGVL0WJ
LU WWN Device Id: 5 000cca 22ecc1553
Firmware Version: MPAOA5D0
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5700 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Dec  4 23:51:08 2021 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
Top