Gut feeling something is wrong

NugentS · Oct 3, 2023

in my case truenas.hdd is the zvol itself

Its PoolName/Dataset/Zvolname - you will have to use "Colit Disk Pool/dataset if there is one/zvol"
Incidently - in my opinion having spaces in Pool Names, Data Set Names and ZVol names is a mistake. It causes complications

Colit · Oct 3, 2023

HoneyBadger said:
Run zfs get refreservation - if the value is none then it is likely a sparse zvol.

You can just run zfs get volblocksize and it will dump a table of the results, but I would assume that they will mostly be 16K.

Yep it's "none" (sparse)

Block size is 32K.

Colit · Oct 3, 2023

NugentS said:
in my case truenas.hdd is the zvol itself

Its PoolName/Dataset/Zvolname - you will have to use "Colit Disk Pool/dataset if there is one/zvol"
Incidently - in my opinion having spaces in Pool Names, Data Set Names and ZVol names is a mistake. It causes complications

Yeah I agree about the spaces - unfortunately that should be a function of the GUI to deny that, as I wasn't sure whether it was a label or important when setting it up initially.

NugentS · Oct 3, 2023

Hmmm - I think a sparse zvol is also a mistake especially given that this box exists to serve iscsi only.

What sort of guests are in your pool? What are they doing?
databases, file servers, mail servers, virtual desktop, etc

It may serve you to have more than one zpool, with different block sizes. OR potentially use both storage servers, splitting the load across both so that if one fails, you only have to recover 50% of the guests and can maintain some services. It also doubles disk bandwidth available

How is the zpool on the alternative server setup? Sparse? Blocksize etc?

Colit · Oct 3, 2023

NugentS said:
Hmmm - I think a sparse zvol is also a mistake especially given that this box exists to serve iscsi only.

What sort of guests are in your pool? What are they doing?
databases, file servers, mail servers, virtual desktop, etc

It may serve you to have more than one zpool, with different block sizes. OR potentially use both storage servers, splitting the load across both so that if one fails, you only have to recover 50% of the guests and can maintain some services. It also doubles disk bandwidth available

How is the zpool on the alternative server setup? Sparse? Blocksize etc?

I only recently inherited this box from a customer we'd specced it for who wanted to move to the cloud. Hence it's about 9 months newer but almost identical except with 12Tb disks. It has about 23% fragmentation currently.

My ultimate plan was to spread the load as I have Synologies both onsite and offsite to take the block-based backups to anyway, so I do have some sort of disaster recovery.

I could move all the VMs to the larger server in the short term, rebuild the zvol(s) on this box and move it all back, but that'd take a while and if I need to buy any hardware for this box (and ultimately the other box too) i'd like to do it before I start that exercise in order to limit the downtime. This is why I was postulating as to whether I was one of the few who would benefit from a L2ARC...but given the cost of Optane drives of the size that would be useful it might be more cost effective to have a couple of non-Optane NVMe M.2 SSDs on an add-in PCIe card and run those in a mirror for an L2ARC?? (or is the mirror going to do me more harm than good?)

NugentS · Oct 3, 2023

Whats your cache hit ratio?

Code:

arc_summary | grep ratio

and post the result please

Hmmm, Reading - just down the road from me (ish) - road being M4

Mirrors are unlikley to hurt (if you can use mirrors with L2ARC, I seem to recall a thread where L2ARC and mirrors couldn't be configured from the GUI - but I may be mistaken) - I think the choice of M.2 / U.2 is more important. Consumer drives specs are basically a pack of lies as far as I can tell and are best avoided. P4800X 375GB = £360.00 (new) so not so bad. However given that L2ARC is not populated quickly - optane may be overkill

Colit · Oct 3, 2023

NugentS said:
Whats your cache hit ratio?

Code:
arc_summary | grep ratio
and post the result please

Hmmm, Reading - just down the road from me (ish) - road being M4

Mirrors are unlikley to hurt (if you can use mirrors with L2ARC, I seem to recall a thread where L2ARC and mirrors couldn't be configured from the GUI - but I may be mistaken) - I think the choice of M.2 / U.2 is more important. Consumer drives specs are basically a pack of lies as far as I can tell and are best avoided. P4800X 375GB = £360.00 (new) so not so bad. However given that L2ARC is not populated quickly - optane may be overkill

Yep, the boxes are in Maidenhead.

Code:

root@truenas[~]# arc_summary | grep ratio
        Cache hit ratio:                               94.6 %     163.7G
        Cache miss ratio:                               5.4 %       9.4G
        Actual hit ratio (MFU + MRU hits):             94.5 %     163.5G
        Hit ratio:                                      6.7 %       1.0G
        Miss ratio:                                    93.3 %      14.3G

Colit · Oct 3, 2023

NugentS said:
Whats your cache hit ratio?

Code:
arc_summary | grep ratio
and post the result please

Hmmm, Reading - just down the road from me (ish) - road being M4

Mirrors are unlikley to hurt (if you can use mirrors with L2ARC, I seem to recall a thread where L2ARC and mirrors couldn't be configured from the GUI - but I may be mistaken) - I think the choice of M.2 / U.2 is more important. Consumer drives specs are basically a pack of lies as far as I can tell and are best avoided. P4800X 375GB = £360.00 (new) so not so bad. However given that L2ARC is not populated quickly - optane may be overkill

2 more questions - 1) can you run more than one separate L2ARC if not a mirror? i.e. so that you can treat the L2ARC as disposable and just keep a stock of (much cheaper) consumer M.2 SSDs with maybe a bifurcated PCIe M.2 daughterboard and swap them out when they die?
2) where did you see a 375Gb P4800X for £360 new?

NugentS · Oct 3, 2023

Well with an arc ratio of > 90% I don't think L2ARC will be much help.

1. No idea - not tried it. I have an arc hit ratio of about 98% - so I don't go down the L2ARC route. However L2ARC is disposible in that it isn't pool critical. Its just a cache - if it fails ZFS shrugs and goes back to just ARC. Don't think you can have 2 L2ARC's but you can almost certainly configure an L2ARC mirror using the cmdline if it doesn't work in the GUI

2. Where else: Ebay. I use 2 900p U.2's on a card like U.2 Carrier which seems to work well. Single point of failure though, as risk I have considered and accepted

NugentS · Oct 3, 2023

Based on what you have written - I think I know what the issue is likley to be (maybe). Its a combination of sparse zvol, spinning disks and thin provisioning causing the disks, over time, to be forever seeking. Whilst the free space is badly fragmented (48%) it (to me) implies that the vhd's are likley badly fragmented as well on the zvol and possibly internally as well

[Warning: This is an educated guess rather than a firm diagnosis]

My suggestion is to ensure that the 2nd server has a non-sparse zvol and then transfer the guests to that, 1 at a time possibly checking each one for internal fragmentation as well and fixing that as well, one at a time. Then rebuild the first server zvol and transfer back, again 1 at a time. Those servers that get a lot of changes should possibly be converted to thick provisioning as well. This may play havoc with your available disk space. Remember that the zvol should be about 50% of the available pool space (which is painful).

Another possible solution, but not cheap, is to use SSD's instead. But that much space, in decent high endurance enterprise SSD's is gonna cost.

I don't get the feeling that anything is fundamentally wrong with the hardware (baring the SLOG size, which would effect writes not reads IF its an actual issue)

Anyone else care to weigh in? Its a lot of work to go through based on my educated guess.

Colit · Oct 4, 2023

To be honest I think you're right, and it certainly makes sense, and would corroborate with the 6ms busy times on reads from the disks. I'm going to have to take some time over this but I want to get it right. As I see it, since I have considerably more free space on the other TrueNAS (32Tb) and it's split over there into two zvols (both of which are sparse, as is the one on this box as I have confirmed, and one dedicated to Veeam backups for that customer), then the process should be:

1) Migrate (one-by-one) the VMs to the other TrueNAS box's working dataset
2) Run a disk defrag inside the guest VM
3) Blow away the zvol on the original TrueNAS and recreate it as 'thick'
4) Migrate the VMs back to the original TrueNAS, being mindful of changing the disk layout to thick provisioned during that migration
5) Move the 3Tb of working VMs on the other TrueNAS to other storage (local SATA or SSD arrays on the servers themselves where possible)
6) Blow away the working data zvol on that TrueNAS and recreate it as 'thick'
7) Move the working VMs back
8) Take a view on the second zvol on that second TrueNAS since it's only for Veeam backups and should be virtually unfragmented anyway

Does that sound right?

NugentS · Oct 4, 2023

On the second TN Server - I thought that was unused. But apparently you are using it for Veeam (aka Backups) in which case it would be very bad practise to also use it for VM storage. If you lose the server you lose servers AND backups - which is bad.

Honestly I think I would like a 3 phase process.

1.Migrate guest to working storage (maybe local) - on its own, and preferably SSD for response time
Defrag the guest
2.Migrate to long term storage - aka Swing Storage (from below)

Rinse and repeat till current permanent zvol is empty

3.Recreate existing zvol as thick
Migrate guests to permanent zvol as thick one at a time - that way you can monitor space useage. *

* Warning: Transitioning to guest thick storage from thin may, depending on how generous disk sizes have been defined in each guest, very rapidly use up available space - and may not be practical.

As a general concept you ought to have at least three sets of storage:
1. Main Server storage
2. Swing Storage to be used when work is needed on the main server storage **
3. Backup Storage - which obviously needs its own 3-2-1 strategy

** Main & Swing storage - can be both at the same time. It just means having two sets of main storage - workloads can be split between the two - but both must be big enough to contain their own loads AND all the other sets loads - this way maintenance can be done on a storage array without effecting service levels. The best way of doing this is with a storage cluster which Core cannot do and (I believe) Scale WILL be able to do at some point in the future - but not yet. I think.

And I am adding in this case a 4th - a smallish SSD Storage array used for defragging the guest - where the IO won't disturb other guests. Making it local to a VM Host is perfect as that way you don't clog up the stoarge network either and can if nessesary dedicate that host to the single defragging guest (although DRS may well take care of that). Having HDD's as your storage devices means you don't have that much IO in the first place (you ARE using mirrors which is good) and a thorough defrag is going to use a lot of IO. I cannot emphasise just how much having this defrag storage on SSD*** will make things better, faster and less intrusive.

***This will burn up a lot of SSD endurance and require a lot of reliable consistent write - so don't use cheap consumer drives.

Colit · Oct 5, 2023

NugentS said:
On the second TN Server - I thought that was unused. But apparently you are using it for Veeam (aka Backups) in which case it would be very bad practise to also use it for VM storage. If you lose the server you lose servers AND backups - which is bad.

Honestly I think I would like a 3 phase process.

1.Migrate guest to working storage (maybe local) - on its own, and preferably SSD for response time
Defrag the guest
2.Migrate to long term storage - aka Swing Storage (from below)

Rinse and repeat till current permanent zvol is empty

3.Recreate existing zvol as thick
Migrate guests to permanent zvol as thick one at a time - that way you can monitor space useage. *

* Warning: Transitioning to guest thick storage from thin may, depending on how generous disk sizes have been defined in each guest, very rapidly use up available space - and may not be practical.

As a general concept you ought to have at least three sets of storage:
1. Main Server storage
2. Swing Storage to be used when work is needed on the main server storage **
3. Backup Storage - which obviously needs its own 3-2-1 strategy

** Main & Swing storage - can be both at the same time. It just means having two sets of main storage - workloads can be split between the two - but both must be big enough to contain their own loads AND all the other sets loads - this way maintenance can be done on a storage array without effecting service levels. The best way of doing this is with a storage cluster which Core cannot do and (I believe) Scale WILL be able to do at some point in the future - but not yet. I think.

And I am adding in this case a 4th - a smallish SSD Storage array used for defragging the guest - where the IO won't disturb other guests. Making it local to a VM Host is perfect as that way you don't clog up the stoarge network either and can if nessesary dedicate that host to the single defragging guest (although DRS may well take care of that). Having HDD's as your storage devices means you don't have that much IO in the first place (you ARE using mirrors which is good) and a thorough defrag is going to use a lot of IO. I cannot emphasise just how much having this defrag storage on SSD*** will make things better, faster and less intrusive.

***This will burn up a lot of SSD endurance and require a lot of reliable consistent write - so don't use cheap consumer drives.

Ah, little bit of simplification - there are several Veeam instances running at my DC. They're not backing up onto the same box it's just legacy backups sat there on that box for other devices.

I have swing storage which is mainly local disk (all SSD) and the Synologies at a pinch (although I pity whoever's on the other end of their poor IOPS)

NugentS · Oct 5, 2023

So.
Migrate to local storage (SSD), defrag internally, migrate to secondary server (which will defrag the VHD itself)
rinse and repeat, one at a time.
Trash zvol on server
Recreate zvol on server to 50% of available space
Migrate back 1 at a time, consider for high use / high change servers converting to thick provision at the same time. Not concerned about something like an AD server - its the high use, high rate of change servers that matter. The ones that are probably horribly fragmented

Davvo · Oct 5, 2023

I don't know if it's applicable in this specific situation, but the following scripts allows for inplace-rebalancing (which also defrags).

GitHub - markusressel/zfs-inplace-rebalancing: Simple bash script to rebalance pool data between all mirrors when adding vdevs to a pool.

Simple bash script to rebalance pool data between all mirrors when adding vdevs to a pool. - markusressel/zfs-inplace-rebalancing

github.com

Colit · Oct 5, 2023

Just had a quick win - found that one of the servers I decommissioned from the DC this week has 128Gb of ECC RAM of the same type as the TrueNAS motherboard supports...so you know where that's going :)

Colit · Oct 5, 2023

Davvo said:
I don't know if it's applicable in this specific situation, but the following scripts allows for inplace-rebalancing (which also defrags).

GitHub - markusressel/zfs-inplace-rebalancing: Simple bash script to rebalance pool data between all mirrors when adding vdevs to a pool.

Simple bash script to rebalance pool data between all mirrors when adding vdevs to a pool. - markusressel/zfs-inplace-rebalancing

github.com

I would wholly go for that script IF I hadn't started moving VMs from TrueNAS this morning and IF the TrueNAS wasn't a production box, but thank you anyway!

NugentS · Oct 5, 2023

That will rebalance and defrag files - not so sure about ZVOL's though

NugentS · Oct 5, 2023

@Colit Please keep us up to date with progress - win on the ECC RAM - but please test it first [or am I simply paranoid]

Colit · Oct 6, 2023

NugentS said:
@Colit Please keep us up to date with progress - win on the ECC RAM - but please test it first [or am I simply paranoid]

Will do.

Incidentally, this is why I thought there was something wrong - this is the same VM in question migrated to local storage:

Important Announcement for the TrueNAS Community.

Gut feeling something is wrong

MVP

Dabbler

Dabbler

MVP

Dabbler

MVP

Dabbler

Dabbler

MVP

MVP

Dabbler

MVP

Dabbler

MVP

MVP

Dabbler

Dabbler

MVP

MVP

Dabbler

Attachments

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Gut feeling something is wrong"

Similar threads