Resource icon

SLOG benchmarking and finding the best SLOG

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
Wanted to benchmark the disk, but for some reason unable to remove it from dedup duties...

If dedup data is not volatile and necessary, why did it assign the disks in stripe and not mirror by default?

SAS 3.74T 52.1T 56 112 268K 2.74M
raidz2 3.73T 50.8T 40 51 204K 532K
gptid/50cad0f1-d0d7-11ea-8717-2cf05d07d39b - - 4 5 20.3K 53.2K
gptid/511b4e3f-d0d7-11ea-8717-2cf05d07d39b - - 4 5 20.4K 53.2K
gptid/51294ecd-d0d7-11ea-8717-2cf05d07d39b - - 4 5 20.2K 53.2K
gptid/517b5371-d0d7-11ea-8717-2cf05d07d39b - - 4 5 20.3K 53.2K
gptid/51907b32-d0d7-11ea-8717-2cf05d07d39b - - 4 5 20.3K 53.2K
gptid/51924d04-d0d7-11ea-8717-2cf05d07d39b - - 4 5 20.3K 53.3K
gptid/51a8084b-d0d7-11ea-8717-2cf05d07d39b - - 4 5 20.4K 53.2K
gptid/5189a1b0-d0d7-11ea-8717-2cf05d07d39b - - 4 5 20.4K 53.2K
gptid/51df64e8-d0d7-11ea-8717-2cf05d07d39b - - 4 5 20.3K 53.2K
gptid/51e871f0-d0d7-11ea-8717-2cf05d07d39b - - 4 5 20.5K 53.2K
dedup - - - - - -
gptid/85c84ce9-e23b-11ea-ba69-2cf05d07d39b 8.37G 436G 11 10 44.4K 520K
gptid/863904ee-e23b-11ea-ba69-2cf05d07d39b 3.18G 441G 4 3 17.7K 187K
special - - - - - -
mirror 2.42G 442G 0 10 2.54K 254K
gptid/ded6ac5d-e224-11ea-ba69-2cf05d07d39b - - 0 5 1.31K 127K
gptid/deea7ff4-e224-11ea-ba69-2cf05d07d39b - - 0 5 1.23K 127K
logs - - - - - -
mirror 123M 6.38G 0 39 13 1.35M
gptid/d73be473-e939-11ea-bd27-2cf05d07d39b - - 0 19 6 693K
gptid/d9886192-e939-11ea-bd27-2cf05d07d39b - - 0 19 6 693K
---------------------------------------------- ----- ----- ----- ----- ----- -----

---------------------------------------------- ----- ----- ----- ----- ----- -----
root@freenas[~]# gpart list da13 | egrep 'Mediasize|rawuuid'
Mediasize: 480113504256 (447G)
rawuuid: 863904ee-e23b-11ea-ba69-2cf05d07d39b
Mediasize: 480113590272 (447G)
root@freenas[~]#

View attachment 41152

View attachment 41153

View attachment 41154
View attachment 41155
I'm not a dev, but I know that first error, and it's not what it seems.

This is an error generated by the "zpool remove" command. It's something that isn't mentioned in the man pages and while not a "bug", it's certainly an undescribed and quite problematic behaviour of the underlying removal process.

zpool remove states that the command can be used to clear and remove a top level vdev. When you try to run it in reality, it checks the vdev sector sizes across the entire pool, and requires them to have the same sector sizes, or it won't be able to execute the removal command. The vdevs must also be mirrors and not raidz.

I don't know why it needs those restrictions to move data, but they aren't in the man page, or warned about. More worryingly even if all top level vdevs are mirrors with the same sector size, I've had it still complain the same way (pool made 100% of 4096 logical/512 physical mirrorred disks).

As for why no valid replicas, that I'm less sure about. But this is the meaning of your "All top level vdevs must have the same sector size" message.


IMPROVEMENT/FIX REQUEST FOR THIS ISSUE: https://jira.ixsystems.com/browse/NAS-107397
(Feel free to comment as well)
 
Last edited:

CrimsonMars

Dabbler
Joined
Aug 7, 2020
Messages
24
I'm not a dev, but I know that first error, and it's not what it seems.

This is an error generated by the "zpool remove" command. It's something that isn't mentioned in the man pages and while not a "bug", it's certainly an undescribed and quite problematic behaviour of the underlying removal process.

zpool remove states that the command can be used to clear and remove a top level vdev. When you try to run it in reality, it checks the vdev sector sizes across the entire pool, and requires them to have the same sector sizes, or it won't be able to execute the removal command. The vdevs must also be mirrors and not raidz.

I don't know why it needs those restrictions to move data, but they aren't in the man page, or warned about. More worryingly even if all top level vdevs are mirrors with the same sector size, I've had it still complain the same way (pool made 100% of 4096 logical/512 physical mirrorred disks).

As for why no valid replicas, that I'm less sure about. But this is the meaning of your "All top level vdevs must have the same sector size" message.


IMPROVEMENT/FIX REQUEST FOR THIS ISSUE: https://jira.ixsystems.com/browse/NAS-107397
(Feel free to comment as well)


Well that figures 512b sectors for dedup and 4kn on disk.... it´s impossible to remove them I guess....
Will try to do it manually later to see the outcome
 

CrimsonMars

Dabbler
Joined
Aug 7, 2020
Messages
24
Well no luck.... unable to offline it either
From my knowledge dedup tables are generally stored in RAM, and as such are volatile, in theory if I remove a disk pool should not fail....
Is my asumption correct or there is other type of data here as well which is non volatile, and necessary for pool consistency?

root@freenas[~]# zpool remove SAS gptid/863904ee-e23b-11ea-ba69-2cf05d07d39b
cannot remove gptid/863904ee-e23b-11ea-ba69-2cf05d07d39b: invalid config; all top-level vdevs must have the same sector size and not be raidz.
 

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
Well no luck.... unable to offline it either
From my knowledge dedup tables are generally stored in RAM, and as such are volatile, in theory if I remove a disk pool should not fail....
Is my asumption correct or there is other type of data here as well which is non volatile, and necessary for pool consistency?

root@freenas[~]# zpool remove SAS gptid/863904ee-e23b-11ea-ba69-2cf05d07d39b
cannot remove gptid/863904ee-e23b-11ea-ba69-2cf05d07d39b: invalid config; all top-level vdevs must have the same sector size and not be raidz.

You often see dedup quoted in the context of RAM, and maybe that's what's misleading you. So there's a key ZFS knowledge update heading your way :)

When you dedup, what you're asking ZFS to do, is to run a check on every block it saves to disk. If an identical block already exists in the pool (which could be some identical metadata block, a chunk of a file or whole file in the "live" pool or in some old snapshot, a block in a ZVOL, *anything* really in the pool...), then ZFS won't save that block to disk. Instead it saves a tiny entry that says "this block here ? If you wanna read it, use that block there, it's got identical contents". So ZFS writes to disk, a small pointer that's a few hundred bytes in size, rather than say, 4k to 128k of actual data. Which is how dedup works. Those small entries are the dedup table, or DDT.

Or, put a different way, when you dedup, you're telling the pool to store its data in deduped format with extra DDT pointers - saving considerable disk space at the cost of considerably higher CPU usage (hashing/lookups) and great demand for consistent and fast 4k IO speed/latency.

DDT itself is treated exactly the same as any other pool metadata. It gets stored in RAM as its used, pushed out to L2ARC if you have it, lost from RAM at reboot or if it's not used for a while, reloaded from disk if its needed and not in RAM.

When you read or scrub that file, or resilver that disk - anything that needs that block's contents - in simple terms ZFS will look at the pool metadata and realise it needs to read that few hundred bytes entry back, to learn where a block with the correct contents is. So to read a sizeable file, it's got to track down and locate all the relevant DDT entries for all its blocks to find the blocks on disk. Bear in mind, ZFS needs to load not just the file, but its metadata, and a ton else too, when it accesses a file. So it does a lot of small DDT reads, which can quickly overpower HDD pools since small reads aren't very efficient on HDDs.

With that background, I'll come back to your question/comment.

  • Yes, DDT entries are stored in RAM as they are loaded. But they are an integral, crucial, ongoing/permanent part of the pool, not volatile. (RAM contains ARC which is just a cache, so it's volatile but the actual dedup tables are integral to the pool and aren't). If the pool didn't have its DDT entries it would be simply unreadable. So no, they aren't volatile. ARC - the RAM cache - is always volatile. But DDT itself is an integral part of the pool if enabled. If you lose the disk/s with DDT on it, like any other metadata, and there aren't copies, your pool is lost. Period.
  • All metadata is crucial for pool integrity. ZFS can try to recover from some kinds of loss, by rewinding the pool, or trying to ignore some kinds of errors. But data is spread across all vdevs for speed (except metadata and small files to the extent they are limited to special vdevs). If you lose a vdev, and that vdev had some important part of your metadata on it without copies, your pool is dead. Period. Again.
  • In summary - no, ZFS does not have data that can trivially be lost and be guaranteed the pool keep its integrity, other than the redundancy you yourself design in when you create vdevs. It stores copies of some data across multiple devices by default, to try and ensure that it's harder to lose all copies of any given block. It can wind back or be forced to try and ignore some kinds of loss and corruption. It can fix the pool if even one good copy of the needed data exists anywhere. But assume as a good starting point, that all data is crucial, and don't try to game it, or assume that you'll be okay with a lost vdev. Especially metadata of *any* kind......
Last the error you comment about, is the same as I described. zpool remove simply wont remove a top level vdev in some circumstances. Its an incomplete function and lacks the ability to do removal of all top level vdevs, or even many of them. You're stuck with it until you destroy the pool or swap a mirrored disk for another larger disk.
 
Last edited:

CrimsonMars

Dabbler
Joined
Aug 7, 2020
Messages
24
You often see dedup quoted in the context of RAM, and maybe that's what's misleading you. So there's a key ZFS knowledge update heading your way :)

When you dedup, what you're asking ZFS to do, is to run a check on every block it saves to disk. If an identical block already exists in the pool (which could be some identical metadata block, a chunk of a file or whole file in the "live" pool or in some old snapshot, a block in a ZVOL, *anything* really in the pool...), then ZFS won't save that block to disk. Instead it saves a tiny entry that says "this block here ? If you wanna read it, use that block there, it's got identical contents". So ZFS writes to disk, a small pointer that's a few hundred bytes in size, rather than say, 4k to 128k of actual data. Which is how dedup works. Those small entries are the dedup table, or DDT.

Or, put a different way, when you dedup, you're telling the pool to store its data in deduped format with extra DDT pointers - saving considerable disk space at the cost of considerably higher CPU usage (hashing/lookups) and great demand for consistent and fast 4k IO speed/latency.

DDT itself is treated exactly the same as any other pool metadata. It gets stored in RAM as its used, pushed out to L2ARC if you have it, lost from RAM at reboot or if it's not used for a while, reloaded from disk if its needed and not in RAM.

When you read or scrub that file, or resilver that disk - anything that needs that block's contents - in simple terms ZFS will look at the pool metadata and realise it needs to read that few hundred bytes entry back, to learn where a block with the correct contents is. So to read a sizeable file, it's got to track down and locate all the relevant DDT entries for all its blocks to find the blocks on disk. Bear in mind, ZFS needs to load not just the file, but its metadata, and a ton else too, when it accesses a file. So it does a lot of small DDT reads, which can quickly overpower HDD pools since small reads aren't very efficient on HDDs.

With that background, I'll come back to your question/comment.

  • Yes, DDT entries are stored in RAM as they are loaded. But they are an integral, crucial, ongoing/permanent part of the pool, not volatile. (RAM contains ARC which is just a cache, so it's volatile but the actual dedup tables are integral to the pool and aren't). If the pool didn't have its DDT entries it would be simply unreadable. So no, they aren't volatile. ARC - the RAM cache - is always volatile. But DDT itself is an integral part of the pool if enabled. If you lose the disk/s with DDT on it, like any other metadata, and there aren't copies, your pool is lost. Period.
  • All metadata is crucial for pool integrity. ZFS can try to recover from some kinds of loss, by rewinding the pool, or trying to ignore some kinds of errors. But data is spread across all vdevs for speed (except metadata and small files to the extent they are limited to special vdevs). If you lose a vdev, and that vdev had some important part of your metadata on it without copies, your pool is dead. Period. Again.
  • In summary - no, ZFS does not have data that can trivially be lost and be guaranteed the pool keep its integrity, other than the redundancy you yourself design in when you create vdevs. It stores copies of some data across multiple devices by default, to try and ensure that it's harder to lose all copies of any given block. It can wind back or be forced to try and ignore some kinds of loss and corruption. It can fix the pool if even one good copy of the needed data exists anywhere. But assume as a good starting point, that all data is crucial, and don't try to game it, or assume that you'll be okay with a lost vdev. Especially metadata of *any* kind......
Last the error you comment about, is the same as I described. zpool remove simply wont remove a top level vdev in some circumstances. Its an incomplete function and lacks the ability to do removal of all top level vdevs, or even many of them. You're stuck with it until you destroy the pool or swap a mirrored disk for another larger disk.

Thanks for your explanation, a little overkill for me, but I guess it´s good for other people.
I only made the assumption that all data is on disk and this disks are used for some sort of cache to speed things up.
The same way it happens to L2ARK.
I did not know it´s used to store the actual de-duplicated blocks for speed... so actually blocks are in this special disks and only pointers are on the disks.

I made this assumption seeing that it recommends striping the disks instead of mirroring them, also it describes the amount necessary similar to RAM requirement, so I thought data is consistent on disk and it only gets cached on the actual SSD´s, so no no harm would come from removal.

I think this should be corrected in the release and documented as this could cause lot of headache(a warning while creating and mirroring by default would go a long way) for a lot of people in the near future.


I think under normal circumstances you can flush data to disk and remove this special disks, but in my case spinning disks have 4kn blocks, and flash is 512e so, it can not flush due to sectors of different size on disk.
 

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
Thanks for your explanation, a little overkill for me, but I guess it´s good for other people.
I only made the assumption that all data is on disk and this disks are used for some sort of cache to speed things up.
The same way it happens to L2ARK.
I did not know it´s used to store the actual de-duplicated blocks for speed... so actually blocks are in this special disks and only pointers are on the disks.

I made this assumption seeing that it recommends striping the disks instead of mirroring them, also it describes the amount necessary similar to RAM requirement, so I thought data is consistent on disk and it only gets cached on the actual SSD´s, so no no harm would come from removal.

I think this should be corrected in the release and documented as this could cause lot of headache(a warning while creating and mirroring by default would go a long way) for a lot of people in the near future.


I think under normal circumstances you can flush data to disk and remove this special disks, but in my case spinning disks have 4kn blocks, and flash is 512e so, it can not flush due to sectors of different size on disk.
Yeah, didn't know what level exactly to describe it at, but better explained more rather than less fully.

You're slightly mistaken on one thing. Its the small DDT pointers that are kept on special vdevs, not the full size data blocks they point to. Wrong way round. That makes sense because they are small and can't be retrieved fast (other than from SSD) because of inefficiencies due to their tiny size. So for example my deduped pool has 13.5 TB of actual file data blocks (on HDD) and about 500 GB of metadata, DDT, spacemaps and other metadata, on SSDs.

tl;dr - when a pool has dedup enabled, DDT isn't cache, its part of the fundamental structure of how data is held in the pool, so like any pool metadata and pointers, if you lose it, your pool is immediately going to be reduced to randomised garbage.
 

CrimsonMars

Dabbler
Joined
Aug 7, 2020
Messages
24
So far, doing an scrub has negative impact on CPU usage(CPU is actually the elusive 1600AF same as ryzen 2600), bellow are some screenshots during a manual scrub, gues is from dedup:

1598830241284.png

1598830279588.png


Just updated to latest build, destroyed the pool and double checked...
Default dedup disk is Stripe actually you can not change it without adding disks, which seems an omission or a bug of the GUI(On the metadata it clearly says it should be the same as pool, but nothing at dedup level):
1598831575234.png

1598832756886.png



You can manually select after adding disks, but there is no warning nothing, I thing an warning and limitation to avoid stupidity would be welcomed,
as it does not compain in any way and by default it adds disks in stripe, which can be catastrophic:
1598832804555.png

Actually playing aroud just realized SLOG is also Striping by default o_O, never actually added disks from GUI except data disks:
1598833591805.png


Other thing that I have noticed is that overprovisioning does not actually work:
1598833022576.png

1598834113150.png

root@freenas[~]# gpart list da13
Geom name: da13
modified: false
state: OK
fwheads: 255
fwsectors: 63
last: 937721815
first: 40
entries: 128
scheme: GPT
Providers:
1. Name: da13p1
Mediasize: 480113504256 (447G)
Sectorsize: 512
Stripesize: 0
Stripeoffset: 65536
Mode: r1w1e2
efimedia: HD(1,GPT,687fcbeb-eb1d-11ea-a21f-2cf05d07d39b,0x80,0x37e47f58)
rawuuid: 687fcbeb-eb1d-11ea-a21f-2cf05d07d39b
rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b
label: (null)
length: 480113504256
offset: 65536
type: freebsd-zfs
index: 1
end: 937721815
start: 128
Consumers:
1. Name: da13
Mediasize: 480113590272 (447G)
Sectorsize: 512
Mode: r1w1e3

Dedup was just for kicks, so I´m left with a couple of SSD that I have no ideea how to use sugestions?(ARC has more tha 90% hits and read is always saturating the 21G total link throughput- 2x10Gb +1x1Gb)


Now this looks better:
1598835144340.png


Funny thing encountered, like how is it even possible...:

root@freenas[~]# gpart destroy -F da13
da13 destroyed
root@freenas[~]# gpart destroy -F da14
da14 destroyed
root@freenas[~]# gpart create -s gpt da13
da13 created
root@freenas[~]# gpart create -s gpt da14
da14 created
root@freenas[~]# gpart add -a 4k -b 128 -t freebsd-zfs -s 400G da13
da13p1 added
root@freenas[~]# gpart add -a 4k -b 128 -t freebsd-zfs -s 400G da14
da14p1 added
root@freenas[~]#
root@freenas[~]#
root@freenas[~]#
root@freenas[~]# gpart list da13 | egrep 'Mediasize|rawuuid'
Mediasize: 429496729600 (400G)
rawuuid: 2db5dc97-eb23-11ea-a21f-2cf05d07d39b
Mediasize: 480113590272 (447G)
root@freenas[~]# gpart list da14 | egrep 'Mediasize|rawuuid'
Mediasize: 429496729600 (400G)
rawuuid: 2fcf43d9-eb23-11ea-a21f-2cf05d07d39b
Mediasize: 480103981056 (447G)
root@freenas[~]# zpool add -f SAS log mirror gptid/d73be473-e939-11ea-bd27-2cf05d07d39b gptid/d9886192-e939-11ea-bd27-2cf05d07d39b
root@freenas[~]# zpool add SAS special mirror gptid/2db5dc97-eb23-11ea-a21f-2cf05d07d39b gptid/2fcf43d9-eb23-11ea-a21f-2cf05d07d39b
invalid vdev specification
use '-f' to override the following errors:
/dev/gptid/2db5dc97-eb23-11ea-a21f-2cf05d07d39b is part of exported pool 'SAS'
/dev/gptid/2fcf43d9-eb23-11ea-a21f-2cf05d07d39b is part of exported pool 'SAS'

root@freenas[~]#
 

Attachments

  • 1598833122631.png
    1598833122631.png
    25.3 KB · Views: 390
Last edited:

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
You've raised 2 (or more?) issues,, I'll comment one at a time.

So far, doing an scrub has negative impact on CPU usage(CPU is actually the elusive 1600AF same as ryzen 2600), bellow are some screenshots during a manual scrub, gues is from dedup:​
tmp1.png
tmp2.png
Just updated to latest build, destroyed the pool and double checked...​
This is very clearly bug/issue NAS-107364, where scrub can drive a CPU into starvation. They've reproduced it so it'll almost certainly be fixed by RC1, but you might want to add those screenshots and comments on the JIRA thread too, to help? (You'll need to create an account to do so but that's quick)

What's happening is, there are plenty of throttles within ZFS, but they're historically oriented more towards not demanding too much RAM or disk IO, so ZFS doesn't block the system that way.

It doesn't seem to have been on anyone's horizons as an possible issue, that a pool could be fast and responsive enough, and have enough RAM, that the mere act of trying to keep up with generating the hashing/checksums needed to scrub a dedup pool could by itself overwhelm a CPU and cause it to be the first thing to starve of resources, even when disk and RAM usage remain at sub-throttle levels. Or in other words there's no check on hash generation as a % of CPU usage, for dedup (or other) *reads*. So if disk IO is running at full speed unthrottled, theres no check if the CPU is keeping up, or demanding all system resources to do so.....
 
Last edited:

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
Default dedup disk is Stripe actually you can not change it without adding disks, which seems an omission or a bug of the GUI(On the metadata it clearly says it should be the same as pool, but nothing at dedup level):
View attachment 41217
View attachment 41218


You can manually select after adding disks, but there is no warning nothing, I thing an warning and limitation to avoid stupidity would be welcomed,
as it does not compain in any way and by default it adds disks in stripe, which can be catastrophic:
View attachment 41219
Actually playing aroud just realized SLOG is also Striping by default o_O, never actually added disks from GUI except data disks:
View attachment 41222

Other thing that I have noticed is that overprovisioning does not actually work:
View attachment 41220
View attachment 41223
root@freenas[~]# gpart list da13
Geom name: da13
modified: false
state: OK
fwheads: 255
fwsectors: 63
last: 937721815
first: 40
entries: 128
scheme: GPT
Providers:
1. Name: da13p1
Mediasize: 480113504256 (447G)
Sectorsize: 512
Stripesize: 0
Stripeoffset: 65536
Mode: r1w1e2
efimedia: HD(1,GPT,687fcbeb-eb1d-11ea-a21f-2cf05d07d39b,0x80,0x37e47f58)
rawuuid: 687fcbeb-eb1d-11ea-a21f-2cf05d07d39b
rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b
label: (null)
length: 480113504256
offset: 65536
type: freebsd-zfs
index: 1
end: 937721815
start: 128
Consumers:
1. Name: da13
Mediasize: 480113590272 (447G)
Sectorsize: 512
Mode: r1w1e3

Dedup was just for kicks, so I´m left with a couple of SSD that I have no ideea how to use sugestions?(ARC has more tha 90% hits and read is always saturating the 21G total link throughput- 2x10Gb +1x1Gb)


Now this looks better:
View attachment 41224

Funny thing encountered, like how is it even possible...:

root@freenas[~]# gpart destroy -F da13
da13 destroyed
root@freenas[~]# gpart destroy -F da14
da14 destroyed
root@freenas[~]# gpart create -s gpt da13
da13 created
root@freenas[~]# gpart create -s gpt da14
da14 created
root@freenas[~]# gpart add -a 4k -b 128 -t freebsd-zfs -s 400G da13
da13p1 added
root@freenas[~]# gpart add -a 4k -b 128 -t freebsd-zfs -s 400G da14
da14p1 added
root@freenas[~]#
root@freenas[~]#
root@freenas[~]#
root@freenas[~]# gpart list da13 | egrep 'Mediasize|rawuuid'
Mediasize: 429496729600 (400G)
rawuuid: 2db5dc97-eb23-11ea-a21f-2cf05d07d39b
Mediasize: 480113590272 (447G)
root@freenas[~]# gpart list da14 | egrep 'Mediasize|rawuuid'
Mediasize: 429496729600 (400G)
rawuuid: 2fcf43d9-eb23-11ea-a21f-2cf05d07d39b
Mediasize: 480103981056 (447G)
root@freenas[~]# zpool add -f SAS log mirror gptid/d73be473-e939-11ea-bd27-2cf05d07d39b gptid/d9886192-e939-11ea-bd27-2cf05d07d39b
root@freenas[~]# zpool add SAS special mirror gptid/2db5dc97-eb23-11ea-a21f-2cf05d07d39b gptid/2fcf43d9-eb23-11ea-a21f-2cf05d07d39b
invalid vdev specification
use '-f' to override the following errors:
/dev/gptid/2db5dc97-eb23-11ea-a21f-2cf05d07d39b is part of exported pool 'SAS'
/dev/gptid/2fcf43d9-eb23-11ea-a21f-2cf05d07d39b is part of exported pool 'SAS'

root@freenas[~]#

Your other comments remind me of 2 other issues, both currently open:
Do either of those help, or sound relevant? If you need to add comments, those could be relevant reports.
 

Asteroza

Dabbler
Joined
Feb 12, 2018
Messages
14
It doesn't seem to have been on anyone's horizons as an possible issue, that a pool could be fast and responsive enough, and have enough RAM, that the mere act of trying to keep up with generating the hashing/checksums needed to scrub a dedup pool could by itself overwhelm a CPU and cause it to be the first thing to starve of resources, even when disk and RAM usage remain at sub-throttle levels. Or in other words there's no check on hash generation as a % of CPU usage, for dedup (or other) *reads*. So if disk IO is running at full speed unthrottled, theres no check if the CPU is keeping up, or demanding all system resources to do so.....

TL:DR; So his storage is too sexy for for his CPU...
 

Stilez

Guru
Joined
Apr 8, 2016
Messages
529

CrimsonMars

Dabbler
Joined
Aug 7, 2020
Messages
24
TL:DR; So his storage is too sexy for for his CPU...
Well actually it did not reach 100% but it got close from time to time.
I guess if I put a SAS12 card it would of, the SAS disks have onboard flash cache for read and write(they are sas12 nearline), that coupled with metadata and dedup metadata on flash almost managed to overwhelm a low entry CPU which would of been fine normally.


Guessing will have to throw in the Threadripper to tame it :D, if it manages to overload that, well, hell i´m throwing the towel... as the disks drop the mic....
 

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
Well actually it did not reach 100% but it got close from time to time.
I guess if I put a SAS12 card it would of, the SAS disks have onboard flash cache for read and write(they are sas12 nearline), that coupled with metadata and dedup metadata on flash almost managed to overwhelm a low entry CPU which would of been fine normally.


Guessing will have to throw in the Threadripper to tame it :D, if it manages to overload that, well, hell i´m throwing the towel... as the disks drop the mic....
Don't waste your time for a while. Developer Alexander Motin (@mav@) on a rather nice system, found it made a 40 core CPU "unhappy", and for me it massively overwhelms an 8 core Xeon E5 v4, like 26 seconds from password + enter to logon "#".

(Technically it only looked like 98-100% usage at times, but it was "hard" usage and acted more like 100%. His test system he said was unhappy at around 90%. Anyway looks like a strict 100.000% isn't needed for this to happen? So your 96% fits right in)

But as I said, he's working on it at the moment, and with luck as its an important issue, it'll hopefully/likely be fixed by RC1 or at worst 12-RELEASE, at which point we can all go back to our 2 - 4 core CPUs for scrubbing dedup pools without issue :)

So save your money, and don't upgrade to TR taming, till we see if you'll really need that, once 12 is final.

Also if dedup.was really "just for kicks", just consider disabling it. Its a resource swamp. It demands good SSDs, heavy RAM, and a decent CPU, to.save HDD space.
 
Last edited:

CrimsonMars

Dabbler
Joined
Aug 7, 2020
Messages
24
Don't waste your time for a while. Developer Alexander Motin (@mav@) on a rather nice system, found it made a 40 core CPU "unhappy", and for me it massively overwhelms an 8 core Xeon E5 v4, like 26 seconds from password + enter to logon "#".

(Technically it only looked like 98-100% usage at times, but it was "hard" usage and acted more like 100%. His test system he said was unhappy at around 90%. Anyway looks like a strict 100.000% isn't needed for this to happen? So your 96% fits right in)

But as I said, he's working on it at the moment, and with luck as its an important issue, it'll hopefully/likely be fixed by RC1 or at worst 12-RELEASE, at which point we can all go back to our 2 - 4 core CPUs for scrubbing dedup pools without issue :)

So save your money, and don't upgrade to TR taming, till we see if you'll really need that, once 12 is final.

Also if dedup.was really "just for kicks", just consider disabling it. Its a resource swamp. It demands good SSDs, heavy RAM, and a decent CPU, to.save HDD space.
Well in my case it never reached that, and never had lock-ups, but it was constantly hammering it, but all was fine, system did not hang, no other issues noted other than a litle heat(I have a threadripper cooler on it, and cpu was above 70, taking in to consideration that the box has 10 140mm industrial fans and some 80,mm and was outputing more noise than my DL380pG8 would say it was pretty toasty)

Guessign it´s because 2nd gen ryzen is a lot more beefy for the task, from my experience in virtualization scenario a 12 core Threadripper is roughly more poverfull than 60 core sandy bridge rated at at 2.5-3ghz, it has more all core turbo and maintains it over time(due to power and cooling buget), now on the power side of things, let´s just say 1600AF is a lot more efficient, and I have no entire server as hungry as my Threadripper(200W idle of total system draw and if I push it locks my 1200W PSU, MB holds it by default in 3.9-4.5 all core)
 

kspare

Guru
Joined
Feb 19, 2015
Messages
507
RMS-200

smartctl 7.1 2019-12-30 r5022 [FreeBSD 12.1-STABLE amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: RMS-200
Serial Number: 0059250
Firmware Version: 5fb7565c
PCI Vendor/Subsystem ID: 0x1cc7
IEEE OUI Identifier: 0x00e0cf
Controller ID: 0
Number of Namespaces: 1
Namespace 1 Size/Capacity: 8,581,545,984 [8.58 GB]
Namespace 1 Utilization: 0
Namespace 1 Formatted LBA Size: 512
Local Time is: Thu Sep 3 06:07:50 2020 CST
Firmware Updates (0x09): 4 Slots, Slot 1 R/O
Optional Admin Commands (0x0006): Format Frmw_DL
Optional NVM Commands (0x0006): Wr_Unc DS_Mngmt

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 25.00W - - 0 0 0 0 100 100

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 - 512 0 3
1 - 512 8 3
2 - 4096 0 0
3 - 4096 8 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 50 Celsius
Available Spare: 0%
Available Spare Threshold: 0%
Percentage Used: 0%
Data Units Read: 96,174,803,373 [49.2 PB]
Data Units Written: 101,658,702,309 [52.0 PB]
Host Read Commands: 420,427,360
Host Write Commands: 509,837,972
Controller Busy Time: 3,156
Power Cycles: 22
Power On Hours: 5,448
Unsafe Shutdowns: 13
Media and Data Integrity Errors: 0
Error Information Log Entries: 0

Error Information (NVMe Log 0x01, max 63 entries)
No Errors Logged

512 # sectorsize
8581545984 # mediasize in bytes (8.0G)
16760832 # mediasize in sectors
0 # stripesize
0 # stripeoffset
RMS-200 # Disk descr.
0059250 # Disk ident.
Yes # TRIM/UNMAP support
0 # Rotation rate in RPM

Synchronous random writes:
0.5 kbytes: 26.6 usec/IO = 18.3 Mbytes/s
1 kbytes: 27.8 usec/IO = 35.1 Mbytes/s
2 kbytes: 28.4 usec/IO = 68.8 Mbytes/s
4 kbytes: 30.4 usec/IO = 128.6 Mbytes/s
8 kbytes: 33.3 usec/IO = 234.6 Mbytes/s
16 kbytes: 40.0 usec/IO = 390.9 Mbytes/s
32 kbytes: 45.5 usec/IO = 686.6 Mbytes/s
64 kbytes: 65.3 usec/IO = 956.9 Mbytes/s
128 kbytes: 145.8 usec/IO = 857.4 Mbytes/s
256 kbytes: 158.3 usec/IO = 1579.2 Mbytes/s
512 kbytes: 181.3 usec/IO = 2758.1 Mbytes/s
1024 kbytes: 315.4 usec/IO = 3170.1 Mbytes/s
2048 kbytes: 511.8 usec/IO = 3908.1 Mbytes/s
4096 kbytes: 950.9 usec/IO = 4206.5 Mbytes/s
8192 kbytes: 1752.2 usec/IO = 4565.7 Mbytes/s
 

kspare

Guru
Joined
Feb 19, 2015
Messages
507
P3700 Formatted with 4k. 2tb OP to 1.5TB

=== START OF INFORMATION SECTION ===
Model Number: INTEL SSDPEDMD020T4
Serial Number: CVFT629000DP2P0EGN
Firmware Version: 8DV101H0
PCI Vendor/Subsystem ID: 0x8086
IEEE OUI Identifier: 0x5cd2e4
Controller ID: 0
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,500,299,198,464 [1.50 TB]
Namespace 1 Formatted LBA Size: 4096
Local Time is: Thu Sep 3 06:11:11 2020 CST
Firmware Updates (0x02): 1 Slot
Optional Admin Commands (0x0006): Format Frmw_DL
Optional NVM Commands (0x0006): Wr_Unc DS_Mngmt
Maximum Data Transfer Size: 32 Pages

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 25.00W - - 0 0 0 0 0 0

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 - 512 0 2
1 - 512 8 2
2 - 512 16 2
3 + 4096 0 0
4 - 4096 8 0
5 - 4096 64 0
6 - 4096 128 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 37 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 9%
Data Units Read: 32,172,282,332 [16.4 PB]
Data Units Written: 13,358,738,730 [6.83 PB]
Host Read Commands: 723,925,260,991
Host Write Commands: 68,796,034,438
Controller Busy Time: 0
Power Cycles: 33
Power On Hours: 19,530
Unsafe Shutdowns: 1
Media and Data Integrity Errors: 0
Error Information Log Entries: 0

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged


4096 # sectorsize
1500299198464 # mediasize in bytes (1.4T)
366283984 # mediasize in sectors
131072 # stripesize
0 # stripeoffset
INTEL SSDPEDMD020T4 # Disk descr.
CVFT629000DP2P0EGN # Disk ident.
Yes # TRIM/UNMAP support
0 # Rotation rate in RPM

Synchronous random writes:
4 kbytes: 15.0 usec/IO = 260.1 Mbytes/s
8 kbytes: 16.7 usec/IO = 468.5 Mbytes/s
16 kbytes: 23.0 usec/IO = 680.5 Mbytes/s
32 kbytes: 29.6 usec/IO = 1055.4 Mbytes/s
64 kbytes: 53.1 usec/IO = 1176.9 Mbytes/s
128 kbytes: 118.9 usec/IO = 1051.2 Mbytes/s
256 kbytes: 158.7 usec/IO = 1575.0 Mbytes/s
512 kbytes: 266.7 usec/IO = 1874.4 Mbytes/s
1024 kbytes: 536.9 usec/IO = 1862.6 Mbytes/s
2048 kbytes: 1074.1 usec/IO = 1862.0 Mbytes/s
4096 kbytes: 2119.0 usec/IO = 1887.7 Mbytes/s
8192 kbytes: 4215.9 usec/IO = 1897.6 Mbytes/s
 

ehsab

Dabbler
Joined
Aug 2, 2020
Messages
45
One more fun toy.... This is RMS-200 From Radiant Memory Systems

Code:
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        46 Celsius
Available Spare:                    0%
Available Spare Threshold:          0%
Percentage Used:                    0%
Data Units Read:                    95,513,598,780 [48.9 PB]
Data Units Written:                 96,577,229,991 [49.4 PB]
Host Read Commands:                 1,148,532,179
Host Write Commands:                1,220,615,011
Controller Busy Time:               430
Power Cycles:                       19
Power On Hours:                     3,293
Unsafe Shutdowns:                   14
Media and Data Integrity Errors:    0
Error Information Log Entries:      1

Error Information (NVMe Log 0x01, max 63 entries)
No Errors Logged


[root@freenas ~]# diskinfo -wS /dev/nvd0
/dev/nvd0
        512             # sectorsize
        8581545984      # mediasize in bytes (8.0G)
        16760832        # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        RMS-200         # Disk descr.
        0085217         # Disk ident.
        Yes             # TRIM/UNMAP support
        0               # Rotation rate in RPM

Synchronous random writes:
         0.5 kbytes:     25.9 usec/IO =     18.9 Mbytes/s
           1 kbytes:     26.6 usec/IO =     36.7 Mbytes/s
           2 kbytes:     26.8 usec/IO =     72.9 Mbytes/s
           4 kbytes:     30.3 usec/IO =    129.0 Mbytes/s
           8 kbytes:     33.5 usec/IO =    233.4 Mbytes/s
          16 kbytes:     45.9 usec/IO =    340.4 Mbytes/s
          32 kbytes:     56.3 usec/IO =    555.2 Mbytes/s
          64 kbytes:     66.5 usec/IO =    940.3 Mbytes/s
         128 kbytes:    118.0 usec/IO =   1059.6 Mbytes/s
         256 kbytes:    137.3 usec/IO =   1821.4 Mbytes/s
         512 kbytes:    176.6 usec/IO =   2830.5 Mbytes/s
        1024 kbytes:    315.6 usec/IO =   3169.0 Mbytes/s
        2048 kbytes:    540.0 usec/IO =   3703.8 Mbytes/s
        4096 kbytes:   1020.7 usec/IO =   3918.9 Mbytes/s
        8192 kbytes:   1835.7 usec/IO =   4358.1 Mbytes/s
[root@freenas ~]#

Here is the RMS-300

Code:
root@truenas[~]# diskinfo -wS /dev/nvd0
/dev/nvd0
        512             # sectorsize
        8549695488      # mediasize in bytes (8.0G)
        16698624        # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        RMS-300         # Disk descr.
        1008088         # Disk ident.
        Yes             # TRIM/UNMAP support
        0               # Rotation rate in RPM

Synchronous random writes:
         0.5 kbytes:     19.4 usec/IO =     25.1 Mbytes/s
           1 kbytes:     20.1 usec/IO =     48.5 Mbytes/s
           2 kbytes:     21.3 usec/IO =     91.7 Mbytes/s
           4 kbytes:     23.9 usec/IO =    163.4 Mbytes/s
           8 kbytes:     24.6 usec/IO =    317.9 Mbytes/s
          16 kbytes:     29.7 usec/IO =    525.9 Mbytes/s
          32 kbytes:     34.9 usec/IO =    894.5 Mbytes/s
          64 kbytes:     45.8 usec/IO =   1363.3 Mbytes/s
         128 kbytes:     80.6 usec/IO =   1550.3 Mbytes/s
         256 kbytes:     99.0 usec/IO =   2525.0 Mbytes/s
         512 kbytes:    147.6 usec/IO =   3388.3 Mbytes/s
        1024 kbytes:    256.2 usec/IO =   3902.8 Mbytes/s
        2048 kbytes:    513.1 usec/IO =   3897.9 Mbytes/s
        4096 kbytes:   1019.4 usec/IO =   3924.0 Mbytes/s
        8192 kbytes:   1923.5 usec/IO =   4159.1 Mbytes/s
 

kspare

Guru
Joined
Feb 19, 2015
Messages
507
Kinda disappointing there isn’t much difference!
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,828
Hah. That RMS system wipes the floor with my p4801x!!!
 

CrimsonMars

Dabbler
Joined
Aug 7, 2020
Messages
24
Hah. That RMS system wipes the floor with my p4801x!!!
Yup, but that is to be expected, any RAM wipes the floor with any disk tech, my only question is how are you using all that speed... you would need at least 100G links to take advantage of that...

I think that 100G is still prohibitive for home use... DC mainstream is barely 25G
 
Top