CTL_Datamove Abort's under high read applications.

Status
Not open for further replies.

klingon55

Dabbler
Joined
Mar 2, 2017
Messages
12
Specs TrueNas Z-35
CPU 2x Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60
RAM 262073MB
Network
GUI/WEB/SSH 1Gbit
ISCSI 10GBit -> Cisco 4500 Fiber Switch -> Cisco UCS

Pool
Encrypted
Default LZ4 compression
de-duplication disabled
42 x 2 Drive Vdevs 4TB 7200 rpm drives
5 Hot Spares
3 x 8GB SSD's for ZIL in stripe
3 x 400GB SSD's for L2ARC

Environment
10 vmware hosts and 85 windows/linux VMs attached via ISCSI datastores on the vmware hosts.
15.5TB used of 151TB

We are receiving a large amount of data move aborts when doing a storage vmotion between datastores. As we are trying to split our servers from 1 lun each host to multiple luns to provide more room queue depth wise.
I performed a ctladm dumpooa when this was happening and was greeted with quite a few Blocked: read(16) entries with high latency's coming from the lun we are migrating off of.

Disk usage has not gone over 40% and CPU does not exceed 50% during this time.

LUN 0 tag 0x135e87 BLOCKED: READ(16). CDB: (410 ms)
LUN 0 tag 0x135e88 BLOCKED: READ(16). CDB: (410 ms)
LUN 0 tag 0x135e89 BLOCKED: READ(16). CDB: (410 ms)
LUN 0 tag 0x135e8a BLOCKED: READ(16). CDB: (390 ms)
LUN 0 tag 0x135e8b BLOCKED: READ(16). CDB: (390 ms)
LUN 0 tag 0x135e8c BLOCKED: READ(16). CDB: (390 ms)
LUN 0 tag 0x135e8d BLOCKED: READ(16). CDB: (390 ms)
LUN 0 tag 0x135e8e BLOCKED: READ(16). CDB: (390 ms)
LUN 0 tag 0x135e8f BLOCKED: READ(16). CDB: (390 ms)
LUN 0 tag 0x135e90 BLOCKED: READ(16). CDB: (390 ms)
LUN 0 tag 0x135e91 BLOCKED: READ(16). CDB: (390 ms)
LUN 0 tag 0x135e93 BLOCKED: READ(16). CDB: (390 ms)
LUN 0 tag 0x135e94 BLOCKED: READ(16). CDB: (390 ms)
LUN 0 tag 0x135e95 BLOCKED: READ(16). CDB: (390 ms)
LUN 0 tag 0x135e96 BLOCKED: READ(16). CDB: (390 ms)
LUN 0 tag 0x135e97 BLOCKED: READ(16). CDB: (390 ms)
LUN 0 tag 0x135e98 BLOCKED: READ(16). CDB: (390 ms)
LUN 0 tag 0x135e99 BLOCKED: READ(16). CDB: (390 ms)
LUN 0 tag 0x135e9a BLOCKED: READ(16). CDB: (390 ms)
LUN 0 tag 0x135e9c BLOCKED: READ(16). CDB: (370 ms)
LUN 0 tag 0x135e9d BLOCKED: READ(16). CDB: (370 ms)
LUN 0 tag 0x135e9e BLOCKED: READ(16). CDB: (370 ms)
LUN 0 tag 0x135e9f BLOCKED: READ(16). CDB: (370ms)
LUN 0 tag 0x135ea0 BLOCKED: READ(16). CDB: (370 ms)
LUN 0 tag 0x135ea1 BLOCKED: READ(16). CDB: (370 ms)
LUN 0 tag 0x135ea2 BLOCKED: READ(16). CDB: (370 ms)
LUN 0 tag 0x135ea3 BLOCKED: READ(16). CDB: (370 ms)
LUN 0 tag 0x135ea4 BLOCKED: READ(16). CDB: (360 ms)
LUN 0 tag 0x135ea5 BLOCKED: READ(16). CDB: (360 ms)
LUN 0 tag 0x135ea6 BLOCKED: READ(16). CDB: (360 ms)
LUN 0 tag 0x135ea7 BLOCKED: READ(16). CDB: (360 ms)
LUN 0 tag 0x135ea9 BLOCKED: READ(16). CDB: (360 ms)
LUN 0 tag 0x135eaa BLOCKED: READ(16). CDB: (360 ms)
LUN 0 tag 0x135eab BLOCKED: READ(16). CDB: (360 ms)
LUN 0 tag 0x135eac BLOCKED: READ(16). CDB: (360 ms)
LUN 0 tag 0x135ead BLOCKED: READ(16). CDB: (350 ms)
LUN 0 tag 0x135eae BLOCKED: READ(16). CDB: (350 ms)
LUN 0 tag 0x135eaf BLOCKED: READ(16). CDB: (350 ms)
LUN 0 tag 0x135eb0 BLOCKED: READ(16). CDB: (350 ms)
LUN 0 tag 0x135eb2 BLOCKED: READ(16). CDB: (350 ms)
LUN 0 tag 0x135eb3 BLOCKED: READ(16). CDB: (350 ms)
LUN 0 tag 0x135eb4 BLOCKED: READ(16). CDB: (350 ms)
LUN 0 tag 0x135eb5 BLOCKED: READ(16). CDB: (350 ms)
LUN 0 tag 0x135eb6 BLOCKED: READ(16). CDB: (340 ms)
LUN 0 tag 0x135eb7 BLOCKED: READ(16). CDB: (340 ms)
LUN 0 tag 0x135eb8 BLOCKED: READ(16). CDB: (340 ms)
LUN 0 tag 0x135eb9 BLOCKED: READ(16). CDB: (340 ms)


Any ideas would be greatly appreciated.
 

klingon55

Dabbler
Joined
Mar 2, 2017
Messages
12
Have you opened a support ticket with iX?
I have we are working on it, but wanted to see if anyone experienced the same. Im on 11.0 U6
It looks like from what I can see it might be an L2ARC issue
Somehow I have 1.7TiB in L2ARC with only 1.09 TiB in L2ARC devices

root@TrueNas02:~ # arc_summary.py
System Memory:

0.11% 269.70 MiB Active, 0.46% 1.15 GiB Inact
75.22% 187.68 GiB Wired, 0.00% 0 Bytes Cache
24.22% 60.43 GiB Free, 0.00% 0 Bytes Gap

Real Installed: 256.00 GiB
Real Available: 99.97% 255.93 GiB
Real Managed: 97.49% 249.52 GiB

Logical Total: 256.00 GiB
Logical Used: 75.95% 194.42 GiB
Logical Free: 24.05% 61.58 GiB

Kernel Memory: 7.21 GiB
Data: 99.60% 7.18 GiB
Text: 0.40% 29.36 MiB

Kernel Memory Map: 319.91 GiB
Size: 43.94% 140.57 GiB
Free: 56.06% 179.34 GiB
Page: 1
------------------------------------------------------------------------

ARC Summary: (HEALTHY)
Storage pool Version: 5000
Filesystem Version: 5
Memory Throttle Count: 0

ARC Misc:
Deleted: 6.52b
Mutex Misses: 123.79m
Evict Skips: 123.79m

ARC Size: 69.42% 163.45 GiB
Target Size: (Adaptive) 69.51% 163.66 GiB
Min Size (Hard Limit): 12.50% 29.43 GiB
Max Size (High Water): 8:1 235.46 GiB

ARC Size Breakdown:
Recently Used Cache Size: 75.14% 122.98 GiB
Frequently Used Cache Size: 24.86% 40.69 GiB

ARC Hash Breakdown:
Elements Max: 133.72m
Elements Current: 93.96% 125.65m
Collisions: 13.35b
Chain Max: 21
Chains: 29.79m
Page: 2
------------------------------------------------------------------------

ARC Total accesses: 11.04b
Cache Hit Ratio: 56.40% 6.23b
Cache Miss Ratio: 43.60% 4.81b
Actual Hit Ratio: 52.58% 5.80b

Data Demand Efficiency: 55.05% 7.50b
Data Prefetch Efficiency: 32.13% 1.95b

CACHE HITS BY CACHE LIST:
Anonymously Used: 1.87% 116.45m
Most Recently Used: 39.88% 2.48b
Most Frequently Used: 53.35% 3.32b
Most Recently Used Ghost: 1.42% 88.42m
Most Frequently Used Ghost: 3.48% 216.79m

CACHE HITS BY DATA TYPE:
Demand Data: 66.32% 4.13b
Prefetch Data: 10.08% 627.79m
Demand Metadata: 23.04% 1.43b
Prefetch Metadata: 0.56% 34.82m

CACHE MISSES BY DATA TYPE:
Demand Data: 70.03% 3.37b
Prefetch Data: 27.55% 1.33b
Demand Metadata: 2.27% 109.43m
Prefetch Metadata: 0.15% 7.26m
Page: 3
------------------------------------------------------------------------

L2 ARC Summary: (HEALTHY)
Passed Headroom: 24.19m
Tried Lock Failures: 80.00m
IO In Progress: 643
Low Memory Aborts: 17.49k
Free on Write: 24.35m
Writes While Full: 1.91m
R/W Clashes: 101
Bad Checksums: 0
IO Errors: 0
SPA Mismatch: 248.75m

L2 ARC Size: (Adaptive) 1.70 TiB
Header Size: 0.52% 9.04 GiB
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
I don't even know if illumos and FreeBSD share the same ZFS but it sounds like the L2ARC is compressed if the originating data set its. Link
This would explain the "over unity" usage of L2ARC. In Fact I believe that top will show this compression information.
 

klingon55

Dabbler
Joined
Mar 2, 2017
Messages
12
I don't even know if illumos and FreeBSD share the same ZFS but it sounds like the L2ARC is compressed if the originating data set its. Link
This would explain the "over unity" usage of L2ARC. In Fact I believe that top will show this compression information.
That definitely makes sense. I checked one of my other FreeNas boxes that is running on 11.0 U4 and saw that my 1Tb cache drive is 1.37TiB in L2ARC, should I be worried about the SPA Mismatch amount or Low Memory Aborts or Tried Lock Failures?

I noticed when I do have the data move errors that can be produced any time by trying to vmotion something that the L2ARC hit rate graph line is missing for the period of time that the data move errors happened.

upload_2018-5-15_15-18-7.png
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
should I be worried about the SPA Mismatch amount or Low Memory Aborts or Tried Lock Failures?
I don't know enough about ZFS to comment on these metrics.

If you can afford to reset your L2ARC cache, you could just remove it for a test run. If that has no effect, add the devices back into the pool. I will admit thats a lot of cache to warm back up!
 

klingon55

Dabbler
Joined
Mar 2, 2017
Messages
12
I don't know enough about ZFS to comment on these metrics.

If you can afford to reset your L2ARC cache, you could just remove it for a test run. If that has no effect, add the devices back into the pool. I will admit thats a lot of cache to warm back up!
First note this pool is encrypted.
Good news to report, it looks like for some reason the unit was shipped with Hyper Threading off, we turned this back on, we reduced the amount of Geli threads substantially as well as updated to 11.1U4 because of all these changes I do not know exactly which one fixed the issue.

With that being said the performance has doubled vs when we tested it when we first installed it with nothing on it. This is impressive because the unit currently has 86 VM's running on it which is our entire local infrastructure with 200+ users and 300 remote. Some of these tests may be bottle necked by the 10gbit connection.

upload_2018-5-24_10-31-54.png
 
Last edited:
Status
Not open for further replies.
Top