Build and performance tuning advice needed

Carl Thompson · May 25, 2017

Hello, this is my first post on the forum. Please be gentle. I have several questions and need some advice on performance tuning. Thank you in advance. While this is my first post I've read many, many posts here as well as all over the internet so I think I have a reasonable grasp of the fundamentals of FreeNAS and ZFS. Thanks @jgreco, @cyberjock and all others who have posted here as I have found your posts absolutely invaluable. You guys rock.

I have several identical FreeNAS servers which are used almost exclusively as backend storage for VMs in a test lab with 200+ users (VM creators). The servers were built from older enterprise hardware we've acquired extremely cheaply because cost is an issue for the lab. Here's what the hardware in each of the servers looks like:

Tyan S7012 motherboard (LGA1366)
2x Intel Xeon L5630 CPUs (2.13GHz, 4 cores each)
144GB ECC DDR3 RAM (Mixed dual-rank from a variety of sources)
2x LSI 9211 HBAs in IT mode (or 2x Dell H200s HBAs cross-flashed to LSI 9211 IT firmware)
14x HGST Ultrastar 7K4000 4TB enterprise SATA 7200RPM mechanical drives
2x Intel S3710 400GB enterprise SSDs
Intel X540-T2 10 gigabit ethernet card on dedicated 10 gigabit storage network
16 bay 3U chassis
2x Redundant 700W power supplies

The software and basic setup:

FreeNAS 9.10
6x 4TB drive mirror VDEVs (12 drives total in 6 mirror VDEVs)
2x 4TB drive hot spares
2x 20GB partitions from SSDs mirrored for SLOG (20GB total)
2x 280GB partitions from SSDs striped for L2ARC (580GB total)
Compression is ON (~1.5 ratio)
Dedup is ON (~2.0 ratio)
Autotune is ON (no other tuning done yet)
Currently using NFS to serve clients
Using sync=always on VM datasets
Most use is from a VMware 6.5 cluster (~10 hosts)
Some use is from from an oVirt 4.1 cluster (5 hosts)
Data is snapshotted every 4 hours and replicated to a bigger, slower FreeNAS server
All servers are under 40% of total capacity (at most 8TB allocated)
ARC hit rate is 93% to 99% currently (depending on server)
L2ARC hit rate is 25% to 40% currently (depending on server)
Fragmentation is between 45% and 75% currently (depending on server)

How it works: surprisingly well! I put the first server together as a side project to see what was what. At the time our lab's VM hosts were using local storage and of course that became a nightmare to manage as use of the lab grew by a couple orders of magnitude. In my initial test I was able to host 100 running moderately heavy VM appliances on the prototype system seemingly without issue. This surprised me because I honestly wasn't expecting much. I was so pleased with this result and the place where we were with our storage was so untenable that we rushed the server into production and later built more. The original server at one point was hosting 180 running VMs but that was clearly too much for it so we rushed the second server into production. Now we try to limit each server to 100-110 running VMs and things are mostly fine.

Mostly.

The big problem is once in a while (every couple of weeks or so) a server will decide to take a vacation for some minutes. The server always comes back but in the meantime VMware freaks out and vSphere HA reboots all the VMs using the datastore after a few minutes. Even if the VMs don't get rebooted most of them are Linux based and Linux acts pretty stupid when its disk goes away. (I can't write to my disk for a few minutes? I'll just remount it read-only forever!) I believe this is the same problem that @jgreco has posted about several times. Aside from these episodes while the VMs generally "feel" OK my latency graphs are spikier and the data have more outliers than I'd like.

So what can I do about it? I've obviously made some mistakes in the initial design and this may be a good time to rectify those as I put new servers in production and rebuild older ones.

I wonder if 144GB of RAM just isn't enough with dedup on and 560GB of L2ARC. From what I understand only 1/4 of the ARC is allocated to metadata caching and I've read that both the dedup tables and L2ARC indices count as metadata. My total ARC size on these servers is about 120GB but if I'm calculating correctly my dedup table on one of the servers is about 40GB by itself. So I'd guess I either need to bump up the memory significantly or change the percentage of the ARC that can be used for metadata. What's the best way to do that? Is changing this ratio a good idea?

Another thought is that I may be beyond what NFS is capable of with FreeNAS and VMware. I've heard (here) that VMware's NFS implementation is a second class citizen and that iSCSI is the way to go for VMware. I'm ready to switch things to iSCSI if the consensus is that its performance and consistency is better for VMware (though this will involve some pain). If so then I'm also looking for best practices advice for iSCSI (I'm an iSCSI n00b). I also am considering the iSCSI switch for VAAI integration benefits.

Would it be advisable to combat the server "vacations" by tuning the maximum transaction group size or timeout lower? Wouldn't that increase fragmentation? Is fragmentation an issue I really need to worry about?

This post is already too long so I'll end it here. I'll post more information from the servers and more questions in later posts.

Thank you,
Carl Thompson

Carl Thompson · May 25, 2017

Here is some more data from the oldest of the servers:

Code:

[root@babar vmware]# zpool list
NAME  SIZE  ALLOC  FREE  EXPANDSZ  FRAG  CAP  DEDUP  HEALTH  ALTROOT
data  21.8T  7.82T  13.9T  -  75%  35%  2.01x  ONLINE  /mnt
freenas-boot  7.19G  754M  6.45G  -  -  10%  1.00x  DEGRADED  -


[root@babar vmware]# zpool status -D data
  pool: data
 state: ONLINE
  scan: scrub repaired 0 in 60h38m with 0 errors on Tue May 16 12:38:16 2017
config:

  NAME  STATE  READ WRITE CKSUM
  data  ONLINE  0  0  0
  mirror-0  ONLINE  0  0  0
  gptid/9fc107ad-76fd-11e6-b41b-00e081c5bae0  ONLINE  0  0  0
  gptid/459d6867-7715-11e6-b2c4-00e081c5bae0  ONLINE  0  0  0
  mirror-1  ONLINE  0  0  0
  gptid/0fbe7dff-7715-11e6-b2c4-00e081c5bae0  ONLINE  0  0  0
  gptid/7333aece-76fd-11e6-b41b-00e081c5bae0  ONLINE  0  0  0
  mirror-2  ONLINE  0  0  0
  gptid/5fa57fad-7711-11e6-8bf3-00e081c5bae0  ONLINE  0  0  0
  gptid/4a76086b-76fd-11e6-b41b-00e081c5bae0  ONLINE  0  0  0
  mirror-3  ONLINE  0  0  0
  gptid/3271531e-753a-11e6-bb23-00e081c5bae0  ONLINE  0  0  0
  gptid/18ed5cff-76fd-11e6-b41b-00e081c5bae0  ONLINE  0  0  0
  mirror-4  ONLINE  0  0  0
  gptid/a96a623a-76fc-11e6-b41b-00e081c5bae0  ONLINE  0  0  0
  gptid/31bac41b-7711-11e6-8bf3-00e081c5bae0  ONLINE  0  0  0
  mirror-5  ONLINE  0  0  0
  gptid/bb1d3efa-7543-11e6-bb23-00e081c5bae0  ONLINE  0  0  0
  gptid/067f3c91-76fc-11e6-b41b-00e081c5bae0  ONLINE  0  0  0
  logs
  mirror-6  ONLINE  0  0  0
  gptid/376e8963-fa07-11e6-ab09-00e081c5ddd2  ONLINE  0  0  0
  gptid/3c40681a-fa07-11e6-ab09-00e081c5ddd2  ONLINE  0  0  0
  cache
  gptid/42a1943e-fa07-11e6-ab09-00e081c5ddd2  ONLINE  0  0  0
  gptid/440e09b0-fa07-11e6-ab09-00e081c5ddd2  ONLINE  0  0  0
  spares
  gptid/3ea0cfce-801f-11e6-b2c4-00e081c5bae0  AVAIL   

errors: No known data errors

 dedup: DDT entries 113403760, size 1161 on disk, 375 in core

bucket  allocated  referenced   
______  ______________________________  ______________________________
refcnt  blocks  LSIZE  PSIZE  DSIZE  blocks  LSIZE  PSIZE  DSIZE
------  ------  -----  -----  -----  ------  -----  -----  -----
  1  94.6M  11.8T  6.18T  6.28T  94.6M  11.8T  6.18T  6.28T
  2  9.01M  1.12T  848G  854G  20.2M  2.51T  1.87T  1.88T
  4  2.52M  321G  268G  269G  12.6M  1.57T  1.31T  1.32T
  8  1.15M  147G  122G  122G  12.1M  1.50T  1.25T  1.25T
  16  586K  73.1G  56.3G  56.7G  12.0M  1.50T  1.19T  1.19T
  32  193K  24.1G  22.2G  22.2G  8.04M  1.00T  946G  949G
  64  66.6K  8.31G  7.52G  7.55G  5.54M  709G  639G  641G
  128  25.0K  3.12G  2.59G  2.60G  4.33M  554G  459G  462G
  256  9.14K  1.14G  839M  850M  3.14M  400G  290G  294G
  512  4.03K  513M  371M  376M  3.03M  386G  273G  277G
  1K  4.08K  521M  422M  427M  5.98M  764G  628G  636G
  2K  298  35.9M  28.3M  28.6M  771K  92.9G  72.5G  73.2G
  4K  63  7.13M  5.75M  5.80M  321K  36.3G  29.3G  29.6G
  8K  25  2.76M  2.04M  2.07M  272K  29.8G  20.2G  20.6G
  16K  39  4.25M  2.27M  2.33M  830K  89.7G  45.4G  46.8G
  32K  8  386K  5.50K  32K  386K  16.3G  258M  1.51G
  64K  2  256K  2K  8K  155K  19.4G  155M  621M
  128K  2  1K  1K  8K  373K  186M  186M  1.46G
  256K  1  512  512  4K  358K  179M  179M  1.40G
 Total  108M  13.5T  7.48T  7.59T  185M  22.9T  15.1T  15.3T


[root@babar vmware]# arc_summary.py
System Memory:

  0.03%  48.30  MiB Active,  0.11%  160.05  MiB Inact
  98.25%  137.82  GiB Wired,  0.14%  198.09  MiB Cache
  1.47%  2.06  GiB Free,  0.00%  12.00  KiB Gap

  Real Installed:  160.00  GiB
  Real Available:  89.98%  143.96  GiB
  Real Managed:  97.44%  140.27  GiB

  Logical Total:  160.00  GiB
  Logical Used:  98.49%  157.59  GiB
  Logical Free:  1.51%  2.41  GiB

Kernel Memory:  2.94  GiB
  Data:  99.11%  2.91  GiB
  Text:  0.89%  26.90  MiB

Kernel Memory Map:  140.27  GiB
  Size:  70.93%  99.50  GiB
  Free:  29.07%  40.77  GiB
  Page:  1
------------------------------------------------------------------------

ARC Summary: (HEALTHY)
  Storage pool Version:  5000
  Filesystem Version:  5
  Memory Throttle Count:  0

ARC Misc:
  Deleted:  770.84m
  Mutex Misses:  7.21m
  Evict Skips:  7.21m

ARC Size:  92.00%  128.13  GiB
  Target Size: (Adaptive)  92.00%  128.14  GiB
  Min Size (Hard Limit):  12.50%  17.41  GiB
  Max Size (High Water):  8:1  139.27  GiB

ARC Size Breakdown:
  Recently Used Cache Size:  81.72%  104.71  GiB
  Frequently Used Cache Size:  18.28%  23.43  GiB

ARC Hash Breakdown:
  Elements Max:  27.15m
  Elements Current:  90.14%  24.47m
  Collisions:  26.36b
  Chain Max:  11
  Chains:  5.57m
  Page:  2
------------------------------------------------------------------------

ARC Total accesses:  221.50b
  Cache Hit Ratio:  93.00%  206.00b
  Cache Miss Ratio:  7.00%  15.50b
  Actual Hit Ratio:  92.73%  205.39b

  Data Demand Efficiency:  96.85%  22.57b
  Data Prefetch Efficiency:  48.78%  516.42m

  CACHE HITS BY CACHE LIST:
  Most Recently Used:  9.56%  19.70b
  Most Frequently Used:  90.14%  185.69b
  Most Recently Used Ghost:  1.54%  3.18b
  Most Frequently Used Ghost:  0.35%  729.42m

  CACHE HITS BY DATA TYPE:
  Demand Data:  10.61%  21.86b
  Prefetch Data:  0.12%  251.89m
  Demand Metadata:  89.01%  183.35b
  Prefetch Metadata:  0.26%  529.33m

  CACHE MISSES BY DATA TYPE:
  Demand Data:  4.59%  712.06m
  Prefetch Data:  1.71%  264.53m
  Demand Metadata:  92.27%  14.31b
  Prefetch Metadata:  1.43%  221.83m
  Page:  3
------------------------------------------------------------------------

L2 ARC Summary: (HEALTHY)
  Passed Headroom:  43.78m
  Tried Lock Failures:  4.58b
  IO In Progress:  0
  Low Memory Aborts:  3.89k
  Free on Write:  235.88k
  Writes While Full:  5.83m
  R/W Clashes:  0
  Bad Checksums:  0
  IO Errors:  0
  SPA Mismatch:  10.86m

L2 ARC Size: (Adaptive)  313.75  GiB
  Header Size:  0.06%  192.84  MiB

L2 ARC Evicts:
  Lock Retries:  56.54k
  Upon Reading:  58

L2 ARC Breakdown:  15.50b
  Hit Ratio:  24.89%  3.86b
  Miss Ratio:  75.11%  11.64b
  Feeds:  12.84m

L2 ARC Buffer:
  Bytes Scanned:  1.24  PiB
  Buffer Iterations:  12.84m
  List Iterations:  49.63m
  NULL List Iterations:  79

L2 ARC Writes:
  Writes Sent:  100.00% 9.81m
  Page:  4
------------------------------------------------------------------------

DMU Prefetch Efficiency:  324.93b
  Hit Ratio:  0.77%  2.49b
  Miss Ratio:  99.23%  322.45b

Carl

cyberjock · May 25, 2017

Can you PM me a debug file (system -> Advanced -> save debug)? You can post it here, but it may contain info you don't want public.

Carl Thompson · May 25, 2017

gstat:

Code:

[root@babar ~]# gstat -pb
dT: 1.001s  w: 1.000s
 L(q)  ops/s  r/s  kBps  ms/r  w/s  kBps  ms/w  %busy Name
  0  407  5  639  13.1  399  23198  0.4  29.2  da0
  1  407  1  36  16.6  403  21152  0.3  22.7  da1
  0  376  0  0  0.0  374  21516  0.2  11.3  da2
  0  373  1  28  12.4  370  21516  0.2  11.5  da3
  0  410  5  559  5.8  402  23198  0.4  25.3  da5
  0  442  3  260  16.6  436  21152  0.2  20.3  da6
  0  802  6  643  8.8  792  23805  0.3  44.0  da7
  0  0  0  0  0.0  0  0  0.0  0.0  da8
  0  528  2  256  84.1  523  23150  0.2  31.4  da9
  0  584  2  20  10.8  579  25291  0.2  21.7  da10
  0  776  4  387  6.4  768  23805  0.3  50.1  da12
  0  589  7  647  10.9  579  25291  0.2  27.3  da13
  0  531  1  4  20.7  527  23150  0.2  19.7  da14
  0  0  0  0  0.0  0  0  0.0  0.0  da15
  0  1344  103  479  0.1  686  26529  0.1  14.1  da4
  0  1368  128  587  0.1  685  26466  0.1  13.9  da11
  0  0  0  0  0.0  0  0  0.0  0.0  da16

Carl

Carl Thompson · May 25, 2017

cyberjock said:
Can you PM me a debug file (system -> Advanced -> save debug)? You can post it here, but it may contain info you don't want public.

Hi, I've PMed you the debug file. Thanks for your help!

Carl

BigDave · May 25, 2017

While I'm certain cyberjock will come up with some help for you, my
question is about power.
I would not be running of that many 7200rpm drives on only 700watt
PSUs even if they were brand new. IMHO if they are used, I'd upgrade!

Carl Thompson · May 25, 2017

@BigDave

Hard drives don't use as much power as you think, apparently! We've had no power issues.

Thanks,
Carl

Carl Thompson · May 25, 2017

I forgot to mention: I'm running sync=always on my VM datasets.

Thanks,
Carl

cyberjock · May 26, 2017

So da4 and da11 seem to be "unhappy" per dmesg. They're both Intel SSDs, so it's likely some weird quirk with the drives. If there's newer firmware for them, I'd consider installing it during your next maintenance window.

Can you give a date/time that you last experienced the problem. I looked at your logs and nothing seems out of place.

Going backwards chronologically:

Some ssh pipe breakage happened on May 25th at 1030AM and 11AMish. Nothing serious.
On May 15th at 2200ish someone rebooted the server.
On May 12th at about 1830 you had some mount requests from 172.16.255.248.
On May 12th at 1434 you have this:

May 12 14:34:10 babar (da4:mps0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 716 command timeout cm 0xfffffe0000f4abc0 ccb 0xfffff807daa77000
May 12 14:34:10 babar (noperiph:mps0:0:4294967295:0): SMID 5 Aborting command 0xfffffe0000f4abc0
May 12 14:34:10 babar mps0: Sending reset from mpssas_send_abort for target ID 10
May 12 14:34:10 babar (da4:mps0:0:10:0): WRITE(10). CDB: 2a 00 00 54 6d 30 00 00 28 00 length 20480 SMID 102 command timeout cm 0xfffffe0000f185e0 ccb 0xfffff80b4489c000
May 12 14:34:10 babar (da4:mps0:0:10:0): WRITE(10). CDB: 2a 00 00 58 25 e8 00 00 10 00 length 8192 SMID 869 command timeout cm 0xfffffe0000f57490 ccb 0xfffff81964140800
May 12 14:34:10 babar (da4:mps0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 736 terminated ioc 804b scsi 0 state c xfer 0
May 12 14:34:10 babar (da4:mps0:0:10:0): WRITE(10). CDB: 2a 00 27 8c 1a 40 00 00 08 00 length 4096 SMID 280 terminated ioc 804b scsi 0 state c xfer(da4:mps0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00
May 12 14:34:10 babar 0
May 12 14:34:10 babar (da4:mps0:0:10:0): CAM status: CCB request completed with an error
May 12 14:34:10 babar (da4:mps0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 897 terminated ioc 804b scsi 0 sta(da4:te c xfer 0
May 12 14:34:10 babar mps0:0: (da4:mps0:0:10:0): WRITE(10). CDB: 2a 00 00 58 25 e8 00 00 10 00 length 8192 SMID 869 terminated ioc 804b scsi 0 state c xfer10: 0
May 12 14:34:10 babar 0): (da4:mps0:0:10:0): WRITE(10). CDB: 2a 00 00 54 6d 30 00 00 28 00 length 20480 SMID 102 terminated ioc 804b scsi 0 state c xfeRetrying command
May 12 14:34:10 babar r 0
May 12 14:34:10 babar (da4:mps0:0:10:0): WRITE(10). CDB: 2a 00 27 8c 1a 40 00 00 08 00
May 12 14:34:10 babar mps0: (da4:mps0:0:10:0): CAM status: CCB request completed with an error
May 12 14:34:10 babar Unfreezing devq for target ID 10
May 12 14:34:10 babar (da4:mps0:0:10:0): Retrying command
May 12 14:34:10 babar (da4:mps0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
May 12 14:34:10 babar (da4:mps0:0:10:0): CAM status: CCB request completed with an error
May 12 14:34:10 babar (da4:mps0:0:10:0): Retrying command
May 12 14:34:10 babar (da4:mps0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
May 12 14:34:10 babar (da4:mps0:0:10:0): CAM status: Command timeout
May 12 14:34:10 babar (da4:mps0:0:10:0): Retrying command
May 12 14:34:10 babar (da4:mps0:0:10:0): WRITE(10). CDB: 2a 00 00 58 25 e8 00 00 10 00
May 12 14:34:10 babar (da4:mps0:0:10:0): CAM status: CCB request completed with an error
May 12 14:34:10 babar (da4:mps0:0:10:0): Retrying command
May 12 14:34:10 babar (da4:mps0:0:10:0): WRITE(10). CDB: 2a 00 00 54 6d 30 00 00 28 00
May 12 14:34:10 babar (da4:mps0:0:10:0): CAM status: CCB request completed with an error
May 12 14:34:10 babar (da4:mps0:0:10:0): Retrying command
May 12 14:34:10 babar (da4:mps0:0:10:0): WRITE(10). CDB: 2a 00 00 54 6d 30 00 00 28 00
May 12 14:34:10 babar (da4:mps0:0:10:0): CAM status: SCSI Status Error
May 12 14:34:10 babar (da4:mps0:0:10:0): SCSI status: Check Condition
May 12 14:34:10 babar (da4:mps0:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
May 12 14:34:10 babar (da4:mps0:0:10:0): Retrying command (per sense data)
May 12 14:34:11 babar (da4:mps0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
May 12 14:34:11 babar (da4:mps0:0:10:0): CAM status: SCSI Status Error
May 12 14:34:11 babar (da4:mps0:0:10:0): SCSI status: Check Condition
May 12 14:34:11 babar (da4:mps0:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
May 12 14:34:11 babar (da4:mps0:0:10:0): Retrying command (per sense data)

That's one of your SSDs, and it's not playing nice for some reason.

Going back a few more days are some mount requests, then a bunch of mount requests to invalid locations that were denied.

Right now my suspicion is something with your Intel SSDs is to blame. Can you provide a date/time when it happened last? If this happens frequently enough, can you get a debug from within about 30 minutes after the issue happens?

Another idea may be to try enabling the "autotuner" in the WebGUI and rebooting (obviously schedule this as a maintenance window).

Spearfoot · May 26, 2017

2x 20GB partitions from SSDs mirrored for SLOG (20GB total)

2x 280GB partitions from SSDs striped for L2ARC (580GB total)

Do I understand correctly that you've partitioned each SSD with 20GB and 280GB partitions, and are then mirroring the 20GB partitions and striping the 280GB partitions? If so, I recommend you not do this, owing to I/O contention. This might be causing the SSD-related errors @cyberjock noted above.

The ZIL SLOG and L2ARC have very different functions and requirements. It would be simpler and cleaner to use one SSD for SLOG and the other for L2ARC.

Zofoor · May 26, 2017

About autotune, from the manual:

It should only be used as a temporary measure on a system that hangs until the underlying hardware issue is addressed by adding more RAM. Autotune will always slow the system down as it caps the ARC.

If you are trying to increase the performance of your FreeNAS® system and suspect that the current hardware may be limiting performance, try enabling autotune.

About dedupe:

If you plan to use ZFS deduplication, ensure you have at least 5 GB RAM per TB of storage to be deduplicated.

So, you have 6 * 4 Tb = 24 Tb of storage. So you would need 120 Gb of ram dedicated for this function. So it should be ok as long as you don't run memory-hungry software on the server.

Carl Thompson · May 26, 2017

Thanks for your help, @cyberjock!

cyberjock said:
So da4 and da11 seem to be "unhappy" per dmesg. They're both Intel SSDs, so it's likely some weird quirk with the drives. If there's newer firmware for them, I'd consider installing it during your next maintenance window.

I've perused the logs from multiple systems and it appears the problems we've had do NOT correspond to these drive error messages. However, there is newer firmware available so the SSDs will be updated.

cyberjock said:
Can you give a date/time that you last experienced the problem. I looked at your logs and nothing seems out of place.

It looks like the last time we had an issue on that server (babar) was May 16 at about 6:20pm PDT.

...

cyberjock said:
On May 15th at 2200ish someone rebooted the server.

I don't see any evidence of that? Currently the uptime for that server is 99 days.

...

cyberjock said:
Going back a few more days are some mount requests, then a bunch of mount requests to invalid locations that were denied.

I didn't notice that. Where do you see that?

cyberjock said:
Right now my suspicion is something with your Intel SSDs is to blame.

Maybe. I'll update the firmware on them.

cyberjock said:
If this happens frequently enough, can you get a debug from within about 30 minutes after the issue happens?

I will try to do that next time I spot the problem as it's happening.

cyberjock said:
Another idea may be to try enabling the "autotuner" in the WebGUI and rebooting (obviously schedule this as a maintenance window).

The autotune setting is already enabled though I'm not sure if the servers have been rebooted since then.

Thanks,
Carl

jgreco · May 26, 2017

Spearfoot said:
Do I understand correctly that you've partitioned each SSD with 20GB and 280GB partitions, and are then mirroring the 20GB partitions and striping the 280GB partitions? If so, I recommend you not do this, owing to I/O contention. This might be causing the SSD-related errors @cyberjock noted above.

The ZIL SLOG and L2ARC have very different functions and requirements. It would be simpler and cleaner to use one SSD for SLOG and the other for L2ARC.

Most often, ZFS catatonia seems to be caused by problems in the write path, especially transaction group sizing, but sometimes just simple write issues.

As far as transaction group sizing goes, ZFS has generally gotten much better in the past five years. If I rub my hands together and do my evil laugh, I can still get it to freeze now and then by throwing carefully designed workloads at it, so I'm still a proponent of reducing the transaction group size. This potentially reduces "megabytes per second" performance, but I'd rather have a system that writes 100MBytes/sec consistently rather than one that does 200MBytes/sec but engages stun mode now and then.

@Spearfoot has already mentioned contention, and I would suggest that if you really want to maintain mirrored SLOG, do consider buying some cheaper consumer-grade SSD for L2ARC. When ZFS decides it needs to free up some ARC, the system can start pumping out a lot of writes towards the L2ARC device especially if you've tuned l2arc_write_* upwards. ZFS also tries to throw TRIM commands at SSD, so you have two subsystems that are each creating SSD-unusual and SSD-stressy workloads, interacting in unknown ways.

The S3710 is a good SSD, but it does have limits. In particular it seems to slow down under workloads that I've never really bothered to identify. We have some of them in some heavy use applications.

Because you're on 10GbE (2x10GbE?), and you might have some PCIe slots available (5 in system, my possibly errant count from your description is 3 used), you could consider an NVMe option such as the Samsung 960 PRO. You can use one of the Addonics boards to add them to a normal PCIe slot:

These things read at thousands of MBytes/s instead of the ~500MBytes/sec a SATA device is limited to (approximately half a single 10GbE). Alternatively, adding several lesser quality consumer SATA SSD's is perfectly acceptable too.

I realize we're at a bit of a pricing bubble for SSD right now, so @Spearfoot's suggestion to just split the existing SSD's into one SLOG and one L2ARC is perfectly fine too, but loses you the mirrored SLOG. If you go that route, please be sure to pull the SLOG SSD out, use Intel's SSD toolbox to reinitialize the drive ("secure erase"), and then add a small SLOG partition. If you're on 2 x 10GbE, your clients could potentially be sending you as much as ~~~2.5GBytes/sec, if you're using a 5 second transaction group, that's 12.5GBytes, and sizing for three transaction groups suggests a SLOG partition size of around 40GBytes would be good. You're not truly likely to actually need anywhere near that. I do feel there's value in doing this, because internally it gives the SSD a huge free page pool to pull from when it is under stress, which will keep things flowing nicely.

Carl Thompson · May 26, 2017

Spearfoot said:
Do I understand correctly that you've partitioned each SSD with 20GB and 280GB partitions, and are then mirroring the 20GB partitions and striping the 280GB partitions? If so, I recommend you not do this, owing to I/O contention. This might be causing the SSD-related errors @cyberjock noted above.

Yes, this is correct. I've heard the resource contention argument against this but I'm not sure how valid it is.

We seem to be well within the IOPS and bandwidth limits of the drives. Isn't that all the matters?
Very few read requests actually seem to be making it past the ARC to the L2ARC. If you look at the gstat snapshot I posted you'll see that there are tons of writes to the SSDs but not a lot of reads. So there isn't really much contention.
Larger SSDs are faster so using one large SSD for both should actually make for a faster SLOG device.
Larger SSDs last longer so using one large SSD for both saves money too!

Spearfoot said:
The ZIL SLOG and L2ARC have very different functions and requirements. It would be simpler and cleaner to use one SSD for SLOG and the other for L2ARC.

I've considered that. But doing it that way I'd lose the redundancy of my SLOG mirror and I'd lose the speed and capacity of my L2ARC stripe.

Thanks,
Carl

Carl Thompson · May 26, 2017

Hi, @jgreco, thanks for your response.

jgreco said:
As far as transaction group sizing goes, ZFS has generally gotten much better in the past five years. If I rub my hands together and do my evil laugh, I can still get it to freeze now and then by throwing carefully designed workloads at it, so I'm still a proponent of reducing the transaction group size. This potentially reduces "megabytes per second" performance, but I'd rather have a system that writes 100MBytes/sec consistently rather than one that does 200MBytes/sec but engages stun mode now and then.

So would you recommend that I tune down the transaction group size and / or timeout then? What's a good place to start for that? Would doing this increase fragmentation?

jgreco said:
Because you're on 10GbE (2x10GbE?), and you might have some PCIe slots available (5 in system, my possibly errant count from your description is 3 used), you could consider an NVMe option such as the Samsung 960 PRO.

Maybe I need to revisit the decision but I purposely chose to use SATA SSDs. I wanted to be able to replace an SSD if necessary without taking the server out of production. There aren't any motherboards or chassis in my almost-no-money price range that can do hot swap PCIe cards.

Also I don't think the older motherboards we're using support gen 3 PCIe which I suspect would be required?

We originally tried Samsung consumer-level "Pro" SATA SSDs before settling on the S3710s. What we found is that while on paper they claim really high IOPS there's apparently a difference between IOPS and sustained IOPS. The Samsung Pro SSDs couldn't even keep up with our moderate needs and we ended up having to set sync=disabled on our VM datasets until we could replace them.

S3710s seem to be getting harder to find cheap so I will probably try some Samsung 863a SATA SSDs, though, as those are supposed to also be quite nice.

jgreco said:
These things read at thousands of MBytes/s instead of the ~500MBytes/sec a SATA device is limited to (approximately half a single 10GbE). Alternatively, adding several lesser quality consumer SATA SSD's is perfectly acceptable too.

True. But my ARCs currently have have a hit rate between 93% and 99% so I'm pretty sure I could saturate at least one 10GbE link even with my lowly SATA SSDs especially since I'm striping them for the L2ARC! (That is I could if the ixgbe driver in 10.3 weren't so horribly pathetic.) Going with PCIe or even more SSDs just doesn't seem to be required for our workload though things may change in the future.

jgreco said:
If you go that route, please be sure to pull the SLOG SSD out, use Intel's SSD toolbox to reinitialize the drive ("secure erase"), and then add a small SLOG partition. If you're on 2 x 10GbE, your clients could potentially be sending you as much as ~~~2.5GBytes/sec, if you're using a 5 second transaction group, that's 12.5GBytes, and sizing for three transaction groups suggests a SLOG partition size of around 40GBytes would be good. You're not truly likely to actually need anywhere near that. I do feel there's value in doing this, because internally it gives the SSD a huge free page pool to pull from when it is under stress, which will keep things flowing nicely.

I forgot to mention that for our newer servers we are over-provisioning these SSDs by 25%. We secure erase the drives before partitioning them to only use 300GB out of 400GB. It's not necessary to pull the SSDs out and use Intel's SSD toolbox to secure erase. It can been right from FreeNAS using camcontrol.

These enterprise-level drives already have a good amount of over-provisioning built in but it's reasonably well established that over-provisioning by 20% or so more can gain you a bit of performance down the road after the drives have been used for a while. What's not well established is that over-provisioning by massive amounts gains you even more than over-provisioning by 25%. I suspect that the law of diminishing returns kicks in and any gains from massive overprovisioning vs, normal overprovisioning are negligible. So my opinion is buy a bigger SSD for SLOG because bigger SSDs are faster and last longer. Over-provision it normally but don't completely waste the space; you bought it you might as well use it!

For our older servers (including the server from which @cyberjock has seen the debug info) we neglected to over-provision at all (the SSDs' partitions fill the entire drive) and the SSDs in them seem to be doing OK anyway. But now that I think of it I guess it's possible the occasional drive timeouts we're seeing could be because the drive firmware needs to shift some blocks around and having some extra over-provisioned space could help minimize them. That's just speculation, though.

Thanks,
Carl

BigDave · May 26, 2017

Carl Thompson said:
@BigDave

Hard drives don't use as much power as you think, apparently!

Perhaps I was not detailed enough, apparently ;)
These estimates below are almost pure speculation for your actual equipment.
I urge you to look closely at ALL the hardware in your machine (not just the drives).
As you may notice, I was under estimating your power needs for this older hardware.

Startup current for your 14 drives = 420w
Dual CPUs @ conservative peak = 40w
Dual HBAs @ conservative peak = 18w
10 GBe NIC @ conservative peak = 4w
RAM 6w EACH @ 9sticks (guess) = 54w
Fans 15w EACH @ 5units (guess) = 75w
Total conservative estimate = 611w - 87% of your PSUs rated capacity (when new).
I would not be surprised to see actual correct figures used for your entire system
and the peak current totaling nearly 100% of your PSUs capacity (not good at all).

https://forums.freenas.org/index.php?threads/proper-power-supply-sizing-guidance.38811/

Mlovelace · May 26, 2017

Carl Thompson said:
Autotune is ON (no other tuning done yet)

Please don't use autotune, turn it off and delete any tuning parameter that was auto-generated.

Carl Thompson · May 26, 2017

BigDave said:
Perhaps I was not detailed enough, apparently ;)
These estimates below are almost pure speculation for your actual equipment.
I urge you to look closely at ALL the hardware in your machine (not just the drives).
As you may notice, I was under estimating your power needs for this older hardware.

Sorry but my actual measured peak power usage during testing and in over a year of having these servers in production say that your "pure speculation" is wrong. You are way off on your peak disk power usage number especially.

Carl

Carl Thompson · May 26, 2017

Mlovelace said:
Please don't use autotune, turn it off and delete any tuning parameter that was auto-generated.

Hi, thanks for the response. However @cyberjock suggested that I should try autotune. What problems are there with it?

Thanks,
Carl

Spearfoot · May 26, 2017

Carl Thompson said:
Sorry but my actual measured peak power usage during testing and in over a year of having these servers in production say that your "pure speculation" is wrong. You are way off on your peak disk power usage number especially.

Carl

Did you measure the peak power with an oscilloscope at system startup? There's no other good way to measure the peak transient draw on the power supply when the system is powered up. This transient peak at startup is when PSUs are most stressed, and is typically when they fail. It's quite a bit higher than the power your system will draw during normal operations. @BigDave's calculations may actually be pretty close to the mark.

That said... you may very well be fine with a pair of 700W PSUs.

Regarding SSD usage for SLOG and L2ARC... @jgreco and I have both pointed out some pretty incisive reasons why you might want to modify your configuration. I'm no great shakes, but @jgreco is an expert and his advice shouldn't be discarded lightly.

You're experiencing intermittent system instability and seeing events in your logs that point to the SSD drives. Were this my system, I would reconfigure the SSD setup and try to establish whether or not they are the root cause of the instability. They very will might not be, but it would be fairly simple to find out at little cost other than a temporarily reduced L2ARC size and lack of SLOG redundancy.

Of course, these are your toys and you can play with them any way you please! :D

Important Announcement for the TrueNAS Community.

Build and performance tuning advice needed

Dabbler

Dabbler

Inactive Account

Dabbler

Dabbler

FreeNAS Enthusiast

Dabbler

Dabbler

Inactive Account

He of the long foot

Patron

Dabbler

Resident Grinch

Dabbler

Dabbler

FreeNAS Enthusiast

Guru

Dabbler

Dabbler

He of the long foot

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Build and performance tuning advice needed"

Similar threads