Discussion for learning

jlw52761 · Feb 28, 2020

Up until about 4 months ago, I had never used ZFS, having come from the "old guard" of SAN configurations using RAID and such. So when I got my new rig, which I will happily call a dumpster dive treasure, I decided that I should probably look into this for my home lab. The reason is I wanted to have volume level snapshots and have good consistency in the event of a power loss or hardware failure. I've always known ZFS fits both those bills, but also knowing that ZFS does not really translate to what one may say are "traditional storage methods". So my journey begins.

Just for context, I was able to get my hands on a sweet rig, with the following specs:

Dual Xeon(R) CPU E5-2603 0 @ 1.80GHz 4-core (No Hyperthreading)
128GB DDR-3 RAM
Dual Intel I350 Gigabit Network Connection Quad Port
Intel RMS25CB080 RAID Module
(x2) 100GB Micron P400m100-MTFDDAK SATA 6Gbps SSD [d0,d1]
(x12) 2TB Seagate Constellation ES ST2000NM0011 7.2k SATA 6Gbps HDD [d2 ~ d13]

I know, it's interesting to say the least, and it was an old storage appliance that was going into the recycle bin that I was able to intercept into my trunk.

Anyhew, my requirements for a home lab storage solution is that it can support multi-protocol (NFS, CIFS, iSCSI) at the minimum, have a simple interface, because I'm getting to old and just want something simple, be able to do snapshots at the volume level, have decent amount of storage (> 6TB), and have reasonable performance as I will only run 1Gbps networking at home, not 10Gbps.

So before going further, I do want to point out that the RMS25CB080 does not have a JBOD mode, so I just made each disk it's own virtual disk, as close to JBOD as I could get it.

So, yeah, this dumpster dive was a treasure, but how to use it. Well, I went down the road of FreeNAS, nowing that FreeBSD had a good reputation with not only network stacks but also with ZFS integration as a first-class citizen. I love me some Linux, Ubuntu mainly, so FreeBSD isn't scary, just different enough to be both amusing and annoying.

So, I installed FreeNAS onto a bootable USB drive, and determined that I will use that for my system drive and dedicate all of the internal storage to, well, storage, and not running the system.

So once FreeNAS was installed, it was time to configure the storage pool. This is where I got lost and had to do some research. I know I wanted the capacity, but I also wanted to have the resiliency in the event of a power outage or hardware failure. I also wanted to get the best bang for the buck in regards to performance. So what I decided to do is create a single pool, using all 12 2TB drives in the pool with a Raid-Z3 configuration. I then added one of the 100GB SSD's as a SLOG, and the other as L2ARC.

My use is I have two zvols shared via iSCSI to my ESXi hosts as datastores, and a couple of NFS datastores, one for Docker container persistent data, and one for general use. I have lz4 compression on, and dedup off.

Things seem to work well, I have a good balance of performance and capacity, and piece of mind. I do have several spare drives sitting in a box, so not worried too much about a single drive failure, but my choice for the Raid-Z3 was I am concerned that with the size of the disks, if I had a failure and replaced that I could have a second drive fail while the resilvering was still happening on the first replacement.

My normal running conditions are as follows:
ARC Size: 77Gb
L2ARC Size: 47GB
ARC hit Ration: 98%
L2ARC Hit Ration: 0%


fnas1# zpool iostat
                 capacity     operations    bandwidth
pool          alloc   free   read  write   read  write
------------  -----  -----  -----  -----  -----  -----
LocalPool      543G  21.2T     22    250   821K  2.09M
freenas-boot  1.75G  26.7G      0      0  4.76K    727
------------  -----  -----  -----  -----  -----  -----

So after all of that, I'm interested to hear what folks that eat and breathe ZFS and FreeNAS think about this and what possible things would be different. I think that would be a good learning for not only myself but others that are just getting into ZFS and all of it's wonderful oddities.

JaimieV · Mar 5, 2020

You have a very fragile system here. Read the stickies at the top of "New to FreeNAS?" and "Storage" forums particularly before doing the rebuild. You've fallen down a bunch of complete newbie holes.

First and foremost with flashing lights and klaxons, running virtual disks through the RAID controller - this is a huge nope with ZFS. It messes up SMART, bad block, data read problems and other status and error reports from the HDDs, so ZFS has an incorrect view of the state of the disks. This will definitely lead to issues when a drive starts going bad and ZFS doesn't know about it because the RAID controller is lying to it. You may have effectively no redundancy protection despite having RAIDZ3, it's that serious an issue.

Second, the non-redundant SLOG is a mild risk. If it pops you'll lose any data that is in it that hasn't been written out yet - which could be a lot, but depending on your usage might not be terrible.

The L2ARC popping just removes the cache, shouldn't cause any exciting issues. However, it's almost certainly pointless as you have 128gig RAM for a home lab system.

Your RAIDZ3 will be pretty slow on both write throughput and iops; more zvols == better for both.

So, actions! You MUST get a new controller and rebuild the whole thing onto it - those virtualised disks won't be usable native when you move them. There's recommendations up in the stickies, LSI IT-mode SAS cards (or rebadged ones) are very much your friends.

As you rebuild, consider experimenting with setting the pool up as a bunch of mirror pairs or at the very least two+ vdevs of of RAIDZ2 or RAIDZ1 instead of your single RAIDZ3. It may well mean your pool will be fast and iops-ey enough that you no longer get any benefit from the SSD SLOG or L2ARC and can ditch them, reducing your complexity. Both my devices on the buttons below push data around faster than a SATA3 SSD SLOG/L2ARC would help with.

Also, do you have a backup solution?

jlw52761 · Mar 5, 2020

Your RAIDZ3 will be pretty slow on both write throughput and iops; more zvols == better for both

Can you elaborate more on this? My background is in SAN administration, where the more spindles in a RAID storage pool = more IOPS. By breaking the disks up into smaller vdevs, that presents less spindles per vdev, so wouldn't that be less IOPS available to the pool? Not sure I am grasping why more vdevs of less disks is better than one vdev with a lot of disks, from a spindle-to-IOPS calculation perspective.
You also mention "a bunch of mirrored pairs", which is intriguing to say the least. I don't have 10Gbe or Infiniband for the storage networking, so not sure if having the possibility of an absolute ton of IOPS theoretically is worth the loss of capacity in my case.

Second, the non-redundant SLOG is a mild risk

With the amount of RAM in this box, I don't actually see my L2ARC being used at all, so I've thought about using that 100GB SSD as a mirror to the SLOG, but I don't see how I can add a mirror to an existing SLOG, I think I will have to destroy the SLOG and rebuild as a mirror.

First and foremost with flashing lights and klaxons, running virtual disks through the RAID controller - this is a huge nope with ZFS

So I did read the stickies, and have done a ton of independent research outside of ixSystems documentation, and even spoke with some old Solaris admins on the subject. The conclusion I came to from all of this is that while ZFS will benefit from having the SMART information, that it's not 100% critical to the stability or reliability of ZFS. I do understand that ZFS can use this information to make informed decisions, but by moving that into hardware then ZFS doesn't have to use that information.

From what I'm seeing, the major concern is that the RAID controller will report that data has been written when it's in the cache, which is battery backed. So, what's the concern with this? The data is still safe thank to the battery backing, so why would one think that this is not reliable? Now I did turn off all caching in the RAID controller because why would I need it in this case...

Now I do agree that when the controller sees a disk as degraded due to a SMART error that it may not report this to the OS, or the OS may not be able to consume this information. I've seen controllers that will actually offline the disk when it get's into a situation with SMART errors, so my thought is that when this happens, the disk will no longer preset to the OS and the protection of the RAIDZ3 should protect my data, because the disk went away. Are you saying that the RAIDZ3 won't protect my data in the even I loose a disk either due to hard failure or the controller no longer presenting it? That's huge if that's the case and really changes the way some decision making can be done.

This will definitely lead to issues when a drive starts going bad and ZFS doesn't know about it because the RAID controller is lying to it

I'm leary of this statement because the RAID controller isn't lying, it's doing it's job. Since the intent of the RAID controller is to handle drive failures within it's own hardware, it has no need to generally present the SMART information to the underlying OS. The RAID controller, if the issue is serious enough, will no longer present the disk to the OS via the logical drive, so ZFS will see this disk as gone.

Also, do you have a backup solution?

Not in a traditional sense no. I'm using the ZFS snapshots at the volume level as my recovery for file level items. Now I do have a second one of these boxes, that once I am in a good configuration, I plan to build up and run in my second house then replicate the first NAS to the second one, basically providing off-site backups. The critical files, such as DNS config, MongoDB files, etc, are already housed on a NFS share that I have rclone running on to replicate to a folder on my OneDrive. This is actually outside of FreeNAS completely, using my "master controller box" which is a Ubuntu server running on a laptop (with battery) connected to my UPS. Oh, and the FreeNAS box is also cabled to my UPS for power protection.

Now the motherboard of this system is an Intel S2600GZ Motherboard with an onboard SAS/SATA controller, and the system already has a Intel RES2SV240 SAS Expander in it, so on my second system I am going to try and connect the expander to the motherboard and see how it all goes. I suspect it will run more at the 3Gbps speed than 6Gbps, but in this case I think that will be just fine. If the onboard controller doesn't work, then I will get a cheap LSI SAS 9211-8i based card that can connect to the expander and run all of my drives, the 12 data drives and the 2 SSD drives.

To be honest, I am still really on the fence about using ZFS at all, or FreeNAS, just because of all these discussion points and years of being told that ZFS is overkill and only crazy large Unix implementations need it because they can't do hardware RAID. Plus, as this and many other discussions show, ZFS can be a delicate snowflake that breaks if you sneeze too hard on it in most cases.

A part of me wants to go back to using pure hardware RAID, where I don't have to worry about blind swapping and I know that the RAID controller will do the work, freeing my CPU's to do actual work. I am a Linux guy by trade (and history), so looking at something like EXT4 on RAID6, or something along those lines. Granted, I lose the lipstick and bells that FreeNAS provides, which is any developers from iX Systems reads this, the GUI and interface is second to none in my opinion. Running MD on Ubuntu is something I have done, and there's a ton of case studies that show doing this with good hardware RAID is just fine and not risky at all.

Overall, great discussion points though, I really appreciate the feedback and hope others can use this as well.

JaimieV · Mar 5, 2020

Coming to ZFS from running trad hardware RAID is like moving from skiing to snowboarding - about half your existing skills are useful and appropriate, the other half will have you fall flat on your face because they're exactly backwards now. Voice of experience here, at least with the snow! The fun is that going in you don't know which half of your skills, habits and reflexes are inappropriate for the new tool.

where the more spindles in a RAID storage pool = more IOPS

For ZFS pools, to a first approximation more vdevs = more IOPS. A vdev may be one disk, of course (if you were mad). ZFS distributes reads and writes to vdevs in parallel, but not across disks.
https://www.ixsystems.com/blog/zfs-pool-performance-1/
https://www.ixsystems.com/blog/zfs-pool-performance-2/
https://calomel.org/zfs_raid_speed_capacity.html - has example builds and throughputs.
https://www.ixsystems.com/community...and-why-we-use-mirrors-for-block-storage.112/
If you're using the NAS over 1gigE you don't need to worry too hard about read/write speeds, but if you're expecting decent IOPS (which you'd want for iSCSI) the above will help get you them. An 8 HDD pool can rival an SSD for IOPS when built appropriately, or can rival a single HDD if not. Also if you end up LAGGing your quad card you'll want more parallelisation available probably.

I don't see how I can add a mirror to an existing SLOG

https://www.ixsystems.com/community/threads/create-zfs-mirror-by-adding-a-drive.14880/#post-81348 - it's the same as adding a mirror to any other vdev. Doing my own further reading, thanks to advances in ZFS a couple of years ago it's apparently basically non-damaging to lose a SLOG now, but there are still advantages to mirrored SLOG for eg DB type work - you get the sum of your SSDs IOPS.

I plan to build up and run in my second house then replicate the first NAS to the second one

Excellent. ZFS replication is built for exactly this - it takes the diff snapshots you make on the primary and only sends those diff blocks to the secondary pool. Restartable and dodgy-network tolerant now, with 11.3

freeing my CPU's to do actual work

ZFS uses basically no CPU these days, to a first approximation. This isn't the 90s where running RAID5 parity calulation would use up half your CPU and be worth moving to a RAID card with a PPC on.

I'm leary of this statement because the RAID controller isn't lying, it's doing it's job.

The job is one that ZFS expects to be doing.
ZFS's massive USP is that it is about data integrity at all costs. With a RAID controller you're cutting it off at the knees.
With a RAID card in place ZFS gets "Here's a stream of blocks that come from the perfectly working physical hard drive" whereas actually it's a purified stream coming from a virtual drive. ZFS can't then trust its block reading and can't do its own block-healing magic, losing one of the major benefits of ZFS. It can't use its own hueristics on disk behavior to make decisions on when to eject a misbehaving drive from the pool, as the RAID card will either present it as here-and-perfect or gone. All hardware level management is hidden from ZFS. ZFS will still cope of course, but you'll lose some of the benefits of ZFS just for being stubborn about swapping in $60 of hardware HBA.

https://www.ixsystems.com/community...-and-why-can't-i-use-a-raid-controller.81931/ points 5 and 6 are a summary. You can find details on why this is the case here, in the wiki by the designers and maintainers of OpenZFS (the form that FreeNAS uses):
http://open-zfs.org/wiki/Hardware#Hardware_RAID_controllers. The whole page is a fantastic mine of information, well worth reading.

RAID cards is the biggest "I went to turn and dug the snowboard front edge in, and broke my nose on the snow" item. But your history with a different technology means you don't really, deep down, understand the perils of front edges on a snowboard. Edges on skis are your friends!

Your choice - you can stick with your comfort zone, or learn how ZFS is different and work with it. If you're doing this for fun, it doesn't really matter - it's only your own lab that's at risk and I can tell you're not the type to commit unique important data to a device without a backup! But please don't take this to a workplace spec/build thinking that a caching RAID controller is a good idea.

jlw52761 · Mar 13, 2020

So one thing that I'm learning as comparison, from a performance standpoint, is that in traditional RAID each disk get's the whole block written to it, and all disks in the RAID set can get blocks in parallel. The comparison in ZFS would be the vdev, not each disk, so the performance calculations occur at the vdev level since that is where the blocks are written, not to each individual disks in the vdev, which only get's a sector of each block. Once I got my head wrapped around that, things started to click on why a 12-way RAIDZ3 is much slower than a 4 3-way RAIDZ or 6 mirrors, because you have more vdevs to write blocks to in parallel.

In my case, I am wanting more capacity than IOPS, but I also want decent IOPS, so my second box is being built up with a 4 3-way RAIDZ instead of 6 mirrors because the RAIDZ only eats 1/3 of my RAW capacity, whereas the mirror eats 1/2, which in my case is 4TB overall. If I was running 10Gbe and was needing the IOPS, I would've opted for the mirrors, or even went with a tiered approach.

As for that hardware, I did end up procuring two LSI9211-8i cards and flashed them to IT firmware. So far, all 14 drives are visible and I have the cage and cabling to add in my two new mirrored boot drives, making 16 total drives. I can't figure out how to mirror the SLOG, so that will take a little time but with the RAM and IOPS I have now, I'm not totally dead without the SLOG as I do have the system on UPS.

So for now, sticking with FreeNAS 11.2, I've found 11.3 to be very unstable in my case, and using the HBA to present a whole bunch of drives seems to be the right path. Now time to play and experiment and learn further.

So this discussion was very good in that I learned something, and hopefully others can learn as well from the discussion. Thank you all for participating and offering the opinions.

JaimieV · Mar 13, 2020

Glad to help! Keep asking as you investigate. I think you've probably got through the big conceptual hurdle now, anything else is details. Might want to ask about whether a SLOG is useful for your use case, it often isn't if you have an IOPSey and fast pool, but that's outside my own expertise :)

Scharbag · Mar 13, 2020

Good job on getting the HBAs. As stated above, ZFS via hardware RAID is just a recipe for heartache.

Another item to consider is that RaidZ1 is risky with larger drives. This is simply due to the fact that a resilver can take a day and it HAMMERS all of the spindles in the vDev. If another spindle fails, the whole pool is lost.

For a 12 disk system, I would consider running 2@6 disk RaidZ2 vDevs. Given that you can have 14 data drives, I would ditch the ZIL (slog) as it is not needed for typical home use, and go with 2@7 disk RaidZ2 vDevs. This gives you adequate redundancy and minimizes ZFS overhead.

I run 2 pools, both with 2 RaidZ2 vDevs in each - one has 6 disk vDevs and one has 9 disk vDevs. Well, and a mirrored pool. :) I run ESXi and virtualize a schwack of stuff and I have never had any speed issues. The ZIL does help on my spinning rust pool to increase random IOPS but it is not required. But I have 44 hot-swap drive bays available... :D

Happy FreeNASing!!

jlw52761 · Mar 17, 2020

With two 6 disk RAIDZ2 pools, I would have what, 16TB total of capacity, but since it's two vDEVs instead of 4, wouldn't I cut my IOPS in half doing that? That would mean I should definitely have a SLOG then as I regularly push between 200 and 400 IOPS under normal load. My current config of four 3-disk RAIDZ vDEVs gives me about 14TB of capacity, but I get 4 spindles worth or IOPS (one per vDEV), roughly 180 IOPS per spindle estimated on the low end.

So with the HBA I also got a SAS to 4 SATA cable and the appropriate power connectors, so the plan is to use the two SSD's as mirrored SLOG, then add two spinning drives, I have some 5k 500GB ones laying around, and add them as a mirrored boot drive. The question would be about the 12 data disks. I have built the second system with four 3-disk RAIDZ vDEVs, without the SLOG. So I can benchmark things that way, then on the first box, rebuild it differently and experiment with different configs and records the results. Plus, I plan to replicate between the two boxes, so if a resilver did blow up the vDEV and my entire pool, I do have a out for it by rebuilding and replicating back in a pinch.

With that, what is a good method that one would recommend to benchmark and test the disk IO? I personally have always used IOMeter in the past, but am willing to "get with the times" if there's something better. What I don't like about using dd alone is that's great for sequential read/write, but doesn't help for benchmarking random IO, which since the primary use is providing NFS shares for my docker environment to store database files and to present iSCSI to my ESXi, random IO is the bigger need.

Oh, as for the instability, I started seeing it on my 11.2 system as well once I shifted load and I believe it has to due with autotune adding the kern.ipc.nmbclusters setting. I removed it and so far things have been solid. I'm not having any faith in the Intel i350 cards with FreeBSD, can't do VLANs and under load seems to cause a complete kernel panic.

jlw52761 · Mar 17, 2020

BTW, using fio for benchmarking using the command inside a jail

fio --randrepeat=1 --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75

Important Announcement for the TrueNAS Community.

Discussion for learning

jlw52761

Explorer

Attachments

JaimieV

Guru

jlw52761

Explorer

JaimieV

Guru

jlw52761

Explorer

JaimieV

Guru

Scharbag

Guru

jlw52761

Explorer

jlw52761

Explorer

Similar threads

Important Announcement for the TrueNAS Community.

Discussion for learning

Explorer

Attachments

Guru

Explorer

Guru

Explorer

Guru

Guru

Explorer

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Discussion for learning"

Similar threads