[RFC] Planning install on LSI MegaRAID SAS 9271-8i (no HBA mode)

Andrii Stesin · Aug 22, 2016

Here at new job, we inherited some 4 or 5 SuperMicro "storages" born in 2014, which are in fact just 19" servers with LSI MegaRaid SAS 9271-8i controllers inside them, each running different version of Windows Server and different kind of iSCSI host software. They are slow and ineffective and mostly misconfigured. Naturally, there is an obvious idea - just convert these to FreeNAS and make them serve NFS instead of iSCSI. Now the first unit is free and empty, we migrated all the data to other storages, so it's time to plan the future setup.

What we got is a 4U SuperMicro box with 4-core fast Xeon and 128GB RAM, which itself hosts 3 shelves of 3,5" drives, with 1 external shelf of 12 3,5" drives, connected with SAS cable to the main unit:

shelf 1 has 9 x 1Tb SATA3 HDDs and 3 x 200Gb SSD drives,
shelf 2 has 12 x 1Tb SATA3 HDDs,
shelf 3 has 12 x 1Tb SATA3 HDDs,
external shelf has 12 x 4Tb SATA3 HDDSs.

They all are connected to the LSI MegaRAID SAS 9271-8i which itself has an internal battery installed and generally is very fast and intelligent device itself. It does not have either HBA mode or "IT" firmware, and I don't want to take the risk of damaging a perfectly good device. Turning the drives into a bunch of pseudo-"volumes" one per drive doesn't seem to be a clever idea either (it was discussed a number of times here around).

My idea is: let hardware RAID do whatever it is good for - for redundancy. Namely, let it give me RAID volumes, hidden behind the MegaRAID, which are just very large disks from the kernel point of view.

And let ZFS do what it does well - namely, pool management and LZ4 compression.

So I plan to set up

one "small" pool from 9 disks of the 1st shelf, in RAID6, "single-volume mirror"
one "fast" pool - take 2 shelves with 12 1Tb drives each into 2 RAID6 volumes, and create a striped pool on these (or maybe even a mirror will be better because drives are somewhat age-worn),
one "fat" pool - 12 x 4Tb drives in RAID6 "single-volume mirror", 9 of these are fresh new other 3 are almost new,
and use a striped volume of 3 SSDs for cache (will it serve all three pools simultaneously?)

Why do I think that this design is viable and risks are not critical.

I don't want hotspare drives because they are effectively useless. We already had a precedent when after 1 drive went south, MegaRAID activated hotspare and started to rebuild it, it should take about 24 hours, but the next 2 drives failed during next 9 hours. So now we rely on our duty engineer, who should replace failed drive immediately without relying on hotspares or such.

This brings another question, will MegaRAID software daemon, which allows to manage it remotely with LSI's graphical control utility, work on FreeBSD / FreeNAS?

Any suggestions, what can be done better witth regard to performance and/or reliability of this setup?

Thanks in advance!
WBR, Andrii Stesin

Mirfster · Aug 22, 2016

Freenas is NOT to be used with Hardware Raid... Period.

No need to go into any other details.... If you must use Hardware Raid, then Freenas is not for you.

gpsguy · Aug 22, 2016

Ditto what Mirfster said. Search the forum for more information.

Andrii Stesin · Aug 22, 2016

What is wrong with config where controller exports it's volumes to ZFS to be used for compressed zpools without extra redundancy at FS level? I recall Joe Greco wrote some time ago that this indeed is not the best setup, but it is a viable option.

Mirfster · Aug 22, 2016

ZFS does not then have direct access to the disks. What happens when the Hardware Raid controller and ZFS are competing for control of the drives?

Also, ZFS will not have access to SMART capabilities.

There are tons more reasons, but I am on my cell and not willing to type too much. Search the forums and you will find tons of postings on why this is not the way to do it.

I highly doubt jgreco would *bless* running like this. More than likely you are thinking about the "unorthodox" method for a SLOG (leveraging Raid BBU), but that is an edge case for a particular scenario. Not in regards to house production data.

Andrii Stesin · Aug 22, 2016

They won't compete for control of the hardware drives because MegaRAID will completely abstract hardware level from ZFS and will present "virtual drives" to it, and you (supposedly) can even access S.M.A.R.T. data for individual drives from behind the LSI card with some magical smartctl arguments.

Also jgreco said in the same thread - "Adding a crappy little RAID controller (LSI, etc) to your massive big RAID controller (ZFS) creates a very bad layering. There are some potential benefits but also significant downsides. If you can sufficiently address all the downsides, and cyberjock has discussed at least some of them, then you can do as you please, but with the caveat that "you've been warned." - I can't add anything, he is perfectly correct, but!

9271-8i is neither crappy nor little, it's a very decent and fast hardware RAID controller and it seems to be unreasonable just to throw away all it's power due to religious reasons,
I'm NOT going to build "RAID over RAID" so no "bad layering" - I just use the RAID layer from LSI and the layer of pools and LZ4 from ZFS (omitting the RAID capabilities of the latter).

Also there is MegaCLI tool which allows to control LSI card from inside FreeBSD CLI, even if LSI management agent daemon is omitted somehow from LSI's FreeBSD software bundle.

Spearfoot · Aug 22, 2016

Andrii Stesin said:
Here at new job, we inherited some 4 or 5 SuperMicro "storages" born in 2014, which are in fact just 19" servers with LSI MegaRaid SAS 9271-8i controllers inside them, each running different version of Windows Server and different kind of iSCSI host software. They are slow and ineffective and mostly misconfigured. Naturally, there is an obvious idea - just convert these to FreeNAS and make them serve NFS instead of iSCSI. Now the first unit is free and empty, we migrated all the data to other storages, so it's time to plan the future setup.

What we got is a 4U SuperMicro box with 4-core fast Xeon and 128GB RAM, which itself hosts 3 shelves of 3,5" drives, with 1 external shelf of 12 3,5" drives, connected with SAS cable to the main unit:

shelf 1 has 9 x 1Tb SATA3 HDDs and 3 x 200Gb SSD drives,

shelf 2 has 12 x 1Tb SATA3 HDDs,

shelf 3 has 12 x 1Tb SATA3 HDDs,

external shelf has 12 x 4Tb SATA3 HDDSs.

They all are connected to the LSI MegaRAID SAS 9271-8i which itself has an internal battery installed and generally is very fast and intelligent device itself. It does not have either HBA mode or "IT" firmware, and I don't want to take the risk of damaging a perfectly good device. Turning the drives into a bunch of pseudo-"volumes" one per drive doesn't seem to be a clever idea either (it was discussed a number of times here around).

My idea is: let hardware RAID do whatever it is good for - for redundancy. Namely, let it give me RAID volumes, hidden behind the MegaRAID, which are just very large disks from the kernel point of view.

And let ZFS do what it does well - namely, pool management and LZ4 compression.

So I plan to set up

one "small" pool from 9 disks of the 1st shelf, in RAID6, "single-volume mirror"

one "fast" pool - take 2 shelves with 12 1Tb drives each into 2 RAID6 volumes, and create a striped pool on these (or maybe even a mirror will be better because drives are somewhat age-worn),

one "fat" pool - 12 x 4Tb drives in RAID6 "single-volume mirror", 9 of these are fresh new other 3 are almost new,

and use a striped volume of 3 SSDs for cache (will it serve all three pools simultaneously?)

Why do I think that this design is viable and risks are not critical.

I don't want hotspare drives because they are effectively useless. We already had a precedent when after 1 drive went south, MegaRAID activated hotspare and started to rebuild it, it should take about 24 hours, but the next 2 drives failed during next 9 hours. So now we rely on our duty engineer, who should replace failed drive immediately without relying on hotspares or such.

This brings another question, will MegaRAID software daemon, which allows to manage it remotely with LSI's graphical control utility, work on FreeBSD / FreeNAS?

Any suggestions, what can be done better witth regard to performance and/or reliability of this setup?

Thanks in advance!
WBR, Andrii Stesin

You can do this... but just because you can, does not mean that you should! :)

iXsystems is the developer of FreeNAS. The very first item in iXsystem's 'Worst Practices Guide' is 'Using Hardware RAID with ZFS'.

HoneyBadger · Aug 22, 2016

Hi Andrii,

You're unlikely to get an official "blessing" of sorts from the forum users, because even though you may be planning to implement ZFS-on-HW-RAID in a way that works, and may well even be safe, the other 99% of people who google "ZFS with hardware RAID" and quickly skim the thread will think that they can use their $25 cheapo card without incident.

(And if you're reading this, random Googler - please read the whole thread. Or register. Or click Spearfoot's link above. It explains why what you're about to do is silly.)

But I'll dive in here.

Andrii Stesin said:
9271-8i is neither crappy nor little, it's a very decent and fast hardware RAID controller and it seems to be unreasonable just to throw away all it's power due to religious reasons,

I'm NOT going to build "RAID over RAID" so no "bad layering" - I just use the RAID layer from LSI and the layer of pools and LZ4 from ZFS (omitting the RAID capabilities of the latter).

1. That card might be big and fast, yes - a dual-core SoC, 1GB of DDR3 - but consider that using ZFS for "software RAID" means that your monster quad-core Xeon is now your "RAID processor" and it's packing 128GB of RAM. I think it's safe to say that power balance is the equivalent of "bringing a nuclear ICBM to a knife fight."

2. In this scenario, ZFS will not be able to protect you from corrupted data. It will be able to identify it, but not do anything about it. You'll be relying 100% on the MegaRAID card for that, so you may (likely "will") have to write your own scripts to monitor the array status via megacli and submit emails if/when something goes sideways, as well as handle replacement of failed devices and monitor rebuilds. Obviously we can't give you any support on that front, so if something's wrong, we won't be able to say what - or even worse, you may not even know something's wrong until it's too late.

In addition, you could also hurt performance this way. ZFS likes to have control over drives and their associated cache - in this case, I'm referring to the 32/64/128MB of RAM on each individual disk's PCB. With them all grouped into a volume on the MegaRAID, ZFS will only see one drive - and if it says "flush cache" to the MegaRAID, it might respond by dumping that entire 1GB of battery-or-flash-backed RAM to spindle. Until that dump is finished, the controller may well (again, likely "will") block any further incoming writes to cache, across all of your pools. Whoops. There goes performance, now you have a problem.

And if that drags on long enough, ZFS may actually consider the device as not responding in time, and offline it. But uh oh ... that was the only vdev in the pool. Now you have a big problem.

Bear in mind that ZFS is, by design, a very "risk-averse" filesystem, and folks who use it and offer support tend to have a similar mindset. We're not suggesting that you go software RAID out of malice or zealotry - it's out of concern for the safety of your data.

And to shuttle back up to your original post for a second - you can't use the same cache devices for multiple pools. ARC ("RAM cache") is shared, but L2ARC ("SSD read cache") isn't. You could attach one to each, though.

Cheers, and hope we haven't scared you off.

depasseg · Aug 22, 2016

You asked for comment. You can choose to consider it or not.

Andrii Stesin said:
Any suggestions, what can be done better witth regard to performance and/or reliability of this setup?

Yes, use an HBA with ZFS, not a HW RAID Controller.

Andrii Stesin said:
I don't want hotspare drives because they are effectively useless. We already had a precedent when after 1 drive went south, MegaRAID activated hotspare and started to rebuild it, it should take about 24 hours, but the next 2 drives failed during next 9 hours. So now we rely on our duty engineer, who should replace failed drive immediately without relying on hotspares or such.

This doesn't make sense to me. A duty engineer can swap a disk faster than an automatic hot-spare can get activated? The 2 failures in 9 hours vs a 24 hour rebuild time has nothing to do with the swapover time. It's a very well known issue, and a driving reason to move to 3 disk parity.

Andrii Stesin said:
This brings another question, will MegaRAID software daemon, which allows to manage it remotely with LSI's graphical control utility, work on FreeBSD / FreeNAS?

I highly doubt it.

Andrii Stesin said:
My idea is: let hardware RAID do whatever it is good for - for redundancy. Namely, let it give me RAID volumes, hidden behind the MegaRAID, which are just very large disks from the kernel point of view.

Andrii Stesin said:
And let ZFS do what it does well - namely, pool management and LZ4 compression.

ZFS does redundancy very well, and furthermore ZFS provides data integrity with self-healing.

MatthewSteinhoff · Aug 22, 2016

They are slow and ineffective and mostly misconfigured.

And you're trying to fix that by a creating an ineffective and misconfigured FreeNAS system?

It sounds as though you have some nice hardware. For just the cost of a nice HBA (or two) - under $400 total - you can make magic happen: a fast, effective and correctly configured FreeNAS system. Do the right thing. Spend the cash.

Cheers,
Matt

Andrii Stesin · Aug 22, 2016

Hi men, thank you million times for your comments. Let's go step by step, I'll start from Spearfoot's comment, so let's read the provided link carefully. Yes, there are drawbacks. How terrible and dangerous they are?
ZFS includes a sophisticated yet efficient strategy for providing various levels of data redundancy, including the mirroring of disk and the “ZFS” equivalents of hardware RAID 5 and higher with the ability of losing up to three disks in an array.

Ok, great. But who said that the ability of ZFS to handle data redundancy itself is mandatory to use, and offloading this part of the job to the controller is (supposedly) strictly forbidden? I don't see any mention of this.

ZFS will not be able to efficiently balance its reads and writes between them or rebuild only the data used by any given disk

Ok, I trust the RAID controller to perform these tasks, I offload this part of the job to hardware. Is it wrong? Why, then?

Hardware RAID cards typically rebuild disks in a linear manner from beginning to end without any regard for their actual contents.

Does "typically" mean "certainly"? Or maybe programmers of Avago/LSI are this dumb not to implement any better scenarios during who-knows-how-many-years? They are still in the market, aren't they? So are they selling crap to unaware people? I sincerely doubt this.

ZFS works carefully to guarantee that every write it receives from the operating systems is on disk and checksummed before reporting success. This strategy relies on each disk reporting that data has been successfully written, but if the data is written to a hardware cache on the RAID card, ZFS is constantly misinformed of write success. This can work fine for some time but in the case of a power outage, catastrophic damage can be done to the ZFS “pool” if key metadata was lost in transit.

Everything is correct, I won't argue with this. But I should mention that LSI has its own onboard battery which, hmm, just works; and catastrophic power outage in the datacenter is not that big of a risk compared to just hardware failure (which happens probably 100 times more often). For key metadata, I can (and plan) to use SSD striped volume, it's damn fast. What's wrong with this approach?

most hardware RAID cards will mask the S.M.A.R.T. disk health status information that each disk provides

Yes, thank you, I'm aware of this, and I already have another way to get S.M.A.R.T. info from behind the butt of the RAID card.

Is this all? Any more drawbacks and risks? If this list of risks and dangers is complete, I'm perfectly ready to live with these.

depasseg · Aug 22, 2016

Andrii Stesin said:
If this list of risks and dangers is complete

I would add that the internet has a lot of recommendations against using ZFS with HW RAID. And it's not that the combination is any more risky than using HW RAID with any other Filesystem, it's that there are unnecessary risks that could be avoided by not using HW RAID. But feel free to do what you want. It's your data.

One big item for me, and a reason I switched from a HW RAID controller was silent data corruption (which ZFS can detect and correct, but HW RAID cannot).
(From http://open-zfs.org/wiki/Hardware):

Hardware RAID will limit opportunities for ZFS to perform self healing on checksum failures. When ZFS does RAID-Z or mirroring, a checksum failure on one disk can be corrected by treating the disk containing the sector as bad for the purpose of reconstructing the original information. This cannot be done when a RAID controller handles the redundancy unless a duplicate copy is stored by ZFS in the case that the corruption involving as metadata, the copies flag is set or the RAID array is part of a mirror/raid-z vdev within ZFS.

The other issue for some, is that you are tied to that controller chipset for the life of that data. If that controller fails, you will need to find a compatible replacement (whcih sometimes isn't easy).

Also, the FreeNAS Developers #1 worst practice is using HW RAID:
http://www.freenas.org/blog/freenas-worst-practices/

Andrii Stesin · Aug 22, 2016

Dear HoneyBadger, you a perfectly right that I'm "unlikely to get an official "blessing" and I swear - "blessing" is not even close to what I'm looking for :) I'm seeking some technical advice, backed with practical experience. My guess is that probably just nobody ever used this kind of setup in real production, this simple. Or am I wrong?

You say, "consider that using ZFS for "software RAID" means that your monster quad-core Xeon is now your "RAID processor" and it's packing 128GB of RAM" - so why on the Earth people are using NVidia GPUs, for example? main CPUs are so powerful and fast now! Intel's onboard graphics is so much fast er than any videocard we were using while playing Doom-2! (just kidding). Offloading some tasks to "hardware" (which in fact is just yet another specialized computer) is not that bad an idea, yes I can emulate i.e. ethernet adapter with main CPU, or even the whole switch (like VMWare does), of I can use a general purpose computer to perform as a BGP border router which holds some 3-4 full views. It's possible and sometimes even brings a better price/performance ratio. (BTW do you recall 80287 floating point coprocessors?) But I already got MegaRAIDs as given, so why should I reject getting the most out of them? Especially when risks are negligible and managed?

You say "In this scenario, ZFS will not be able to protect you from corrupted data." - Ok, let controller alone take this responsibility. BTW in case I use a RAID-given "disk" for zpool, doesn't ZFS do it's own checksumming of data? Yes, I'll be relying 100% on the MegaRAID card for that. (And free pretty much CPU cycles for other purposes, for LZ4 first of all).

"have to write your own scripts to monitor the array status via megacli and submit emails if/when something goes sideways" - I don't see any problem in writing some scripts and put these into crontab, and even call REST endpoints of my ticketing system to open tickets immediately as soon a problem symptoms are observed (I hate email for this, it's 20th century stuff). I remember the whole SQL interpreter written in plain /bin/sh 20+ years ago, so this is obvious.

And great thanks you for the note about the probable performance hit. This is the really valuable remark, I already noticed some mentions of this scenario somewhere while reading around, and I certainly will make some tests for this "massive write case" (and tickle some tunables then) before I'll put the whole beast under load. Probably limiting the write transaction size to maybe 1/3 of controller cache size will do.

Andrii Stesin · Aug 22, 2016

"The 2 failures in 9 hours vs a 24 hour rebuild time has nothing to do with the swapover time. It's a very well known issue, and a driving reason to move to 3 disk parity."

In fact, this is not a technical case but the 100% managerial one. First f*up of my predecessors is that they put 12 identical drives all from the same manufacturer's batch (even s/n are sequential) in a single shelf and in a single RAID volume. No wonder that after some period of operation under identical load, they started to fail one by one, i.e. three drives in 9 hours - the second f*up is that nobody cared to replace 1 old drive with a new one every month after the first year of array's operation.

Andrii Stesin · Aug 22, 2016

Dear MatthewSteinhoff you are perfectly correct. For my own private systems, I use exactly this approach. But here the whole stuff is the property of my employer, and I fell into the abyss:

either I continue to use MegaRAID under Windows server (thank you, I'm tired enough with this),
or I will use it with FreeNAS/ZFS in, hmm, (who said "pervert"?) incorrect (but working) way.

I know the risks and drawbacks of the first (windows) approach in detail. Nobody will give me any budget to downgrade a good and perfectly operational hardware. So, to be or not to be - that is the question...

And for my purpose I need a good on-the-fly compression at filesystem level. That is the goal.

Mirfster · Aug 22, 2016

Andrii Stesin said:
9271-8i is neither crappy nor little, it's a very decent and fast hardware RAID controller and it seems to be unreasonable just to throw away all it's power due to religious reasons,

Andrii Stesin said:
I know the risks and drawbacks of the first (win) approach in detail. Nobody will give me any budget to downgrade a good and perfectly operational hardware.

Um okay, so regardless of the data-points provided it would appear to me that the crux of the matter is that you seem totally hung up on somehow justifying using your 9271-8i card(s). So be it, but don't forget to calculate in the expense of actually losing data as well as Production downtime. How much will that impact "budgetary concerns"?

No matter how you spin things or try to validate your direction, it is ultimately wrong. You are not the first and more than likely not the last to think their hardware is the exception. Knowing this I will detach myself from this conversation and wish you nothing but the best of luck.

Andrii Stesin · Aug 22, 2016

As for "issue for some, is that you are tied to that controller chipset for the life of that data" - hmm, yes, somewhat. But from my own experience, LSI MegaRAID stores a separate copy of array configuration on each drive of an array. So when you disconnect a shelf from one MegaRAID and connect it to another (similar or identical) one, you can point it to the selected stack of "unconfigured good" drives and tell it to sniff out the config from these. It works Ok, we did this once.

Also, I didn't get the idea how ZFS can fail in the case when 2 (two) separate arrays of hardware RAID are mirrored at ZFS level. Ok let one array of the two get corrupted, ZFS will detect this and either repair the mirror or switch the broken array offline. Where is the problem? Or are they speaking about "mirror over mirror" setup, which does not make sense for me?

Andrii Stesin · Aug 22, 2016

"wish you nothing but the best of luck" - thank you, sincerely. So, I'll try this and will share my experience.

Either it will work (and I'll tell you the details) - or it will not, and I'll revert to some other option, i.e. a non-ZFS setup over hardware RAID (and tell you the details).

But let's consider a non-ZFS setup over hardware RAID (be it NTFS or say EXT4) - how will it be any better than ZFS over the same hardware?

Andrii Stesin · Aug 22, 2016

BTW the very notion of the situation when "HW RAID controller suffers silent data corruption (which ZFS can detect and correct, but HW RAID cannot)" make me wonder, are the programmers who write RAID firmwares for vendors, complete idiots? or what else can explain such a mess?

depasseg · Aug 22, 2016

Andrii Stesin said:
BTW in case I use a RAID-given "disk" for zpool, doesn't ZFS do it's own checksumming of data?

Yes it does, and it will alert you to the corruption either on a scrub or when the file is individually accessed, but if there isn't any ZFS redundancy, then ZFS can't correct that file and you will need to restore from backup.

Andrii Stesin said:
make me wonder, are the programmers who write RAID firmwares for vendors, complete idiots? or what else can explain such a mess?

The same designers who create hard disks to fail. IOW, no one. Stuff happens, bits get flipped, bits randomly go bye-bye, Mr. Murphy comes to visit the data on your disk. It happens, regardless of how.

Important Announcement for the TrueNAS Community.

[RFC] Planning install on LSI MegaRAID SAS 9271-8i (no HBA mode)

Dabbler

Doesn't know what he's talking about

Active Member

Dabbler

Doesn't know what he's talking about

Dabbler

He of the long foot

actually does care

FreeNAS Replicant

Guru

Dabbler

FreeNAS Replicant

Dabbler

Dabbler

Dabbler

Doesn't know what he's talking about

Dabbler

Dabbler

Dabbler

FreeNAS Replicant

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "[RFC] Planning install on LSI MegaRAID SAS 9271-8i (no HBA mode)"

Similar threads