ZFS on all-sdd storage

p.remek · Oct 23, 2014

Hello,

I would like to ask if it is a good idea to run ZFS on pure ssd storage (12 directly attached SSDs to server) in terms of SSD endurance. The storage will be used to run virtual machines from it.

Does ZFS have worse impact on SSD wear-out than other filesystems like EXT4? Does the fact that ZFS is COW filesystem actually mean that I do more writes per written block than non-COW filesystems?

Regards,

pjc · Oct 23, 2014

Why would you use all SSD? What kind of network are you using? Regular gigabit will easily be the bottleneck with 5+ spindles.

p.remek · Oct 23, 2014

Its all SAS3 (12Gb/s) - SSDs, expanders, HBA. To clarify, I might made a mistake asking on this forum as this will not particulary be a FreeNAS deployment and I apologize for that. This is a generic ZFS question,

pjc · Oct 23, 2014

My question wasn't specific to FreeNAS either really.

So what kind of network are you connecting to?

p.remek · Oct 23, 2014

Not sure I understand. The server is currently connected via 2x 10Gb SFP+ with other 2x10Gb SFP+ ports available. But really, its a hypervisor. So the I/O will be flying between disks and the server via SAS3 even if I disconnect the network completely as long as the virtual machines will be power on and doing something.

pjc · Oct 23, 2014

Got it: you've got a 20Gbit network with up to 40Gbit available, which is a lot faster than most people around here. If most of your load were network-based, it could still be a bottleneck (assuming a peak of 600MB/s which is about the fastest SSD I've seen).

If you split it into two pools of 6 drives each, RAIDZ2, that gives you a maximum throughput of 48Gbit coming off of the SSDs.

As to wear-out, I'll defer to others more expert, but I think COW would actually be better for SSD, since you don't overwrite the old data, you just write the changed data to a new location. You do have to update your metadata telling you which block is current, but that's all journaled as well.

Given that a write-optimized SSD can be used for L2ARC (cache) or ZIL, I suspect the write performance for the pools wouldn't pose a problem.

But really ZFS has a totally different goal: reliable data storage. It's hard for me to imagine that having a slightly higher replacement rate of SSDs would even be a blip compared to having an unreliable pool running EXT4.

The more likely issue is going to be RAM. ZFS likes lots and lots of RAM. If you're planning on running lots of local VMs, they're presumably going to eat most of that.

L · Oct 23, 2014

yes, yes do it.. iXsystems has an all flash array also. ZFS should be better than most filesystems, because it doesn't write anything to disk until there is something to write. Most ssd's do some wear leveling. From people I know running zfs on ssd, it runs wonderfully. But the one thing that is still unknown is longevity.

L · Oct 23, 2014

I also will say the big arrays I have been seeing going into m & e and for large scale cad systems are front-ended by a bunch of ssd. Also if you look at the oracle zfs appliance, it has tiered storage with the front-end ssd.

L · Oct 23, 2014

One more answer, zfs, with cow, will not overwrite live blocks so there is less chance the same block will be written to over and over again.

cyberjock · Oct 23, 2014

pjc said:
As to wear-out, I'll defer to others more expert, but I think COW would actually be better for SSD, since you don't overwrite the old data, you just write the changed data to a new location. You do have to update your metadata telling you which block is current, but that's all journaled as well.

Linda Kateley said:
One more answer, zfs, with cow, will not overwrite live blocks so there is less chance the same block will be written to over and over again.

No, COW has no significant affect at all. Why do I know this? Because any SSD, when trying to write over a file with a new version, won't actually write to the same blocks. The old SSDs that did random writes of 5IO/sec did that. But the latest ones will *always* write to empty/erased blocks. You can't write to blocks that aren't erased, and the erasure process is an I/O killer. Hence the need for all the wear leveling, spare storage, etc.

The only "true" difference between something like NTFS and ZFS to a flash drive is that on NTFS you logically think you are writing to a given block, but physically you aren't. With ZFS you are not only logically writting to a different block, but you are physically writing elsewhere too. Note that the OS and its applications do NOT have access to the physical to logical translation layer (that's kept on the SSD itself and is technically called the "Flash Translation Layer") so the block that ZFS is writing to doesn't match the physical location except by random chance.

p.remek · Oct 23, 2014

Thanks for reply, my main concern was about whether ZFS - when doing write - has to do some extra write for each actual useful write. I imagined it as follows: because ZFS is COW, it writes to a free location on disk and because of that, it probably has to update some pointer, to this new location - which is extra write - and thus bad for SSD. On the other hand on typical filesystem, I would just write to target location with no additional write.

pjc said:
Got it: you've got a 20Gbit network with up to 40Gbit available, which is a lot faster than most people around here. If most of your load were network-based, it could still be a bottleneck (assuming a peak of 600MB/s which is about the fastest SSD I've seen).

If you split it into two pools of 6 drives each, RAIDZ2, that gives you a maximum throughput of 48Gbit coming off of the SSDs.

As to wear-out, I'll defer to others more expert, but I think COW would actually be better for SSD, since you don't overwrite the old data, you just write the changed data to a new location. You do have to update your metadata telling you which block is current, but that's all journaled as well.

Given that a write-optimized SSD can be used for L2ARC (cache) or ZIL, I suspect the write performance for the pools wouldn't pose a problem.

But really ZFS has a totally different goal: reliable data storage. It's hard for me to imagine that having a slightly higher replacement rate of SSDs would even be a blip compared to having an unreliable pool running EXT4.

The more likely issue is going to be RAM. ZFS likes lots and lots of RAM. If you're planning on running lots of local VMs, they're presumably going to eat most of that.

cyberjock · Oct 23, 2014

p.remek said:
Thanks for reply, my main concern was about whether ZFS - when doing write - has to do some extra write for each actual useful write. I imagined it as follows: because ZFS is COW, it writes to a free location on disk and because of that, it probably has to update some pointer, to this new location - which is extra write - and thus bad for SSD. On the other hand on typical filesystem, I would just write to target location with no additional write.

Technically, you are correct with your assessment. ZFS will certainly wear out an SSD sooner than say, NTFS.

The question is whether we should care or not. If it were 1% most of us would say "who cares" and do it without hesitation. But what if it decreases the lifespan by 50%. Many of us might start caring, especially home users. There's no good solid info on how much it would matter. I've heard some people claim 10-20% is typical (which is bad, but not terrible). Most people these days are throwing out SSDs because they are too small and not because they are worn out. If that's the case, then who cares what the lifespan gets burned at!? You'll replace it beforehand anyway. ;)

p.remek · Oct 23, 2014

cyberjock said:
Technically, you are correct with your assessment. ZFS will certainly wear out an SSD sooner than say, NTFS.

The question is whether we should care or not. If it were 1% most of us would say "who cares" and do it without hesitation. But what if it decreases the lifespan by 50%. Many of us might start caring, especially home users. There's no good solid info on how much it would matter. I've heard some people claim 10-20% is typical (which is bad, but not terrible). Most people these days are throwing out SSDs because they are too small and not because they are worn out. If that's the case, then who cares what the lifespan gets burned at!? You'll replace it beforehand anyway. ;)

In our case the problem is that the hardware is a part of bussiness case expenses and the bussiness case is calculated for 5 years. So if there is a good chance that during 5 years we will have to replace most of the discs, we need need to calculate price for twice as many disks to the bussiness case :)

cyberjock · Oct 23, 2014

Sorry, but you probably can't even calculate that possibility even if on NTFS.

If you do a crapload of writing you could potentially wear it out in a few months or a year, even on NTFS. If you don't do a lot of writing and the data is static it could be good for decades.

To be honest, I wouldn't worry so much about the cost of the drives if you have to replace them in a few years. They're constantly getting bigger and cheaper. If you really need to replace them in 3 years it's not going to be the end of the world. Just think, a 256GB SSD can be purchased for about $100 today and 3 years ago the same drives were like $400+. To boot, they are faster than they were 3 years ago.

It's quite possible that by the time you need to be worried about buying replacement drives for your pool you'll be able to buy a single drive that can hold 1/2 your pool's data for $100.

Don't fret it. Buy the SSDs and be happy. Tell your boss you did the analysis and all is well. Just don't buy those TLC drives. Those seem very scary for ZFS IMO.

p.remek · Oct 23, 2014

Also, is it possible to "isolate" that extra write to a separate SSD? I mean, if the extra write is actually a write to some metadata, maybe it is possible to have all metadata on separate SSD and not spread across all of them? Then it would be possible to buy some extra-heavy-duty SSD for that one, and for the rest buy normal SSDs.

One more question, is there some other extra write besides the metadata write required for a particular write of uselful data?

cyberjock said:
Technically, you are correct with your assessment. ZFS will certainly wear out an SSD sooner than say, NTFS.

The question is whether we should care or not. If it were 1% most of us would say "who cares" and do it without hesitation. But what if it decreases the lifespan by 50%. Many of us might start caring, especially home users. There's no good solid info on how much it would matter. I've heard some people claim 10-20% is typical (which is bad, but not terrible). Most people these days are throwing out SSDs because they are too small and not because they are worn out. If that's the case, then who cares what the lifespan gets burned at!? You'll replace it beforehand anyway. ;)

p.remek · Oct 23, 2014

Thanks, well, I do realize that it depends on the workload which we will be running there, but given that the workload is same no matter if I use SSD or HDD, it is a constant parameter in my equation. But I do have a assumption, that with normal filesystem (EXT4) it should endure for 3-5 or more years. So now I needed to understand if ZFS makes it relatively worse given same workload.

cyberjock said:
Sorry, but you probably can't even calculate that possibility even if on NTFS.

If you do a crapload of writing you could potentially wear it out in a few months or a year, even on NTFS. If you don't do a lot of writing and the data is static it could be good for decades.

To be honest, I wouldn't worry so much about the cost of the drives if you have to replace them in a few years. They're constantly getting bigger and cheaper. If you really need to replace them in 3 years it's not going to be the end of the world. Just think, a 256GB SSD can be purchased for about $100 today and 3 years ago the same drives were like $400+. To boot, they are faster than they were 3 years ago.

It's quite possible that by the time you need to be worried about buying replacement drives for your pool you'll be able to buy a single drive that can hold 1/2 your pool's data for $100.

Don't fret it. Buy the SSDs and be happy. Tell your boss you did the analysis and all is well. Just don't buy those TLC drives. Those seem very scary for ZFS IMO.

cyberjock · Oct 23, 2014

p.remek said:
Thanks, well, I do realize that it depends on the workload which we will be running there, but given that the workload is same no matter if I use SSD or HDD, it is a constant parameter in my equation.

Technically, it's not a constant parameter. HDDs don't suffer "wearout" from nonstop writing. SSDs do. If you were going to be doing nonstop sequential writing you'd be FAR FAR better off with HDDs.

p.remek said:
But I do have a assumption, that with normal filesystem (EXT4) it should endure for 3-5 or more years.

No evidence to show how much worse any one file system is compared to another. Each is unique and nobody has those answers. And to be perfectly honest, your assumption is probably already so watered down that trying to assume ZFS is better or worse than EXT4 is like someone arguing that they pissed in the Pacific Ocean and so the water level rose. Who cares. You got to piss in the freakin' ocean!

No, you can't redistribute writes or anything like that.

Like I said.. just tell your boss it will be fine, do ZFS on SSD and be happy. I have no doubt the next time I hear from you will be you replacing the drives because they are too small.

p.remek · Oct 24, 2014

cyberjock said:
Technically, it's not a constant parameter. HDDs don't suffer "wearout" from nonstop writing. SSDs do. If you were going to be doing nonstop sequential writing you'd be FAR FAR better off with HDDs.

No, you can't redistribute writes or anything like that.

Sorry, I made a mistake in my sentence, I wanted to say "no matter which filesystem I use" istead of "SSD or HDD".

cyberjock said:
No, you can't redistribute writes or anything like that.

To my question: Also, is it possible to "isolate" that extra write to a separate SSD? I mean, if the extra write is actually a write to some metadata, maybe it is possible to have all metadata on separate SSD and not spread across all of them? Then it would be possible to buy some extra-heavy-duty SSD for that one, and for the rest buy normal SSDs.

I understood from above comments, that it is possible to have the metadata on separate SSD, doesn't this address my question (isolating the extra writre)?

cyberjock · Oct 24, 2014

No, you don't put metadata on a separate disk. The drives act as one single entity. You have no decision on where the data goes. ZFS makes that decision with an algorithm that prefers empty drives/vdevs.

HoneyBadger · Oct 24, 2014

There are some companies that have forked ZFS and set it up as you describe (separate vdevs for metadata using high-endurance SLC NAND) but there's nothing like that in OpenZFS at the moment.

Important Announcement for the TrueNAS Community.

ZFS on all-sdd storage

Cadet

Contributor

Cadet

Contributor

Cadet

Contributor

L

Guest

L

Guest

L

Guest

Inactive Account

Cadet

Inactive Account

Cadet

Inactive Account

Cadet

Cadet

Inactive Account

Cadet

Inactive Account

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ZFS on all-sdd storage"

Similar threads