2nd freenas box: 100T, hardware & setup questions

voodoo · Jul 5, 2013

Hi all,

I'm very happy with my 1st freenas box, 8*3T, raid-z2, about 16T capacity, all in 1 zpool, 1 vdev.

The photo & video generated by my DSLR(Nikon D800) are so huge, I'd like to build a 2nd box, about 100T capacity, for the next 2 years. I have read some posts, best practice and FAQ in the forum, but I am not clear on some questions still. Your feedback and comments are very welcome.

Here're some basic info:
a) A home server, mainly for multimedia, two or three concurrent connections at most.
b) As small+quiet+green+cheap as possible.
c) Reliability is the 1st priority, capacity is 2nd, availability 3rd, throughput 4th.
d) Data volume estimation : 100T, 1) photos (30~40T), 2) videos(40~50T), 3) etc, e.g. music/documents. (15~20T). It's about 35~40 4T HDDs.
e) I can wait till freenas 9.1 beta release.
f) Freenas is using CIFS sharing now, clients are windows and android equipments only.

Question 1. ZFS setup:

1.1: In order to reduce the HDD running time and maximize power saving , plus sometimes I only need to access 1 category while don't need to access the other categories for 2 weeks, I'd like to shutdown the unused HDDs, should I divide the categories to different zpools or vdevs or just 3 folders for the categories are fine?

1.2: zpool setup: From some threads, I heard only 1 zpool can be accessed at a time. If I want to access to another zpool, I need to export previous zpool and import new zpool. Am I right? So 3 zpools for 3 categories doesn't work because sometimes I still need to access 3 categories at the same time, so only 1 option: 1 zpool for all 3 categories ?

1.3: I'm happy with my 1st box with raid-z2. So 2nd box is raid-z2 still. Any problem?

1.4: From the best practice, raid-z2 needs 2^n+2 HDDs, e.g. 6/10/18/34 HDDs. 10*4T HDDs raid-z2 provide about 28T capacity, 18*4T HDDs raid-z2 provide about 50T capacity, 34*4T raid-z2 give 100T. So, 34*4T HDDs are enough for 100T, or it should be : photo category will occupy a set of 18 HDDs, video category will occupy a set of 18 HDDs, rest category will occupy a set of 6 HDDs. So totally 42*4T HDDs needed. Which is better?

1.5 vdev setup: How about vdev? Some threads said over 11 HDDs in 1 vdev are too large. Is it true? So if every 10 4T HDDs for 1 vdev, should setup 4 vdevs? Or 1 category 1 vdev? Or only 1 vdev for 100T is fine?

1.6. should I consider ESXi ?

Question 2. Hardware setup:

2.1. CPU, maybe intel i3, maybe a haswell i5? One cpu is enough.

2.2. RAM, is 32G enough? As lots of 8G ram in the market, mainboard with 4 ram slots, which is cheap, can make to 32G easily. If 64G is needed, then only the super expensive server version mainboard can provide 8 ram slots. :( And 8G DDR3 ECC ram is still too expensive, and only the expensive mainboards can support ECC ram. So if possible, I'd prefer 32G normal ram. Fine?

2.3 mainboard. Some ASRock mainboards have 8 SATA ports, 4 ram slots and only $70, I love it! But if needs to support 8 ECC ram slots, it seems not too much choices....and all are expensive......

2.4. SATA card. 4* 8 ports as cheap as possible SATA cards plus mother board SATA ports? Or use one DELL H810 6GB SATA card (with 1G cache) which can connect 190*4T HDDs? Does the functions "auto cache on SDD" and "1G cache" useful?

2.5. Case+Power. As there will be 34~42 HDDs in total. As discussed in 1.1, if it's able to shutdown parts of the HDDs, should use 1 case for all HDDs or 2 cases?
Option 1) I found second hand Rackable SE-3016 3U 16 disks server case is small and not too expensive. Any better suggestion? Why I said in 2.4 that using one DELL H810 6GB SATA card, because it can connect up to 12 * Rackable 3U 16 disk cases. There're 2 SAS SFF-8088 ports on the front panel of the case. The case internal has SATA cables for 16 disks.
Option 2) Buy 4~5 normal cases which with 10 HDDs disk space each. But how to connect 5 cases (power cables and SATA cables) to one motherboard to control and solve the cable length problem? As the requirement: as small+quiet+green+cheap as possible, which option is better? Any case recommend?

2.5. Should I go for two gigabyte NICs to double the throughput? I'm using a gigabit router/switch and NICs right now. Does it needed to using two gigabit NICs on the new box to double the throughput? If yes, should I upgrade the gigabit router/switch and WIFI component? Or I should use other ports or connection? Environment: CIFS+ win/android.

2.6. To get as small+quiet+green+cheap as possible, maybe this time I should build 3 small freenas boxes ( 1 for photo, 1 for video, 1 for etc) but not 1 giant box. So that I can use the mainstream components and no need to buy the expensive server ones. Right?

Qestion 3. anything missing ?

I knew I mess up several concepts and some questions are dump enough. :) Your comments are very welcome, whatever to which question. :)

jgreco · Jul 5, 2013

voodoo said:
b) As small+quiet+green+cheap as possible.

forget that.

1.1: In order to reduce the HDD running time and maximize power saving , plus sometimes I only need to access 1 category while don't need to access the other categories for 2 weeks, I'd like to shutdown the unused HDDs, should I divide the categories to different zpools or vdevs or just 3 folders for the categories are fine?

1.2: zpool setup: From some threads, I heard only 1 zpool can be accessed at a time. If I want to access to another zpool, I need to export previous zpool and import new zpool. Am I right? So 3 zpools for 3 categories doesn't work because sometimes I still need to access 3 categories at the same time, so only 1 option: 1 zpool for all 3 categories ?

incorrect. you can certainly have multiple pools, most people just don't do it as it isn't the normal thing. maybe you're thinking of the guy who came in here some time ago wanting access to a hundred terabytes on an 8gb system? that was laughable and basically I told him he could have multiple pools if he only accessed one at a time.

1.3: I'm happy with my 1st box with raid-z2. So 2nd box is raid-z2 still. Any problem?

as long as you feel it suits your reliability requirements. if you hop immediately and instantly on any sign of impending failure, it is probably okay. but for 30TB of reliable storage, i'm doing 11 4TB's in RAIDZ3 with one hot spare and a cold spare as well.

1.4: From the best practice, raid-z2 needs 2^n+2 HDDs, e.g. 6/10/18/34 HDDs. 10*4T HDDs raid-z2 provide about 28T capacity, 18*4T HDDs raid-z2 provide about 50T capacity, 34*4T raid-z2 give 100T. So, 34*4T HDDs are enough for 100T, or it should be : photo category will occupy a set of 18 HDDs, video category will occupy a set of 18 HDDs, rest category will occupy a set of 6 HDDs. So totally 42*4T HDDs needed. Which is better?

1.5 vdev setup: How about vdev? Some threads said over 11 HDDs in 1 vdev are too large. Is it true? So if every 10 4T HDDs for 1 vdev, should setup 4 vdevs? Or 1 category 1 vdev? Or only 1 vdev for 100T is fine?

you're best off following the 2^n+2 rule AND obeying the ~12 drives per vdev rule.

1.6. should I consider ESXi ?

no

Question 2. Hardware setup:

2.1. CPU, maybe intel i3, maybe a haswell i5? One cpu is enough.

2.2. RAM, is 32G enough? As lots of 8G ram in the market, mainboard with 4 ram slots, which is cheap, can make to 32G easily. If 64G is needed, then only the super expensive server version mainboard can provide 8 ram slots. :( And 8G DDR3 ECC ram is still too expensive, and only the expensive mainboards can support ECC ram. So if possible, I'd prefer 32G normal ram. Fine?

no.

2.3 mainboard. Some ASRock mainboards have 8 SATA ports, 4 ram slots and only $70, I love it! But if needs to support 8 ECC ram slots, it seems not too much choices....and all are expensive......

2.4. SATA card. 4* 8 ports as cheap as possible SATA cards plus mother board SATA ports? Or use one DELL H810 6GB SATA card (with 1G cache) which can connect 190*4T HDDs? Does the functions "auto cache on SDD" and "1G cache" useful?

2.5. Case+Power. As there will be 34~42 HDDs in total. As discussed in 1.1, if it's able to shutdown parts of the HDDs, should use 1 case for all HDDs or 2 cases?
Option 1) I found second hand Rackable SE-3016 3U 16 disks server case is small and not too expensive. Any better suggestion? Why I said in 2.4 that using one DELL H810 6GB SATA card, because it can connect up to 12 * Rackable 3U 16 disk cases. There're 2 SAS SFF-8088 ports on the front panel of the case. The case internal has SATA cables for 16 disks.
Option 2) Buy 4~5 normal cases which with 10 HDDs disk space each. But how to connect 5 cases (power cables and SATA cables) to one motherboard to control and solve the cable length problem? As the requirement: as small+quiet+green+cheap as possible, which option is better? Any case recommend?

basically you have to play with the big boys. consumer grade crap is going to get you crap results and tears in the end when your data mysteriously all goes missing one day. scrap all that crap if you want a single box, and buy a true server.

to support 48 drives: buy two SC846BE16-R920B, outfit one with the power supply slave board, and then find a nice E5-1600 mainboard to cram in the other one, maybe the X9SRL-F. IBM ServeRAID M1015 and some various IPASS cabling to hook it all up.

2.5. Should I go for two gigabyte NICs to double the throughput? I'm using a gigabit router/switch and NICs right now. Does it needed to using two gigabit NICs on the new box to double the throughput? If yes, should I upgrade the gigabit router/switch and WIFI component? Or I should use other ports or connection? Environment: CIFS+ win/android.

2.6. To get as small+quiet+green+cheap as possible, maybe this time I should build 3 small freenas boxes ( 1 for photo, 1 for video, 1 for etc) but not 1 giant box. So that I can use the mainstream components and no need to buy the expensive server ones. Right?

still wrong, you still need the server grade stuff, and typically it actually ends up being about the same price if you do it right. "cheapest fscking asrock board you can find" is not what you want in a device that is intended to handle and protect large amounts of data.

you want cheap AND reliable, follow the guidance in the sticky in this forum. it isn't a joke. it's there to help.

voodoo · Jul 8, 2013

Thanks jgreco, very detailed and very clear, very appreciated.

More questions: ;)

1. Where to setup the hot spare and cold spare disks? I haven't found the menu for these settings.

2.

you're best off following the 2^n+2 rule AND obeying the ~12 drives per vdev rule.

48 drives don't fit to 2^n+2 rule, while 34 or 66 drives fit. So how to split the 48 drives? Every 12 drives per vdev, then there'll be 1 zpool and 4 vdevs. Or 34 drives a zpool then the rest 14 drive a zpool?

3. If 4*12-drives-vdevs, I think raidz2 can be tolerant towards:
a) 1 disk fail in 1st vdev, 1 disk fail in 2nd vdev, but if one more disk fail what ever in which vdev, all data in zpool gone.
b) 2 disks fail in the same vdev, but if any one more disk fail whatever in which vdev, all data in zpool gone.
Right?

jgreco · Jul 8, 2013

1) hot spare = "spare" in FreeNAS terminology, it's there when you create the pool, or really you can just let it sit in the case...

Code:

[root@freenas] ~# zpool status
  pool: test
state: ONLINE
  scan: none requested
config:
 
        NAME                                            STATE    READ WRITE CKSUM
        test                                            ONLINE      0    0    0
          raidz3-0                                      ONLINE      0    0    0
            gptid/4ffa6167-e54b-11e2-b13f-000c2920acf7  ONLINE      0    0    0
            gptid/50a1c192-e54b-11e2-b13f-000c2920acf7  ONLINE      0    0    0
            gptid/514f13cd-e54b-11e2-b13f-000c2920acf7  ONLINE      0    0    0
            gptid/52053862-e54b-11e2-b13f-000c2920acf7  ONLINE      0    0    0
            gptid/52bec1b6-e54b-11e2-b13f-000c2920acf7  ONLINE      0    0    0
            gptid/539e0368-e54b-11e2-b13f-000c2920acf7  ONLINE      0    0    0
            gptid/544d1ca4-e54b-11e2-b13f-000c2920acf7  ONLINE      0    0    0
            gptid/54f068d0-e54b-11e2-b13f-000c2920acf7  ONLINE      0    0    0
            gptid/55ab7c63-e54b-11e2-b13f-000c2920acf7  ONLINE      0    0    0
            gptid/566c958d-e54b-11e2-b13f-000c2920acf7  ONLINE      0    0    0
            gptid/57324e04-e54b-11e2-b13f-000c2920acf7  ONLINE      0    0    0
        logs
          gptid/58d68274-e54b-11e2-b13f-000c2920acf7    ONLINE      0    0    0
        spares
          da12p2                                        AVAIL
 
errors: No known data errors

Ends up looking like that from the CLI

cold spare = a disk that's sitting on a shelf. you should actually break in the disk for about a month, running all the same SMART tests and other burn-in testing, but then you pull it (we buy extra trays and leave it in the tray) and put it in a static bag, in a foam drive shipping sleeve, on the spares shelf, and let it sit. then when a disk fails, you replace it immediately with the hot spare. then the bad drive is removed, cold spare is inserted, you RMA the failed drive, and order a new drive (because you won't want an RMA drive to be your cold spare when it comes back in a month). you break in the new cold spare too.

2) what you want are vdevs that are 12 drives or less wide. for optimum performance with raidz2, 8+2=10. you can see a raidz3 of 12 devices above: 11 active and a spare, providing 8 disks of usable space. that was picked in part because it was neat and tidy. it is a 24 bay chassis so it's 50% full. adding another 12 disks in the same format is easy and fun. but there's no absolute rule that you have to be neat and tidy. also no rule that you have to fill every bay with a disk... hm.

so if you had 40 4TB disks, 4 vdevs of 10 disks each in RAIDZ2, each vdev would have about 28TB usable (remember "4TB" means "3.6TB"), which is about as close to 100TB as can be easily had while maintaining same-sized vdevs.

3) a) incorrect. you can theoretically lose two disks in *each* vdev; the vdevs are independent from each other for purposes of data protection. loss of a vdev is fatal to the pool though.

b) a two disk failure to a vdev is not necessarily fatal. however, ANY trouble rebuilding the vdev could result in some data loss, or in worst case pool catastrophe. again, each vdev is independent from the others from a data protection point of view.

the real problem is that drives are large enough these days that we are no longer as confident as we once were that we'll be able to safely read all the data off the remaining drives to rebuild a drive. if it is data that you truly care about, raidz3 is a real good idea for vdev's of this size.

because you know if you get raidz3, and have a warm spare sitting there, no drives are going to fail, because fate has a sense of humor. but if you get raidz2, with no spares, two drives will die a day apart, and critical metadata on one of the other disks will suddenly develop a read error.... again, because fate has a sense of humor.

voodoo · Jul 8, 2013

Thank you again, jgreco!

I found it's very easy to add hot spare to a zpool via this command: "zpool add pool_name spare drive_name". But it's a little bit different from my expectation that ZFS can't detect a drive failure & automatically start a rebuild/resilver. It just means when using hot spare, you can rebuild first and won't have to crack the case to swap out a filed drive for a good one. So if raidz2 is not enough for safety, upgrade to using raidz3 is better than using raidz2 + a spare.

What I don't understand is as "the vdevs are independent from each other for purposes of data protection" then why "loss of a vdev is fatal to the pool though". If I have 4 vdevs, because they're independent, then even I lost 3 of them, I should still have 1 vdev alive and the pool should be ok. Well, as one vdev die will make a zpool fail, then as we have bad luck always ;) , a raidz2 pool only can tolerant 2 disks failure, a raidz3 pool can tolerant 3 disks failure. Now I understood why jgreco strongly suggest to use raidz3 + hot spare. :)

titan_rw · Jul 8, 2013

Data is striped across all available vdevs. Because of this, if you loose any vdev, the entire pool is lost.

So for example, let's say you have 4 vdevs, each consisting of 10 drives in raidz2. You can loose 2 drives from each set of 10 drives (for a total of 8 failed drives), and 'probably' still be ok, baring any read errors from the remaining disks.

However, if you loose 3 drives from ANY vdev, ALL DATA is gone. There's no recovering any data from a zpool with a failed vdev.

Vdevs are independent from a data redundancy point of view, however all vdevs are required for the zpool to function. The vdevs being independant simply means that in a raidz2, you have 2 drive redundancy in each vdev. In other words, redundancy is 'local' to the vdev. But all vdevs are required for the zpool. vdevs with redundancy can function in degraded mode, and still be available to the pool. So in the above example, you could have 2 vdevs each with a 'bad' disk. There's 4 vdevs. 2 are 'healthy'. 2 are degraded due to a missing disk. All 4 vdevs are 'available' though, so the pool continues to function.

Consider the following situation on why raidz3 is prefered for larger vdevs. With the 4 vdevs of 10 drives in z2 above: You have a disk that's reporting SMART read errors, so you offline it and order a replacement. While waiting for a replacement, another drive (in the same vdev as the first) starts head resetting (click of death). You order another disk. You've now 'used up' all your redundancy. You're relying on the 8 remaining disks to operate at 100% without any read errors during the wait for replacement drives, AND during the subsequent rebuild once the drives arrive. So the first replacement disk arrives. You install it, and start a resilver. However during resilvering one of the remaining 8 disks returns corrupt data. You have no more redundancy for zfs to fix the bad data. Hopefully the bad data is a file you can recover from a backup. If it's metadata, you might be in more trouble. This bad data is now affecting integrity of the entire zpool.

Having an extra disk of parity gives you that much more protection. 11 disk vdevs of raidz3 are much better.

jgreco · Jul 9, 2013

what he said.

As far as detecting failures: FreeBSD is a conventional UNIX system and it was never envisioned that the kernel would be responsible for managing relatively high level device events. ZFS is a poor fit in this regard.

Basically, FreeBSD has chosen not to try to bludgeon the Solaris style support for that stuff in, because it wouldn't fit well and would probably be a lot of work. Square peg in round hole or however that goes.

So FreeBSD is developing a userland daemon to do that sort of management, but development seems to be kind of laggy and draggy (been hearing about it for several years now). There is also something to be said for not automatically panicking and replacing drives without some human intervention. Ever seen a hardware RAID controller that got too aggressive about that and rendered a device unusable?

Important Announcement for the TrueNAS Community.

2nd freenas box: 100T, hardware & setup questions

voodoo

Cadet

jgreco

Resident Grinch

voodoo

Cadet

jgreco

Resident Grinch

voodoo

Cadet

titan_rw

Guru

jgreco

Resident Grinch

Similar threads

Important Announcement for the TrueNAS Community.

2nd freenas box: 100T, hardware & setup questions

voodoo

Cadet

jgreco

Resident Grinch

voodoo

Cadet

jgreco

Resident Grinch

voodoo

Cadet

titan_rw

Guru

jgreco

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "2nd freenas box: 100T, hardware & setup questions"

Similar threads