Can drives on encrypted zpools ever be "replaced"?

cyberjock · May 19, 2013

So I'm having this problem on one of my encrypted zpools and I'm trying to determine how to deal with this issue. I'll give step-by-step instructions to replace a zpool in a VM so its easily replicable....(I used Virtualbox but I doubt it matters since I'm seeing this on a production system). Some of these steps are unnecessary, but I performed them for troubleshooting purposes. I don't have much experience with geli(none before FreeNAS introduced it, but I've been experimenting with it for a few weeks) and I'm trying to determine how to recover from what appears to be an unrecoverable situation for FreeNAS encrypted zpools.

Here's how to reproduce the issue in a VM:
1. Install FreeNAS 8.3.1-p2(I installed x64 version) to a virtual hard drive(I made my disk 2.5GB).
2. After installation remove the CD from the virtual disk list and create 3 more virtual drives. I used 10GB drives. I put my drives on a SATA controller but again this shouldn't matter.
3. Created an encrypted RAIDZ1 array with the 3 drives created in step 2. I also enabled the footer in the GUI so I can see if any output errors occur.
4. In FreeNAS, I setup my keys in this order: Created pass phrase, then Downloaded Key and Recovery Key. I created a 10GB temp file just to prove the files would be there. Not really necessary, but I go overboard when troubleshooting just to rule out other things.
5. Rebooted FreeNAS and used the key+passphrase worked to restore the zpool. I get "An error occurred!" message in the GUI header but the zpool mounts. Both a zpool status and zpool scrub show nothing wrong so I'll ignore the error since it seems to be in error itself. The footer shows 3 drives mounted with AES-XTS 128 encryption.
6. Rebooted FreeNAS again and used the recovery key(no password required) to restore the zpool. Again I get "An error occurred!" message in the GUI header but the zpool mounts. Both a zpool status and zpool scrub show nothing wrong so I'll ignore the error since it seems to be in error itself. The footer shows 3 drives mounted with AES-XTS 128 encryption.
7. Now to simulate a "failed drive". All of the steps I will be performing will be based off of the manual, section 6.3.11 - Replacing a Failed Drive or ZIL Device. After all, us senior guys are constantly telling newbies to use this section for replacing a drive because it works. At this point ada0 is my boot drive and ada1 through ada3 are my RAIDZ1 member disks.
8. Next I took a look at the /dev/adaXX to the gptids using gpart list. I did this in case the info comes in handy later. I took the last disk's partition in the zpool (ada3p2) and set it to "OFFLINE" status. The disk goes offline. GELI makes some comments about the drive being detached and a zpool status shows that the gptid starting with 76f87c66 is now OFFLINE and my zpool status is now in DEGRADED status. The ada3 device goes offline and a Replace button appears just as expected.
9. Now I do a VM shutdown and remove the first disk from the VM and create a new VM disk(in that order). It is important that you make sure that the disk you remove and the new disk you create are on the same SATA port in VirtualBox. If I had removed ada1 and the booted up and tried to mount my zpool I'll get the error in the GUI that "Error: Volume could not be imported: 1 devices failed to decrypt." because FreeNAS doesn't seem to be smart enough to recognize a drive that moves device IDs from ada3 to ada2. Additionally, if you do fail to make sure they're on the same SATA port in VirtualBox and you attempt to mount them something goes horribly wrong and you'll have an even harder time trying to mount the zpool because for some reason only 1 disk will mount even after you fix the devices (which a RAIDZ1 can't recover from). Personally, I've been unable to determine how to survive a situation where the disks change IDs, but that's not the problem I'm trying to identify. For now, just stick with the last disk and make sure you use the same SATA port. I took a snapshot here just to have one. (I'm really trying to nail down this issue, its been bothering me for a week)
10. After the new disk is created(and obviously is at least the same size as the old virtual disks) boot up FreeNAS. You should be able to mount the zpool with the key+pass and the recovery key but in a degraded state. This seems to be pretty normal and expected.
11. To continue with the disk replacement per 6.3.11 of the manual I mount the encrypted zpool in its degraded state using the key+pass method (I get the GUI "An error occurred!" message but I'm ignoring it because I don't know if its due to the same issue I saw in step 5 and 6 or because a disk is missing)
12. Next I click the "REPLACE" button. It asks for a passphrase and I enter the same passphrase I used for the rest of the zpool. After a few seconds I get "Disk replacement has been initiated," and resilvering completes seconds later.
13. In accordance with the manual I then click the "Detach" button in the GUI to removed the old drive from the zpool permanently. Now a zpool status and zpool scrub give typical "no problems" output, so all is well. At this point I've completed all of the steps section 6.3.11 of the manual and everything should be okay. Time to verify all is actually well. After all, you'd hate to find out later that things aren't well.
14. I reboot the FreeNAS machine and attempt to mount the zpool with key+passphrase. It works as expected. Again I get the "An error occurred!" but everything seems fine. My zpool is healthy.
15. I reboot the FreeNAS machine again and attempt to mount the zpool with the recovery key. Uh-oh. It's not working quite right. My zpool is "DEGRADED". Looking at the footer I have the errors:

Code:

[middleware.notifier:1200] Failed to geli attach gptid/76f87c66-c0cc-11e2-8195-080027d1d3cd: geli: Cannot open gptid/76f87c66-c0cc-11e2-8195-080027d1d3cd: No such file or directory.
[middleware.notifier:1200] Failed to geli attach gptid/610edb60-c0d0-11e2-b27e-080027d1d3cd: geli: Wrong key for gptid/610edb60-c0d0-11e2-b27e-080027d1d3cd.

So not only is it trying to attach the old gptid(76f87c66) but it can't mount the new gptid(610edb60). Now things aren't looking so well. So big picture I appear to have full redundancy with the key+passphrase but not with the recovery key. I need to fix this obviously. If a second disk were to fail my recovery key would be useless since it wouldn't be capable of remounting my zpool.

16. Rebooted the machine again, mounted the zpool with the key+passphrase method and I have new error now. I get the following in the footer:

Code:

freenas kernel: GEOM_ELI: Device gptid/7679a9bb-c0cc-11e2-8195-080027d1d3cd.eli created.
freenas kernel: GEOM_ELI: Encryption: AES-XTS 128
freenas kernel: GEOM_ELI:     Crypto: software
freenas kernel: GEOM_ELI: Device gptid/76b64f5c-c0cc-11e2-8195-080027d1d3cd.eli created.
freenas kernel: GEOM_ELI: Encryption: AES-XTS 128
freenas kernel: GEOM_ELI:     Crypto: software
freenas manage.py: [middleware.notifier:1200] Failed to geli attach gptid/76f87c66-c0cc-11e2-8195-080027d1d3cd: geli: Cannot open gptid/76f87c66-c0cc-11e2-8195-080027d1d3cd: No such file or directory.
freenas kernel: GEOM_ELI: Device gptid/610edb60-c0d0-11e2-b27e-080027d1d3cd.eli created.
freenas kernel: GEOM_ELI: Encryption: AES-XTS 128
freenas kernel: GEOM_ELI:     Crypto: software

So now FreeNAS is trying to remount my old failed drive every time. Not exactly the end of the world(definitely something that needs to be fixed though). The zpool does mount and is healthy. Since last time I mounted the zpool one disk failed to mount I chose to do a zpool scrub and wait for it to finish.

17. So time to fix that recovery key. I click the "Add Recovery Key" button. I get the warning that it will invalidate any previous recovery key("so what" for this situation) and click "Continue". I get a GUI error in the header that says

Code:

Error: Unable to set recovery key: geli: Cannot open gptid/76f87c66-c0cc-11e2-8195-080027d1d3cd: No such file or directory."

In the footer I get:

Code:

freenas notifier: 1+0 records in
freenas notifier: 1+0 records out
freenas notifier: 64 bytes transferred in 0.000048 secs (1335500 bytes/sec)
freenas manage.py: [middleware.exceptions:38] [MiddlewareError: Unable to set recovery key: geli: Cannot open gptid/76f87c66-c0cc-11e2-8195-080027d1d3cd: No such file or directory. ]

Hmm. Not the most ideal situation. It looks like FreeNAS may have tried copying the key from somewhere to somewhere. Maybe it fixed my replaced disk and my recovery key will work again... But I wasn't given the option to download a recovery key(remember you get that warning that all previous recovery keys will be invalid). This may not be good news at all.

18. Rebooted FreeNAS and attempted to mount the zpool using the recovery key I do have. Well, my recovery key doesn't work at all.

I get this error in the header:

Code:

Error: Volume could not be imported: 4 devices failed to decrypt.

And in the footer I get these entries:

Code:

freenas manage.py: [middleware.notifier:1200] Failed to geli attach gptid/7679a9bb-c0cc-11e2-8195-080027d1d3cd: geli: Wrong key for gptid/7679a9bb-c0cc-11e2-8195-080027d1d3cd.
freenas manage.py: [middleware.notifier:1200] Failed to geli attach gptid/76b64f5c-c0cc-11e2-8195-080027d1d3cd: geli: Wrong key for gptid/76b64f5c-c0cc-11e2-8195-080027d1d3cd.
freenas manage.py: [middleware.notifier:1200] Failed to geli attach gptid/76f87c66-c0cc-11e2-8195-080027d1d3cd: geli: Cannot open gptid/76f87c66-c0cc-11e2-8195-080027d1d3cd: No such file or directory.
freenas manage.py: [middleware.notifier:1200] Failed to geli attach gptid/610edb60-c0d0-11e2-b27e-080027d1d3cd: geli: Wrong key for gptid/610edb60-c0d0-11e2-b27e-080027d1d3cd.
freenas manage.py: [middleware.exceptions:38] [MiddlewareError: Volume could not be imported: 4 devices failed to decrypt]

So to recap, after a disk replacement and an attempt to recreate the recovery key I'm left with a system that can't use the recovery key and the key+pass works with the 3 installed member drives of the zpool but has an error because its trying to mount the disk that was replaced.

So does anyone have a clue what I did wrong, or if I even did something wrong? Should the manual be updated to reflect how to deal with encryption?

Is this a bug?

Is there a way to recovery from this using the CLI and/or editing the config file?

Any other ideas how to achieve full redundancy with both the key+passphrase and recovery key method without recreating the zpool and recovering from backups?

Note: I documented this issue in support ticket 2178 and have had no response, which is why I'm asking the forum. If you take the time to read 2178(I don't think there's any reason to since this post discusses step-by-step what I did on that server, but in a VM machine here, the only thing that seems to be much different is that I always used the recovery key in 2178(laziness). But on that server the problem found me because after the resilvering completes and I reboot and use the recovery key again I just end up with a DEGRADED zpool. I rebooted and resilvered twice before stopping and started questioning what was wrong. And naturally(thanks to murphy's law) I've replaced 1 failed disk last weekend and another disk in the zpool is racking up SMART errors like nobody's business, so I'm about to have a recovery key with no redundancy. I'd destroy and recreate the zpool from backup but now I'm questioning how "trustworthy" the encryption is with regards to disk replacement/recovery. This is a pretty big deal for people that haven't realized that their recovery key isn't 100% after a failed disk and I don't see any easy way to recover from this with the knowledge I have(and alot of people using FreeNAS have even less than myself...)

jgreco · May 19, 2013

Aren't you missing some steps around 14? I'm totally not seeing where you initialized your replacement drive with a recovery key. How is an old recovery key supposed to decrypt a new disk? So you try to decrypt things with the old recovery key and things go all to heck. Mmm. Moving forward from that point will just bring pain, I think.

My suspicion is that something gets hosed in storage_encrypteddisk. Try looking:

Code:

# sqlite3 /data/freenas-v1.db
sqlite> select * from storage_disk;
bla bla
sqlite> select * from storage_encrypteddisk;
bla bla

Old disk still hanging around? Want to live dangerously? You can update the table manually, like to delete a disk by ID (first column), "delete from storage_encrypteddisk where id=<yourid>;"

It does seem to screw this up if you putz around, I don't know the exact bug, but I did duplicate it, and fixed it by just blasting the superfluous encrypteddisk entry. Then it was able to regenerate the recovery key, survive a reboot, and mount the zpool using the recovery key, so I think that might have been sufficient for repair.

So I suggest you try following a less meandering path when trying to replace a drive, maybe skip having the system unnecessarily in degraded states with half-configured encryption.

cyberjock · May 19, 2013

jgreco said:
Aren't you missing some steps around 14? I'm totally not seeing where you initialized your replacement drive with a recovery key. How is an old recovery key supposed to decrypt a new disk? So you try to decrypt things with the old recovery key and things go all to heck. Mmm. Moving forward from that point will just bring pain, I think.

I kind of thought the same thing, but on my production server where I am having this issue I tried to regenerate the recovery key and key+passphrase when I first mounted the new drive and the resilvering had finished and all I got were errors there. IMO the manual is a little lacking in direction with how to proceed with disk replacements for encrypted zpools. I kind of expected that FreeNAS would be smart enough to duplicate the recovery key from another disk on disk replacement, but that doesn't seem to happen(or it does but doesn't work quite right, or I screwed it up with something I am doing manually). I'd expect the manual would have a section for encrypted disk replacement that included things like mounting the zpool with key+passphrase or recovery key, etc.

jgreco said:
My suspicion is that something gets hosed in storage_encrypteddisk. Try looking:

Code:
# sqlite3 /data/freenas-v1.db sqlite> select * from storage_disk; bla bla sqlite> select * from storage_encrypteddisk; bla bla

Old disk still hanging around? Want to live dangerously? You can update the table manually, like to delete a disk by ID (first column), "delete from storage_encrypteddisk where id=<yourid>;"

It does seem to screw this up if you putz around, I don't know the exact bug, but I did duplicate it, and fixed it by just blasting the superfluous encrypteddisk entry. Then it was able to regenerate the recovery key, survive a reboot, and mount the zpool using the recovery key, so I think that might have been sufficient for repair.

I was thinking if I could clean up the SQL entries I could probably recover from this situation, but I'd really like to see the bug fixed too. I know there's other people that won't be as open to the idea of play with SQL as I am. :P

So I suggest you try following a less meandering path when trying to replace a drive, maybe skip having the system unnecessarily in degraded states with half-configured encryption.

I'll give this a test in a bit. Server was turned off, it was getting too warm in the location it was in and I don't like to take chances with stuff.

cyberjock · May 20, 2013

jgreco said:

Ok, I ran those commands. I actually don't know exactly what those commands do(but I'll be googing them in a few minutes so I can learn!) but he storage_disk includes all of the disks that are currently insalled in he system.

As for the encrypteddisk I got this output:

Code:

sqlite> select * from storage_encrypteddisk;
1|2|23|gptid/bf270de6-b86a-11e2-9cfb-0015171496ae
2|2|24|gptid/c1560e68-b86a-11e2-9cfb-0015171496ae
3|2|30|gptid/c1b125e6-b86a-11e2-9cfb-0015171496ae
4|2|31|gptid/c20a0232-b86a-11e2-9cfb-0015171496ae
5|2|33|gptid/c25ef2da-b86a-11e2-9cfb-0015171496ae
6|2|32|gptid/c2b79c49-b86a-11e2-9cfb-0015171496ae
7|2|24|gptid/050bac43-b9f0-11e2-bcaa-0015171496ae
8|2|24|gptid/991f4916-ba6f-11e2-9ed2-0015171496ae
sqlite>

There are only 6 disks for the only encrypted zpool in that system. So clearly the "old" disks are there. The encryped zpool isn't mounted and I forgot the key at a friends house. I'll know which 6 are real and which two are "stale" later today for sure. I just have to figure out how to delete them. At least, that's what I think you're saying.

jgreco · May 20, 2013

cyberjock said:
I kind of expected that FreeNAS would be smart enough to duplicate the recovery key from another disk on disk replacement

Without having looked too closely at the actual encryption in use and what is actually going on, I'll note that the usual point of encryption is to protect your data. For key based encryption, once a key is generated and in your possession, it should not be possible for the system to come up with the same key again on its own. Even if you provided the key to the system to initialize the new drive, this would leave the data on the replaced drive keyed with that key. It would be best to have the proper scope, which is only the drives that are actively part of the pool should be keyed with the in-use key.

jgreco · May 20, 2013

cyberjock said:
Ok, I ran those commands. I actually don't know exactly what those commands do

Note that you can actually do damage to the FreeNAS configuration. FreeNAS uses sqlite for its conf db storage. You can learn dangerous stuff in the CLI guide for sqlite. Be very careful if you go in there.

(but I'll be googing them in a few minutes so I can learn!) but he storage_disk includes all of the disks that are currently insalled in he system.

As for the encrypteddisk I got this output:

Code:
sqlite> select * from storage_encrypteddisk; 1|2|23|gptid/bf270de6-b86a-11e2-9cfb-0015171496ae 2|2|24|gptid/c1560e68-b86a-11e2-9cfb-0015171496ae 3|2|30|gptid/c1b125e6-b86a-11e2-9cfb-0015171496ae 4|2|31|gptid/c20a0232-b86a-11e2-9cfb-0015171496ae 5|2|33|gptid/c25ef2da-b86a-11e2-9cfb-0015171496ae 6|2|32|gptid/c2b79c49-b86a-11e2-9cfb-0015171496ae 7|2|24|gptid/050bac43-b9f0-11e2-bcaa-0015171496ae 8|2|24|gptid/991f4916-ba6f-11e2-9ed2-0015171496ae sqlite>

There are only 6 disks for the only encrypted zpool in that system. So clearly the "old" disks are there. The encryped zpool isn't mounted and I forgot the key at a friends house. I'll know which 6 are real and which two are "stale" later today for sure. I just have to figure out how to delete them. At least, that's what I think you're saying.

Right. Back up your /data/freenas-v1.db first. Then you see the first column, let's say disk 4 is stale, you could do

delete from storage_encrypteddisk where id=4;

and that'd remove it from the database. You probably need to remove two stale disks. Then when your gptid's match between _disk and _encrypteddisk I would suggest trying to regenerate the recovery key and see what happens.

I cannot say for certain that there won't be side effects. I have not looked at the FreeNAS middleware to see if anything might break if you mess around and the column ID's become noncontiguous.

cyberjock · May 20, 2013

I'll give that a shot tonight. I was thinking about this earlier and I realized I haven't replaced 2 disks, but I have resilvered the same disk twice because after i resilvered and I rebooted and used my recovery key the zpool was degraded, so then I thought I had to resilver again(which I did). But yeah, 2 entries are clearly bogus because there's only 6 drives in that zpool. I forgot to mention in my original post that I can't rekey the key+pass on my production machine either because of the errors because the old drive is removed. In any case, I think(and hope) that the problems will be resolved once I remove the bad entries.

As a side question do you know what the 2's and the 23+ values are from the sql query represent? Curiosity is killing me....

Code:

sqlite> select * from storage_encrypteddisk;
1|2|23|gptid/bf270de6-b86a-11e2-9cfb-0015171496ae
2|2|24|gptid/c1560e68-b86a-11e2-9cfb-0015171496ae
3|2|30|gptid/c1b125e6-b86a-11e2-9cfb-0015171496ae
4|2|31|gptid/c20a0232-b86a-11e2-9cfb-0015171496ae
5|2|33|gptid/c25ef2da-b86a-11e2-9cfb-0015171496ae
6|2|32|gptid/c2b79c49-b86a-11e2-9cfb-0015171496ae
7|2|24|gptid/050bac43-b9f0-11e2-bcaa-0015171496ae
8|2|24|gptid/991f4916-ba6f-11e2-9ed2-0015171496ae
sqlite>

titan_rw · May 20, 2013

Looking at the output of "sqlite3 /data/freenas-v1.db '.dump storage_encrypteddisk'", shows:

Code:

CREATE TABLE "storage_encrypteddisk" ("id" integer NOT NULL PRIMARY KEY, "encrypted_volume_id" integer NOT NULL, "encrypted_disk_id" integer NOT NULL, "encrypted_provider" varchar(120) NOT NULL UNIQUE);

I don't have any encrypted disks, so no actual data shows on my system. But we can see that the first column is the primary ID. The second column would be the volume id, so I assume which volume the disk belongs to. In your case, they're all members of volume 2. I'm not sure on the encrypted disk id. There's duplicate values there, so I'm not sure where it gets that from.

cyberjock · May 22, 2013

So I tried testing out your ideas jgreco, in my VM.

In my VM I had 4 drives listed for storage_encrypteddisk, but the zpool was a 3 disk RAIDZ1. Entry #3 was the entry that needed to be deleted, so I did delete from storage_encrypteddisk where id=3; and then verified that the 3 remaining disks were the 3 disks that should be there.

Then I rebooted.

After putting in my key+pass all 3 disks decrypted properly but I got the "An Error Occurred" message in the GUI(which should be ignored based on my first post.

Then I tried to do various things to see how FreeNAS would react.

1. I tried to delete to "Remove Recovery Key" and I got an error...

Code:

freenas manage.py: [middleware.exceptions:38] [MiddlewareError: Unable to remove key: geli: Master Key 1 is not set. ]

Somewhat expected since 1 of the disks had no key.

2. Then I did "Add Recovery Key" and I got no error and a key downloaded. This looks very promising!

So I rebooted.

With the new recovery key the zpool mounted in HEALTHY status(and I got another "An Error Occurred!" message in the GUI to be ignored).

So it appears that all someone needs to do is delete the GPT-ID that has been removed from the system and all will be "fixed". Not an ideal situation, but it does seem to work.

I'm a little surprised that an issue like this exists with FreeNAS. This is a big deal for someone that doesn't go to the forum or figure out that their recovery key doesn't work optimally until its too late. I'd hate to go to use the recovery key someday out of necessity to figure out I can't mount the pool and there's no fix! I really wish someone would have commented on the ticket I made(especially if I was doing something wrong) but so far there's been nothing still.

Thanks for the help! My zpool is grateful. :P

jgreco · May 22, 2013

I really wish someone would have commented on the ticket I made(especially if I was doing something wrong) but so far there's been nothing still.

I would prefer that you update your own trouble tickets. I don't really see value in inserting myself in the process unnecessarily, since you're better situated to follow up and test.

People writing code to implement features like this often suffer from blinders to the simple and obvious blunders, errors, and problems that might occur along the way, etc., so it is not too shocking that some rough edges exist. I'm a particularly nasty problem magnet, partly because I'll gleefully do the things that you might not be able to/want to. Having worked both ends of the bug fixing process, I'll say that it is very helpful to get sufficient detail to be able to duplicate a problem. You did a nice job of documenting it here in the forum, but I'm not sure you did as well in the ticket (and I'm too lazy to go look). So I suggest a link to this thread be included in the ticket.

cyberjock · May 22, 2013

No, don't take it that way. What I meant was that it would have been nice if one of the developers, the manual writer, etc could have told me if I was doing the disk replacement process badly. I wasn't expecting you to comment on my trouble ticket.

jgreco · May 23, 2013

Right, but my guess is that the feature author has probably tested this several times, but is locked into a particular set of preconceived notions about how things are supposed to work/be done/etc, and that it works correctly for the set of steps used to test during development. I'm also going to guess two further things: 1) that a bug report with possibly insufficient detail to reproduce the problem (and I'm too lazy to go analyzing this in detail so this is a generalization, not an accusation or anything) might simply land the problem back in the "inbox" for later exploration, and 2) there's probably a lot of pressure to get 9.1 moving along. Software development often works in what might appear to be strange ways, but we can kind of see that a lot of dev effort is going into 9.1.0R. So anything you can do to help make your bug appear valid and reproducible makes it easier for someone to decide what milestone will be targeted for fixing your bug.

cyberjock · May 23, 2013

I kind of figured with the stuff you said. That's why I provided step by step directions in the ticket and explained that this could backfire. The 2 things that should take precedence regarding projects should always be security vulnerabilities and the potential for data loss while still following included directions. In this case I'd say it falls into #2. While I wouldn't expect them to drop everything, even a comment like "the manual has it wrong regarding encrypted zpools and here's the right steps" or a post that they have verified the issue in their lab or something would give me some comfort that the issue is at least found. Right now I'm getting the feeling that 9.1 is more important than even verifying a potential issue is valid. I was told by another senior poster(who I'll let remain anonymous) that encryption in FreeNAS was very buggy, not reliable and shouldn't be trusted with anyone's data without the bugs being worked out. I was going to prove him wrong and now he has sent me the well deserved "I told you so" IM.

Back in early April someone posted a thread about some potential vulnerabilities that are several years old but might still exist in FreeNAS for some reason. No response from anyone of value in the forum and I created ticket https://support.freenas.org/ticket/2121 and got no acknowledgment from anyone important. To me that's kind of scary that a potential security risk is still at "new defect" status 6 weeks later with no reply.

It's really starting to worry me that the FreeNAS project is going somewhere that I don't want it to go. I like FreeNAS but this complete disregard for some things that could be a big deal is a little scary to me(yes, I acknowledge that anyone's own issue is the most important thing in their mind). Its not like these tickets are even being acknowledged. Ticket 2121 is 6 weeks old, still at "new defect" with no comments. My ticket as a CC email address added but nothing else. So what will happen when 9.1 hits production and this issue is still around?

Then you look at things like the forums. Twice there have been threads with gripes from top posters in the forum. Comments have been made by various individuals with the power to make changes to the forums in my 12+ months here and still nothing is changed. Questions via PMs are going unanswered, phone calls not returned, etc. One of the most senior posters I know has recently taken a break from the forum with no long term plans of ever returning because of stuff that's going on. I'm really getting burned out with the same BS day in and day out in the forums and I'm really questioning if I want to continue to devote time and energy into this project.

I'm beginning to have that gut feeling that the FreeNAS project is about to take a nosedive because of the continuing lack of concern with things. Even acknowledging that an issue has been reproduced in the lab or that a potential security issue is an issue(even if no ETA can be provided right now to fix it) is better than unanswered, unacknowledged, "new defect" ticket. It makes it sound like whomever is setting the goals/deadlines of the project really isn't interested in even looking at the issue and has dismissed it "just because".

jgreco · May 24, 2013

So my guess is you've never been involved with software development. I see it as not too unusual.

Looking at your ticket, two of the three issues are trite. Really, there is no call to panic over bzip2; your server should never be doing anything where this would come into play. The ntpd thing appears to be version-whining, because the configuration of FreeNAS ntpd appears to restrict external packets, and the OpenSSL thing, yeah, that's externally visible and ought to be fixed. Mitigation is trivial though, simply don't expose https port to the Internet.

So, part two is that I don't really know what the design strategy for the underlying OS is. It is very possible that a baseline image was created for FreeNAS 8 and that changes are being imported manually, which is really the main way I can think to explain that these items are still at FreeBSD 8.0 revisions. There are definitely some fantastic reasons for going that route, and also some reasons not to do it.

But there's a bigger picture here, and one that's easy to miss out on. iXsystems has a motive for developing FreeNAS, and that's TrueNAS. It is easy to not comprehend the implications. But here's something that's maybe a rough guess. iX has tried to make a business out of providing professional FreeBSD support. They've done a nice job with PC-BSD, for example, but one only has to look at the history of similar Linux companies to see the trail of corpses. FreeBSD is arguably weaker in the desktop arena. It makes sense to focus on servers if you can, but even technical superiority is not enough to guarantee success. So trends in the market made it kind of obvious that Sun's "the network is the computer" strategy was finally really coming true, massive storage with massive access requirements from large networks made NAS a must-have for environments where SAN was too expensive and/or couldn't be reasonably scaled. And NetApp sucks.

iXsystems apparently decided to jump into the foray of largely Linux-based NAS devices. And when I say that, I'm not talking the software you load onto an everyday PC, I'm talking the vendor software like EMC Lifeline, and other more specialized offerings. Sun had been offering Solaris and ZFS as an option for several years, but had been doing it in the typical Sun manner; they had a tough time being seen as an appliance storage vendor. Nexenta is reasonably successful, but a commercial NAS on FreeBSD was largely missing from the scene.

iXsystems apparently jumped at the chance to acquire the FreeNAS brand. But here's the thing. It's gotta be tough to make money on. If you look at Netgear with the ReadyDATA, you can see Nexenta loaded onto Supermicro at what I'm guessing is a 300% markup. It is going to be harder to convince people to buy a TrueNAS system if they can buy a Supermicro and stick a bunch of disks in and load up FreeNAS themselves. Unless they need some of the differentiating features, the promise of support may not be that compelling. They may not be finding it quite as profitable - or easy - as they were hoping it would be when they bought it two years ago.

So my suspicion is that they're throwing the available resources at development to keep things moving as quickly as possible. Since the major version bump from 8 to 9 probably mandated a re-import of the FreeBSD/NanoBSD files, this would resolve your bug report too. I can understand your feeling jilted at the lack of a personalized response, but it is quite possible that they're keeping the ticket open as an unresolved issue because no released version fixes it, and that it'll be quickly marked as "resolved" shortly after a FreeNAS 9 reaches release status.

cyberjock · May 25, 2013

jgreco said:
So my guess is you've never been involved with software development. I see it as not too unusual.

I have but only with in-house software that we used ourselves. If there was a security risk it was our own systems that were going to be compromised and not someone else's. If there was a fatal flaw in the software and we found it after we had rolled it to production then we told the 5 people on the planet that would be responsible for those systems and life went on. Later when we fixed the bug we'd tell those 5 people that the issue had been resolved on the latest software. There was a very active feedback system in place and there was always responses within 48 hours(by order of our commanding officer).

jgreco said:
Looking at your ticket, two of the three issues are trite. Really, there is no call to panic over bzip2; your server should never be doing anything where this would come into play. The ntpd thing appears to be version-whining, because the configuration of FreeNAS ntpd appears to restrict external packets, and the OpenSSL thing, yeah, that's externally visible and ought to be fixed. Mitigation is trivial though, simply don't expose https port to the Internet.

I kind of thought the same you did on this. My concern wasn't that those exact issues that he brought up should be cause for concern. What concerned me was that we might have a vulnerability elsewhere because of whatever processes are in place that has resulted in our software being out of date. I'm not actively forwarding any ports to any of the FreeNAS servers I manage so I consider those risks to be rather low. I just have to wonder why the version numbers don't match and are there any potential consequences for using the old version(assuming we actually are somehow). I consider the fact that nobody has responded to mean that nobody has investigated if this is actually an issue at all. That's what bothers me. If it is an issue then log it as such and let the project manager(s) decide if we need to emergently rollout a patch release or wait for the next full build (9.x right now). If its not then lets close the ticket. What I've seen too often is that companies(and projects, and the military) deliberately choose to not investigate a ticket because they don't want to find out they have a security risk and have to make an annoucement. If 9.x comes with all new versions of the software programs mentioned in my ticket then I can expect that they will surely close the ticket to "fixed in 9.x". But that doesn't really help out the average joe that doesn't upgrade, can't, etc. If I assume that we are using the old versions for some reason and they close my ticket when 9.x hits release status what is my guarantee that the same faulty process that caused problems with 8.x aren't already set up to fail with 9.x and all future version. I view that as an issue that should both be forward and backwards looking.

I know in the nuclear industry if you contract out to them you MUST provide this level of responsibility, forever(or until the software is no longer used), once a site has made the choice to use it. Many other industries are the same. Failure to do so can result in prison sentences(and many have for trying to back out of those promises because of cost later). I believe I had read that 3 people last year went to jail for more than 10 years each for not maintaining their responsibilities.

jgreco said:
So, part two is that I don't really know what the design strategy for the underlying OS is. It is very possible that a baseline image was created for FreeNAS 8 and that changes are being imported manually, which is really the main way I can think to explain that these items are still at FreeBSD 8.0 revisions. There are definitely some fantastic reasons for going that route, and also some reasons not to do it.

I'd accept that as an answer(and I want to think that is the answer) but they're afraid of making the announcement that "sorry, all of our outdated software has security risks and we're not gonna fix it until 9.x". Well, I wouldn't expect 9.x would do much better long term because I'd wager that 9.x will eventually be filled with its own known but unpatched vulnerabilities. I doubt iXsystems would want to see a lawsuit because they deliberately knew of the potential risks of not updating their base image and chose not to anyway.

jgreco said:
But there's a bigger picture here, and one that's easy to miss out on. iXsystems has a motive for developing FreeNAS, and that's TrueNAS. It is easy to not comprehend the implications. But here's something that's maybe a rough guess. iX has tried to make a business out of providing professional FreeBSD support. They've done a nice job with PC-BSD, for example, but one only has to look at the history of similar Linux companies to see the trail of corpses. FreeBSD is arguably weaker in the desktop arena. It makes sense to focus on servers if you can, but even technical superiority is not enough to guarantee success. So trends in the market made it kind of obvious that Sun's "the network is the computer" strategy was finally really coming true, massive storage with massive access requirements from large networks made NAS a must-have for environments where SAN was too expensive and/or couldn't be reasonably scaled. And NetApp sucks.

iXsystems apparently decided to jump into the foray of largely Linux-based NAS devices. And when I say that, I'm not talking the software you load onto an everyday PC, I'm talking the vendor software like EMC Lifeline, and other more specialized offerings. Sun had been offering Solaris and ZFS as an option for several years, but had been doing it in the typical Sun manner; they had a tough time being seen as an appliance storage vendor. Nexenta is reasonably successful, but a commercial NAS on FreeBSD was largely missing from the scene.

Totally agree. I'm actually wondering if ZFS on Linux coming to be reliable will somewhat hurt FreeNAS. Far more people know Linux than FreeBSD and if ZFS on Linux is just as stable as ZFS on FreeBSD/FreeNAS will that hurt FreeNAS? Alot of issues people have with software not being supported on FreeBSD, drivers, etc would almost go away just by switching to Linux over FreeNAS/FreeBSD.

jgreco said:
iXsystems apparently jumped at the chance to acquire the FreeNAS brand. But here's the thing. It's gotta be tough to make money on. If you look at Netgear with the ReadyDATA, you can see Nexenta loaded onto Supermicro at what I'm guessing is a 300% markup. It is going to be harder to convince people to buy a TrueNAS system if they can buy a Supermicro and stick a bunch of disks in and load up FreeNAS themselves. Unless they need some of the differentiating features, the promise of support may not be that compelling. They may not be finding it quite as profitable - or easy - as they were hoping it would be when they bought it two years ago.

I wasn't particularly interested in FreeNAS, FreeBSD, or ZFS before about Jan 2012. So I don't have much knowledge on the whole FreeNAS split between the "new" FreeNAS and the NAS4Free project. I know one or two people had posted that the contract prices were rather high for what they were getting over doing the work themselves. I can't vouch for it because I'll never do a contract for my home server. I'm not sure how many big businesses would do it either because every large business I've been involved with had dedicated Linux and FreeBSD admins in-house to do the dirty work.

jgreco said:
So my suspicion is that they're throwing the available resources at development to keep things moving as quickly as possible. Since the major version bump from 8 to 9 probably mandated a re-import of the FreeBSD/NanoBSD files, this would resolve your bug report too. I can understand your feeling jilted at the lack of a personalized response, but it is quite possible that they're keeping the ticket open as an unresolved issue because no released version fixes it, and that it'll be quickly marked as "resolved" shortly after a FreeNAS 9 reaches release status.

I kind of discussed this above. But that bothers me because if they aren't updating their base image does that mean that the FreeBSD 9 base image that they are using now won't be updated either? Saying nothing can sometimes be worse than admitting the truth. That's why there are so many conspiracy theorists out there believing all sorts of crazy stories because they weren't given the truth to begin with.

Now I feel like pulling out my VM image and doing comparisons of my own to see if the software is in fact out of date. I know it would be some serious egg-in-the-face if these are actually super old and we have vulnerabilities abound we aren't even aware of.

jgreco · May 25, 2013

I'd accept that as an answer(and I want to think that is the answer) but they're afraid of making the announcement that "sorry, all of our outdated software has security risks and we're not gonna fix it until 9.x". Well, I wouldn't expect 9.x would do much better long term because I'd wager that 9.x will eventually be filled with its own known but unpatched vulnerabilities. I doubt iXsystems would want to see a lawsuit because they deliberately knew of the potential risks of not updating their base image and chose not to anyway.

Don't take it the wrong way when I say, that's a geek's way of wishing the world to be how it probably ought to be (so I cleverly paint myself into the same corner by agreeing). But the realistic aspect is that software licenses typically don't allow you much recourse. I've got a 2010 Samsung Smart TV hanging on the wall that I expect to never see another firmware update for, despite known issues. Any of the ~dozen Linux BusyBox devices the average person has in their house, often trite things like DSL modems and TiVo's, do not see the sort of constant revision of firmware that we're discussing here. Pragmatic choices by device manufacturers are to only address issues that are likely to actually be a problem, or are actually being exploited, or (in the most cynical cases) are causing them so many headaches under their warranty program that the firmware update is worth hiring back a member of the software team that developed the thing in the first place, because typically these guys have been let go if they're contract developers, or moved on to another project. There's minimal value to the company (but lots of trouble and complexity) in revisiting the firmware of an already-sold device, which is why there's so much angst about things like Android firmware for older phones, but so little action.

cyberjock said:
I kind of discussed this above. But that bothers me because if they aren't updating their base image does that mean that the FreeBSD 9 base image that they are using now won't be updated either? Saying nothing can sometimes be worse than admitting the truth. That's why there are so many conspiracy theorists out there believing all sorts of crazy stories because they weren't given the truth to begin with.

Now I feel like pulling out my VM image and doing comparisons of my own to see if the software is in fact out of date. I know it would be some serious egg-in-the-face if these are actually super old and we have vulnerabilities abound we aren't even aware of.

This has happened before and it will happen again. Fortunately, with open source software, you have the choice to look, and even to build your own version of the software.

cyberjock · May 25, 2013

jgreco said:
There's minimal value to the company (but lots of trouble and complexity) in revisiting the firmware of an already-sold device, which is why there's so much angst about things like Android firmware for older phones, but so little action.

I had that exact problem with my Motorola Droid Bionic. I was promised ICS when I bought my phone. It took almost a full year and alot of other phones were given priority despite many of them being older and newer, and some weren't even promised ICS. Now I have JB and I know that's the end of the road for me.

jgreco said:
This has happened before and it will happen again. Fortunately, with open source software, you have the choice to look, and even to build your own version of the software.

Of that I have no doubt. But a project that's being very actively developed as much as FreeNAS? A little sad(and scary) to me.

jgreco · May 25, 2013

cyberjock said:
Of that I have no doubt. But a project that's being very actively developed as much as FreeNAS? A little sad(and scary) to me.

It's simple economics. If you have a hundred man-hours available to pay a developer in the next month, do you have him authoring and committing new features in the current code tree, or do you have him porting in the latest source code in the old (FreeBSD 8) tree which is going to be deprecated in the next six to twelve months anyways?

jgreco · May 26, 2013

I should also point out that there appear to be some deviations from the FreeBSD 8 source tree ... for example I had noticed that the output of "top" had been augmented in the latest FreeNAS:

26 processes: 1 running, 25 sleeping
CPU: 0.0% user, 0.0% nice, 49.9% system, 0.4% interrupt, 49.7% idle
Mem: 152M Active, 7336K Inact, 4465M Wired, 8K Cache, 142M Buf, 3280M Free
ARC: 4181M Total, 282M MFU, 3879M MRU, 18K Anon, 13M Header, 7422K Other
Swap: 16G Total, 340K Used, 16G Free

titan_rw · May 26, 2013

I think I remember reading about that change in the release notes.

I didn't know this was a deviation from official freebsd.

Important Announcement for the TrueNAS Community.

Can drives on encrypted zpools ever be "replaced"?

Inactive Account

Resident Grinch

Inactive Account

Inactive Account

Resident Grinch

Resident Grinch

Inactive Account

Guru

Inactive Account

Resident Grinch

Inactive Account

Resident Grinch

Inactive Account

Resident Grinch

Inactive Account

Resident Grinch

Inactive Account

Resident Grinch

Resident Grinch

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Can drives on encrypted zpools ever be "replaced"?"

Similar threads