ZFS Raid Z pool broken 30 minutes into a scrub

norbs · Sep 29, 2014

Had the same pool for about 4 years. A few weeks ago it started crashing freenas into a "db>" prompt.

My hardware:
My hardware was a i7-3770 with 32gb of non-ecc ram, it was all on esxi a VM that was using VT-D to a LSI 9211-8i card that was in IT mode. The VM had 2 vcpu's and about 20gb or reserved memory assigned. I have since switch to a xeon e3 1231 v3 with 32gb of ecc ram on the a recommended super micro board. I'm still running freenas as a VM.

I tried quite a few trouble shooting thing i've found all over this site with no luck.

I have also tried importing the pool while having one disk from the array unplugged (tried all 4 pool members without any luck)

I ended up running a zdb -e -bcsvL RAIDZ with the following results:

Code:

Assertion failed: 0 == dmu_bonus_hold(os, object, dl, &dl->dl_dbuf) (0x0 == 0x2, file /fusion/jkh/921/freebas.FreeBSD/src/cddl/lib/libzpool/../../../sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_deadlist.c, line 101.
Abort (core dumpted)

zpool import:

Code:

 pool: RAIDZ
     id: 16802863673492970021
  state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

        RAIDZ                                           ONLINE
          raidz1-0                                      ONLINE
            gptid/c96a7313-705a-11e3-9480-000c29717331  ONLINE
            gptid/ca4e8476-705a-11e3-9480-000c29717331  ONLINE
            gptid/cb226ccf-705a-11e3-9480-000c29717331  ONLINE
            gptid/cbf24925-705a-11e3-9480-000c29717331  ONLINE

gpart status:

Code:

  Name  Status  Components
da0s1      OK  da0
da0s2      OK  da0
da0s3      OK  da0
da0s4      OK  da0
da0s1a      OK  da0s1
da0s2a      OK  da0s2
da1p1      OK  da1
da1p2      OK  da1
da2p1      OK  da2
da2p2      OK  da2
da3p1      OK  da3
da3p2      OK  da3
da4p1      OK  da4
da4p2      OK  da4
da5p1      OK  da5
da5p2      OK  da5
da6p1      OK  da6
da6p2      OK  da6

I was not running ECC RAM, I am now... I'm just posting here as a last attempt at rescuing my data before I recreate a fresh pool.

EDIT: I was pretty close to 90% capacity when all this happened and I believe I do have snapshots configured. I'm wondering if this had more to do with it than anything. Can anyone chime in on this?

EDIT2:
Some emails I did receive before it just stopped mounting:

nas.local kernel log messages:
> panic: solaris assert: 0 == dmu_bonus_hold(os, object, dl, &dl->dl_dbuf) (0x0 == 0x2), file: /fusion/jkh/9.2.1/freenas/FreeBSD/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_deadlist.c, line: 101
> cpuid = 1
> KDB: stack backtrace:
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2a/frame 0xffffff85976ed170
> kdb_backtrace() at kdb_backtrace+0x37/frame 0xffffff85976ed230
> panic() at panic+0x1ce/frame 0xffffff85976ed330
> assfail3() at assfail3+0x29/frame 0xffffff85976ed350
> dsl_deadlist_open() at dsl_deadlist_open+0xd7/frame 0xffffff85976ed3c0
> dsl_dataset_hold_obj() at dsl_dataset_hold_obj+0x20b/frame 0xffffff85976ed480
> dsl_dataset_stats() at dsl_dataset_stats+0x23e/frame 0xffffff85976ed770
> dmu_objset_stats() at dmu_objset_stats+0x1a/frame 0xffffff85976ed790
> zfs_ioc_objset_stats_impl() at zfs_ioc_objset_stats_impl+0x63/frame 0xffffff85976ed7d0
> zfs_ioc_snapshot_list_next() at zfs_ioc_snapshot_list_next+0x156/frame 0xffffff85976ed810
> zfsdev_ioctl() at zfsdev_ioctl+0x58d/frame 0xffffff85976ed8b0
> devfs_ioctl_f() at devfs_ioctl_f+0x7b/frame 0xffffff85976ed920
> kern_ioctl() at kern_ioctl+0x106/frame 0xffffff85976ed970
> sys_ioctl() at sys_ioctl+0xfd/frame 0xffffff85976ed9d0
> amd64_syscall() at amd64_syscall+0x5ea/frame 0xffffff85976edaf0
> Xfast_syscall() at Xfast_syscall+0xf7/frame 0xffffff85976edaf0
> --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x8019b48fc, rsp = 0x7fffffffad38, rbp = 0x7fffffffae50 ---
> KDB: enter: panic
> Textdump complete.
> cpu_reset: Restarting BSP
> cpu_reset_proxy: Stopped CPU 1
> ugen0.2: <VMware> at usbus0
> uhid0: <VMware> on usbus0
> ums0: <VMware> on usbus0
> ums0: 16 buttons and [XYZT] coordinates ID=0
> Root mount waiting for: usbus1 usbus0
> uhub1: 6 ports with 6 removable, self powered
> ugen0.3: <vendor 0x0e0f> at usbus0
> uhub2: <VMware Virtual USB Hub> on usbus0
> Root mount waiting for: usbus0
> uhub2: 7 ports with 7 removable, self powered
> Root mount waiting for: usbus0
> ugen0.4: <CP1000PFCLCD> at usbus0
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2a/frame 0xffffff85976f2170
> kdb_backtrace() at kdb_backtrace+0x37/frame 0xffffff85976f2230
> panic() at panic+0x1ce/frame 0xffffff85976f2330
> assfail3() at assfail3+0x29/frame 0xffffff85976f2350
> dsl_deadlist_open() at dsl_deadlist_open+0xd7/frame 0xffffff85976f23c0
> dsl_dataset_hold_obj() at dsl_dataset_hold_obj+0x20b/frame 0xffffff85976f2480
> dsl_dataset_stats() at dsl_dataset_stats+0x23e/frame 0xffffff85976f2770
> dmu_objset_stats() at dmu_objset_stats+0x1a/frame 0xffffff85976f2790
> zfs_ioc_objset_stats_impl() at zfs_ioc_objset_stats_impl+0x63/frame 0xffffff85976f27d0
> zfs_ioc_snapshot_list_next() at zfs_ioc_snapshot_list_next+0x156/frame 0xffffff85976f2810
> zfsdev_ioctl() at zfsdev_ioctl+0x58d/frame 0xffffff85976f28b0
> devfs_ioctl_f() at devfs_ioctl_f+0x7b/frame 0xffffff85976f2920
> kern_ioctl() at kern_ioctl+0x106/frame 0xffffff85976f2970
> sys_ioctl() at sys_ioctl+0xfd/frame 0xffffff85976f29d0
> amd64_syscall() at amd64_syscall+0x5ea/frame 0xffffff85976f2af0
> Xfast_syscall() at Xfast_syscall+0xf7/frame 0xffffff85976f2af0
> --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x8019b48fc, rsp = 0x7fffffffad38, rbp = 0x7fffffffae50 ---
> panic: solaris assert: 0 == dmu_bonus_hold(os, object, dl, &dl->dl_dbuf) (0x0 == 0x2), file: /fusion/jkh/9.2.1/freenas/FreeBSD/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_deadlist.c, line: 101
> cpuid = 1
> KDB: stack backtrace:
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2a/frame 0xffffff859762a170
> kdb_backtrace() at kdb_backtrace+0x37/frame 0xffffff859762a230
> panic() at panic+0x1ce/frame 0xffffff859762a330
> assfail3() at assfail3+0x29/frame 0xffffff859762a350
> dsl_deadlist_open() at dsl_deadlist_open+0xd7/frame 0xffffff859762a3c0
> dsl_dataset_hold_obj() at dsl_dataset_hold_obj+0x20b/frame 0xffffff859762a480
> dsl_dataset_stats() at dsl_dataset_stats+0x23e/frame 0xffffff859762a770
> dmu_objset_stats() at dmu_objset_stats+0x1a/frame 0xffffff859762a790
> zfs_ioc_objset_stats_impl() at zfs_ioc_objset_stats_impl+0x63/frame 0xffffff859762a7d0
> zfs_ioc_snapshot_list_next() at zfs_ioc_snapshot_list_next+0x156/frame 0xffffff859762a810
> zfsdev_ioctl() at zfsdev_ioctl+0x58d/frame 0xffffff859762a8b0
> devfs_ioctl_f() at devfs_ioctl_f+0x7b/frame 0xffffff859762a920
> kern_ioctl() at kern_ioctl+0x106/frame 0xffffff859762a970
> sys_ioctl() at sys_ioctl+0xfd/frame 0xffffff859762a9d0
> amd64_syscall() at amd64_syscall+0x5ea/frame 0xffffff859762aaf0
> Xfast_syscall() at Xfast_syscall+0xf7/frame 0xffffff859762aaf0
> --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x8019b48fc, rsp = 0x7fffffffad38, rbp = 0x7fffffffae50 ---
> KDB: enter: panic
> Textdump complete.
> cpu_reset: Restarting BSP
> cpu_reset_proxy: Stopped CPU 1
> da5 at mps0 bus 0 scbus4 target 3 lun 0
> da5: <ATA ST2000DL003-9VT1 CC32> Fixed Direct Access SCSI-6 device
> da5: Serial Number 5YD3KXS1
> da5: 600.000MB/s transfers
> da5: Command Queueing enabled
> da5: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
> da5: quirks=0x8<4K>
> WARNING: /data was not properly dismounted

-- End of security output --

> pid 25259 (mv), uid 0 inumber 8 on /data: filesystem full
> pid 27624 (mv), uid 0 inumber 9 on /data: filesystem full
> pid 29950 (mv), uid 0 inumber 9 on /data: filesystem full
> pid 32275 (mv), uid 0 inumber 9 on /data: filesystem full
> pid 35783 (mv), uid 0 inumber 8 on /data: filesystem full
> pid 39280 (mv), uid 0 inumber 8 on /data: filesystem full
> pid 41698 (mv), uid 0 inumber 9 on /data: filesystem full
> pid 45221 (mv), uid 0 inumber 8 on /data: filesystem full
> pid 51193 (mv), uid 0 inumber 8 on /data: filesystem full

EDIT 3:

Sample of what a crash looks like:

norbs · Sep 29, 2014

Anyone? 6TB of data... on the line here... not a single reply or suggestion?

mjws00 · Sep 29, 2014

Your description of the problem is hard to understand. zpool broken... (not sure what that means.) What is the exact error? zpool status shows it online? I looked at your other thread and it looks like you got some new hardware? I'm not sure what you were trying to do with the zdb command exactly. Are you still crashing?

Have you tried exporting the pool and then reimporting on a fresh install?

The only way you have a chance at getting help is to provide exact details. All hardware. Error messages, etc. The zdb error isn't useful. There is a pretty good list in the forum rules of what might be useful.

This was a pool from vt-d on esxi if I read your other thread correctly. Are you on bare metal now? Good luck. People will help you if you give them the right details.

norbs · Sep 29, 2014

mjws00 said:
Your description of the problem is hard to understand. zpool broken... (not sure what that means.) What is the exact error? zpool status shows it online? I looked at your other thread and it looks like you got some new hardware? I'm not sure what you were trying to do with the zdb command exactly. Are you still crashing?

Have you tried exporting the pool and then reimporting on a fresh install?

The only way you have a chance at getting help is to provide exact details. All hardware. Error messages, etc. The zdb error isn't useful. There is a pretty good list in the forum rules of what might be useful.

This was a pool from vt-d on esxi if I read your other thread correctly. Are you on bare metal now? Good luck. People will help you if you give them the right details.

Woke up freenas was crashed out to a "db>" prompt, I know I had a scrub at 12 midnight that night because I was notified by email. There was just a ton of text on the screen. I rebooted same thing as it tried to load this specific pool "RAIDZ" it was crash and reboot with a ton of scrolling text.

I tried to do a fresh install of freenas and it just crashed on import of that same pool.

My hardware was a i7-3770 with 32gb of non-ecc ram, it was all on esxi a VM that was using VT-D to a LSI 9211-8i card that was in IT mode. The VM had 2 vcpu's and about 20gb or reserved memory assigned.

Leading up to this is did have a few emails warnings about some stuff I had no understanding of. I'll add them to the original post for clarity.

After reading many of these threads I switched my hardware out to a xeon with ecc memory but still same configuration running as a VM with VT-D.

There is also another mirrored pool with 2 * 3tb virtual drives running as a mirror that has had no issues while being imported to a fresh install.

cyberjock · Sep 29, 2014

@norbs,

Hate to break it to you, but your chances of getting the data back aren't good.

You also shouldn't be too surprised that as soon as you choose to virtualize FreeNAS things change and you are at much greater risk of spontaneous failures without warning or cause than not virtualizing. You also chose to go with VT-D on non-server grade hardware. I'd never ever have even considered that and I *did* run FreeNAS in a VM for a while.

You also shouldn't be too surprised that the more experienced users aren't going to have much sympathy for doing something we've *strongly* recommended against. So not to shatter a brick in your face, but you are probably on your own. Unfortunately you are almost certainly another example of why we have such strongly worded warnings with regards to virtualizing FreeNAS and the experienced users don't reply to threads like this for fear of giving even the slightest impression that it is a good idea.

All that being said, if you can't import the pool from a bare metal install of FreeNAS then I'd say your pool is done for the count. :(

I'd also strongly recommend doing a RAM test for a day or two before using it again, and seriously consider buying more appropriate hardware for FreeNAS and not virtualizing.

mjws00 · Sep 29, 2014

Nice job getting more info up. Another quick question. Was this an upgrade from a very old version? You said you were running this for like 4 years? I'm pretty sure cyber has seen a pool die during an upgrade while utilizing vt-d.

It also looks like you've hit the right places and threads in terms of trying to recover. I haven't seen success from a pool that straight up causes a crash while importing. Something pretty critical has to be fubar'd. Virtual, non-virtual.... no dice.

norbs · Sep 29, 2014

Started on version 8.x something and been pretty good at keeping current with updates. This didn't happen right after an update but about a week after. It was my first scrub since the update and looking at the time stamps I remember it only got about 36 minutes into the scrub before it failed.

When I started using freenas I was on physical hardware running on an amd fusion board and the 4 drives. About a year after I moved to ESXi and I've also kept current with ESXi versions as well as FreeNAS updates. At first I was doing raw disk mapping to the VM and then ran into something on the forums saying that if you are virtualizing, VT-D is recommended over RDMs because of things like access to s.m.a.r.t. information.

I've kept my VM's fairly safe by keeping everything behind a UPS and had it configured to shutdown my freenas vm in case of any power failures. I even made sure everything had more than enough cooling and airflow.

Either way, disappointing to hear about the virtualization and the communities stance on it, hope it's something that maybe changes in the future. Not sure what to do from here.

cyberjock · Sep 29, 2014

Yeah, don't hold your breath on the community's stance. Namely:

1. The problem is the problem likely isn't on FreeNAS' side of the house. We don't see these weird problems *except* when virtualizing FreeNAS. So clearly the bug isn't presenting itself when you do "the right thing". See #2 for "the right thing". VMWare isn't in a hurry to fix a problem that is limited to FreeNAS in a VM.
2. ZFS wants direct disk access, which pretty much means VMs are already "not going to be a good idea". Virtualizing file servers has generally been a bad idea for as long as virtualization has existed, so no surprise that it's still not recommended now.
3. VT-D technology has very strict hardware and software requirements, making it only work properly and completely in very specific situations. Your motherboard must be thoroughly tested and vetted to make VT-D reliable. The devices you choose to put on VT-D must also have thorough and complete support in drivers to handle VT-D passthrough hardware as well as the OS. (all of this almost exclusively means "intel only and only server-grade boards like Supermicro")

To make #3 even worse, the very people that want to do virtualizing are *always* cost conscious (if they weren't they wouldn't virtualize to begin with, right?). But of those people that are cost conscious they, more often than not, also choose to go with non-Intel and almost always non-server. Case and point, you are running with non-server grade hardware.

I myself had a "high end desktop board with a server chipset" and when I enabled VT-d it pooped all over the place. The SAS controller was unreliable and about 95% of the time would refuse to work at all. You could reboot the VM all day long and about 1 in 15 to 1 in 20 times it would "just work". Of course, it could stop working at any moment too, which didn't help. Then you were left with another dozen or so reboots while you frantically sweat about whether you'd ever see your data again. No thanks. I went to Supermicro at that point. ;)

So no, don't expect the community's stance to change. The stance is because there's hardware limitations as well as financial obligations that make the reality of the situation something that isn't likely to change. Nobody here is paid to provide support, so asking volunteers to spend stupendous amounts of time handling problems that a very small subset of users even care about is just unrealistic. Now, if you want paid support via the forums, that changes things. But most people that virtualize also are averse to paying someone else to fix up their server. Call it pride. Call it being cheap. Call it whatever you want. But the facts speak for themselves. :(

I've had many, many people ask for support in building a system. And as soon as I quote a price (which on a per-hours basis is very reasonable) they have a cow. They want to pay me $50 for what will be 5-10 hours of work. Yeah, no thanks.

The "bigger picture" is this isn't supported, wouldn't be recommended even if it did work, and won't ever be supported without major changes to the whole "market" of the situation. Two of the three aren't going to change without serious changes in the IT field and it's technology, and nobody is aware of anything that would change that in the next 3-5 years.

I used to tell people "go cheap with a VM and you'll still pay for it.. with your data" Unfortunately this was not likely to change 3 years ago when this was a serious problem for FreeNAS, and isn't expected to change in the next decade either. Time will tell though.

Disclaimer: I used to virtualize FreeNAS and I deliberately got away from it for a reason. Why the hell would I choose a configuration for my *own* data that I wouldn't recommend to my enemies? Do I really hate myself more than my enemies!? I've done some stupid things in my life, but going to bed every night with a configuration I'm going to argue is totally inappropriate for data you care about while simultaneously storing data you care about is... idiotic. So I had to fix it for my own good.

mjws00 · Sep 29, 2014

There isn't a lot of good news on the 'what to do from here' front. Sorry. Even with baremetal and cyber in the mood to help... we can't even get an os to mount the pool without crashing. You've seen the threads... i.e boot to a different os, try a livecd, try and mount read only, run a whacked out zbd command. No luck. Grab the backup.

Unfortunately there are many losses like this, so it is pretty hardcore on the don't virtualize side. You really don't have to go back far in the archives to see where the prevailing winds were much more tolerant. I'd like to see the issue tackled myself. But the variables are just crazy when it comes to hunting for a source. Doubt we'll ever get there. And even if we think something is stable and it has run for a year or two... that just makes the unknown cause more painful when it strikes.

Good luck. Hopefully you have your critical data on the mirror that mounts.

EDIT: Man that guy hits enter quick ;)

norbs · Sep 30, 2014

cyberjock said:
Yeah, don't hold your breath on the community's stance. Namely:

1. The problem is the problem likely isn't on FreeNAS' side of the house. We don't see these weird problems *except* when virtualizing FreeNAS. So clearly the bug isn't presenting itself when you do "the right thing". See #2 for "the right thing". VMWare isn't in a hurry to fix a problem that is limited to FreeNAS in a VM.
2. ZFS wants direct disk access, which pretty much means VMs are already "not going to be a good idea". Virtualizing file servers has generally been a bad idea for as long as virtualization has existed, so no surprise that it's still not recommended now.
3. VT-D technology has very strict hardware and software requirements, making it only work properly and completely in very specific situations. Your motherboard must be thoroughly tested and vetted to make VT-D reliable. The devices you choose to put on VT-D must also have thorough and complete support in drivers to handle VT-D passthrough hardware as well as the OS. (all of this almost exclusively means "intel only and only server-grade boards like Supermicro")

To make #3 even worse, the very people that want to do virtualizing are *always* cost conscious (if they weren't they wouldn't virtualize to begin with, right?). But of those people that are cost conscious they, more often than not, also choose to go with non-Intel and almost always non-server. Case and point, you are running with non-server grade hardware.

I myself had a "high end desktop board with a server chipset" and when I enabled VT-d it pooped all over the place. The SAS controller was unreliable and about 95% of the time would refuse to work at all. You could reboot the VM all day long and about 1 in 15 to 1 in 20 times it would "just work". Of course, it could stop working at any moment too, which didn't help. Then you were left with another dozen or so reboots while you frantically sweat about whether you'd ever see your data again. No thanks. I went to Supermicro at that point. ;)

So no, don't expect the community's stance to change. The stance is because there's hardware limitations as well as financial obligations that make the reality of the situation something that isn't likely to change. Nobody here is paid to provide support, so asking volunteers to spend stupendous amounts of time handling problems that a very small subset of users even care about is just unrealistic. Now, if you want paid support via the forums, that changes things. But most people that virtualize also are averse to paying someone else to fix up their server. Call it pride. Call it being cheap. Call it whatever you want. But the facts speak for themselves. :(

I've had many, many people ask for support in building a system. And as soon as I quote a price (which on a per-hours basis is very reasonable) they have a cow. They want to pay me $50 for what will be 5-10 hours of work. Yeah, no thanks.

The "bigger picture" is this isn't supported, wouldn't be recommended even if it did work, and won't ever be supported without major changes to the whole "market" of the situation. Two of the three aren't going to change without serious changes in the IT field and it's technology, and nobody is aware of anything that would change that in the next 3-5 years.

I used to tell people "go cheap with a VM" and you'll still pay for it.. with your data. Unfortunately this is not likely to change 3 years ago when this was a serious problem for FreeNAS, and isn't expected to change in the next decade either. Time will tell though.

Disclaimer: I used to virtualize FreeNAS and I deliberately got away from it for a reason. Why the hell would I choose a configuration for my *own* data that I wouldn't recommend to my enemies? Do I really hate myself more than my enemies!? I've done some stupid things in my life, but going to bed every night with a configuration I'm going to argue is totally inappropriate for data you care about while simultaneously storing data you care about is... idiotic. So I had to fix it for my own good.

I tottaly understand...

In the end, if the data on there was more valuable I'd probably pay to have it fixed but unfortunate it's not worth that much and the important stuff was backed up. I was simply hoping someone could guide me in the proper direction but sounds like it's no where near that simple.

My primary use of the box has been to be very comfortable with ESXi since it's been about 60% of my job. I don't know if a second box just for storage would be a good idea because at that point I might as well make another ESXi host as well and start playing with vMotion as well. Now although that sounds like fun I don't wanna get too overboard with all this. In the end I think maybe I'll just go with more disks and better backups.

Either way, thank you for your time, I really do appreciate it despite being one of those people that don't listen and get themselves into trouble. Take care!

cyberjock · Sep 30, 2014

You are welcome. I'm just glad you have backups. Too many people don't and then it really sucks having to tell them "yeah, sorry but all the important data you've gathered throughout your life is gone forever". That really, really sucks.

Important Announcement for the TrueNAS Community.

ZFS Raid Z pool broken 30 minutes into a scrub

norbs

Explorer

norbs

Explorer

mjws00

Guru

norbs

Explorer

cyberjock

Inactive Account

mjws00

Guru

norbs

Explorer

cyberjock

Inactive Account

mjws00

Guru

norbs

Explorer

cyberjock

Inactive Account

Similar threads

Important Announcement for the TrueNAS Community.

ZFS Raid Z pool broken 30 minutes into a scrub

Explorer

Explorer

Guru

Explorer

Inactive Account

Guru

Explorer

Inactive Account

Guru

Explorer

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ZFS Raid Z pool broken 30 minutes into a scrub"

Similar threads