freenas rebooted on disk failure

armsby · Jun 3, 2014

I have configured an encrypted volume one 3 disk for test purpose, 1 of the disk is known to be bad as I want to see how freenas behave when a disk fail in an encrypted volume, but what I did not expect was that freenas rebooted, this is the last message in the log

Code:

Jun  3 19:49:01 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs snapshot -r ssd@auto-20140603.1949-1m
Jun  3 19:53:18 freenas kernel: (da2:mps0:0:19:0): READ(10). CDB: 28 00 05 d1 8b f0 00 01 00 00 length 131072 SMID 337 terminated ioc 804b scsi 0 state 0 xfer 0
Jun  3 19:53:18 freenas kernel: (da2:mps0:0:19:0): READ(10). CDB: 28 00 05 d1 8d 70 00 01 00 00 length 131072 SMID 228 terminated ioc 804b scsi 0 state 0 xfer 0
Jun  3 19:53:18 freenas kernel: (da2:mps0:0:19:0): WRITE(10). CDB: 2a 00 06 06 5a c0 00 00 08 00 length 4096 SMID 1000 terminated ioc 804b scsi 0 state 0 xfer 0
Jun  3 19:53:18 freenas kernel: (da2:mps0:0:19:0): READ(10). CDB: 28 00 05 d1 8b 68 00 00 80 00
Jun  3 19:53:18 freenas kernel: (da2:mps0:0:19:0): CAM status: SCSI Status Error
Jun  3 19:53:18 freenas kernel: (da2:mps0:0:19:0): SCSI status: Check Condition
Jun  3 19:53:18 freenas kernel: (da2:mps0:0:19:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jun  3 19:53:18 freenas kernel: (da2:mps0:0:19:0): Info: 0x5d18bb5
Jun  3 19:53:18 freenas kernel: (da2:mps0:0:19:0): Error 5, Unretryable error
Jun  3 19:53:18 freenas kernel: GEOM_ELI: g_eli_read_done() failed gptid/a6ec6834-dd19-11e3-adad-08606e450d1c.eli[READ(offset=48907014144, length=65536)]
Jun  3 19:59:00 freenas kernel: (da2:mps0:0:19:0): READ(10). CDB: 28 00 05 d4 7d e0 00 00 80 00 length 65536 SMID 698 terminated ioc 804b scsi 0 state 0 xfer 0
Jun  3 19:59:00 freenas kernel: (da2:mps0:0:19:0): READ(10). CDB: 28 00 02 35 76 50 00 00 80 00 length 65536 SMID 807 terminated ioc 804b scsi 0 state 0 xfer 0
Jun  3 19:59:00 freenas kernel: (da2:mps0:0:19:0): READ(10). CDB: 28 00 05 d4 7a e0 00 00 80 00
Jun  3 19:59:00 freenas kernel: (da2:mps0:0:19:0): CAM status: SCSI Status Error
Jun  3 19:59:00 freenas kernel: (da2:mps0:0:19:0): SCSI status: Check Condition
Jun  3 19:59:00 freenas kernel: (da2:mps0:0:19:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jun  3 19:59:00 freenas kernel: (da2:mps0:0:19:0): Info: 0x5d47b1d
Jun  3 19:59:00 freenas kernel: (da2:mps0:0:19:0): Error 5, Unretryable error
Jun  3 19:59:00 freenas kernel: GEOM_ELI: g_eli_read_done() failed gptid/a6ec6834-dd19-11e3-adad-08606e450d1c.eli[READ(offset=49005510656, length=65536)]
Jun  3 20:03:38 freenas kernel: (da2:mps0:0:19:0): READ(10). CDB: 28 00 02 5b c4 90 00 00 80 00 length 65536 SMID 553 command timeout cm 0xffffff8000cd0488 ccb 0xfffffe011e5f3000
Jun  3 20:03:38 freenas kernel: (noperiph:mps0:0:4294967295:0): SMID 1 Aborting command 0xffffff8000cd0488
Jun  3 20:03:38 freenas kernel: (da2:mps0:0:19:0): READ(10). CDB: 28 00 02 5b c5 10 00 00 80 00 length 65536 SMID 281 command timeout cm 0xffffff8000cba808 ccb 0xfffffe011e5d1000
Jun  3 20:03:38 freenas kernel: (da2:mps0:0:19:0): READ(10). CDB: 28 00 02 5b c5 90 00 00 80 00 length 65536 SMID 949 command timeout cm 0xffffff8000ceffe8 ccb 0xfffffe011e58e000
Jun  3 20:03:41 freenas kernel: (da2:mps0:0:19:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 701 command timeout cm 0xffffff8000cdc228 ccb 0xfffffe011e5e8800
Jun  3 20:08:11 freenas syslogd: kernel boot file is /boot/kernel/kernel

I am running on 9.2.1.6-beta.

has anybody seen this before?

cyberjock · Jun 3, 2014

Ok, clarify a few things for me:

1. Are you saying you installed a deliberately failing disk in a RAIDZ1?
2. So how did you "simulate" a failure with the disk(this may be N/A based on the above question).

armsby · Jun 3, 2014

1. Yes, this server is not ready for use yet, so this would be the perfect time to test everything that you would see during a disk failer, like smart email, time before disk was identified as bad and kicked out.

2. I just used the freenas with non critical data, nothing out of the normal I ran a btsync with another nas until it failed

Sent from my HTC One max using Tapatalk

cyberjock · Jun 3, 2014

Ok.

So, you were doing RAIDZ1 with a failed disk. As everyone here should know, that means you have 1 disk's worth of redundancy. If you pull a good disk and you are left with any other disks with any error whatsoever you can have problems up to and including complete failure of the pool. This is why my sig has the "RAID5/RAIDZ1 is dead" link. RAIDZ1 should not be used in production environments except for temp files. In your case, since you are wanting to see how a failed disk handles itself you should be doing RAIDZ2. Without another disk's worth of redundancy the whole house of cards falls down when you have no more redundancy and disks are still continuing to fail.

Good on you for testing stuff out. It's time consuming, but gives you a few important things:

1. You see how the server really responds to failures.
2. You get confidence in the design you are about to implement.
3. You learn more about how FreeNAS works internally.

armsby · Jun 3, 2014

Thank you for the warning, but in this case pool failure is what I am trying to get, the production setup will be 4 vdevs with 6 disks running RAIDZ2.

My problem is that I did not expect the server to reboot when it kicked out the disk, with no more log

Sent from my HTC One max using Tapatalk

cyberjock · Jun 4, 2014

Yeah, it likely immediately paniced because it no longer had enough redundancy to continue functioning. Once you run out of redundancy whatever data you receive, trashed or not, must be processed by ZFS. If that trashed data causes a crash, game over. This is no different than hardware RAID.

armsby · Jun 4, 2014

Ok, good to know, but pool seem fine after reboot and resilver

Sent from my HTC One max using Tapatalk

cyberjock · Jun 4, 2014

Right. Everything will be normal until the ZFS code gets some garbage that makes no sense. Instant panic.

You reboot and things continue right along until you get more garbage. Instant panic (again).

This is a common problem with ZFS when you start dealing with unrepairable corruption. It is also why I have this on slide 11 of my FreeNAS for noobs presentation....

If any VDev in a zpool is failed, you will lose the entire zpool with no chance of partial recovery. (Read this again so it sinks in)

It goes from working just fine to worthless in an instant. There's no room for error. This absolute is supposed to be maintained by the server admin. Of course, too many admins don't recognize this basic philosophy and lose their data as a result.;)

armsby · Jun 4, 2014

That was the warning that made me do this test, I want to see what happens in worst case, before I put data on that I care about

Sent from my HTC One max using Tapatalk

Important Announcement for the TrueNAS Community.

freenas rebooted on disk failure

armsby

Dabbler

cyberjock

Inactive Account

armsby

Dabbler

cyberjock

Inactive Account

armsby

Dabbler

cyberjock

Inactive Account

armsby

Dabbler

cyberjock

Inactive Account

armsby

Dabbler

Similar threads