freenas rebooted on disk failure

Status
Not open for further replies.

armsby

Dabbler
Joined
Apr 20, 2014
Messages
14
I have configured an encrypted volume one 3 disk for test purpose, 1 of the disk is known to be bad as I want to see how freenas behave when a disk fail in an encrypted volume, but what I did not expect was that freenas rebooted, this is the last message in the log

Code:
Jun  3 19:49:01 freenas autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs snapshot -r ssd@auto-20140603.1949-1m
Jun  3 19:53:18 freenas kernel: (da2:mps0:0:19:0): READ(10). CDB: 28 00 05 d1 8b f0 00 01 00 00 length 131072 SMID 337 terminated ioc 804b scsi 0 state 0 xfer 0
Jun  3 19:53:18 freenas kernel: (da2:mps0:0:19:0): READ(10). CDB: 28 00 05 d1 8d 70 00 01 00 00 length 131072 SMID 228 terminated ioc 804b scsi 0 state 0 xfer 0
Jun  3 19:53:18 freenas kernel: (da2:mps0:0:19:0): WRITE(10). CDB: 2a 00 06 06 5a c0 00 00 08 00 length 4096 SMID 1000 terminated ioc 804b scsi 0 state 0 xfer 0
Jun  3 19:53:18 freenas kernel: (da2:mps0:0:19:0): READ(10). CDB: 28 00 05 d1 8b 68 00 00 80 00
Jun  3 19:53:18 freenas kernel: (da2:mps0:0:19:0): CAM status: SCSI Status Error
Jun  3 19:53:18 freenas kernel: (da2:mps0:0:19:0): SCSI status: Check Condition
Jun  3 19:53:18 freenas kernel: (da2:mps0:0:19:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jun  3 19:53:18 freenas kernel: (da2:mps0:0:19:0): Info: 0x5d18bb5
Jun  3 19:53:18 freenas kernel: (da2:mps0:0:19:0): Error 5, Unretryable error
Jun  3 19:53:18 freenas kernel: GEOM_ELI: g_eli_read_done() failed gptid/a6ec6834-dd19-11e3-adad-08606e450d1c.eli[READ(offset=48907014144, length=65536)]
Jun  3 19:59:00 freenas kernel: (da2:mps0:0:19:0): READ(10). CDB: 28 00 05 d4 7d e0 00 00 80 00 length 65536 SMID 698 terminated ioc 804b scsi 0 state 0 xfer 0
Jun  3 19:59:00 freenas kernel: (da2:mps0:0:19:0): READ(10). CDB: 28 00 02 35 76 50 00 00 80 00 length 65536 SMID 807 terminated ioc 804b scsi 0 state 0 xfer 0
Jun  3 19:59:00 freenas kernel: (da2:mps0:0:19:0): READ(10). CDB: 28 00 05 d4 7a e0 00 00 80 00
Jun  3 19:59:00 freenas kernel: (da2:mps0:0:19:0): CAM status: SCSI Status Error
Jun  3 19:59:00 freenas kernel: (da2:mps0:0:19:0): SCSI status: Check Condition
Jun  3 19:59:00 freenas kernel: (da2:mps0:0:19:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jun  3 19:59:00 freenas kernel: (da2:mps0:0:19:0): Info: 0x5d47b1d
Jun  3 19:59:00 freenas kernel: (da2:mps0:0:19:0): Error 5, Unretryable error
Jun  3 19:59:00 freenas kernel: GEOM_ELI: g_eli_read_done() failed gptid/a6ec6834-dd19-11e3-adad-08606e450d1c.eli[READ(offset=49005510656, length=65536)]
Jun  3 20:03:38 freenas kernel: (da2:mps0:0:19:0): READ(10). CDB: 28 00 02 5b c4 90 00 00 80 00 length 65536 SMID 553 command timeout cm 0xffffff8000cd0488 ccb 0xfffffe011e5f3000
Jun  3 20:03:38 freenas kernel: (noperiph:mps0:0:4294967295:0): SMID 1 Aborting command 0xffffff8000cd0488
Jun  3 20:03:38 freenas kernel: (da2:mps0:0:19:0): READ(10). CDB: 28 00 02 5b c5 10 00 00 80 00 length 65536 SMID 281 command timeout cm 0xffffff8000cba808 ccb 0xfffffe011e5d1000
Jun  3 20:03:38 freenas kernel: (da2:mps0:0:19:0): READ(10). CDB: 28 00 02 5b c5 90 00 00 80 00 length 65536 SMID 949 command timeout cm 0xffffff8000ceffe8 ccb 0xfffffe011e58e000
Jun  3 20:03:41 freenas kernel: (da2:mps0:0:19:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 701 command timeout cm 0xffffff8000cdc228 ccb 0xfffffe011e5e8800
Jun  3 20:08:11 freenas syslogd: kernel boot file is /boot/kernel/kernel


I am running on 9.2.1.6-beta.

has anybody seen this before?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Ok, clarify a few things for me:

1. Are you saying you installed a deliberately failing disk in a RAIDZ1?
2. So how did you "simulate" a failure with the disk(this may be N/A based on the above question).
 

armsby

Dabbler
Joined
Apr 20, 2014
Messages
14
1. Yes, this server is not ready for use yet, so this would be the perfect time to test everything that you would see during a disk failer, like smart email, time before disk was identified as bad and kicked out.

2. I just used the freenas with non critical data, nothing out of the normal I ran a btsync with another nas until it failed

Sent from my HTC One max using Tapatalk
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Ok.

So, you were doing RAIDZ1 with a failed disk. As everyone here should know, that means you have 1 disk's worth of redundancy. If you pull a good disk and you are left with any other disks with any error whatsoever you can have problems up to and including complete failure of the pool. This is why my sig has the "RAID5/RAIDZ1 is dead" link. RAIDZ1 should not be used in production environments except for temp files. In your case, since you are wanting to see how a failed disk handles itself you should be doing RAIDZ2. Without another disk's worth of redundancy the whole house of cards falls down when you have no more redundancy and disks are still continuing to fail.

Good on you for testing stuff out. It's time consuming, but gives you a few important things:

1. You see how the server really responds to failures.
2. You get confidence in the design you are about to implement.
3. You learn more about how FreeNAS works internally.
 

armsby

Dabbler
Joined
Apr 20, 2014
Messages
14
Thank you for the warning, but in this case pool failure is what I am trying to get, the production setup will be 4 vdevs with 6 disks running RAIDZ2.

My problem is that I did not expect the server to reboot when it kicked out the disk, with no more log

Sent from my HTC One max using Tapatalk
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Yeah, it likely immediately paniced because it no longer had enough redundancy to continue functioning. Once you run out of redundancy whatever data you receive, trashed or not, must be processed by ZFS. If that trashed data causes a crash, game over. This is no different than hardware RAID.
 

armsby

Dabbler
Joined
Apr 20, 2014
Messages
14
Ok, good to know, but pool seem fine after reboot and resilver

Sent from my HTC One max using Tapatalk
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Right. Everything will be normal until the ZFS code gets some garbage that makes no sense. Instant panic.

You reboot and things continue right along until you get more garbage. Instant panic (again).

This is a common problem with ZFS when you start dealing with unrepairable corruption. It is also why I have this on slide 11 of my FreeNAS for noobs presentation....

If any VDev in a zpool is failed, you will lose the entire zpool with no chance of partial recovery. (Read this again so it sinks in)

It goes from working just fine to worthless in an instant. There's no room for error. This absolute is supposed to be maintained by the server admin. Of course, too many admins don't recognize this basic philosophy and lose their data as a result.;)
 

armsby

Dabbler
Joined
Apr 20, 2014
Messages
14
That was the warning that made me do this test, I want to see what happens in worst case, before I put data on that I care about

Sent from my HTC One max using Tapatalk
 
Status
Not open for further replies.
Top