Hi,
Recently I have had some not so pleasant issues with one of our FreeNAS servers during disk failure. The system gets almost completely inaccessible when one of 6 disks in a raid z2 pool fails. I've described our issues and hardware below. I would expect the pool to continue working in a degraded state. We have previously been using NexentaStor on the same hardware without problems when disks were failing. Any ideas what is causing FreeNAS to stall in this way? Yes, I know we have crappy disks, will make another post under hardware for suggestions.
During the first failure I got an alert email stating that "The volume pool1 (ZFS) state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected." After that NFS mounts were inaccessible, like the web UI and ssh login. I managed to use the console over IPMI, but it was slow and flooded with error messages. After a reboot all came up and the system worked properly with a degraded pool. After running "zpool clear" the pool was healthy again.
Logs from the disk fail are flooded with
After a while these things started to show up
Yesterday morning the same disk failed again. The alert mail stated that "The volume pool1 (ZFS) state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state." Everything worked properly with a degraded pool.
The same message in the logs but just for 1 minute an 30 seconds, after that the system continued quietly
When I got to work I decided to switch to a spare disk. I accidentally replaced it with a disk with previous issues, so after the resilver finished things got bad again. FreeNAS was inaccessible like the first time. This time we got no email, and after a reboot it did not come up again. I guess it was busy reading the bad disk. After we pulled the bad disk the machine started as normal and we could replace the bad disk with the correct spare.
Logs were filled with these messages, which continued in the console after the reboot
Hardware
Recently I have had some not so pleasant issues with one of our FreeNAS servers during disk failure. The system gets almost completely inaccessible when one of 6 disks in a raid z2 pool fails. I've described our issues and hardware below. I would expect the pool to continue working in a degraded state. We have previously been using NexentaStor on the same hardware without problems when disks were failing. Any ideas what is causing FreeNAS to stall in this way? Yes, I know we have crappy disks, will make another post under hardware for suggestions.
During the first failure I got an alert email stating that "The volume pool1 (ZFS) state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected." After that NFS mounts were inaccessible, like the web UI and ssh login. I managed to use the console over IPMI, but it was slow and flooded with error messages. After a reboot all came up and the system worked properly with a degraded pool. After running "zpool clear" the pool was healthy again.
Logs from the disk fail are flooded with
Code:
Sep 26 13:34:58 storage3 (da2:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Sep 26 13:34:58 storage3 (da2:mps0:0:2:0): CAM status: SCSI Status Error Sep 26 13:34:58 storage3 (da2:mps0:0:2:0): SCSI status: Check Condition Sep 26 13:34:58 storage3 (da2:mps0:0:2:0): SCSI sense: Deferred error: HARDWARE FAILURE asc:3,0 (Peripheral device write fault) Sep 26 13:34:58 storage3 (da2:mps0:0:2:0): Info: 0x31599031 Sep 26 13:34:58 storage3 (da2:mps0:0:2:0): Field Replaceable Unit: 8 Sep 26 13:34:58 storage3 (da2:mps0:0:2:0): Actual Retry Count: 24 Sep 26 13:34:58 storage3 (da2:mps0:0:2:0): Retrying command (per sense data) Sep 26 13:34:59 storage3 (da2:mps0:0:2:0): READ(10). CDB: 28 00 30 3d 73 20 00 00 40 00 Sep 26 13:34:59 storage3 (da2:mps0:0:2:0): CAM status: SCSI Status Error Sep 26 13:34:59 storage3 (da2:mps0:0:2:0): SCSI status: Check Condition Sep 26 13:34:59 storage3 (da2:mps0:0:2:0): SCSI sense: Deferred error: HARDWARE FAILURE asc:3,0 (Peripheral device write fault) Sep 26 13:34:59 storage3 (da2:mps0:0:2:0): Info: 0x31599032 Sep 26 13:34:59 storage3 (da2:mps0:0:2:0): Field Replaceable Unit: 8 Sep 26 13:34:59 storage3 (da2:mps0:0:2:0): Actual Retry Count: 24 Sep 26 13:34:59 storage3 (da2:mps0:0:2:0): Retrying command (per sense data) Sep 26 13:35:00 storage3 (da2:mps0:0:2:0): READ(10). CDB: 28 00 3e b2 c3 58 00 00 40 00 Sep 26 13:35:00 storage3 (da2:mps0:0:2:0): CAM status: SCSI Status Error Sep 26 13:35:00 storage3 (da2:mps0:0:2:0): SCSI status: Check Condition Sep 26 13:35:00 storage3 (da2:mps0:0:2:0): SCSI sense: Deferred error: HARDWARE FAILURE asc:3,0 (Peripheral device write fault) Sep 26 13:35:00 storage3 (da2:mps0:0:2:0): Info: 0x31599033 Sep 26 13:35:00 storage3 (da2:mps0:0:2:0): Field Replaceable Unit: 8 Sep 26 13:35:00 storage3 (da2:mps0:0:2:0): Actual Retry Count: 24 Sep 26 13:35:00 storage3 (da2:mps0:0:2:0): Retrying command (per sense data)
After a while these things started to show up
Code:
Sep 26 13:39:15 storage3 (da2:mps0:0:2:0): Retrying command (per sense data) Sep 26 13:40:04 storage3 (da2:mps0:0:2:0): WRITE(10). CDB: 2a 00 00 40 04 50 00 00 08 00 length 4096 SMID 86 command timeout cm 0xffffff800110ee30 ccb 0xfffffe0078fa4000 Sep 26 13:40:04 storage3 (noperiph:mps0:0:4294967295:0): SMID 2 Aborting command 0xffffff800110ee30 Sep 26 13:40:04 storage3 (da2:mps0:0:2:0): WRITE(10). CDB: 2a 00 00 40 04 50 00 00 08 00 Sep 26 13:40:05 storage3 (da2:mps0:0:2:0): CAM status: Command timeout Sep 26 13:40:05 storage3 (da2:mps0:0:2:0): Retrying command Sep 26 13:40:05 storage3 (da2:mps0:0:2:0): WRITE(10). CDB: 2a 00 68 cb 9a 50 00 00 08 00 length 4096 SMID 628 command timeout cm 0xffffff800113a4a0 ccb 0xfffffe000f5fa800 Sep 26 13:40:05 storage3 (noperiph:mps0:0:4294967295:0): SMID 3 Aborting command 0xffffff800113a4a0 Sep 26 13:40:06 storage3 (da2:mps0:0:2:0): WRITE(10). CDB: 2a 00 68 cb 9a 50 00 00 08 00 Sep 26 13:40:06 storage3 (da2:mps0:0:2:0): CAM status: Command timeout Sep 26 13:40:06 storage3 (da2:mps0:0:2:0): Retrying command Sep 26 13:40:06 storage3 (da2:mps0:0:2:0): WRITE(10). CDB: 2a 00 68 cb 9c 50 00 00 08 00 length 4096 SMID 702 command timeout cm 0xffffff8001140370 ccb 0xfffffe007903b000 Sep 26 13:40:06 storage3 (noperiph:mps0:0:4294967295:0): SMID 4 Aborting command 0xffffff8001140370 Sep 26 13:40:06 storage3 (da2:mps0:0:2:0): WRITE(10). CDB: 2a 00 68 cb 9c 50 00 00 08 00 Sep 26 13:40:06 storage3 (da2:mps0:0:2:0): CAM status: Command timeout Sep 26 13:40:06 storage3 (da2:mps0:0:2:0): Retrying command Sep 26 13:40:07 storage3 (da2:mps0:0:2:0): READ(10). CDB: 28 00 28 5c 08 20 00 00 40 00 length 32768 SMID 904 command timeout cm 0xffffff8001150640 ccb 0xfffffe000f5f5800 Sep 26 13:40:07 storage3 (noperiph:mps0:0:4294967295:0): SMID 5 Aborting command 0xffffff8001150640 Sep 26 13:40:07 storage3 (da2:mps0:0:2:0): READ(10). CDB: 28 00 28 5c 08 20 00 00 40 00 Sep 26 13:40:07 storage3 (da2:mps0:0:2:0): CAM status: Command timeout Sep 26 13:40:07 storage3 (da2:mps0:0:2:0): Retrying command Sep 26 13:40:08 storage3 (da2:mps0:0:2:0): READ(10). CDB: 28 00 28 5c 0c 60 00 00 40 00 length 32768 SMID 480 command timeout cm 0xffffff800112e700 ccb 0xfffffe074c86a000 Sep 26 13:40:08 storage3 (noperiph:mps0:0:4294967295:0): SMID 6 Aborting command 0xffffff800112e700 Sep 26 13:40:08 storage3 (da2:mps0:0:2:0): READ(10). CDB: 28 00 28 5c 0c 60 00 00 40 00 Sep 26 13:40:08 storage3 (da2:mps0:0:2:0): CAM status: Command timeout Sep 26 13:40:08 storage3 (da2:mps0:0:2:0): Retrying command
Yesterday morning the same disk failed again. The alert mail stated that "The volume pool1 (ZFS) state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state." Everything worked properly with a degraded pool.
The same message in the logs but just for 1 minute an 30 seconds, after that the system continued quietly
Code:
Sep 30 03:02:03 storage3 (da2:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Sep 30 03:02:03 storage3 (da2:mps0:0:2:0): CAM status: SCSI Status Error Sep 30 03:02:03 storage3 (da2:mps0:0:2:0): SCSI status: Check Condition Sep 30 03:02:03 storage3 (da2:mps0:0:2:0): SCSI sense: Deferred error: HARDWARE FAILURE asc:3,0 (Peripheral device write fault) Sep 30 03:02:03 storage3 (da2:mps0:0:2:0): Info: 0x31bee4fd Sep 30 03:02:03 storage3 (da2:mps0:0:2:0): Field Replaceable Unit: 8 Sep 30 03:02:03 storage3 (da2:mps0:0:2:0): Actual Retry Count: 24 Sep 30 03:02:03 storage3 (da2:mps0:0:2:0): Retrying command (per sense data) Sep 30 03:02:04 storage3 (da2:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Sep 30 03:02:04 storage3 (da2:mps0:0:2:0): CAM status: SCSI Status Error Sep 30 03:02:04 storage3 (da2:mps0:0:2:0): SCSI status: Check Condition Sep 30 03:02:04 storage3 (da2:mps0:0:2:0): SCSI sense: Deferred error: HARDWARE FAILURE asc:3,0 (Peripheral device write fault) Sep 30 03:02:04 storage3 (da2:mps0:0:2:0): Info: 0x31bee4fe Sep 30 03:02:04 storage3 (da2:mps0:0:2:0): Field Replaceable Unit: 8 Sep 30 03:02:04 storage3 (da2:mps0:0:2:0): Actual Retry Count: 24 Sep 30 03:02:05 storage3 (da2:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Sep 30 03:02:05 storage3 (da2:mps0:0:2:0): CAM status: SCSI Status Error Sep 30 03:02:05 storage3 (da2:mps0:0:2:0): SCSI status: Check Condition Sep 30 03:02:05 storage3 (da2:mps0:0:2:0): SCSI sense: Deferred error: HARDWARE FAILURE asc:3,0 (Peripheral device write fault) Sep 30 03:02:05 storage3 (da2:mps0:0:2:0): Info: 0x31bee4ff Sep 30 03:02:05 storage3 (da2:mps0:0:2:0): Field Replaceable Unit: 8 Sep 30 03:02:05 storage3 (da2:mps0:0:2:0): Actual Retry Count: 24 Sep 30 03:02:05 storage3 (da2:mps0:0:2:0): Retrying command (per sense data) Sep 30 03:02:06 storage3 (da2:mps0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
When I got to work I decided to switch to a spare disk. I accidentally replaced it with a disk with previous issues, so after the resilver finished things got bad again. FreeNAS was inaccessible like the first time. This time we got no email, and after a reboot it did not come up again. I guess it was busy reading the bad disk. After we pulled the bad disk the machine started as normal and we could replace the bad disk with the correct spare.
Logs were filled with these messages, which continued in the console after the reboot
Code:
Sep 30 15:43:24 storage3 (da10:mps2:0:0:0): READ(10). CDB: 28 00 00 40 02 a0 00 00 e0 00 Sep 30 15:43:24 storage3 (da10:mps2:0:0:0): CAM status: SCSI Status Error Sep 30 15:43:24 storage3 (da10:mps2:0:0:0): SCSI status: Check Condition Sep 30 15:43:24 storage3 (da10:mps2:0:0:0): SCSI sense: HARDWARE FAILURE asc:32,0 (No defect spare location available) Sep 30 15:43:24 storage3 (da10:mps2:0:0:0): Info: 0x4002d3 Sep 30 15:43:24 storage3 (da10:mps2:0:0:0): Field Replaceable Unit: 157 Sep 30 15:43:24 storage3 (da10:mps2:0:0:0): Command Specific Info: 0xa1615169 Sep 30 15:43:24 storage3 (da10:mps2:0:0:0): Actual Retry Count: 255 Sep 30 15:43:24 storage3 (da10:mps2:0:0:0): Retrying command (per sense data) Sep 30 15:43:24 storage3 (da10:mps2:0:0:0): READ(10). CDB: 28 00 00 40 02 a0 00 00 e0 00 Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): CAM status: SCSI Status Error Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): SCSI status: Check Condition Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): SCSI sense: HARDWARE FAILURE asc:32,0 (No defect spare location available) Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): Info: 0x4002d3 Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): Field Replaceable Unit: 157 Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): Command Specific Info: 0xa1615169 Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): Actual Retry Count: 255 Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): Retrying command (per sense data) Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): READ(10). CDB: 28 00 00 40 02 a0 00 00 e0 00 Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): CAM status: SCSI Status Error Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): SCSI status: Check Condition Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): SCSI sense: HARDWARE FAILURE asc:32,0 (No defect spare location available) Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): Info: 0x4002c1 Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): Field Replaceable Unit: 157 Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): Command Specific Info: 0xa1615169 Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): Actual Retry Count: 255 Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): Retrying command (per sense data) Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): READ(10). CDB: 28 00 00 40 02 a0 00 00 e0 00 Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): CAM status: SCSI Status Error Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): SCSI status: Check Condition Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): SCSI sense: HARDWARE FAILURE asc:32,0 (No defect spare location available) Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): Info: 0x4002c1 Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): Field Replaceable Unit: 157 Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): Command Specific Info: 0xa1615169 Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): Actual Retry Count: 255 Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): Retrying command (per sense data) Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): READ(10). CDB: 28 00 00 40 02 a0 00 00 e0 00 Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): CAM status: SCSI Status Error Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): SCSI status: Check Condition Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): SCSI sense: HARDWARE FAILURE asc:32,0 (No defect spare location available) Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): Info: 0x4002d0 Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): Field Replaceable Unit: 157 Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): Command Specific Info: 0xa1615169 Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): Actual Retry Count: 255 Sep 30 15:43:57 storage3 (da10:mps2:0:0:0): Error 5, Retries exhausted
Hardware
- Supermicro X8DTH
- 2x E5620 @ 2.40GHz
- 48GB DDR3 RAM
- 2x LSI SAS9211-8i (IT) + built in SMC2008IT
- 4x Intel 311 20GB SSD (syspool + ZIL)
- 8x Seagate Savvio 10K.5 64MB 900GB SAS (6 disk raid z2 + 2 spare)
- 2x Intel 510 Series 2.5" SSD 250GB (cache)