Disk in the same slot keep failing...

vonion · Feb 25, 2018

It is a setup with 9 x SM863a in RAID Z, chassis is SYS-1028U-TNRT+ drives and system are converted from a working VSAN. da2 failed twice after I finished setup today. I replaced the disk in da2 twice, once resilvering done, it starts to show checksum error again...

New to FreeNAS, please let me know if need additional info, otherwise log is below

Code:

zpool status
  pool: FNAS03-Z2
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 299G in 0 days 02:04:20 with 0 errors on Sun Feb 25 02:58:28 2018
config:

	NAME											STATE	 READ WRITE CKSUM
	FNAS03-Z2							   ONLINE	   0	 0	 0
	  raidz2-0									  ONLINE	   0	 0	 0
		gptid/79a29617-19d0-11e8-957a-002590fa888d  ONLINE	   0	 0	 0
		gptid/7a040ad1-19d0-11e8-957a-002590fa888d  ONLINE	   0	 0	 0
		gptid/7a663ebd-19d0-11e8-957a-002590fa888d  ONLINE	   0	 0	 0
		gptid/69598304-1a09-11e8-957a-002590fa888d  ONLINE	   0	 0	13
		gptid/7b4dd453-19d0-11e8-957a-002590fa888d  ONLINE	   0	 0	 0
		gptid/7bb14e7d-19d0-11e8-957a-002590fa888d  ONLINE	   0	 0	 0
		gptid/7c14049a-19d0-11e8-957a-002590fa888d  ONLINE	   0	 0	 0
		gptid/7c77b5b5-19d0-11e8-957a-002590fa888d  ONLINE	   0	 0	 0
		gptid/7cdb5e0f-19d0-11e8-957a-002590fa888d  ONLINE	   0	 0	 0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: none requested
config:

	NAME		STATE	 READ WRITE CKSUM
	freenas-boot  ONLINE	   0	 0	 0
	  ada0p2	ONLINE	   0	 0	 0

errors: No known data errors

Code:

Feb 25 00:49:45 FNAS03 mpr0: (da2:mpr0:0:4:0): CAM status: CCB request completed with an error
Feb 25 00:49:45 FNAS03 (da2:mpr0:0:4:0): Error 5, Periph was invalidated
Feb 25 00:49:45 FNAS03 (da2:mpr0:0:4:0): READ(10). CDB: 28 00 95 f7 6c 38 00 00 08 00
Feb 25 00:49:45 FNAS03 clearing target 4 handle 0x000b
Feb 25 00:49:45 FNAS03 mpr0: At enclosure level 0, slot 2, connector name (	)
Feb 25 00:49:45 FNAS03 mpr0: Unfreezing devq for target ID 4
Feb 25 00:49:45 FNAS03 (da2:mpr0:0:4:0): CAM status: CCB request completed with an error
Feb 25 00:49:45 FNAS03 ZFS: vdev state changed, pool_guid=2025594203911010728 vdev_guid=14089127837580570434
Feb 25 00:49:45 FNAS03 ZFS: vdev state changed, pool_guid=2025594203911010728 vdev_guid=14089127837580570434
Feb 25 00:49:45 FNAS03 ZFS: vdev state changed, pool_guid=2025594203911010728 vdev_guid=14089127837580570434
Feb 25 00:49:45 FNAS03 (da2:mpr0:0:4:0): Error 5, Periph was invalidated
Feb 25 00:49:45 FNAS03 (da2:mpr0:0:4:0): READ(10). CDB: 28 00 95 f7 6c 10 00 00 08 00
Feb 25 00:49:45 FNAS03 (da2:mpr0:0:4:0): CAM status: CCB request completed with an error
Feb 25 00:49:45 FNAS03 (da2:mpr0:0:4:0): Error 5, Periph was invalidated
Feb 25 00:49:45 FNAS03 (da2:mpr0:0:4:0): READ(10). CDB: 28 00 95 f7 6b f8 00 00 08 00
Feb 25 00:49:45 FNAS03 (da2:mpr0:0:4:0): CAM status: CCB request completed with an error
Feb 25 00:49:45 FNAS03 (da2:mpr0:0:4:0): Error 5, Periph was invalidated
Feb 25 00:49:45 FNAS03 (da2:mpr0:0:4:0): READ(10). CDB: 28 00 95 f7 6b b8 00 00 08 00
Feb 25 00:49:45 FNAS03 (da2:mpr0:0:4:0): CAM status: CCB request completed with an error
Feb 25 00:49:45 FNAS03 (da2:mpr0:0:4:0): Error 5, Periph was invalidated
Feb 25 00:49:45 FNAS03 (da2:mpr0:0:4:0): READ(10). CDB: 28 00 95 f7 6b 98 00 00 08 00
Feb 25 00:49:45 FNAS03 (da2:mpr0:0:4:0): CAM status: CCB request completed with an error
Feb 25 00:49:45 FNAS03 (da2:mpr0:0:4:0): Error 5, Periph was invalidated
Feb 25 00:49:45 FNAS03 GEOM_MIRROR: Device swap2: provider da2p1 disconnected.
Feb 25 00:49:45 FNAS03 (da2:mpr0:0:4:0): Periph destroyed
Feb 25 00:49:59 FNAS03 daemon[6144]:	 2018/02/25 00:49:59 [WARN] agent: Check 'freenas_health' is now warning
Feb 25 00:52:00 FNAS03 daemon[6144]:	 2018/02/25 00:52:00 [WARN] agent: Check 'freenas_health' is now warning
Feb 25 00:52:05 FNAS03 mpr0: SAS Address for SATA device = 947e9634e1c1a8d8
Feb 25 00:52:05 FNAS03 mpr0: SAS Address from SAS device page0 = 4433221102000000
Feb 25 00:52:05 FNAS03 mpr0: SAS Address from SATA device = 947e9634e1c1a8d8
Feb 25 00:52:05 FNAS03 mpr0: Found device <881<SataDev,Direct>,End Device> <6.0Gbps> handle<0x000b> enclosureHandle<0x0001> slot 2
Feb 25 00:52:05 FNAS03 mpr0: At enclosure level 0 and connector name (	)
Feb 25 00:52:05 FNAS03 da2 at mpr0 bus 0 scbus0 target 4 lun 0
Feb 25 00:52:05 FNAS03 da2: <ATA SAMSUNG MZ7KM1T9 104Q> Fixed Direct Access SPC-4 SCSI device
Feb 25 00:52:05 FNAS03 da2: Serial Number S3HUNX0J601476Z
Feb 25 00:52:05 FNAS03 da2: 600.000MB/s transfers
Feb 25 00:52:05 FNAS03 da2: Command Queueing enabled
Feb 25 00:52:05 FNAS03 da2: 1831420MB (3750748848 512 byte sectors)
Feb 25 00:52:05 FNAS03 da2: quirks=0x8<4K>
Feb 25 00:53:55 FNAS03 notifier: 32+0 records in
Feb 25 00:53:55 FNAS03 notifier: 32+0 records out
Feb 25 00:53:55 FNAS03 notifier: 33554432 bytes transferred in 0.098817 secs (339562990 bytes/sec)
Feb 25 00:53:55 FNAS03 GEOM: da2: the primary GPT table is corrupt or invalid.
Feb 25 00:53:55 FNAS03 GEOM: da2: using the secondary instead -- recovery strongly advised.
Feb 25 00:53:55 FNAS03 notifier: dd: /dev/da2: short write on character device
Feb 25 00:53:55 FNAS03 notifier: dd: /dev/da2: end of device
Feb 25 00:53:55 FNAS03 notifier: 33+0 records in
Feb 25 00:53:55 FNAS03 notifier: 32+1 records out
Feb 25 00:53:55 FNAS03 notifier: 33906688 bytes transferred in 0.083948 secs (403901163 bytes/sec)
Feb 25 00:53:56 FNAS03 notifier: gpart: arg0 'gptid/69598304-1a09-11e8-957a-002590fa888d': Invalid argument
Feb 25 00:53:56 FNAS03 notifier: gpart: provider: Device not configured
Feb 25 00:53:56 FNAS03 notifier: gpart: arg0 'gptid/69598304-1a09-11e8-957a-002590fa888d': Invalid argument
Feb 25 00:53:56 FNAS03 ZFS: vdev state changed, pool_guid=2025594203911010728 vdev_guid=10055579806217775314
Feb 25 00:54:00 FNAS03 daemon[6144]:	 2018/02/25 00:54:00 [WARN] agent: Check 'freenas_health' is now warning
Feb 25 00:54:05 FNAS03 ZFS: vdev state changed, pool_guid=2025594203911010728 vdev_guid=13070446044546478261
Feb 25 00:54:05 FNAS03 ZFS: vdev state changed, pool_guid=2025594203911010728 vdev_guid=15261218018251548598
Feb 25 00:54:05 FNAS03 ZFS: vdev state changed, pool_guid=2025594203911010728 vdev_guid=13671545178972892527
Feb 25 00:54:05 FNAS03 ZFS: vdev state changed, pool_guid=2025594203911010728 vdev_guid=10055579806217775314
Feb 25 00:54:05 FNAS03 ZFS: vdev state changed, pool_guid=2025594203911010728 vdev_guid=14932043723365803802
Feb 25 00:54:05 FNAS03 ZFS: vdev state changed, pool_guid=2025594203911010728 vdev_guid=12564594350981481757
Feb 25 00:54:05 FNAS03 ZFS: vdev state changed, pool_guid=2025594203911010728 vdev_guid=5704886040987976214
Feb 25 00:54:05 FNAS03 ZFS: vdev state changed, pool_guid=2025594203911010728 vdev_guid=7707503818947279095
Feb 25 00:54:05 FNAS03 ZFS: vdev state changed, pool_guid=2025594203911010728 vdev_guid=11776749107165575220
Feb 25 00:54:09 FNAS03 GEOM_ELI: Device mirror/swap2.eli destroyed.
Feb 25 00:54:09 FNAS03 GEOM_MIRROR: Device swap2: provider destroyed.
Feb 25 00:54:09 FNAS03 GEOM_MIRROR: Device swap2 destroyed.
Feb 25 00:54:10 FNAS03 GEOM_MIRROR: Device mirror/swap2 launched (2/2).
Feb 25 00:54:10 FNAS03 GEOM_ELI: Device mirror/swap2.eli created.
Feb 25 00:54:10 FNAS03 GEOM_ELI: Encryption: AES-XTS 128
Feb 25 00:54:10 FNAS03 GEOM_ELI:	 Crypto: hardware
Feb 25 00:56:01 FNAS03 daemon[6144]:	 2018/02/25 00:56:01 [WARN] agent: Check 'freenas_health' is now warning
Feb 25 00:58:01 FNAS03 daemon[6144]:	 2018/02/25 00:58:01 [WARN] agent: Check 'freenas_health' is now warning
Feb 25 01:00:01 FNAS03 daemon[6144]:	 2018/02/25 01:00:01 [WARN] agent: Check 'freenas_health' is now warning

hescominsoon · Feb 25, 2018

if the same slot is failing then you have a hardware problem. checksum errors are usually indicative of either a cable problem, a backplane problem, or an HBA issue. If you have another bay move the affected drive to that bay and see if you can successfully add that drive into the pool. if it works the drive is ok and you know you have something up with that bay.

vonion · Feb 25, 2018

hescominsoon said:
if the same slot is failing then you have a hardware problem. checksum errors are usually indicative of either a cable problem, a backplane problem, or an HBA issue. If you have another bay move the affected drive to that bay and see if you can successfully add that drive into the pool. if it works the drive is ok and you know you have something up with that bay.

Thanks! I'll try that.

I have another node with identical hardware configuration. If I shut down the system, swap all the drives and import the configure to new node, would be sane to expect boot up working?

My plan is to migrate the data to that node so I can play around, check cables or update firmware on the current one.

Redcoat · Feb 25, 2018

vonion said:
I have another node with identical hardware configuration. If I shut down the system, swap all the drives and import the configure to new node, would be sane to expect boot up working?

Yes, but it seems "bootup" IS working. What you appear to have is a hardware error associated with da2. @hescominsoon pointed you to the hardware issues to address and how to do so.

If you don't have a spare slot to which to move the current da2 drive for a test, try moving the drive around and see of the problem follows the drives or stays with the port/cable combination. Deal with what you find as a process of elimination. Or, you can try moving all the drives to the other box and importing the pool there as you suggest.

vonion · Feb 26, 2018

Redcoat said:
Yes, but it seems "bootup" IS working. What you appear to have is a hardware error associated with da2. @hescominsoon pointed you to the hardware issues to address and how to do so.

If you don't have a spare slot to which to move the current da2 drive for a test, try moving the drive around and see of the problem follows the drives or stays with the port/cable combination. Deal with what you find as a process of elimination. Or, you can try moving all the drives to the other box and importing the pool there as you suggest.

Thanks. I rebuilt the volume using the spare node with same drives, now running stress test, don't see any errors.

I'm going to get some more drives and test on the old node see if the checksum error continues.

Important Announcement for the TrueNAS Community.

Disk in the same slot keep failing...

vonion

Cadet

hescominsoon

Patron

vonion

Cadet

Redcoat

MVP

vonion

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Disk in the same slot keep failing...

vonion

Cadet

hescominsoon

Patron

vonion

Cadet

Redcoat

MVP

vonion

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Disk in the same slot keep failing..."

Similar threads