Intermittent pool degradation

SteezeMcQueen · May 23, 2020

Basic details:
FreeNAS 11.3 (Both U2 and U3) running in a VM on ESXi 6.7
Physically, there are two Bulldozer CPUs (Opteron 6276) and 54GB of ECC RAM. FreeNAS VM has 8 vCPU and 32GB RAM assigned. The LSI 9211-8i HBA connected to 4 WD Red 3TB drives is passthrough from ESXi.
FreeNAS has been running happily in a VM for several months, and as a bare metal install on the same hardware for years prior.

Situation:
Over the past two weeks, I have had an intermittent degradation of my pool. The start of this was on May 16th, when my pool threw this alert: "Pool tycho-pool state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error." Logging in and running a `zpool status` showed that the pool was Online, and that the drive corresponding to /dev/da3 had a nonzero Read or Write level. I unfortunately don't have the historical record of that specific data.
I looked at `smartctl -a /dev/da3` and didn't find any failures. After poking around on the FreeNAS forums and Reddit, I chalked it up to gremlins and did a `zpool clear` and went about my business, seeing if it would reappear.

On May 19th, I upgraded from 11.3U2 to U3 and rebooted. Later that night, another "Pool tycho-pool state is ONLINE: One or more devices has experienced an unrecoverable error." alert was triggered. 14 minutes later, the pool entered the degraded state: "Pool tycho-pool state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state."
Again, I don't have historical data for the specific values, but again it was the drive /dev/da3 that was at fault. However, this time /dev/da3 was no longer showing up as attached to the system, my devices ended at da2. I took this as a sign that this was a legitimate drive failure (as I've experienced in the past) and ordered some new drives and powered down the FreeNAS VM.

Last night, I powered the VM back up. The pool came back in the Online state, all drives present. Looking at `zpool status`, a different drive (da2) showed a nonzero CKSUM value (20-something). FreeNAS was rebooted a 2nd time, and on reboot the pool came back Online and fully healthy, with `zpool status` showing all zeros.

I scheduled a battery of Long SMART tests to run across all the pool's drives that night (They all came back passing this morning) and I'm now running a scrub.

`smartctl -a` shows RAW_VALUE 0 for IDs 5, 196, and 197:

Code:

DRIVE     ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
/dev/da0    5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
          196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
          197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
/dev/da1    5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
          196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
          197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
/dev/da2    5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
          196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
          197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
/dev/da3    5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
          196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
          197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

All IDs are TYPE Pre-fail or Old_age, which I don't find too shocking considering these drives have 36,000 hours (da2 is the baby at 18,000). Everything I've read seems to indicate that the RAW_VALUE is what matters.

For each drive, the last 21 SMART scans completed without error. I follow the recommended guidance of a Short scan on days 5, 12, 19, 26, and Long scans on days 8, 22, so that scan history extends back before the initial alert on May 16.

So, my conundrum: What's actually failing here? The drives are certainly old enough that I won't be shocked if they're bad, but the SMART data doesn't seem to back that up. It could be the HBA, or the cables - is there a good way to test that that doesn't involve just replacing them (I'm not opposed, but don't have any spares on-hand).
I have new drives, but if it's not actually the drives that are failing, I'm loathe to replace them when I don't really need to.

As I said, I'm running a scrub now. When that completes, my next step is to run `badblocks`, unless there's a better next step that I'm unaware of.

SteezeMcQueen · May 23, 2020

I was able to go back through the older /var/log/messages and found this!

May 16th, when the first "Device unrecoverable error" occurred, I see the following:

Code:

May 16 02:05:16 tycho-station     (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 56 a2 a3 08 00 00 00 08 00 00 length 4096 SMID 554 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
May 16 02:05:16 tycho-station     (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 5b a7 1b f0 00 00 00 08 00 00 length 4096 SMID 695 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
May 16 02:05:16 tycho-station     (da3:mps0:0:8:0): READ(10). CDB: 28 00 bd 8c c2 e8 00 00 08 00 length 4096 SMID 871 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
May 16 02:05:16 tycho-station     (da3:mps0:0:8:0): READ(10). CDB: 28 00 bc cf c2 58 00 00 08 00 length 4096 SMID 250 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
May 16 02:05:16 tycho-station (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 56 a2 a3 08 00 00 00 08 00 00
May 16 02:05:16 tycho-station (da3:mps0:0:8:0): CAM status: CCB request completed with an error
May 16 02:05:16 tycho-station (da3:mps0:0:8:0): Retrying command
May 16 02:05:16 tycho-station (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 5b a7 1b f0 00 00 00 08 00 00
May 16 02:05:16 tycho-station (da3:mps0:0:8:0): CAM status: CCB request completed with an error
May 16 02:05:16 tycho-station (da3:mps0:0:8:0): Retrying command
May 16 02:05:16 tycho-station (da3:mps0:0:8:0): READ(10). CDB: 28 00 bd 8c c2 e8 00 00 08 00
May 16 02:05:16 tycho-station (da3:mps0:0:8:0): CAM status: CCB request completed with an error
May 16 02:05:16 tycho-station (da3:mps0:0:8:0): Retrying command
May 16 02:05:16 tycho-station (da3:mps0:0:8:0): READ(10). CDB: 28 00 bc cf c2 58 00 00 08 00
May 16 02:05:16 tycho-station (da3:mps0:0:8:0): CAM status: CCB request completed with an error
May 16 02:05:16 tycho-station (da3:mps0:0:8:0): Retrying command
May 16 02:05:16 tycho-station (da3:mps0:0:8:0): WRITE(16). CDB: 8a 00 00 00 00 01 56 c6 28 f0 00 00 00 20 00 00
May 16 02:05:16 tycho-station (da3:mps0:0:8:0): CAM status: SCSI Status Error
May 16 02:05:16 tycho-station (da3:mps0:0:8:0): SCSI status: Check Condition
May 16 02:05:16 tycho-station (da3:mps0:0:8:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
May 16 02:05:16 tycho-station (da3:mps0:0:8:0): Info: 0x156c628f0
May 16 02:05:16 tycho-station (da3:mps0:0:8:0): Error 22, Unretryable error
May 16 02:05:17 tycho-station ZFS: vdev state changed, pool_guid=2455528100460742656 vdev_guid=12810312848179354167
May 16 06:49:22 tycho-station     (da3:mps0:0:8:0): WRITE(10). CDB: 2a 00 00 40 02 38 00 00 08 00 length 4096 SMID 862 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
May 16 06:49:22 tycho-station     (da3:mps0:0:8:0): WRITE(10). CDB: 2a 00 00 40 04 38 00 00 08 00 length 4096 SMID 1025 terminated ioc 804b loginfo 31080000 sc(da3:mps0:0:8:0): WRITE(10). CDB: 2a 00 00 40 02 38 00 00 08 00
May 16 06:49:22 tycho-station si 0 state 0 xfer 0
May 16 06:49:22 tycho-station (da3:mps0:0:8:0): CAM status: CCB request completed with an error
May 16 06:49:22 tycho-station (da3:mps0:0:8:0): Retrying command
May 16 06:49:22 tycho-station     (da3:mps0:0:8:0): WRITE(16). CDB: 8a 00 00 00 00 01 5d 50 a0 38 00 00 00 08 00 00 length 4096 SMID 949 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
May 16 06:49:22 tycho-station     (da3:mps0:0:8:0): WRITE(16). CDB: 8a 00 00 00 00 01 5d 50 a2 38 00 00 00 08 00 00 length 4096 SMID 313 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
May 16 06:49:22 tycho-station (da3:mps0:0:8:0): WRITE(10). CDB: 2a 00 00 40 04 38 00 00 08 00
May 16 06:49:22 tycho-station (da3:mps0:0:8:0): CAM status: CCB request completed with an error
May 16 06:49:22 tycho-station (da3:mps0:0:8:0): Retrying command
May 16 06:49:22 tycho-station (da3:mps0:0:8:0): WRITE(16). CDB: 8a 00 00 00 00 01 5d 50 a0 38 00 00 00 08 00 00
May 16 06:49:22 tycho-station (da3:mps0:0:8:0): CAM status: CCB request completed with an error
May 16 06:49:22 tycho-station (da3:mps0:0:8:0): Retrying command
May 16 06:49:22 tycho-station (da3:mps0:0:8:0): WRITE(16). CDB: 8a 00 00 00 00 01 5d 50 a2 38 00 00 00 08 00 00
May 16 06:49:22 tycho-station (da3:mps0:0:8:0): CAM status: CCB request completed with an error
May 16 06:49:22 tycho-station (da3:mps0:0:8:0): Retrying command
May 16 06:49:22 tycho-station (da3:mps0:0:8:0): WRITE(10). CDB: 2a 00 d4 02 7e d0 00 00 80 00
May 16 06:49:22 tycho-station (da3:mps0:0:8:0): CAM status: SCSI Status Error
May 16 06:49:22 tycho-station (da3:mps0:0:8:0): SCSI status: Check Condition
May 16 06:49:22 tycho-station (da3:mps0:0:8:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
May 16 06:49:22 tycho-station (da3:mps0:0:8:0): Info: 0xd4027ed0
May 16 06:49:22 tycho-station (da3:mps0:0:8:0): Error 22, Unretryable error
May 16 06:49:23 tycho-station ZFS: vdev state changed, pool_guid=2455528100460742656 vdev_guid=12810312848179354167
May 16 14:18:26 tycho-station     (da3:mps0:0:8:0): WRITE(10). CDB: 2a 00 d4 15 8f 88 00 00 80 00 length 65536 SMID 901 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
May 16 14:18:26 tycho-station (da3:mps0:0:8:0): WRITE(10). CDB: 2a 00 d4 15 8f 88 00 00 80 00
May 16 14:18:26 tycho-station (da3:mps0:0:8:0): CAM status: CCB request completed with an error
May 16 14:18:26 tycho-station (da3:mps0:0:8:0): Retrying command
May 16 14:18:26 tycho-station (da3:mps0:0:8:0): WRITE(10). CDB: 2a 00 f1 83 c6 c8 00 00 48 00
May 16 14:18:26 tycho-station (da3:mps0:0:8:0): CAM status: SCSI Status Error
May 16 14:18:26 tycho-station (da3:mps0:0:8:0): SCSI status: Check Condition
May 16 14:18:26 tycho-station (da3:mps0:0:8:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
May 16 14:18:26 tycho-station (da3:mps0:0:8:0): Info: 0xf183c6c8
May 16 14:18:26 tycho-station (da3:mps0:0:8:0): Error 22, Unretryable error
May 16 14:18:27 tycho-station ZFS: vdev state changed, pool_guid=2455528100460742656 vdev_guid=12810312848179354167

SteezeMcQueen · May 23, 2020

On May 20th, when da3 faulted and was kicked out:

Code:

May 20 00:24:08 tycho-station     (da3:mps0:0:8:0): WRITE(10). CDB: 2a 00 f6 49 46 18 00 00 80 00 length 65536 SMID 975 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
May 20 00:24:08 tycho-station     (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 53 2a b5 b8 00 00 00 80 00 00 length 65536 SMID 925 terminated ioc 804b lo(da3:mps0:0:8:0): WRITE(10). CDB: 2a 00 f6 49 46 18 00 00 80 00
May 20 00:24:08 tycho-station ginfo 31080000 scsi 0 state 0 xfer 0
May 20 00:24:08 tycho-station (da3:mps0:0:8:0): CAM status: CCB request completed with an error
May 20 00:24:08 tycho-station (da3:mps0:0:8:0): Retrying command
May 20 00:24:08 tycho-station     (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 53 2a b7 c8 00 00 00 80 00 00 length 65536 SMID 126 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
May 20 00:24:08 tycho-station     (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 53 2a cb 80 00 00 00 80 00 00 length 65536 SMID 916 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
May 20 00:24:08 tycho-station (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 53 2a b5 b8 00 00 00 80 00 00
May 20 00:24:08 tycho-station (da3:mps0:0:8:0): CAM status: CCB request completed with an error
May 20 00:24:08 tycho-station (da3:mps0:0:8:0): Retrying command
May 20 00:24:08 tycho-station (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 53 2a b7 c8 00 00 00 80 00 00
May 20 00:24:08 tycho-station (da3:mps0:0:8:0): CAM status: CCB request completed with an error
May 20 00:24:08 tycho-station (da3:mps0:0:8:0): Retrying command
May 20 00:24:08 tycho-station (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 53 2a cb 80 00 00 00 80 00 00
May 20 00:24:08 tycho-station (da3:mps0:0:8:0): CAM status: CCB request completed with an error
May 20 00:24:08 tycho-station (da3:mps0:0:8:0): Retrying command
May 20 00:24:08 tycho-station (da3:mps0:0:8:0): WRITE(16). CDB: 8a 00 00 00 00 01 28 45 be 80 00 00 00 c8 00 00
May 20 00:24:08 tycho-station (da3:mps0:0:8:0): CAM status: SCSI Status Error
May 20 00:24:08 tycho-station (da3:mps0:0:8:0): SCSI status: Check Condition
May 20 00:24:08 tycho-station (da3:mps0:0:8:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
May 20 00:24:08 tycho-station (da3:mps0:0:8:0): Info: 0x12845be80
May 20 00:24:08 tycho-station (da3:mps0:0:8:0): Error 22, Unretryable error
May 20 00:24:15 tycho-station     (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 53 2a b5 b8 00 00 00 80 00 00 length 65536 SMID 875 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
May 20 00:24:15 tycho-station     (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 53 2a b7 c8 00 00 00 80 00 00 length 65536 SMID 1029 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
May 20 00:24:15 tycho-station     (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 53 2a cb 80 00 00 00 80 00 00 length 65536 SMID 989 terminated ioc 804b lo(da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 53 2a b5 b8 00 00 00 80 00 00
May 20 00:24:15 tycho-station ginfo 31080000 scsi 0 state 0 xfer 0
May 20 00:24:15 tycho-station     (da3:mps0:0:8:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00 length 8192 SMID 949 terminated ioc 804b loginfo 31080000 scsi(da3:mps0:0:8:0): CAM status: CCB request completed with an error
May 20 00:24:15 tycho-station (da3:mps0:0:8:0): Retrying command
May 20 00:24:15 tycho-station (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 53 2a b7 c8 00 00 00 80 00 00
May 20 00:24:15 tycho-station 0 state 0 xfer 0
May 20 00:24:15 tycho-station     (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 5d 50 9e 90 00 00 00 10 00 00 length 8192 SMID 1059 terminated ioc 804b lo(da3:mps0:0:8:0): CAM status: CCB request completed with an error
May 20 00:24:15 tycho-station ginfo 31080000 scsi 0 state 0 xfer 0
May 20 00:24:15 tycho-station     (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 5d 50 a0 90 00 00 00 10 00 00 length 8192 SMID 104 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
May 20 00:24:15 tycho-station (da3:mps0:0:8:0): Retrying command
May 20 00:24:15 tycho-station (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 53 2a cb 80 00 00 00 80 00 00
May 20 00:24:15 tycho-station (da3:mps0:0:8:0): CAM status: CCB request completed with an error
May 20 00:24:15 tycho-station (da3:mps0:0:8:0): Retrying command
May 20 00:24:15 tycho-station (da3:mps0:0:8:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00
May 20 00:24:15 tycho-station (da3:mps0:0:8:0): CAM status: CCB request completed with an error
May 20 00:24:15 tycho-station (da3:mps0:0:8:0): Retrying command
May 20 00:24:15 tycho-station (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 5d 50 9e 90 00 00 00 10 00 00
May 20 00:24:15 tycho-station (da3:mps0:0:8:0): CAM status: CCB request completed with an error
May 20 00:24:15 tycho-station (da3:mps0:0:8:0): Retrying command
May 20 00:24:15 tycho-station (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 5d 50 a0 90 00 00 00 10 00 00
May 20 00:24:15 tycho-station (da3:mps0:0:8:0): CAM status: CCB request completed with an error
May 20 00:24:15 tycho-station (da3:mps0:0:8:0): Retrying command
May 20 00:24:15 tycho-station (da3:mps0:0:8:0): WRITE(10). CDB: 2a 00 f6 49 46 18 00 00 80 00
May 20 00:24:15 tycho-station (da3:mps0:0:8:0): CAM status: SCSI Status Error
May 20 00:24:15 tycho-station (da3:mps0:0:8:0): SCSI status: Check Condition
May 20 00:24:15 tycho-station (da3:mps0:0:8:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
May 20 00:24:15 tycho-station (da3:mps0:0:8:0): Info: 0xf6494618
May 20 00:24:15 tycho-station (da3:mps0:0:8:0): Error 22, Unretryable error
May 20 00:24:16 tycho-station ZFS: vdev state changed, pool_guid=2455528100460742656 vdev_guid=12810312848179354167
May 20 00:38:17 tycho-station     (da3:mps0:0:8:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 1083 Aborting command 0xfffffe0001393d70
May 20 00:38:17 tycho-station mps0: Sending reset from mpssas_send_abort for target ID 8
May 20 00:38:21 tycho-station ZFS: vdev state changed, pool_guid=2455528100460742656 vdev_guid=12810312848179354167
May 20 00:38:21 tycho-station ZFS: vdev state changed, pool_guid=2455528100460742656 vdev_guid=12810312848179354167
May 20 00:38:21 tycho-station     (pass5:mps0:0:8:0): LOG SENSE. CDB: 4d 00 4d 00 00 00 00 00 40 00 length 64 SMID 153 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
May 20 00:38:21 tycho-station     (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 0c 2f ea 28 00 00 00 08 00 00 length 4096 SMID 1055 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
May 20 00:38:21 tycho-station     (da3:mps0:0:8:0): WRITE(16). CDB: 8a 00 00 00 00 01 28 46 50 b0 00 00 00 80 00 00 length 65536 SMID 889 terminated ioc 804b l(da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 0c 2f ea 28 00 00 00 08 00 00
May 20 00:38:21 tycho-station oginfo 31130000 scsi 0 state c xfer 0
May 20 00:38:21 tycho-station mps0: (da3:mps0:0:8:0): CAM status: CCB request completed with an error
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): Retrying command
May 20 00:38:21 tycho-station Unfreezing devq for target ID 8
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): WRITE(16). CDB: 8a 00 00 00 00 01 28 46 50 b0 00 00 00 80 00 00
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): CAM status: CCB request completed with an error
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): Retrying command
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): CAM status: Command timeout
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): Retrying command
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): WRITE(16). CDB: 8a 00 00 00 00 01 28 46 50 b0 00 00 00 80 00 00
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): CAM status: SCSI Status Error
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): SCSI status: Check Condition
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): Retrying command (per sense data)
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 0c 2f ea 28 00 00 00 08 00 00
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): CAM status: SCSI Status Error
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): SCSI status: Check Condition
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): Retrying command (per sense data)
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): CAM status: SCSI Status Error
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): SCSI status: Check Condition
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): Error 5, Retries exhausted
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): WRITE(16). CDB: 8a 00 00 00 00 01 28 46 50 b0 00 00 00 80 00 00
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): CAM status: SCSI Status Error
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): SCSI status: Check Condition
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): Retrying command (per sense data)
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 0c 2f ea 28 00 00 00 08 00 00
...
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): READ(16). CDB: 88 00 00 00 00 01 5d 50 a3 87 00 00 00 01 00 00
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): CAM status: SCSI Status Error
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): SCSI status: Check Condition
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
May 20 00:38:21 tycho-station (da3:mps0:0:8:0): Error 5, Retries exhausted
May 20 00:38:40 tycho-station collectd[2916]: Traceback (most recent call last):
  File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 61, in read
    temperatures = c.call('disk.temperatures', self.disks, self.powermode, self.smartctl_args)
  File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 61, in read
    temperatures = c.call('disk.temperatures', self.disks, self.powermode, self.smartctl_args)
  File "/usr/local/lib/python3.7/site-packages/middlewared/client/client.py", line 386, in call
    raise CallTimeout("Call timeout")
middlewared.client.client.CallTimeout: Call timeout
May 20 00:38:57 tycho-station     (pass5:mps0:0:8:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 976 terminated ioc 804b loginfo 31170000 scsi 0 state c xfer 0
May 20 00:38:58 tycho-station     (pass5:mps0:0:8:0): INQUIRY. CDB: 12 00 00 00 40 00 length 64 SMID 153 terminated ioc 804b loginfo 31170000 scsi 0 state 0 xfer 0
May 20 00:38:59 tycho-station mps0: mpssas_prepare_remove: Sending reset for target ID 8
May 20 00:38:59 tycho-station da3 at mps0 bus 0 scbus32 target 8 lun 0
May 20 00:38:59 tycho-station da3: <ATA WDC WD30EFRX-68E 0A82> s/n WD-xxxxxUNUN detached
May 20 00:38:59 tycho-station GEOM_MIRROR: Device swap0: provider da3p1 disconnected.
May 20 00:38:59 tycho-station (da3:mps0:0:8:0): Periph destroyed
May 20 00:38:59 tycho-station mps0: Unfreezing devq for target ID 8
May 20 00:39:03 tycho-station GEOM_ELI: Device mirror/swap0.eli destroyed.
May 20 00:39:04 tycho-station GEOM_MIRROR: Device swap0: provider destroyed.
May 20 00:39:04 tycho-station GEOM_MIRROR: Device swap0 destroyed.

Redcoat · May 23, 2020

Could your HBA be short of cooling and getting a bit toasty?

SteezeMcQueen · May 23, 2020

Redcoat said:
Could your HBA be short of cooling and getting a bit toasty?

To the best of my knowledge, no. Since it's in IT mode, it doesn't have much in the way of logging, so I can't tell you the chip temp on the actual controller. Drive temps are right around 30C, and the system board temp sensor is at 32C. This is inside a SuperMicro SuperChassis 745 so airflow is good across the PCI cards. The SAS2008 spec sheet says it can operate in an environment up to 55C as long as it's getting 200 LFM airflow. That calculates out to 24.5 CFM through a 80mm hole - the fan in the 745TQ is an 80mm unit that's at least 65 CFM, and the LSI is smack in the middle of it's airflow. Even with the cables hanging off the back, it's getting good airflow and the overall environment is well below 55C.

I wish I had more granular temperature data for the times around when da3 had failed Reads/Writes and triggered the pool alerts, but I'm going off of exported data from RRD's disktemp reports which is something like 4x/day. Any tips on how to log temperature readings every few minutes?

I also posted this problem to r/FreeNAS on Reddit and it was suggested I try swapping around cables - both at the SAS end and the SATA end. I'm going to do so, but I'm looking for suggestions on how best to stress the system and try to provoke a failure.

Glorious1 · May 28, 2020

Yes I was wondering about cables too. Since the drive is in use, you can't do full read/write stress test. You could do a read stress test with something like dd if=/dev/da3 of=/dev/null bs=1048576 count=4000000. Or you could just write and read a bunch of big files to that pool.

I would suggest a relacement drive purchased and stress tested and ready to go if possible.

SteezeMcQueen · May 28, 2020

I've also learned about fio, which seems like it might be helpful.

Here's a question: Does dd touch ZFS, or maybe more granular: Does running dd with a /dev/daX as the in-file touch the zpool that that drive is a part of? It seems like dd would be a good way to test a specific device, but not necessarily the rest of the "stuff" that is in-between that device and the OS, ie RAM, the HBA, cables.

I'm not sure if fio is the same as dd in that case, or if it acts on the higher file system, rather than specific devices (like I said, I just learned about it and am still reading up on it.)

In my head, the only way I've thought of to test all that is to just do a bunch of transfers to/from FreeNAS and another VM on the same ESXi host.

Glorious1 · May 28, 2020

I think dd would ignore the pool, and it's only copying so should be safe, but yeah, just moving files between clients and the pool or among pools should do it,

Important Announcement for the TrueNAS Community.

Intermittent pool degradation

SteezeMcQueen

Cadet

SteezeMcQueen

Cadet

SteezeMcQueen

Cadet

Redcoat

MVP

SteezeMcQueen

Cadet

Glorious1

Guru

SteezeMcQueen

Cadet

Glorious1

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Intermittent pool degradation

Cadet

Cadet

Cadet

MVP

Cadet

Guru

Cadet

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Intermittent pool degradation"

Similar threads