SYNCHRONIZE CACHE command timeout error

Pheran · Jun 3, 2017

Hi folks,

So I'm in the process of doing a major storage upgrade on my FreeNAS 9.10.2-U4 box (see sig link for full specs), slowly transitioning from 8x4TB to 8x10TB (RAIDZ2). I've swapped out 2 disks so far, da0 and da1, with no issues and successful resilvers. I was planning to swap out a 3rd disk this evening but when I pulled up my IPMI console getting ready to shut down I spotted this error on da0, which is one of the new 10TB drives (Seagate Ironwolf ST10000VN0004). I've grepped my messages logs and this is the only time this error has occurred. This disk controller is the builtin SAS2308 controller running IT firmware 20.00.04.

Code:

Jun  3 05:35:00 vault  (da0:mps0:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 227 command timeout cm 0xfffffe00009ca9f0 ccb 0xfffff80094e94800
Jun  3 05:35:00 vault  (noperiph:mps0:0:4294967295:0): SMID 1 Aborting command 0xfffffe00009ca9f0
Jun  3 05:35:00 vault mps0: Sending reset from mpssas_send_abort for target ID 0
Jun  3 05:35:00 vault  (da0:mps0:0:0:0): WRITE(16). CDB: 8a 00 00 00 00 01 69 90 65 b0 00 00 00 10 00 00 length 8192 SMID 161 terminated ioc 804b scsi 0 state c xfer 0
Jun  3 05:35:00 vault  (da0:mps0:0:0:0): WRITE(16). CDB: 8a 00 00 00 00 01 69 92 19 d0 00 00 00 08 00 00 length 4096 SMID 781 terminated ioc 804b sc(da0:mps0:0:0:0): WRITE(16). CDB: 8a 00 00 00 00 01 69 90 65 b0 00 00 00 1
0 00 00
Jun  3 05:35:00 vault si 0 state c xfer 0
Jun  3 05:35:00 vault (da0:mps0:0:0:0): CAM status: CCB request completed with an error
Jun  3 05:35:00 vault mps0: (da0:Unfreezing devq for target ID 0
Jun  3 05:35:00 vault mps0:0:0:0): Retrying command
Jun  3 05:35:00 vault (da0:mps0:0:0:0): WRITE(16). CDB: 8a 00 00 00 00 01 69 92 19 d0 00 00 00 08 00 00
Jun  3 05:35:00 vault (da0:mps0:0:0:0): CAM status: CCB request completed with an error
Jun  3 05:35:00 vault (da0:mps0:0:0:0): Retrying command
Jun  3 05:35:00 vault (da0:mps0:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jun  3 05:35:00 vault (da0:mps0:0:0:0): CAM status: Command timeout
Jun  3 05:35:00 vault (da0:mps0:0:0:0): Retrying command
Jun  3 05:35:00 vault (da0:mps0:0:0:0): WRITE(16). CDB: 8a 00 00 00 00 01 69 92 19 d0 00 00 00 08 00 00
Jun  3 05:35:00 vault (da0:mps0:0:0:0): CAM status: SCSI Status Error
Jun  3 05:35:00 vault (da0:mps0:0:0:0): SCSI status: Check Condition
Jun  3 05:35:00 vault (da0:mps0:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  3 05:35:00 vault (da0:mps0:0:0:0): Retrying command (per sense data)
Jun  3 05:35:15 vault (da0:mps0:0:0:0): WRITE(16). CDB: 8a 00 00 00 00 01 69 92 20 e0 00 00 00 08 00 00
Jun  3 05:35:15 vault (da0:mps0:0:0:0): CAM status: SCSI Status Error
Jun  3 05:35:15 vault (da0:mps0:0:0:0): SCSI status: Check Condition
Jun  3 05:35:15 vault (da0:mps0:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  3 05:35:15 vault (da0:mps0:0:0:0): Retrying command (per sense data)

Here's the SMART data for da0, which has only been in service for a short time.

Code:

[root@vault] /var/log# smartctl -a /dev/da0
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:  ST10000VN0004-1ZD101
Serial Number:  ZA215PKR
LU WWN Device Id: 5 000c50 0a25fd00d
Firmware Version: SC60
User Capacity:  10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:  512 bytes logical, 4096 bytes physical
Rotation Rate:  7200 rpm
Form Factor:  3.5 inches
Device is:  Not in smartctl database [for details use: -P showall]
ATA Version is:  ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:  Sat Jun  3 22:51:29 2017 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
  was completed without error.
  Auto Offline Data Collection: Enabled.
Self-test execution status:  (  0) The previous self-test routine completed
  without error or no self-test has ever
  been run.
Total time to complete Offline
data collection:  (  575) seconds.
Offline data collection
capabilities:  (0x7b) SMART execute Offline immediate.
  Auto Offline data collection on/off support.
  Suspend Offline collection upon new
  command.
  Offline surface scan supported.
  Self-test supported.
  Conveyance Self-test supported.
  Selective Self-test supported.
SMART capabilities:  (0x0003) Saves SMART data before entering
  power-saving mode.
  Supports SMART auto save timer.
Error logging capability:  (0x01) Error logging supported.
  General Purpose Logging supported.
Short self-test routine
recommended polling time:  (  1) minutes.
Extended self-test routine
recommended polling time:  ( 913) minutes.
Conveyance self-test routine
recommended polling time:  (  2) minutes.
SCT capabilities:  (0x50bd) SCT Status supported.
  SCT Error Recovery Control supported.
  SCT Feature Control supported.
  SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x000f  081  066  044  Pre-fail  Always  -  135416632
  3 Spin_Up_Time  0x0003  097  097  000  Pre-fail  Always  -  0
  4 Start_Stop_Count  0x0032  100  100  020  Old_age  Always  -  2
  5 Reallocated_Sector_Ct  0x0033  100  100  010  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x000f  078  061  045  Pre-fail  Always  -  64962889
  9 Power_On_Hours  0x0032  100  100  000  Old_age  Always  -  240 (177 151 0)
10 Spin_Retry_Count  0x0013  100  100  097  Pre-fail  Always  -  0
12 Power_Cycle_Count  0x0032  100  100  020  Old_age  Always  -  2
184 End-to-End_Error  0x0032  100  100  099  Old_age  Always  -  0
187 Reported_Uncorrect  0x0032  100  100  000  Old_age  Always  -  0
188 Command_Timeout  0x0032  100  100  000  Old_age  Always  -  0
189 High_Fly_Writes  0x003a  099  099  000  Old_age  Always  -  1
190 Airflow_Temperature_Cel 0x0022  060  056  040  Old_age  Always  -  40 (Min/Max 35/44)
191 G-Sense_Error_Rate  0x0032  100  100  000  Old_age  Always  -  201
192 Power-Off_Retract_Count 0x0032  100  100  000  Old_age  Always  -  1
193 Load_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  112
194 Temperature_Celsius  0x0022  040  044  000  Old_age  Always  -  40 (0 26 0 0 0)
195 Hardware_ECC_Recovered  0x001a  008  006  000  Old_age  Always  -  135416632
197 Current_Pending_Sector  0x0012  100  100  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0010  100  100  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x003e  200  200  000  Old_age  Always  -  0
200 Multi_Zone_Error_Rate  0x0023  100  100  001  Pre-fail  Always  -  0
240 Head_Flying_Hours  0x0000  100  253  000  Old_age  Offline  -  238 (182 72 0)
241 Total_LBAs_Written  0x0000  100  253  000  Old_age  Offline  -  5590811588
242 Total_LBAs_Read  0x0000  100  253  000  Old_age  Offline  -  10854800400

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline  Completed without error  00%  85  -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
  1  0  0  Not_testing
  2  0  0  Not_testing
  3  0  0  Not_testing
  4  0  0  Not_testing
  5  0  0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Does anyone have any idea what this means, or what kind of problem it really indicates? I've launched a long SMART test against da0 to see if that turns up anything. Normally those run once a month but it hasn't been in service long enough to have one yet.

Dice · Jun 4, 2017

Have you burned in the drive?

Pheran · Jun 4, 2017

Dice said:
Have you burned in the drive?

No, I was being lazy. The long SMART test completed without error. So far the timeout error has not repeated, so I'm not sure what to think.

Dice · Jun 5, 2017

Pheran said:
No, I was being lazy.

It might not pay off.
Hope the drive is actually holding up down the line (once it sees simliar stress to the burnin test).

Pheran · Jun 5, 2017

Dice said:
It might not pay off.
Hope the drive is actually holding up down the line (once it sees simliar stress to the burnin test).

It's RAIDZ2, I'm not terribly concerned. And this doesn't really feel like a drive problem, though I certainly could be wrong. I'm sort of wondering if I should upgrade my 2308 from 20.00.04.00 to 20.00.07.00, but without the error recurring, it's impossible to know if I fixed anything no matter what I do.

nifoc · Jun 21, 2017

I also have two of the IronWolf 10TB drives and have seen the same error(s). As of right now, the error(s) have only happened once per drive.

One drive is connected using an LSI 2308 and the other one is connected via an Intel C224 (both integrated on the ASRock Rack E3C224D4I-14S).

I did burn in both drives and the SMART data shows nothing out of the ordinary.

Pheran · Jun 21, 2017

I've got 6 of them running now and I've only ever seen that one error. It seems like some crazy fluke. The drives have been working perfectly fine through resilvers, scrubs, and SMART tests.

nifoc · Jun 25, 2017

For me, it's sadly happening on a daily basis now. I tried different cables, different HBA's and even different power cables, but nothing seems to help.

Since I burned in both drives and SMART still shows nothing out of the ordinary, I feel like this has to be some kind of bug either in FreeBSD or in the Seagate firmware.

Any idea how I could meaningfully narrow this down?

Pheran · Jun 25, 2017

nifoc said:
For me, it's sadly happening on a daily basis now. I tried different cables, different HBA's and even different power cables, but nothing seems to help.

Since I burned in both drives and SMART still shows nothing out of the ordinary, I feel like this has to be some kind of bug either in FreeBSD or in the Seagate firmware.

Any idea how I could meaningfully narrow this down?

What firmware is on your LSI2308? I was considering upgrading from 20.00.04 to 20.00.07 (shot in the dark), but since it hasn't happened again I haven't bothered.

nifoc · Jun 25, 2017

Pheran said:
What firmware is on your LSI2308?

20.00.04.

But I also tried the on-board SATA2 connectors on my motherboard and it happened with these, too. I don't think this is specifically related to the LSI2308.

I feel like this isn't necessarily load-related either, because I can run a full scrub (12TB/~14 hours) without running into this issue.

nifoc · Jun 26, 2017

I now had a drive completely drop out because of these errors:

Code:

	(da3:mps1:0:11:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 453 command timeout cm 0xfffffe00010a4290 ccb 0xfffff8077062a000
	(noperiph:mps1:0:4294967295:0): SMID 1 Aborting command 0xfffffe00010a4290
mps1: Sending reset from mpssas_send_abort for target ID 11
	(da3:mps1:0:11:0): READ(16). CDB: 88 00 00 00 00 03 ae 31 65 30 00 00 00 08 00 00 length 4096 SMID 693 terminated ioc 804b scsi 0 state c xfer 0
	(da3:mps1:0:11:0): READ(16). CDB: 88 00 00 00 00 03 ae 31 65 38 00 00 00 08 00 00 length 4096 SMID 747 terminated ioc 804b sc(da3:mps1:0:11:0): READ(16). CDB: 88 00 00 00 00 03 ae 31 65 30 00 00 00 08 00 00
si 0 state c xfer 0
	(da3:mps1:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 04 71 db 1f 70 00 00 00 08 00 00 length 4096 SMID 336 terminated ioc 804b scsi 0 state c xfer 0
	(da3:mps1:0:11:0): READ(16). CDB: 88 00 00 00 00 03 ad fa f4 38 00 00 00 08 00 00 length 4096 SMID 499 terminated ioc 804b sc(da3:mps1:0:11:0): CAM status: CCB request completed with an error
(da3:mps1:0:11:0): Retrying command
(da3:mps1:0:11:0): READ(16). CDB: 88 00 00 00 00 03 ae 31 65 38 00 00 00 08 00 00
si 0 state c xfer 0
	(da3:mps1:0:11:0): READ(16). CDB: 88 00 00 00 00 03 ad bf 49 b0 00 00 00 08 00 00 length 4096 SMID 806 terminated ioc 804b sc(da3:mps1:0:11:0): CAM status: CCB request completed with an error
(da3:mps1:0:11:0): Retrying command
(da3:mps1:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 04 71 db 1f 70 00 00 00 08 00 00
si 0 state c xfer 0
	(da3:mps1:0:11:0): READ(16). CDB: 88 00 00 00 00 03 ad bf 49 a8 00 00 00 08 00 00 length 4096 SMID 930 terminated ioc 804b sc(da3:mps1:0:11:0): CAM status: CCB request completed with an error
(da3:mps1:0:11:0): Retrying command
(da3:mps1:0:11:0): READ(16). CDB: 88 00 00 00 00 03 ad fa f4 38 00 00 00 08 00 00
si 0 state c xfer 0
	(da3:mps1:0:11:0): READ(16). CDB: 88 00 00 00 00 03 ad bf 49 b8 00 00 00 08 00 00 length 4096 SMID 507 terminated ioc 804b sc(da3:mps1:0:11:0): CAM status: CCB request completed with an error
(da3:mps1:0:11:0): Retrying command
(da3:mps1:0:11:0): READ(16). CDB: 88 00 00 00 00 03 ad bf 49 b0 00 00 00 08 00 00
si 0 state c xfer 0
	(da3:mps1:0:11:0): READ(16). CDB: 88 00 00 00 00 03 ad 85 32 70 00 00 00 08 00 00 length 4096 SMID 654 terminated ioc 804b sc(da3:mps1:0:11:0): CAM status: CCB request completed with an error
(da3:mps1:0:11:0): Retrying command
si 0 state c xfer 0
	(da3:mps1:0:11:0): READ(16). CDB: 88 00 00 00 00 03 ad 85 30 60 00 00 00 08 00 00 length 4096 SMID 466 terminated ioc 804b sc(da3:mps1:0:11:0): READ(16). CDB: 88 00 00 00 00 03 ad bf 49 a8 00 00 00 08 00 00
si 0 state c xfer 0
(da3:mps1:0:11:0): CAM status: CCB request completed with an error
(da3:mps1:0:11:0): Retrying command
(da3:mps1:0:11:0): READ(16). CDB: 88 00 00 00 00 03 ad bf 49 b8 00 00 00 08 00 00
	(da3:mps1:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 04 71 db 1f 28 00 00 00 10 00 00 length 8192 SMID 568 terminated ioc 804b s(da3:mps1:0:11:0): CAM status: CCB request completed with an error
(da3:mps1:0:11:0): Retrying command
(da3:mps1:0:11:0): READ(16). CDB: 88 00 00 00 00 03 ad 85 32 70 00 00 00 08 00 00
csi 0 state c xfer 0
	(da3:mps1:0:11:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 763 terminated ioc 804b scsi 0 sta(da3:mps1:0:11:0): CAM status: CCB request completed with an error
(da3:mps1:0:11:0): Retrying command
(da3:mps1:0:11:0): READ(16). CDB: 88 00 00 00 00 03 ad 85 30 60 00 00 00 08 00 00
(da3:mps1:0:11:0): CAM status: CCB request completed with an error
(da3:mps1:0:11:0): Retrying command
(da3:mps1:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 04 71 db 1f 28 00 00 00 10 00 00
te c xfer 0
mps1: (da3:mps1:0:11:0): CAM status: CCB request completed with an error
(da3:mps1:0:11:0): Retrying command
(da3:mps1:0:11:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Unfreezing devq for target ID 11
(da3:mps1:0:11:0): CAM status: CCB request completed with an error
(da3:mps1:0:11:0): Retrying command
(da3:mps1:0:11:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da3:mps1:0:11:0): CAM status: Command timeout
(da3:mps1:0:11:0): Retrying command
(da3:mps1:0:11:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da3:mps1:0:11:0): CAM status: SCSI Status Error
(da3:mps1:0:11:0): SCSI status: Check Condition
(da3:mps1:0:11:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da3:mps1:0:11:0): Error 6, Retries exhausted
(da3:mps1:0:11:0): Invalidating pack

Looking into replacing the drive now, because that's the only thing I haven't changed yet.

SMART data is still saying that everything is okay.

Pheran · Jul 7, 2017

This problem has taken a nasty turn for the worse for me. These errors have now taken 2 drives out of my 8-drive RAIDZ2 array. One more and it goes boom. Going to try a resilver and cross my fingers. So far I still have no evidence of any actual problems with the drives, they pass tests. Nifoc, anything new on your end?

Pheran · Jul 7, 2017

Logs for the record. I miss my 100% reliable HGST drives.

Code:

Jul  5 03:04:10 vault  (da7:mps0:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 301 command tim
eout cm 0xfffffe00009d0b10 ccb 0xfffff803af05f000
Jul  5 03:04:10 vault  (noperiph:mps0:0:4294967295:0): SMID 1 Aborting command 0xfffffe00009d0b10
Jul  5 03:04:10 vault mps0: Sending reset from mpssas_send_abort for target ID 7
Jul  5 03:04:10 vault  (da7:mps0:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 672 command tim
eout cm 0xfffffe00009ef200 ccb 0xfffff8018f6a7800
Jul  5 03:04:11 vault  (da7:mps0:0:7:0): WRITE(16). CDB: 8a 00 00 00 00 02 50 f3 0d b0 00 00 00 08 00 00 length 4096 SMID 503 te
rminated ioc 804b scsi 0 state c xfer 0
Jul  5 03:04:11 vault  (da7:mps0:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 672 terminated
ioc 804b scsi 0 stat(da7:mps0:0:7:0): WRITE(16). CDB: 8a 00 00 00 00 02 50 f3 0d b0 00 00 00 08 00 00
Jul  5 03:04:11 vault e c xfer 0
Jul  5 03:04:11 vault (da7:mps0:0:7:0): CAM status: CCB request completed with an error
Jul  5 03:04:11 vault mps0: (da7:Unfreezing devq for target ID 7
Jul  5 03:04:11 vault mps0:0:7:0): Retrying command
Jul  5 03:04:11 vault (da7:mps0:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul  5 03:04:11 vault (da7:mps0:0:7:0): CAM status: CCB request completed with an error
Jul  5 03:04:11 vault (da7:mps0:0:7:0): Retrying command
Jul  5 03:04:11 vault (da7:mps0:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul  5 03:04:11 vault (da7:mps0:0:7:0): CAM status: Command timeout
Jul  5 03:04:11 vault (da7:mps0:0:7:0): Retrying command
Jul  5 03:04:11 vault (da7:mps0:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul  5 03:04:11 vault (da7:mps0:0:7:0): CAM status: SCSI Status Error
Jul  5 03:04:11 vault (da7:mps0:0:7:0): SCSI status: Check Condition
Jul  5 03:04:11 vault (da7:mps0:0:7:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jul  5 03:04:11 vault (da7:mps0:0:7:0): Error 6, Retries exhausted
Jul  5 03:04:11 vault (da7:mps0:0:7:0): Invalidating pack

....

Jul  7 03:02:28 vault  (da5:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 599 command timeout cm 0xfffffe00009e9230 ccb 0xfffff802ee521000
Jul  7 03:02:28 vault  (noperiph:mps0:0:4294967295:0): SMID 2 Aborting command 0xfffffe00009e9230
Jul  7 03:02:28 vault mps0: Sending reset from mpssas_send_abort for target ID 5
Jul  7 03:02:28 vault  (da5:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 604 command timeout cm 0xfffffe00009e98c0 ccb 0xfffff806a2424800
Jul  7 03:02:28 vault  (da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 01 74 6e b7 50 00 00 00 08 00 00 length 4096 SMID 374 terminated ioc 804b scsi 0 state c xfer 0
Jul  7 03:02:28 vault  (da5:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 604 terminated ioc 804b scsi 0 stat(da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 01 74 6e b7 50 00 00 00 08 00 00
Jul  7 03:02:28 vault e c xfer 0
Jul  7 03:02:28 vault (da5:mps0:0:5:0): CAM status: CCB request completed with an error
Jul  7 03:02:28 vault mps0: (da5:Unfreezing devq for target ID 5
Jul  7 03:02:28 vault mps0:0:5:0): Retrying command
Jul  7 03:02:28 vault (da5:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul  7 03:02:28 vault (da5:mps0:0:5:0): CAM status: Command timeout
Jul  7 03:02:28 vault (da5:mps0:0:5:0): Retrying command
Jul  7 03:02:28 vault (da5:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul  7 03:02:28 vault (da5:mps0:0:5:0): CAM status: CCB request completed with an error
Jul  7 03:02:28 vault (da5:mps0:0:5:0): Retrying command
Jul  7 03:02:28 vault (da5:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul  7 03:02:28 vault (da5:mps0:0:5:0): CAM status: SCSI Status Error
Jul  7 03:02:28 vault (da5:mps0:0:5:0): SCSI status: Check Condition
Jul  7 03:02:28 vault (da5:mps0:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jul  7 03:02:28 vault (da5:mps0:0:5:0): Error 6, Retries exhausted
Jul  7 03:02:28 vault (da5:mps0:0:5:0): Invalidating pack

nifoc · Jul 7, 2017

Pheran said:
Nifoc, anything new on your end?

After I had one drive drop out of the pool, I re-examined the error messages. I basically had two different "issues".

Drive A: SYNCHRONIZE CACHE(10)/READ(16)/WRITE(16) timeouts
Drive B: FLUSHCACHE48 timeouts

The errors on drive A happened much more frequently and eventually caused the drive to drop out of the pool. The error on drive B happened very infrequently.

I decided to replace drive A with a 10TB WD Red and haven't seen any SYNCHRONIZE CACHE(10)/READ(16)/WRITE(16) timeouts for 1 1/2 weeks now. My best guess is that at least drive A was really broken - even though SMART didn't indicate anything. Maybe faulty firmware?

I was able to fix the FLUSHCACHE48 issue on drive B by simply disabling EPC (on the drive). Apparently this one is also an issue under Linux (0xea is FLUSHCACHE48).

Pheran · Jul 7, 2017

I haven't seen any FLUSHCACHE48, mine are all SYNCHRONIZE CACHE. I'm sorry, but I'm not familiar with EPC? That link talks about disabling NCQ.

nifoc · Jul 7, 2017

Looks like the content in the blog post changed since I read it. Sorry about that!

EPC stands for "Extended Power Conditions" and it handles - among other things - automatically parking the head after X minutes of inactivity.

In any case, disabling EPC doesn't help with SYNCHRONIZE CACHE timeouts. Replacing the drive(s) is probably your best option.

Pheran · Jul 7, 2017

nifoc said:
Looks like the content in the blog post changed since I read it. Sorry about that!

EPC stands for "Extended Power Conditions" and it handles - among other things - automatically parking the head after X minutes of inactivity.

In any case, disabling EPC doesn't help with SYNCHRONIZE CACHE timeouts. Replacing the drive(s) is probably your best option.

It's difficult for me to believe that 2 of 8 (or 3 of 8 if you count the transient error) brand new drives are all bad, especially when they won't fail any other tests. I wonder if Seagate would even accept an RMA without any SMART errors?

nifoc · Jul 7, 2017

Pheran said:
I wonder if Seagate would even accept an RMA without any SMART errors?

I didn't try that. I simply returned the drive to the seller and got my money back - which probably rather specific to Germany/EU.

Pheran · Jul 12, 2017

The saga continues. I got both drives back into the pool after putting them offline, re-adding them, and scrubbing. I also upgraded the controller firmware to 20.00.07, even though I knew it was grasping at straws. This was stable for a little while and then da7 dropped out again this morning. That drive might actually have a legitimate problem because I think it's noisy. I'm going to leave it offline and run a full badblocks scan on it, then another SMART long test. I'd love to RMA it if I could get SMART to say there's something wrong. For the record, the host reset that happened during a previous long test was one of the SYNC errors.

Code:

Jul 12 03:02:31 vault  (da7:mps0:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 575 command timeout cm 0xfffffe00009e72b0 ccb 0xfffff80334147800
Jul 12 03:02:31 vault  (noperiph:mps0:0:4294967295:0): SMID 1 Aborting command 0xfffffe00009e72b0
Jul 12 03:02:31 vault mps0: Sending reset from mpssas_send_abort for target ID 7
Jul 12 03:02:31 vault  (da7:mps0:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 882 command timeout cm 0xfffffe0000a005a0 ccb 0xfffff805adbb4800
Jul 12 03:02:31 vault  (da7:mps0:0:7:0): WRITE(16). CDB: 8a 00 00 00 00 02 50 c0 e2 a8 00 00 00 08 00 00 length 4096 SMID 551 terminated ioc 804b scsi 0 state c xfer 0
Jul 12 03:02:31 vault  (da7:mps0:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 882 terminated ioc 804b scsi 0 stat(da7:mps0:0:7:0): WRITE(16). CDB: 8a 00 00 00 00 02 50 c0 e2 a8 00 00 00 08 00 00
Jul 12 03:02:31 vault e c xfer 0
Jul 12 03:02:31 vault (da7:mps0:0:7:0): CAM status: CCB request completed with an error
Jul 12 03:02:31 vault mps0: (da7:Unfreezing devq for target ID 7
Jul 12 03:02:31 vault mps0:0:7:0): Retrying command
Jul 12 03:02:31 vault (da7:mps0:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul 12 03:02:31 vault (da7:mps0:0:7:0): CAM status: Command timeout
Jul 12 03:02:31 vault (da7:mps0:0:7:0): Retrying command
Jul 12 03:02:31 vault (da7:mps0:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul 12 03:02:31 vault (da7:mps0:0:7:0): CAM status: CCB request completed with an error
Jul 12 03:02:31 vault (da7:mps0:0:7:0): Retrying command
Jul 12 03:02:31 vault (da7:mps0:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Jul 12 03:02:31 vault (da7:mps0:0:7:0): CAM status: SCSI Status Error
Jul 12 03:02:31 vault (da7:mps0:0:7:0): SCSI status: Check Condition
Jul 12 03:02:31 vault (da7:mps0:0:7:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jul 12 03:02:31 vault (da7:mps0:0:7:0): Error 6, Retries exhausted
Jul 12 03:02:31 vault (da7:mps0:0:7:0): Invalidating pack

[root@vault] /var/log# smartctl -a /dev/da7
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:  ST10000VN0004-1ZD101
Serial Number:  ZA2153F0
LU WWN Device Id: 5 000c50 0a230591e
Firmware Version: SC60
User Capacity:  10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:  512 bytes logical, 4096 bytes physical
Rotation Rate:  7200 rpm
Form Factor:  3.5 inches
Device is:  Not in smartctl database [for details use: -P showall]
ATA Version is:  ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:  Wed Jul 12 10:05:08 2017 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
  was completed without error.
  Auto Offline Data Collection: Enabled.
Self-test execution status:  (  0) The previous self-test routine completed
  without error or no self-test has ever
  been run.
Total time to complete Offline
data collection:  (  575) seconds.
Offline data collection
capabilities:  (0x7b) SMART execute Offline immediate.
  Auto Offline data collection on/off support.
  Suspend Offline collection upon new
  command.
  Offline surface scan supported.
  Self-test supported.
  Conveyance Self-test supported.
  Selective Self-test supported.
SMART capabilities:  (0x0003) Saves SMART data before entering
  power-saving mode.
  Supports SMART auto save timer.
Error logging capability:  (0x01) Error logging supported.
  General Purpose Logging supported.
Short self-test routine
recommended polling time:  (  1) minutes.
Extended self-test routine
recommended polling time:  ( 861) minutes.
Conveyance self-test routine
recommended polling time:  (  2) minutes.
SCT capabilities:  (0x50bd) SCT Status supported.
  SCT Error Recovery Control supported.
  SCT Feature Control supported.
  SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x000f  079  064  044  Pre-fail  Always  -  86799815
  3 Spin_Up_Time  0x0003  099  099  000  Pre-fail  Always  -  0
  4 Start_Stop_Count  0x0032  100  100  020  Old_age  Always  -  1
  5 Reallocated_Sector_Ct  0x0033  100  100  010  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x000f  078  061  045  Pre-fail  Always  -  62858440
  9 Power_On_Hours  0x0032  100  100  000  Old_age  Always  -  279 (117 193 0)
10 Spin_Retry_Count  0x0013  100  100  097  Pre-fail  Always  -  0
12 Power_Cycle_Count  0x0032  100  100  020  Old_age  Always  -  1
184 End-to-End_Error  0x0032  100  100  099  Old_age  Always  -  0
187 Reported_Uncorrect  0x0032  100  100  000  Old_age  Always  -  0
188 Command_Timeout  0x0032  100  100  000  Old_age  Always  -  0
189 High_Fly_Writes  0x003a  098  098  000  Old_age  Always  -  2
190 Airflow_Temperature_Cel 0x0022  066  059  040  Old_age  Always  -  34 (Min/Max 24/41)
191 G-Sense_Error_Rate  0x0032  100  100  000  Old_age  Always  -  279
192 Power-Off_Retract_Count 0x0032  100  100  000  Old_age  Always  -  0
193 Load_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  136
194 Temperature_Celsius  0x0022  034  041  000  Old_age  Always  -  34 (0 24 0 0 0)
195 Hardware_ECC_Recovered  0x001a  013  004  000  Old_age  Always  -  86799815
197 Current_Pending_Sector  0x0012  100  100  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0010  100  100  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x003e  200  200  000  Old_age  Always  -  0
200 Multi_Zone_Error_Rate  0x0023  100  100  001  Pre-fail  Always  -  0
240 Head_Flying_Hours  0x0000  100  253  000  Old_age  Offline  -  221 (116 197 0)
241 Total_LBAs_Written  0x0000  100  253  000  Old_age  Offline  -  5817608007
242 Total_LBAs_Read  0x0000  100  253  000  Old_age  Offline  -  5743767161

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline  Completed without error  00%  268  -
# 2  Extended offline  Completed without error  00%  143  -
# 3  Extended offline  Completed without error  00%  126  -
# 4  Extended offline  Interrupted (host reset)  80%  104  -
# 5  Short offline  Completed without error  00%  0  -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
  1  0  0  Not_testing
  2  0  0  Not_testing
  3  0  0  Not_testing
  4  0  0  Not_testing
  5  0  0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

nifoc · Jul 12, 2017

Pheran said:
That drive might actually have a legitimate problem because I think it's noisy.

By noisy, do you mean some weird "clicking"/clatter? Obviously not the normal noise a HDD makes, but something else? I could swear the IronWolf I returned made these weird noises.

Btw. I did not have any issues since I returned the "faulty" drive.

Important Announcement for the TrueNAS Community.

SYNCHRONIZE CACHE command timeout error

Patron

Wizard

Patron

Wizard

Patron

Dabbler

Patron

Dabbler

Patron

Dabbler

Dabbler

Patron

Patron

Dabbler

Patron

Dabbler

Patron

Dabbler

Patron

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "SYNCHRONIZE CACHE command timeout error"

Similar threads