SYNCHRONIZE CACHE command timeout error

Pheran · Feb 10, 2019

So much for that da1 theory, da4 bombed out this morning.

cobrakiller58 · Feb 11, 2019

Thank you for keeping us updated, did anyone ever report this as a bug? I do not see any reference to one.

spoke too soon it was brought up in this one which is closed https://redmine.ixsystems.com/issues/31398

MikeyG · Mar 6, 2019

Same exact issues here with multiple drives. Always seems to be the same 3 drives. Smart tests and running bad blocks don't report any errors. Tried changing out cables, drive cage, etc. Seagate ST8000NM0055. LSI 3008

Drives that are failing have different firmware versions. Have not tried updating LSI firmware yet as I'm hesitant to mess with that.

sas3flash -listall reports firmware 10.00.03

I suppose I could replace all 3 drives, but they aren't under warranty. :-(

Pheran · Mar 6, 2019

Yeah, I only have 10TB Ironwolfs but I've heard it affects the 8TB models as well. I find it curious that you say this only affects 3 of your 8 drives. It hits all of my Ironwolfs.

MikeyG · Mar 6, 2019

Yeah, it does seem to hit the same 3 over and over again. After a while i started recording the failures. They are up to months apart though. This one that failed out today last failed out in December.

ivosevb · Mar 20, 2019

Hi, I just want to say that I have the same kind of errors with plain FreeBSD 12, LSI 3008, 18x WD 4 TB Red in Supermicro chasis 5049p. So, i think the problem is not with FreeNAS or Seagate disks exclusively. It happens from time to time on different storage bays. Today it happened after 20 days of work. System just rebooted after five retries "SYNCHRONIZE CACHE" errors.

MikeyG · Mar 20, 2019

I got it again the other day on a drive. There's got to be something that can be done about this. It seems too spread out among drive models and controllers for it to be a compatibility issue. It seems that it off-lines the drive very quickly (within a second). Is there any way to have Freenas attempt to recover the drive back into the pool, or if there is a communication issue give a another second to try again? I think even if I was running VMs on the pool, they could survive for 2-3 seconds without being able to access the disk if the pool needed to be temporarily frozen.

Pheran · Mar 24, 2019

My hard drives have recently started dropping like flies - nearly once a day. The last change I made was upgrading my Plex plugin from 1.14.0 to 1.14.1. I'm wondering if the bandwidth graph that 1.14.1 makes is exercising the drives more frequently and causing more failures, even though it's writing a trivial amount of data. This is very painful. Given how far SSDs have dropped in price, I'm tempted to pick up 2 1TB SSDs and mirror them for use as a dedicated jail dataset.

MikeyG · Mar 26, 2019

This is grasping as straws here, but is anyone experiencing these issues with a Mirror Vdev pool? I've had my Freenas server for about a year now. At first I set it up with 4X mirror vdevs. For about 8 months it was that way, and I don't think I ever saw this error.

Around August/September I switched to RAIDZ2, and starting at least in November I get these every few weeks. Looking back through the posts, everyone seems to mention RAIDZ2.

Anyone notice anything like this? I'm tempted to switch back to test myself, but that's a major PITA.

Pheran · Mar 26, 2019

mgittelman said:
This is grasping as straws here, but is anyone experiencing these issues with a Mirror Vdev pool? I've had my Freenas server for about a year now. At first I set it up with 4X mirror vdevs. For about 8 months it was that way, and I don't think I ever saw this error.

Interesting idea, I've never tried it. But you're right, switching over is a huge pain in the ass.

ivosevb · Apr 17, 2019

Just to update my previous post. We made full backup and migrated data to another FreeBSD ZFS server, and installed 11.1-RELEASE from scratch. We recreated ZFS pool because of 12.0 ZFS features that 11.1 doesn't support. Since then (two weeks) everything is stable. No more errors. 11.2 has the same MPR driver version like 12.0, so we go with 11.1 and ... we are happy now. There is a MPR version "23.00.00.00-fbsd" in current, but we didn't try it. Our system:

FreeBSD 11.1-RELEASE-p15
mpr0: Firmware: 16.00.01.00, Driver: 15.03.00.00-fbsd

SuperMicro SuperStorage Server 5049P-E1CTR36L
Intel Xeon Bronze 3104 CPU @ 1.70GHz
64 GB RAM
Avago Technologies (LSI) SAS3008
18x WD Red 4 TB
Intel SSDPE2KX010T8 NVMe

MikeyG · Apr 17, 2019

@ivosevb If I'm understanding you correctly, are you saying the issue appears to be a driver introduced in Freenas 11.2?

ivosevb · Apr 17, 2019

I really can't tell for sure, i'm just an ordinary user, we also think about bad backplane, but we deside to try different driver version ... or in the end, oh my God, Linux.

Pheran · Apr 20, 2019

To be honest, I do not think that ivosevb has the same problem as the other folks in this thread. This known issue with large Seagate Ironwolf drives does not impact 4TB WD Red drives. I can certainly confirm that the Ironwolf problem is not caused or solved by FreeNAS 11.2.

soulburn · May 1, 2019

I have this same issue with the IronWolf 10 TB drives. My pool is set up with mirrors. I have an LSI 3008 and another system with an LSI 2008 and some amount of drives always fails every few days in both systems, especially during a scrub. Has anyone found any solutions? It's become a huge problem for me.

Pheran · May 1, 2019

soulburn said:
I have this same issue with the IronWolf 10 TB drives. My pool is set up with mirrors. I have an LSI 3008 and another system with an LSI 2008 and some amount of drives always fails every few days in both systems, especially during a scrub. Has anyone found any solutions? It's become a huge problem for me.

Sadly this thread has been going for about 2 years now and there's no solution that I know of. But at least you just answered @mgittelman about whether or not mirrors would solve this - I guess not.

wanrover · May 8, 2019

Hi! I just found this interesting tread and want to inform you all about similar experience.

I have a HP gen8 DL380P here with the P420i controller in HBA mode and 10 x 5TB Seagate Barracuda ST5000LM000 2.5" And these drives are thrown out at random under heavy load.
It usually starts with one drive (random which one) first getting kicked offline and replaced with the spare, then the re-silvering starts. Then, after a few hours of re-silvering, it starts to throw out more drives. I have seen up to four drives kicked out at one time, before the server totally kneels and reboots itself.

Code:

May  6 16:09:19 freenas-232 (da2:ciss0:32:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
May  6 16:09:19 freenas-232 (da2:ciss0:32:10:0): CAM status: CCB request completed with an error
May  6 16:09:19 freenas-232 (da2:ciss0:32:10:0): Retrying command
May  6 16:09:25 freenas-232 ciss0: *** Hot-plug drive removed, Port=1I Box=1 Bay=3 SN=            WCJ16HQB
May  6 16:09:25 freenas-232 (da2:ciss0:32:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
May  6 16:09:25 freenas-232 (da2:ciss0:32:10:0): CAM status: CCB request completed with an error
May  6 16:09:25 freenas-232 (da2:ciss0:32:10:0): Error 5, Retries exhausted
May  6 16:09:25 freenas-232 (da2:ciss0:32:10:0): READ(10). CDB: 28 00 5d 27 df 08 00 00 80 00
May  6 16:09:25 freenas-232 (da2:ciss0:32:10:0): CAM status: SCSI Status Error
May  6 16:09:25 freenas-232 (da2:ciss0:32:10:0): SCSI status: Check Condition
May  6 16:09:25 freenas-232 (da2:ciss0:32:10:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported)
May  6 16:09:25 freenas-232 (da2:ciss0:32:10:0): Command Specific Info: 0xc24f00
May  6 16:09:25 freenas-232 (da2:ciss0:32:10:0): Error 6, Unretryable error
May  6 16:09:25 freenas-232 (da2:ciss0:32:10:0): Invalidating pack
May  6 16:09:25 freenas-232 da2 at ciss0 bus 32 scbus1 target 10 lun 0
May  6 16:09:33 freenas-232 (da2:ciss0:32:10:0): Periph destroyed
May  6 16:39:40 freenas-232 ciss0: *** Physical drive failure, Port=1I Box=1 Bay=3 reason=0x14
[CODE)

DGenerateKane · May 8, 2019

Been dealing with this basically from the day I got my used server and added the 10TB Ironwolf drives to it. Replaced the HBA with an identical one right away, then 6 months later with a IBM M1015 crossflashed with LSI IT firmware. Replaced all 8 drives (RAIDZ2) within a week, issues seemed gone but eventually started up again. All 8 drives have issues at some point forcing a reboot (sometimes over IPMI because I lose the WebGUI if two fault at once). I actually got a warning about one drive last week that actually looked like a bad drive. It failed to get past 20% of a long test, so I ordered a WD Easystore 10TB and shucked it. I'm still testing it, but that bad drive not only recovered, it completed two long tests without error. I notice my errors seem to crop up most frequently when my Windows 10 VM is running, particularly if I'm also transferring lots of data at the same time. Not sure if a drive has errored while idle.

I'm honestly not sure where to go at this point. Ideally, replace all the drives but I absolutely do not have the funds for it right now. Eventually when I do, I may swap out the drives in my other NAS and see what happens. The board has an LSI controller for the 8 SAS ports, and 6 SATA ports so who knows what will happen if I mix and match. I'm also still trying to figure out how I'm going to fill all 36 bays, and having to replace drives isn't helping.

jegonzo71 · May 10, 2019

I am now having the same issue. I am running FreeNAS 11.1-U7, with 12 disks in a RAID 10 with a mixture of ST3000DM001, ST5000DM000, ST6000DM003. System has been running for 4 years, originally FreeNAS 9.3. I have no Ironwolf drives in the system. The only change that has been made to the system recently was that about 2 weeks ago I had added (2) of the ST6000DM003 to the pool, these are the first ST6000DM003 to be added to the pool. Also, system was updated from 11.1-U6 in mid January. Controller is an LSI-920016e.

Output:
FreeNAS.local kernel log messages:
> (da1:mps0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 01 78 b2 8a 40 00 00 00 10 00 00 length 8192 SMID 198 terminated ioc 804b loginfo 31120303 scsi 0 state c xfer 0
> (da1:mps0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 01 78 b2 8a 40 00 00 00 10 00 00
> (da1:mps0:0:9:0): CAM status: CCB request completed with an error
> (da1:mps0:0:9:0): Retrying command
> (da1:mps0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 01 78 b2 8a 40 00 00 00 10 00 00
> (da1:mps0:0:9:0): CAM status: SCSI Status Error
> (da1:mps0:0:9:0): SCSI status: Check Condition
> (da1:mps0:0:9:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> (da1:mps0:0:9:0): Retrying command (per sense data)
> (da1:mps0:0:9:0): READ(10). CDB: 28 00 23 4c 22 38 00 00 08 00
> (da1:mps0:0:9:0): CAM status: SCSI Status Error
> (da1:mps0:0:9:0): SCSI status: Check Condition
> (da1:mps0:0:9:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> (da1:mps0:0:9:0): Retrying command (per sense data)

Not at home right now so I couldn't say which drive da1 is (it should be an ST5000DM000). Will update later.

Pheran · May 10, 2019

@jegonzo71 Given that you have no large Ironwolfs and your problem just started, the root cause is likely something other than what's being discussed in this thread.

Important Announcement for the TrueNAS Community.

SYNCHRONIZE CACHE command timeout error

Patron

Guru

Patron

Patron

Patron

Cadet

Patron

Patron

Patron

Patron

Cadet

Patron

Cadet

Patron

Contributor

Patron

Cadet

Explorer

Cadet

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "SYNCHRONIZE CACHE command timeout error"

Similar threads