SYNCHRONIZE CACHE command timeout error

Joined
Jan 18, 2017
Messages
524
Thank you for keeping us updated, did anyone ever report this as a bug? I do not see any reference to one.

spoke too soon it was brought up in this one which is closed https://redmine.ixsystems.com/issues/31398
 
Last edited:

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
Same exact issues here with multiple drives. Always seems to be the same 3 drives. Smart tests and running bad blocks don't report any errors. Tried changing out cables, drive cage, etc. Seagate ST8000NM0055. LSI 3008

Drives that are failing have different firmware versions. Have not tried updating LSI firmware yet as I'm hesitant to mess with that.

sas3flash -listall reports firmware 10.00.03

I suppose I could replace all 3 drives, but they aren't under warranty. :-(
 

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
Yeah, it does seem to hit the same 3 over and over again. After a while i started recording the failures. They are up to months apart though. This one that failed out today last failed out in December.
 

ivosevb

Cadet
Joined
Aug 30, 2015
Messages
5
Hi, I just want to say that I have the same kind of errors with plain FreeBSD 12, LSI 3008, 18x WD 4 TB Red in Supermicro chasis 5049p. So, i think the problem is not with FreeNAS or Seagate disks exclusively. It happens from time to time on different storage bays. Today it happened after 20 days of work. System just rebooted after five retries "SYNCHRONIZE CACHE" errors.
 

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
I got it again the other day on a drive. There's got to be something that can be done about this. It seems too spread out among drive models and controllers for it to be a compatibility issue. It seems that it off-lines the drive very quickly (within a second). Is there any way to have Freenas attempt to recover the drive back into the pool, or if there is a communication issue give a another second to try again? I think even if I was running VMs on the pool, they could survive for 2-3 seconds without being able to access the disk if the pool needed to be temporarily frozen.
 

Pheran

Patron
Joined
Jul 14, 2015
Messages
280
My hard drives have recently started dropping like flies - nearly once a day. The last change I made was upgrading my Plex plugin from 1.14.0 to 1.14.1. I'm wondering if the bandwidth graph that 1.14.1 makes is exercising the drives more frequently and causing more failures, even though it's writing a trivial amount of data. This is very painful. Given how far SSDs have dropped in price, I'm tempted to pick up 2 1TB SSDs and mirror them for use as a dedicated jail dataset.
 

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
This is grasping as straws here, but is anyone experiencing these issues with a Mirror Vdev pool? I've had my Freenas server for about a year now. At first I set it up with 4X mirror vdevs. For about 8 months it was that way, and I don't think I ever saw this error.

Around August/September I switched to RAIDZ2, and starting at least in November I get these every few weeks. Looking back through the posts, everyone seems to mention RAIDZ2.

Anyone notice anything like this? I'm tempted to switch back to test myself, but that's a major PITA.
 

Pheran

Patron
Joined
Jul 14, 2015
Messages
280
This is grasping as straws here, but is anyone experiencing these issues with a Mirror Vdev pool? I've had my Freenas server for about a year now. At first I set it up with 4X mirror vdevs. For about 8 months it was that way, and I don't think I ever saw this error.

Interesting idea, I've never tried it. But you're right, switching over is a huge pain in the ass.
 

ivosevb

Cadet
Joined
Aug 30, 2015
Messages
5
Just to update my previous post. We made full backup and migrated data to another FreeBSD ZFS server, and installed 11.1-RELEASE from scratch. We recreated ZFS pool because of 12.0 ZFS features that 11.1 doesn't support. Since then (two weeks) everything is stable. No more errors. 11.2 has the same MPR driver version like 12.0, so we go with 11.1 and ... we are happy now. There is a MPR version "23.00.00.00-fbsd" in current, but we didn't try it. Our system:

FreeBSD 11.1-RELEASE-p15
mpr0: Firmware: 16.00.01.00, Driver: 15.03.00.00-fbsd

SuperMicro SuperStorage Server 5049P-E1CTR36L
Intel Xeon Bronze 3104 CPU @ 1.70GHz
64 GB RAM
Avago Technologies (LSI) SAS3008
18x WD Red 4 TB
Intel SSDPE2KX010T8 NVMe
 

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
@ivosevb If I'm understanding you correctly, are you saying the issue appears to be a driver introduced in Freenas 11.2?
 

ivosevb

Cadet
Joined
Aug 30, 2015
Messages
5
I really can't tell for sure, i'm just an ordinary user, we also think about bad backplane, but we deside to try different driver version ... or in the end, oh my God, Linux.
 

Pheran

Patron
Joined
Jul 14, 2015
Messages
280
To be honest, I do not think that ivosevb has the same problem as the other folks in this thread. This known issue with large Seagate Ironwolf drives does not impact 4TB WD Red drives. I can certainly confirm that the Ironwolf problem is not caused or solved by FreeNAS 11.2.
 

soulburn

Contributor
Joined
Jul 6, 2014
Messages
100
I have this same issue with the IronWolf 10 TB drives. My pool is set up with mirrors. I have an LSI 3008 and another system with an LSI 2008 and some amount of drives always fails every few days in both systems, especially during a scrub. Has anyone found any solutions? It's become a huge problem for me.
 

Pheran

Patron
Joined
Jul 14, 2015
Messages
280
I have this same issue with the IronWolf 10 TB drives. My pool is set up with mirrors. I have an LSI 3008 and another system with an LSI 2008 and some amount of drives always fails every few days in both systems, especially during a scrub. Has anyone found any solutions? It's become a huge problem for me.

Sadly this thread has been going for about 2 years now and there's no solution that I know of. But at least you just answered @mgittelman about whether or not mirrors would solve this - I guess not.
 

wanrover

Cadet
Joined
Feb 6, 2019
Messages
1
Hi! I just found this interesting tread and want to inform you all about similar experience.

I have a HP gen8 DL380P here with the P420i controller in HBA mode and 10 x 5TB Seagate Barracuda ST5000LM000 2.5" And these drives are thrown out at random under heavy load.
It usually starts with one drive (random which one) first getting kicked offline and replaced with the spare, then the re-silvering starts. Then, after a few hours of re-silvering, it starts to throw out more drives. I have seen up to four drives kicked out at one time, before the server totally kneels and reboots itself.

Code:
May  6 16:09:19 freenas-232 (da2:ciss0:32:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
May  6 16:09:19 freenas-232 (da2:ciss0:32:10:0): CAM status: CCB request completed with an error
May  6 16:09:19 freenas-232 (da2:ciss0:32:10:0): Retrying command
May  6 16:09:25 freenas-232 ciss0: *** Hot-plug drive removed, Port=1I Box=1 Bay=3 SN=            WCJ16HQB
May  6 16:09:25 freenas-232 (da2:ciss0:32:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
May  6 16:09:25 freenas-232 (da2:ciss0:32:10:0): CAM status: CCB request completed with an error
May  6 16:09:25 freenas-232 (da2:ciss0:32:10:0): Error 5, Retries exhausted
May  6 16:09:25 freenas-232 (da2:ciss0:32:10:0): READ(10). CDB: 28 00 5d 27 df 08 00 00 80 00
May  6 16:09:25 freenas-232 (da2:ciss0:32:10:0): CAM status: SCSI Status Error
May  6 16:09:25 freenas-232 (da2:ciss0:32:10:0): SCSI status: Check Condition
May  6 16:09:25 freenas-232 (da2:ciss0:32:10:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported)
May  6 16:09:25 freenas-232 (da2:ciss0:32:10:0): Command Specific Info: 0xc24f00
May  6 16:09:25 freenas-232 (da2:ciss0:32:10:0): Error 6, Unretryable error
May  6 16:09:25 freenas-232 (da2:ciss0:32:10:0): Invalidating pack
May  6 16:09:25 freenas-232 da2 at ciss0 bus 32 scbus1 target 10 lun 0
May  6 16:09:33 freenas-232 (da2:ciss0:32:10:0): Periph destroyed
May  6 16:39:40 freenas-232 ciss0: *** Physical drive failure, Port=1I Box=1 Bay=3 reason=0x14
[CODE)
 

DGenerateKane

Explorer
Joined
Sep 4, 2014
Messages
95
Been dealing with this basically from the day I got my used server and added the 10TB Ironwolf drives to it. Replaced the HBA with an identical one right away, then 6 months later with a IBM M1015 crossflashed with LSI IT firmware. Replaced all 8 drives (RAIDZ2) within a week, issues seemed gone but eventually started up again. All 8 drives have issues at some point forcing a reboot (sometimes over IPMI because I lose the WebGUI if two fault at once). I actually got a warning about one drive last week that actually looked like a bad drive. It failed to get past 20% of a long test, so I ordered a WD Easystore 10TB and shucked it. I'm still testing it, but that bad drive not only recovered, it completed two long tests without error. I notice my errors seem to crop up most frequently when my Windows 10 VM is running, particularly if I'm also transferring lots of data at the same time. Not sure if a drive has errored while idle.

I'm honestly not sure where to go at this point. Ideally, replace all the drives but I absolutely do not have the funds for it right now. Eventually when I do, I may swap out the drives in my other NAS and see what happens. The board has an LSI controller for the 8 SAS ports, and 6 SATA ports so who knows what will happen if I mix and match. I'm also still trying to figure out how I'm going to fill all 36 bays, and having to replace drives isn't helping.
 

jegonzo71

Cadet
Joined
May 10, 2019
Messages
1
I am now having the same issue. I am running FreeNAS 11.1-U7, with 12 disks in a RAID 10 with a mixture of ST3000DM001, ST5000DM000, ST6000DM003. System has been running for 4 years, originally FreeNAS 9.3. I have no Ironwolf drives in the system. The only change that has been made to the system recently was that about 2 weeks ago I had added (2) of the ST6000DM003 to the pool, these are the first ST6000DM003 to be added to the pool. Also, system was updated from 11.1-U6 in mid January. Controller is an LSI-920016e.

Output:
FreeNAS.local kernel log messages:
> (da1:mps0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 01 78 b2 8a 40 00 00 00 10 00 00 length 8192 SMID 198 terminated ioc 804b loginfo 31120303 scsi 0 state c xfer 0
> (da1:mps0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 01 78 b2 8a 40 00 00 00 10 00 00
> (da1:mps0:0:9:0): CAM status: CCB request completed with an error
> (da1:mps0:0:9:0): Retrying command
> (da1:mps0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 01 78 b2 8a 40 00 00 00 10 00 00
> (da1:mps0:0:9:0): CAM status: SCSI Status Error
> (da1:mps0:0:9:0): SCSI status: Check Condition
> (da1:mps0:0:9:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> (da1:mps0:0:9:0): Retrying command (per sense data)
> (da1:mps0:0:9:0): READ(10). CDB: 28 00 23 4c 22 38 00 00 08 00
> (da1:mps0:0:9:0): CAM status: SCSI Status Error
> (da1:mps0:0:9:0): SCSI status: Check Condition
> (da1:mps0:0:9:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> (da1:mps0:0:9:0): Retrying command (per sense data)

Not at home right now so I couldn't say which drive da1 is (it should be an ST5000DM000). Will update later.
 
Top