I recently put a new file server together to upgrade from a Synology DS1813+ to FreeBSD+ZFS to get into the ZFS goodness.
The hardware is:
Case: Fractal Design Node 804
Motherboard: Supermicro X11SSL-CF (6 SATA and onboard LSI 3008) https://www.newegg.com/Product/Product.aspx?Item=2W2-0014-00023
CPU: Intel(R) Xeon(R) CPU E3-1230 v6 @ 3.50GHz (3504.14-MHz K8-class CPU) https://www.newegg.com/Product/Product.aspx?Item=N82E16819117788
CPU cooler: Noctua NH-U12S https://www.newegg.com/Product/Product.aspx?Item=N82E16835608040
Memory: 2x Supermicro ECC DDR4-2400 16GB https://www.newegg.com/Product/Product.aspx?Item=9SIA7S67Y99082
Hard drives: 10x HGST Deskstar NAS 6TB https://www.newegg.com/Product/Product.aspx?Item=N82E16822146118
SSD drives: 2x Samsung 960 Pro 256 GB https://www.newegg.com/Product/Product.aspx?Item=N82E16820147685
SAS-SATA cables: CableCreation Internal HD Mini SAS 8643 to 4x SATA
right angle: https://www.amazon.com/gp/product/B01GPDBEDM/ref=oh_aui_detailpage_o01_s00?ie=UTF8&psc=1
straight angle: https://www.amazon.com/gp/product/B01BTEYLEQ/ref=oh_aui_detailpage_o01_s00?ie=UTF8&psc=1
SATA cables: Came with motherboard, 100 cm
Power supply: Seasonic Snow Silent 750W https://www.newegg.com/Product/Product.aspx?Item=N82E16817151221
Operating system: FreeBSD 11.2-RELEASE-p3 #0: Thu Sep 6 07:14:16 UTC 2018 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64
The SSD drives and 2 of the HGST are hooked up to the SATA ports on the motherboard. The other 8 HGST are hooked up through the LSI 3008 integrated SAS controller.
The OS is installed on the SSD's in a mirror vdev zpool for zroot. The 10 HGST are intended to be in a RAIDZ2 zpool for tank.
I have updated the motherboard to:
Firmware Revision : 01.45
Firmware Build Time : 02/06/2018
BIOS Version: 2.2
BIOS Build Time: 05/23/2018
Redfish Version : 1.0.1
CPLD Version: 02.b1.01
I have flashed the LSI 3008 to IT mode and version 16.00.01.00 (dmesg reports: mpr0: Firmware: 16.00.01.00, Driver: 18.03.00.00-fbsd). This was the latest available on the Supermicro web site.
I have read that the firmware and driver versions somehow need to match but it is not clear how. Some posts say they need to be the same numbers. Others say not quite the same numbers.
I have been trying to burn in the system, and good that I have been trying because I have been battling intermittent but persistent variants of this error:
Suffice to say, this results in a dropped disk (see "Invalidating pack" line above) and this pool status:
This is only the latest instace of the error.
I have tried reseating SAS cables and power cables multiple times amongst all of my testing over the last week to no avail.
I have also done various tests that involve hooking up as many drives as I can to only the SATA ports, which the motherboard has 6 of, so I can test 6 of the 10 HGST's at a time that way. I have not had one problem reported by ZFS or FreebSD when doing like a 6 disk raidz2 of the SATA ports and heavy testing. The native SATA ports seem rock solid. But I did receive such errors when the other 4 still connected to the SAS controller were tested as a separate 4 disk raidz1 zpool.
It does not seem related to any disk. I have gotten such reports for different disks at different times for drives connected to the SAS controller. It does not seem to correlate with a particular disk. As you can see in my info dump at the end, SMART doesn't report any errors or reallocations on any of the disks, and they seem independently OK when testing with SATA ports direcly.
It seems related to heavy activity on the SAS controller somehow. I have not directly observed this error occur on an idle system, only when loading it with heavy activity.
So it seems related to the SAS controller path somehow. In this latest test I have my case completely open in case it was heat related since right now the case only has the upper fans installed on the front, so there isn't any direct airflow across the bottom where the integrated SAS controller is. That's relevant because I read some thread where someone said they got these errors when they turned their fan speed down to reduce the noise while they were present, and turned them back up when they left the room, and he noticed a correlation with the lower fan speed and this error.
I do have some Supermicro SAS-SATA cables and some more Noctua fans on the way, so I will try that. But given the errors so far occur across both of my current SAS cables, it is somewhat "aspirational" at this point that new SAS-SATA cables will solve the problem. The SAS-SATA cables are 100cm which I have read is apparently a bad idea, but my existing SATA cables are also 100cm and nothing bad occurred with them. The new ones are 60cm. I guess we'll see...
At this point, I feel like with the case open during this last test I have given this thing the best chance at success, and this last test is even somewhat scarier because it was just an immediate drop with very little in the way of error output. Nothing seems obviously defective and failing all of the time, and all error-type counts on the drives are 0 even after full disk writes such as with "dd if=/dev/zero of=<disk>" type tests, and yet under heavy load on the SAS controller, especially with raidz2, suddenly a disk gets dropped for whatever reason. One of the times it was even two disks, which of course means all redundancy of a raidz2 pool is now lost.
Not many of the postings I could find about this error seem to resolve very cleanly.
As of right now I can't trust any data to this server if I am going to constantly worry about random drive drops.
First, can anyone tell me what this error actually means technically?
Otherwise, is this possibly a bad SAS controller and therefore a motherboard I should RMA?
Is there something simple I am overlooking?
Has anyone else had success or failure with the LSI 3008 and HGST 6TB NAS drives?
The hardware is:
Case: Fractal Design Node 804
Motherboard: Supermicro X11SSL-CF (6 SATA and onboard LSI 3008) https://www.newegg.com/Product/Product.aspx?Item=2W2-0014-00023
CPU: Intel(R) Xeon(R) CPU E3-1230 v6 @ 3.50GHz (3504.14-MHz K8-class CPU) https://www.newegg.com/Product/Product.aspx?Item=N82E16819117788
CPU cooler: Noctua NH-U12S https://www.newegg.com/Product/Product.aspx?Item=N82E16835608040
Memory: 2x Supermicro ECC DDR4-2400 16GB https://www.newegg.com/Product/Product.aspx?Item=9SIA7S67Y99082
Hard drives: 10x HGST Deskstar NAS 6TB https://www.newegg.com/Product/Product.aspx?Item=N82E16822146118
SSD drives: 2x Samsung 960 Pro 256 GB https://www.newegg.com/Product/Product.aspx?Item=N82E16820147685
SAS-SATA cables: CableCreation Internal HD Mini SAS 8643 to 4x SATA
right angle: https://www.amazon.com/gp/product/B01GPDBEDM/ref=oh_aui_detailpage_o01_s00?ie=UTF8&psc=1
straight angle: https://www.amazon.com/gp/product/B01BTEYLEQ/ref=oh_aui_detailpage_o01_s00?ie=UTF8&psc=1
SATA cables: Came with motherboard, 100 cm
Power supply: Seasonic Snow Silent 750W https://www.newegg.com/Product/Product.aspx?Item=N82E16817151221
Operating system: FreeBSD 11.2-RELEASE-p3 #0: Thu Sep 6 07:14:16 UTC 2018 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64
The SSD drives and 2 of the HGST are hooked up to the SATA ports on the motherboard. The other 8 HGST are hooked up through the LSI 3008 integrated SAS controller.
The OS is installed on the SSD's in a mirror vdev zpool for zroot. The 10 HGST are intended to be in a RAIDZ2 zpool for tank.
I have updated the motherboard to:
Firmware Revision : 01.45
Firmware Build Time : 02/06/2018
BIOS Version: 2.2
BIOS Build Time: 05/23/2018
Redfish Version : 1.0.1
CPLD Version: 02.b1.01
I have flashed the LSI 3008 to IT mode and version 16.00.01.00 (dmesg reports: mpr0: Firmware: 16.00.01.00, Driver: 18.03.00.00-fbsd). This was the latest available on the Supermicro web site.
I have read that the firmware and driver versions somehow need to match but it is not clear how. Some posts say they need to be the same numbers. Others say not quite the same numbers.
I have been trying to burn in the system, and good that I have been trying because I have been battling intermittent but persistent variants of this error:
Code:
Sep 24 00:13:40 fbnas kernel: (da2:mpr0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 676 Aborting command 0xfffffe0001067bc0 Sep 24 00:13:40 fbnas kernel: mpr0: Sending reset from mprsas_send_abort for target ID 2 Sep 24 00:13:40 fbnas kernel: (pass2:mpr0:0:2:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 1111 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0 Sep 24 00:13:40 fbnas kernel: mpr0: Unfreezing devq for target ID 2 Sep 24 00:13:40 fbnas kernel: (da2:mpr0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Sep 24 00:13:40 fbnas kernel: (da2:mpr0:0:2:0): CAM status: Command timeout Sep 24 00:13:40 fbnas kernel: (da2:mpr0:0:2:0): Retrying command Sep 24 00:13:41 fbnas kernel: (da2:mpr0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Sep 24 00:13:41 fbnas kernel: (da2:mpr0:0:2:0): CAM status: SCSI Status Error Sep 24 00:13:41 fbnas kernel: (da2:mpr0:0:2:0): SCSI status: Check Condition Sep 24 00:13:41 fbnas kernel: (da2:mpr0:0:2:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Sep 24 00:13:41 fbnas kernel: (da2:mpr0:0:2:0): Error 6, Retries exhausted Sep 24 00:13:41 fbnas kernel: (da2:mpr0:0:2:0): Invalidating pack Sep 24 00:13:41 fbnas ZFS: vdev state changed, pool_guid=14629618157114388220 vdev_guid=5341139343660289690 Sep 24 00:13:41 fbnas ZFS: vdev state changed, pool_guid=14629618157114388220 vdev_guid=5341139343660289690
Suffice to say, this results in a dropped disk (see "Invalidating pack" line above) and this pool status:
Code:
pool: tank state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: none requested config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 ada2 ONLINE 0 0 0 ada3 ONLINE 0 0 0 da0 ONLINE 0 0 0 da1 ONLINE 0 0 0 da2 FAULTED 6 4 0 too many errors da3 ONLINE 0 0 0 da4 ONLINE 0 0 0 da5 ONLINE 0 0 0 da6 ONLINE 0 0 0 da7 ONLINE 0 0 0 errors: No known data errors
This is only the latest instace of the error.
I have tried reseating SAS cables and power cables multiple times amongst all of my testing over the last week to no avail.
I have also done various tests that involve hooking up as many drives as I can to only the SATA ports, which the motherboard has 6 of, so I can test 6 of the 10 HGST's at a time that way. I have not had one problem reported by ZFS or FreebSD when doing like a 6 disk raidz2 of the SATA ports and heavy testing. The native SATA ports seem rock solid. But I did receive such errors when the other 4 still connected to the SAS controller were tested as a separate 4 disk raidz1 zpool.
It does not seem related to any disk. I have gotten such reports for different disks at different times for drives connected to the SAS controller. It does not seem to correlate with a particular disk. As you can see in my info dump at the end, SMART doesn't report any errors or reallocations on any of the disks, and they seem independently OK when testing with SATA ports direcly.
It seems related to heavy activity on the SAS controller somehow. I have not directly observed this error occur on an idle system, only when loading it with heavy activity.
So it seems related to the SAS controller path somehow. In this latest test I have my case completely open in case it was heat related since right now the case only has the upper fans installed on the front, so there isn't any direct airflow across the bottom where the integrated SAS controller is. That's relevant because I read some thread where someone said they got these errors when they turned their fan speed down to reduce the noise while they were present, and turned them back up when they left the room, and he noticed a correlation with the lower fan speed and this error.
I do have some Supermicro SAS-SATA cables and some more Noctua fans on the way, so I will try that. But given the errors so far occur across both of my current SAS cables, it is somewhat "aspirational" at this point that new SAS-SATA cables will solve the problem. The SAS-SATA cables are 100cm which I have read is apparently a bad idea, but my existing SATA cables are also 100cm and nothing bad occurred with them. The new ones are 60cm. I guess we'll see...
At this point, I feel like with the case open during this last test I have given this thing the best chance at success, and this last test is even somewhat scarier because it was just an immediate drop with very little in the way of error output. Nothing seems obviously defective and failing all of the time, and all error-type counts on the drives are 0 even after full disk writes such as with "dd if=/dev/zero of=<disk>" type tests, and yet under heavy load on the SAS controller, especially with raidz2, suddenly a disk gets dropped for whatever reason. One of the times it was even two disks, which of course means all redundancy of a raidz2 pool is now lost.
Not many of the postings I could find about this error seem to resolve very cleanly.
As of right now I can't trust any data to this server if I am going to constantly worry about random drive drops.
First, can anyone tell me what this error actually means technically?
Otherwise, is this possibly a bad SAS controller and therefore a motherboard I should RMA?
Is there something simple I am overlooking?
Has anyone else had success or failure with the LSI 3008 and HGST 6TB NAS drives?
Last edited: