Disks or HW Failing for DAS

R2U2 · Jul 13, 2018

I've been running this DAS box for years now, and only after I upgraded to FreeNAS11 and now having problems (coincidence?).
I have since detached the volume for da11-da21 and also removed another volume with 8? 8TB disks which have also been having issues (I think staying connected or even seeing all the disks on the NAS, they worked fine until around the same time after writing ~8TB to the volume).

The main reason I suspect it's something with hardware is I added a few new 8TB drives after fully testing them and they were having issues around the same time the disks started to show errors. (both volumes were on the same DAS. Later today i will plug in the 8TB disks and update the thread)

The disks on the NAS are acting perfectly fine, i'm only having issues with my DAS.

FreeNAS-11-STABLE

DAS
SUPERMICRO 4U 846
Power board for JBOD
BPN-SAS2-846EL1 (I don't really have $300-400 at the moment to just throw it at another BPN to test/swap)

NAS
supermicro X8DTH
2x E5620
96GB ECC
LSI SAS2008
DAS is connected to NAS via Dell PERC H200E (Already swapped card and cables. could the FW need flashing?)

Here is a snippet of the SMART status

Code:


+------+------------------+----+-----+-----+-----+-------+-------+--------+------+----------+------+-------+----+
|Device|Serial			|Temp|Power|Start|Spin |ReAlloc|Current|Offline |Seek  |Total	 |High  |Command|Last|
|	  |Number			|	|On   |Stop |Retry|Sectors|Pending|Uncorrec|Errors|Seeks	 |Fly   |Timeout|Test|
|	  |				  |	|Hours|Count|Count|	   |Sectors|Sectors |	  |		  |Writes|Count  |Age |
+------+------------------+----+-----+-----+-----+-------+-------+--------+------+----------+------+-------+----+
|da11 ?|YHGZW4JA		  | 38 |45186| 2183|	0|	  0|	  0|	   0|	 0|	   247|   N/A|	N/A|   2|
|da12 ?|YHGZW4RA		  | 39 |44330| 1953|	0|	  0|	  0|	   0|	 0|	196774|   N/A|	N/A|   2|
|da13 ?|YHH26Z0A		  | 34 |44175| 2570|	0|	  0|	  0|	   0|	 0|   7405654|   N/A|	N/A|   6|
|da14 ?|YVH40NGA		  | 35 |  229| 1146|	0|	  0|	  0|	   0|	 0|		16|   N/A|	N/A|  10|
|da15 ?|YHH05LPA		  | 39 |45784| 2194|	0|	  0|	  0|	   0|	 0|	917702|   N/A|	N/A|   2|
|da16 ?|YHH0BH7A		  | 39 |44363| 2434|	0|	  0|	  0|	   0|	 0|   2162897|   N/A|	N/A|   2|
|da17 ?|YVHPGDTA		  | 38 |37175| 1919|	0|	  0|	  0|	   0|	 0|	   112|   N/A|	N/A|   2|
|da18 ?|YHH02BLA		  | 34 |44013| 1648|	0|	  0|	  0|	   0|	 0|  10420296|   N/A|	N/A|   2|
|da19 ?|P8GJU02P		  | 35 | 9116| 1338|1638410|	  0|	  0|	   0|	 0|   4784132|   N/A|	N/A|   3|
|da20 ?|YHH0A23A		  | 35 |44208| 2847|	0|	  0|	  0|	   0|	 0|   1114267|   N/A|	N/A|   8|
|da21 ?|YHGZE4UA		  | 34 |44229| 3584|	0|	  0|	  0|	   0|	 0|   4128961|   N/A|	N/A|   5|



########## SMART status report for da18 drive (HITACHI HUA723030ALA640 : YHH02BLA) ##########

SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000b   100   100   016	Pre-fail  Always	   -	   0
  2 Throughput_Performance  0x0005   100   100   054	Pre-fail  Offline	  -	   0
  3 Spin_Up_Time			0x0007   253   253   024	Pre-fail  Always	   -	   121 (Average 305)
  4 Start_Stop_Count		0x0012   100   100   000	Old_age   Always	   -	   1648
  5 Reallocated_Sector_Ct   0x0033   100   100   005	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000b   062   062   067	Pre-fail  Always   FAILING_NOW 10420296
  8 Seek_Time_Performance   0x0005   100   100   020	Pre-fail  Offline	  -	   0
  9 Power_On_Hours		  0x0012   094   094   000	Old_age   Always	   -	   44013
 10 Spin_Retry_Count		0x0013   100   100   060	Pre-fail  Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   68
192 Power-Off_Retract_Count 0x0032   098   098   000	Old_age   Always	   -	   3089
193 Load_Cycle_Count		0x0012   098   098   000	Old_age   Always	   -	   3089
194 Temperature_Celsius	 0x0002   176   176   000	Old_age   Always	   -	   34 (Min/Max 16/60)
196 Reallocated_Event_Count 0x0032   100   100   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0022   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0008   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x000a   200   200   000	Old_age   Always	   -	   0

ATA Error Count: 34 (device log contains only the most recent five errors)
		CR = Command Register [HEX]
		FR = Features Register [HEX]
		SC = Sector Count Register [HEX]
		SN = Sector Number Register [HEX]
		CL = Cylinder Low Register [HEX]
		CH = Cylinder High Register [HEX]
		DH = Device/Head Register [HEX]
		DC = Device Command Register [HEX]
		ER = Error register [HEX]
		ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 34 occurred at disk power-on lifetime: 44013 hours (1833 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 00 9e 50 0d  Error: UNC at LBA = 0x0d509e00 = 223387136

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 80 ff 3f 40 00	  00:26:58.152  READ FPDMA QUEUED
  60 00 08 80 01 40 40 00	  00:26:58.151  READ FPDMA QUEUED
  60 00 08 80 fe 3f 40 00	  00:26:58.150  READ FPDMA QUEUED
  60 00 08 80 fd 3f 40 00	  00:26:58.147  READ FPDMA QUEUED
  60 00 10 80 01 40 40 00	  00:26:58.141  READ FPDMA QUEUED

Error 33 occurred at disk power-on lifetime: 44013 hours (1833 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 80 00 40 00  Error: UNC at LBA = 0x00400080 = 4194432

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 80 03 00 40 00	  00:23:40.786  READ FPDMA QUEUED
  60 00 00 00 03 00 40 00	  00:23:40.786  READ FPDMA QUEUED
  60 00 00 80 02 00 40 00	  00:23:40.786  READ FPDMA QUEUED
  60 00 00 00 02 00 40 00	  00:23:40.785  READ FPDMA QUEUED
  60 00 00 80 01 00 40 00	  00:23:40.785  READ FPDMA QUEUED

Error 32 occurred at disk power-on lifetime: 44012 hours (1833 days + 20 hours)
  When the command that caused the error occurred, the device was in standby mode.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 40 40 01 00 00  Error: UNC at LBA = 0x00000140 = 320

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 18 80 00 40 40 00	  00:09:54.305  READ FPDMA QUEUED
  60 00 10 80 00 40 40 00	  00:09:51.586  READ FPDMA QUEUED
  60 00 08 80 00 00 40 00	  00:09:51.492  READ FPDMA QUEUED
  60 00 00 00 00 00 40 00	  00:09:51.476  READ FPDMA QUEUED
  60 01 00 87 a3 50 40 00	  00:09:50.741  READ FPDMA QUEUED

Error 31 occurred at disk power-on lifetime: 44012 hours (1833 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 01 80 00 00 00  Error: UNC at LBA = 0x00000080 = 128

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 01 00 80 00 00 40 00	  00:47:32.617  READ FPDMA QUEUED
  ec 00 00 00 00 00 00 00	  00:47:32.615  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00	  00:47:32.610  IDENTIFY DEVICE
  60 01 00 80 00 00 40 00	  00:47:32.599  READ FPDMA QUEUED
  b0 d5 01 01 4f c2 00 00	  00:47:20.086  SMART READ LOG

Error 30 occurred at disk power-on lifetime: 44012 hours (1833 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 01 7f 00 40 00  Error: UNC at LBA = 0x0040007f = 4194431

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 01 00 7f 00 40 40 00	  00:46:02.178  READ FPDMA QUEUED
  ec 00 00 00 00 00 00 00	  00:46:02.176  IDENTIFY DEVICE
  60 01 00 7f 00 40 40 00	  00:46:02.161  READ FPDMA QUEUED
  ec 00 00 00 00 00 00 00	  00:46:02.159  IDENTIFY DEVICE
  2f 00 01 10 00 00 00 00	  00:46:02.076  READ LOG EXT

Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline	   Completed without error	   00%	 43954		 -




########## SMART status report for da21 drive (HITACHI HUA723030ALA640 : YHGZE4UA) ##########

SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000b   100   100   016	Pre-fail  Always	   -	   1
  2 Throughput_Performance  0x0005   100   100   054	Pre-fail  Offline	  -	   0
  3 Spin_Up_Time			0x0007   253   253   024	Pre-fail  Always	   -	   359 (Average 108)
  4 Start_Stop_Count		0x0012   100   100   000	Old_age   Always	   -	   3584
  5 Reallocated_Sector_Ct   0x0033   100   100   005	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000b   061   061   067	Pre-fail  Always   FAILING_NOW 4128961
  8 Seek_Time_Performance   0x0005   100   100   020	Pre-fail  Offline	  -	   0
  9 Power_On_Hours		  0x0012   094   094   000	Old_age   Always	   -	   44229
 10 Spin_Retry_Count		0x0013   100   100   060	Pre-fail  Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   74
192 Power-Off_Retract_Count 0x0032   096   096   000	Old_age   Always	   -	   4971
193 Load_Cycle_Count		0x0012   096   096   000	Old_age   Always	   -	   4971
194 Temperature_Celsius	 0x0002   176   176   000	Old_age   Always	   -	   34 (Min/Max 16/58)
196 Reallocated_Event_Count 0x0032   100   100   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0022   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0008   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x000a   200   200   000	Old_age   Always	   -	   0

ATA Error Count: 72 (device log contains only the most recent five errors)
		CR = Command Register [HEX]
		FR = Features Register [HEX]
		SC = Sector Count Register [HEX]
		SN = Sector Number Register [HEX]
		CL = Cylinder Low Register [HEX]
		CH = Cylinder High Register [HEX]
		DH = Device/Head Register [HEX]
		DC = Device Command Register [HEX]
		ER = Error register [HEX]
		ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 72 occurred at disk power-on lifetime: 44220 hours (1842 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 e0 a0 9e 50 0d  Error: UNC at LBA = 0x0d509ea0 = 223387296

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 e0 18 a0 a0 50 40 00   1d+21:18:05.013  READ FPDMA QUEUED
  60 e0 10 a0 9e 50 40 00   1d+21:18:05.013  READ FPDMA QUEUED
  60 e0 08 a0 02 40 40 00   1d+21:18:05.012  READ FPDMA QUEUED
  60 e0 00 a0 00 40 40 00   1d+21:18:05.012  READ FPDMA QUEUED
  60 10 20 90 a0 50 40 00   1d+21:16:45.641  READ FPDMA QUEUED

Error 71 occurred at disk power-on lifetime: 44220 hours (1842 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 80 00 04 40 00  Error: UNC at LBA = 0x00400400 = 4195328

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 10 80 a1 50 40 00   1d+21:16:33.156  READ FPDMA QUEUED
  60 00 08 80 9f 50 40 00   1d+21:16:33.156  READ FPDMA QUEUED
  60 00 00 80 03 40 40 00   1d+21:16:33.156  READ FPDMA QUEUED
  60 b0 00 d0 01 40 40 00   1d+21:16:33.155  READ FPDMA QUEUED
  60 08 48 c8 01 40 40 00   1d+21:16:33.141  READ FPDMA QUEUED

Error 70 occurred at disk power-on lifetime: 44220 hours (1842 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 40 c0 00 00 00  Error: UNC at LBA = 0x000000c0 = 192

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 28 80 02 40 40 00   1d+21:11:55.992  READ FPDMA QUEUED
  60 00 20 80 01 00 40 00   1d+21:11:55.992  READ FPDMA QUEUED
  60 00 18 80 01 00 40 00   1d+21:11:55.992  READ FPDMA QUEUED
  60 00 10 80 00 40 40 00   1d+21:11:55.992  READ FPDMA QUEUED
  60 00 08 80 00 00 40 00   1d+21:11:55.992  READ FPDMA QUEUED

Error 69 occurred at disk power-on lifetime: 44220 hours (1842 days + 12 hours)
  When the command that caused the error occurred, the device was in standby mode.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 80 00 40 00  Error: UNC at LBA = 0x00400080 = 4194432

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 28 80 01 00 40 00   1d+21:11:47.657  READ FPDMA QUEUED
  60 00 08 80 01 00 40 00   1d+21:11:47.656  READ FPDMA QUEUED
  60 00 00 00 00 00 40 00   1d+21:11:47.655  READ FPDMA QUEUED
  60 00 20 80 00 00 40 00   1d+21:11:47.651  READ FPDMA QUEUED
  60 00 18 80 00 40 40 00   1d+21:11:47.651  READ FPDMA QUEUED

Error 68 occurred at disk power-on lifetime: 44180 hours (1840 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 01 af a3 50 0d  Error: UNC at LBA = 0x0d50a3af = 223388591

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 01 00 af a3 50 40 00	  05:09:01.855  READ FPDMA QUEUED
  ec 00 00 00 00 00 00 00	  05:08:41.550  IDENTIFY DEVICE
  ef 10 02 00 00 00 00 00	  05:08:41.302  SET FEATURES [Enable SATA feature]
  ef 02 00 00 00 00 00 00	  05:08:41.302  SET FEATURES [Enable write cache]
  ef aa 00 00 00 00 00 00	  05:08:41.302  SET FEATURES [Enable read look-ahead]

Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline	   Completed without error	   00%	 44103		 -

Chris Moore · Jul 13, 2018

R2U2 said:
Connected to NAS via Dell PERC H200E (Already swapped card and cables. could the FW need flashing?)

You could give a lot more information about your hardware configuration so someone might be able to get a clue, but just looking at what you did provide, it purely looks like failed drives.

Johnnie Black · Jul 14, 2018

da18 certainly is failing.

R2U2 · Jul 14, 2018

Chris Moore said:
You could give a lot more information about your hardware configuration so someone might be able to get a clue, but just looking at what you did provide, it purely looks like failed drives.

The main reason I suspect it's something with hardware is I added a few new 8TB drives after fully testing them and they were having issues around the same time the disks started to show errors. (both volumes were on the same DAS).

I've added more HW information.

Chris Moore · Jul 14, 2018

R2U2 said:
DAS is connected to NAS via Dell PERC H200E (Already swapped card and cables. could the FW need flashing?)

Both controllers were the same model? Is it still the Dell firmware or were these flashed to the LSI firmware?

R2U2 said:
The main reason I suspect it's something with hardware is I added a few new 8TB drives after fully testing them and they were having issues around the same time the disks started to show errors.

If these drives have not actually all failed at once, which sounds a bit unlikely, it could be some incompatibility of the controller with the new 8TB drives. You might need to get a new firmware or possibly even a newer chip-set controller. Something like:
https://www.ebay.com/itm/SAS9207-8E-LSI-8-port-6gb-s-SAS-SATA-TO-PCIe-Host-Bus-Adapter/253413945368

Which is based on the newer LSI SAS 2308 6Gb/s IO Controller

R2U2 · Jul 14, 2018

Chris Moore said:
Both controllers were the same model? Is it still the Dell firmware or were these flashed to the LSI firmware?

They are both the same model with dell firmware (The one in right now is on Dell 7.15.08.00-IT). The 8TBs were shucked from WD easystore +1 year ago, and I've read that the backplane I use is compatible with the 3.3v. If there is another firmware I could try before buying another card I would be up for it.

If I remember correctly there was an issue with all the drives showing on the DAS one day. I've already selected a few drives before when I was initially having problems and ran some disk tests on the drives that had smart issues and only 1/11 was "I should replace this soon". I have removed the 8TB drives thinking there was a problem with having them on the same box and still get issues like sometimes I don't see all the disks (3TB and 8TB) like 1 or 2 of them each are missing on freenas and then I just have to reboot until I see it.

Stux · Jul 14, 2018

Yeah. So, you need to cross-flash and update the firmware.

Could’ve happened with the upgrade to 11.0 depending on what you were upgrading from as the sas driver was upgraded in the last few years, and the cards need to be upgraded to suit.

Chris Moore · Jul 14, 2018

R2U2 said:
They are both the same model with dell firmware (The one in right now is on Dell 7.15.08.00-IT).

I would look at getting the latest firmware from LSI installed instead of the Dell firmware. The Dell Firmware for the regular H200 is restricted in what it is able to do in comparison to the LSI version.

R2U2 · Jul 14, 2018

Stux said:
Could’ve happened with the upgrade to 11.0 depending on what you were upgrading from

I think I was on 9.10 then fresh installed to corral and then another fresh install for 11, Ill try flashing the FW to LSI 9211-8i.

I know that i had to update to v20 on my 9211-8 cards back in end of 9 or when 10 was still around, but I did not have 8TB disks back then.

Important Announcement for the TrueNAS Community.

Disks or HW Failing for DAS

R2U2

Dabbler

Chris Moore

Hall of Famer

Johnnie Black

Guru

R2U2

Dabbler

Chris Moore

Hall of Famer

R2U2

Dabbler

Stux

MVP

Chris Moore

Hall of Famer

R2U2

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Disks or HW Failing for DAS

Dabbler

Hall of Famer

Guru

Dabbler

Hall of Famer

Dabbler

MVP

Hall of Famer

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Disks or HW Failing for DAS"

Similar threads