May I have your opinion on what happened?

AuBird · Nov 30, 2017

Thank you for taking the time to look at this post. Yesterday my certified 4U FreeNAS running FreeNAS-9.10.2-U6 (561f0d7a1) kicked off an email saying one or more devices experienced an error. When I logged in to investigate the zpool status -v output was something like below, and I did see the CKSUM column increment by 20-30. Then all of the sudden the system rebooted and all the CKSUM numbers are back at zero. All data on the iSCSI share appears to be okay, the backup application using the share hasn't reported any integrity issues yet.

From what I've read this means there is a disk in trouble or bad memory. I've attached the output of smartctl -a for all active multipaths. Is it likely bad memory? Would memtest catch any errors with ECC memory? Or are there a few bad drives? Any idea why the system would just reboot? I wasn't able to see anything in the logs other than cases being built for the raidz2-1 vdev, and I don't know what that means.

EDIT: Just one more thing, I did upgrade from 9.3 to 9.10 a couple days ago.

Code:

[root@freenas] ~# zpool status -v
  pool: fnas01p01
 state: ONLINE
status: One or more devices has experienced an error resulting in data
		corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
		entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 13h41m with 0 errors on Sun Nov 26 13:41:51 2017
config:

		NAME											STATE	 READ WRITE CKSUM
		fnas01p01									   ONLINE	   0	 0   107
		  raidz2-0									  ONLINE	   0	 0	 0
			gptid/dc9a2811-615c-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 0
			gptid/dce660b8-615c-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 0
			gptid/dd342422-615c-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 0
			gptid/dd830db8-615c-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 0
			gptid/ddd197f0-615c-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 0
			gptid/de20caf7-615c-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 0
			gptid/de70e4c0-615c-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 0
			gptid/debecb5a-615c-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 0
			gptid/df0e3e33-615c-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 0
			gptid/df5e1829-615c-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 0
			gptid/dfaea5ca-615c-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 0
		  raidz2-1									  ONLINE	   0	 0   214
			gptid/d880b4e4-62b8-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 0
			gptid/d8ed9dcd-62b8-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 0
			gptid/d95e33ef-62b8-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 0
			gptid/d9c8a064-62b8-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 0
			gptid/da3642fb-62b8-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 0
			gptid/daa1e095-62b8-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 0
			gptid/db0bc938-62b8-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 0
			gptid/db7cfec9-62b8-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 0
			gptid/dbeac94c-62b8-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 0
			gptid/dc567379-62b8-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 0
			gptid/dcc284fc-62b8-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 0
		spares
		  gptid/e00654c8-615c-11e5-929b-0cc47a55b944	AVAIL
		  gptid/bcabc891-62ba-11e5-929b-0cc47a55b944	AVAIL

errors: Permanent errors have been detected in the following files:

		fnas01p01/rep01lun02:<0x1>

Chris Moore · Nov 30, 2017

Do you have any monitoring scripts setup or is the system running the standard build with no adjustments or custom setting?

Sent from my SAMSUNG-SGH-I537 using Tapatalk

AuBird · Nov 30, 2017

It's standard, I just added scheduled SMART tests. No custom settings, there is something about snmp in the tunables, but I'm pretty sure that is stock.

AuBird · Nov 30, 2017

I added the fact this was upgraded from 9.3 to 9.10 just a couple days ago, maybe the day before this error kicked out?

Chris Moore · Nov 30, 2017

I was looking at your attached text file. It appears to be generated by a script that I am not familiar with. Due to the function of the script, the data is incomplete and presented in an unusual format. This makes it difficult to tell if there is a problem with one of your drives, but I will look at it further. The unfiltered output of smartctl -a /dev/daXX would be easier to read.
I would suggest that, if there is not some compelling reason preventing it, you should upgrade to the latest version of the OS.
In addition to that, you should utilize some of the scripts here:
https://forums.freenas.org/index.ph...for-freenas-scripts-including-disk-burnin.28/
to keep an eye on the function of the drives.
Is this a personal system or is it being used in a business environment?

Chris Moore · Nov 30, 2017

AuBird said:
I wasn't able to see anything in the logs other than cases being built for the raidz2-1 vdev, and I don't know what that means.

'cases' are how the system keeps track of accumulated errors that a drive is experiencing. If the drive experiences enough errors, ZFS will mark the drive as failed. Since you have hot-spares, and RAID-z2, I would expect your data to be safe, however there is the error at the end of your zpool status

Code:

errors: Permanent errors have been detected in the following files:

		fnas01p01/rep01lun02:

The naming on that file makes it look like an iSCSI extent, but you said that the applications were working.

AuBird said:
Is it likely bad memory? Would memtest catch any errors with ECC memory?

I don't think this is a RAM error. From what I see, I would expect some disk subsystem error. Either a disk, or disks, or the controller.

I noticed a line, "Drive Trip Temperature: 85 C", in your text file. What is the significance of this?

AuBird · Nov 30, 2017

This is in a business environment, not heavily used and only stores our backups. I'll take a look at the next stable branch and the scripts, thank you.

I threw the script together so I could easily get output of smartctl -a /dev/daXX without running it for each drive. It's just a for loop running through all the ACTIVE multipath drives.

AuBird · Nov 30, 2017

The "Drive Trip Temperature: 85 C" line is part of the output from smartctl -a /dev/da1, and I assume it's the drives temperature limit suggested by the manufacture.

Code:

[root@freenas] ~# smartctl -a /dev/da1
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:			   HGST
Product:			  HUS724040ALS640
Revision:			 A280
Compliance:		   SPC-4
User Capacity:		4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
LU is fully provisioned
Rotation Rate:		7200 rpm
Form Factor:		  3.5 inches
Logical Unit id:	  0x5000cca05cd295fc
Serial number:		PCKRU81X
Device type:		  disk
Transport protocol:   SAS (SPL-3)
Local Time is:		Thu Nov 30 09:44:06 2017 EST
SMART support is:	 Available - device has SMART capability.
SMART support is:	 Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:	 33 C
Drive Trip Temperature:		85 C

Manufactured in week 11 of year 2015
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  17
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2405
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 17665617016389632

Error counter log:
		   Errors Corrected by		   Total   Correction	 Gigabytes	Total
			   ECC		  rereads/	errors   algorithm	  processed	uncorrected
		   fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:	 198854		0		 0	198854   16679186	 115332.008		   0
write:		 0		0		 0		 0	 150285	  22177.758		   0
verify:		0		0		 0		 0	 817121		  0.000		   0

Non-medium error count:		0

SMART Self-test log
Num  Test			  Status				 segment  LifeTime  LBA_first_err [SK ASC ASQ]
	 Description							  number   (hours)
# 1  Background short  Completed				   -   19450				 - [-   -	-]
# 2  Background long   Completed				   -   19447				 - [-   -	-]

Long (extended) Self Test duration: 37452 seconds [624.2 minutes]

[root@freenas] ~#

Chris Moore · Nov 30, 2017

I have dealt with a lot of drives, even HGST, but this is a very different output than I usually see. The specifics change from one model to another even in the same vendor, but the part at the bottom usually looks more like this:

Code:

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   117   099   006	Pre-fail  Always	   -	   135984832
  3 Spin_Up_Time			0x0003   096   096   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   020	Old_age   Always	   -	   95
  5 Reallocated_Sector_Ct   0x0033   100   100   010	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000f   077   060   030	Pre-fail  Always	   -	   50498920
  9 Power_On_Hours		  0x0032   094   094   000	Old_age   Always	   -	   5370
 10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   87
183 Runtime_Bad_Block	   0x0032   100   100   000	Old_age   Always	   -	   0
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0
188 Command_Timeout		 0x0032   100   100   000	Old_age   Always	   -	   0 0 0
189 High_Fly_Writes		 0x003a   097   097   000	Old_age   Always	   -	   3
190 Airflow_Temperature_Cel 0x0022   070   053   045	Old_age   Always	   -	   30 (Min/Max 28/33)
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   52
193 Load_Cycle_Count		0x0032   100   100   000	Old_age   Always	   -	   715
194 Temperature_Celsius	 0x0022   030   047   000	Old_age   Always	   -	   30 (0 20 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0
240 Head_Flying_Hours	   0x0000   100   253   000	Old_age   Offline	  -	   5323h+24m+33.496s
241 Total_LBAs_Written	  0x0000   100   253   000	Old_age   Offline	  -	   26028475929
242 Total_LBAs_Read		 0x0000   100   253   000	Old_age   Offline	  -	   1030194009260

No Errors Logged

Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline	Completed without error	   00%	  5365		 -

AuBird · Nov 30, 2017

My drives at home show tables like that as well, but this FreeNAS and and another running FreeNAS 11 show the same strange output with these HGST drives. So far I haven't found a way to get the same sort of information you listed from these drives.

Johnnie Black · Nov 30, 2017

Those are SAS disks, SMART is very different from SATA disks.

AuBird · Nov 30, 2017

I started a scrub on the pool this morning, but just stopped it after seeing this output which spooked me. I might just pull these drives and throw them into our spare and run a scrub there.

Code:

[root@freenas] ~# zpool status -v
  pool: fnas01p01
 state: ONLINE
status: One or more devices has experienced an error resulting in data
		corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
		entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub canceled on Thu Nov 30 10:11:30 2017
config:

		NAME											STATE	 READ WRITE CKSUM
		fnas01p01									   ONLINE	   0	 0   109
		  raidz2-0									  ONLINE	   0	 0   108
			gptid/dc9a2811-615c-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 9
			gptid/dce660b8-615c-11e5-929b-0cc47a55b944  ONLINE	   0	 0	10
			gptid/dd342422-615c-11e5-929b-0cc47a55b944  ONLINE	   0	 0	10
			gptid/dd830db8-615c-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 4
			gptid/ddd197f0-615c-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 7
			gptid/de20caf7-615c-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 6
			gptid/de70e4c0-615c-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 5
			gptid/debecb5a-615c-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 5
			gptid/df0e3e33-615c-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 6
			gptid/df5e1829-615c-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 9
			gptid/dfaea5ca-615c-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 4
		  raidz2-1									  ONLINE	   0	 0   110
			gptid/d880b4e4-62b8-11e5-929b-0cc47a55b944  ONLINE	   0	 0	10
			gptid/d8ed9dcd-62b8-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 3
			gptid/d95e33ef-62b8-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 6
			gptid/d9c8a064-62b8-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 6
			gptid/da3642fb-62b8-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 5
			gptid/daa1e095-62b8-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 5
			gptid/db0bc938-62b8-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 5
			gptid/db7cfec9-62b8-11e5-929b-0cc47a55b944  ONLINE	   0	 0	11
			gptid/dbeac94c-62b8-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 9
			gptid/dc567379-62b8-11e5-929b-0cc47a55b944  ONLINE	   0	 0	 2
			gptid/dcc284fc-62b8-11e5-929b-0cc47a55b944  ONLINE	   0	 0	14
		spares
		  gptid/e00654c8-615c-11e5-929b-0cc47a55b944	AVAIL
		  gptid/bcabc891-62ba-11e5-929b-0cc47a55b944	AVAIL

errors: Permanent errors have been detected in the following files:

		fnas01p01/rep01lun01:<0x1>
		fnas01p01/rep01lun02:<0x1>

Chris Moore · Nov 30, 2017

Every drive in the text file you uploaded had a large number of errors:
- EDIT - except for the two hot spares. They have not been used, much.

Code:

Error counter log:
		   Errors Corrected by		   Total   Correction	 Gigabytes	Total
			   ECC		  rereads/	errors   algorithm	  processed	uncorrected
		   fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:	 271040		3		 0	271043   16170787	 115075.134		   0
write:		 0		0		 0		 0	 320778	  22177.539		   0
verify:		0		0		 0		 0	  34814		  0.000		   0

Non-medium error count:		4

(internal in the drive) but, it appeared that the errors had been corrected.
The scrub information you just posted makes me wonder.
Then the following drives had "Non-medium" errors:

Code:

da24
Serial number:		PCKSEZEX
Non-medium error count:		4

da42
Serial number:		PCKT1WHX
Non-medium error count:		1

I think I would let the scrub finish and see if it can fix the problems.
If the system is holding backups, you might not actually have a viable backup.

Chris Moore · Nov 30, 2017

Would you post the output of sas2flash so we can see the firmware version of the SAS controller.

AuBird · Nov 30, 2017

Just confirmed we got the data copied off yesterday

Chris Moore said:
Would you post the output of sas2flash so we can see the firmware version of the SAS controller.

Code:

[root@freenas] ~# sas2flash
LSI Corporation SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2008-2013 LSI Corporation. All rights reserved

AuBird · Nov 30, 2017

Heh, I remember 16 being a thing for the 9.3 branch. I could easily upgrade the firmware, thank you.

EDIT: But the recent purchase from iXsystems with 11 on it is running the same version

Code:

root@freenasb11a:~ # sas2flash
LSI Corporation SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2008-2013 LSI Corporation. All rights reserved

		No LSI SAS adapters found! Limited Command Set Available!
		Finished Processing Commands Successfully.
		Exiting SAS2Flash.

AuBird · Nov 30, 2017

I've restarted the scrub.

Chris Moore · Nov 30, 2017

AuBird said:
Heh, I remember 16 being a thing for the 9.3 branch. I could easily upgrade the firmware, thank you.

Sorry, the whole command is sas2flash -list which should give you an output like this:

Code:

LSI Corporation SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2008-2013 LSI Corporation. All rights reserved

		Adapter Selected is a LSI SAS: SAS2308_2(D1)

		Controller Number			  : 0
		Controller					 : SAS2308_2(D1)
		PCI Address					: 00:02:00:00
		SAS Address					: removed
		NVDATA Version (Default)	   : 14.01.00.06
		NVDATA Version (Persistent)	: 14.01.00.06
		Firmware Product ID			: 0x2214 (IT)
		Firmware Version			   : 20.00.07.00
		NVDATA Vendor				  : LSI
		NVDATA Product ID			  : SAS9207-8i
		BIOS Version				   : 07.39.00.00
		UEFI BSD Version			   : 07.27.01.00
		FCODE Version				  : N/A
		Board Name					 : SAS9207-8i
		Board Assembly				 : removed
		Board Tracer Number			: removed

		Finished Processing Commands Successfully.
		Exiting SAS2Flash.

The version 16 you are seeing is just the version of the utility.

AuBird · Nov 30, 2017

Code:

[root@freenas] ~# sas2flash -list
LSI Corporation SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2008-2013 LSI Corporation. All rights reserved

		Adapter Selected is a LSI SAS: SAS2308_1(D1)

		Controller Number			  : 0
		Controller					 : SAS2308_1(D1)
		PCI Address					: 00:01:00:00
		SAS Address					: 5003048-0-1897-ff01
		NVDATA Version (Default)	   : 14.01.00.12
		NVDATA Version (Persistent)	: 14.01.00.12
		Firmware Product ID			: 0x2714 (IR)
		Firmware Version			   : 20.00.02.00
		NVDATA Vendor				  : LSI
		NVDATA Product ID			  : LSI2308-IR
		BIOS Version				   : 07.39.00.00
		UEFI BSD Version			   : 07.27.01.00
		FCODE Version				  : N/A
		Board Name					 : LSI2308-IR
		Board Assembly				 : N/A
		Board Tracer Number			: N/A

		Finished Processing Commands Successfully.
		Exiting SAS2Flash.

Chris Moore · Nov 30, 2017

AuBird said:
NVDATA Product ID : LSI2308-IR

I feel like this is wrong. That line should end with IT (Initiator Target) instead of IR. The IR firmware has the hardware RAID functionality and FreeNAS doesn't like that. The other thing, which doesn't feel like as big a deal, is this:

AuBird said:
Firmware Version : 20.00.02.00

That firmware is a little out of date, if I recall correctly, the latest is: 20.00.07.00
I am not saying this is the problem, but it looks like a possibility. Did this system come from iXsystems configured like this?

Important Announcement for the TrueNAS Community.

May I have your opinion on what happened?

Dabbler

Attachments

Hall of Famer

Dabbler

Dabbler

Hall of Famer

Hall of Famer

Dabbler

Dabbler

Hall of Famer

Dabbler

Guru

Dabbler

Hall of Famer

Hall of Famer

Dabbler

Dabbler

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "May I have your opinion on what happened?"

Similar threads