HELP needed -> SCSI sense: NOT READY -> pool lost

IceBoosteR · May 20, 2018

Hi all,
first of all my system config:
Xeon E3-1225v3 on Dell T20 motherboard
24GB of DD3 ECC memory
10x4TB WD RED in RAIDZ2 connected 8drives via flashed M1015 and 2 drives via onbaord SATA.
Seasonic Focus Plus Platinum 650W with APC Back UPS Pro 900VA in front of the system.

Today I am getting this error messages:
CRITICAL: May 20, 2018, 11:09 a.m. - The volume RED state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.

Seeing in zpool status:

Code:

root@freenas:~ # zpool status
  pool: RED
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
		Sufficient replicas exist for the pool to continue functioning in a
		degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
		repaired.
  scan: scrub repaired 0 in 0 days 05:10:32 with 0 errors on Sun May 20 05:10:58 2018
config:

		NAME												STATE	 READ WRITE CKSUM
		RED												 DEGRADED	 0	 0	 0
		  raidz2-0										  DEGRADED	 0	 0	 0
			gptid/5575c4d7-b72f-11e7-96c3-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/c7c9cba9-cd27-11e7-b158-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/56f73609-b72f-11e7-96c3-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/ee2f1174-1e6f-11e8-84e7-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/5881e2ae-b72f-11e7-96c3-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/593ba0f5-b72f-11e7-96c3-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/59f5e3ca-b72f-11e7-96c3-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/5aaf3ccd-b72f-11e7-96c3-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/5c015dd6-b72f-11e7-96c3-1866da308b0d.eli  FAULTED	  0   601	 0  too many errors
			gptid/5cc902e2-b72f-11e7-96c3-1866da308b0d.eli  ONLINE	   0	 0	 0

errors: No known data errors

So I was running a scrub this night, which finished after 5 hours (0-5 o'clock). At 11am I am getting the error messages in /var/log/messages:

Code:

May 20 11:09:02 freenas		 (da4:mps0:0:6:0): WRITE(10). CDB: 2a 00 eb 15 9b a8 00 00 20 00 length 16384 SMID 347 terminated ioc 804b loginfo 31110630 scsi 0 state c xfer 0
May 20 11:09:02 freenas (da4:mps0:0:6:0): WRITE(10). CDB: 2a 00 eb 15 9b a8 00 00 20 00
May 20 11:09:02 freenas (da4:mps0:0:6:0): CAM status: CCB request completed with an error
May 20 11:09:02 freenas (da4:mps0:0:6:0): Retrying command
May 20 11:09:02 freenas (da4:mps0:0:6:0): WRITE(10). CDB: 2a 00 eb 15 9b a8 00 00 20 00
May 20 11:09:02 freenas (da4:mps0:0:6:0): CAM status: SCSI Status Error
May 20 11:09:02 freenas (da4:mps0:0:6:0): SCSI status: Check Condition
May 20 11:09:02 freenas (da4:mps0:0:6:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
May 20 11:09:02 freenas (da4:mps0:0:6:0): Retrying command (per sense data)
May 20 11:09:02 freenas (da4:mps0:0:6:0): WRITE(10). CDB: 2a 00 eb 15 9b a8 00 00 20 00
May 20 11:09:02 freenas (da4:mps0:0:6:0): CAM status: SCSI Status Error
May 20 11:09:02 freenas (da4:mps0:0:6:0): SCSI status: Check Condition
May 20 11:09:02 freenas (da4:mps0:0:6:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
May 20 11:09:02 freenas (da4:mps0:0:6:0): Retrying command (per sense data)
May 20 11:09:02 freenas (da4:mps0:0:6:0): WRITE(10). CDB: 2a 00 eb 15 9b a8 00 00 20 00
May 20 11:09:02 freenas (da4:mps0:0:6:0): CAM status: SCSI Status Error
May 20 11:09:02 freenas (da4:mps0:0:6:0): SCSI status: Check Condition
May 20 11:09:02 freenas (da4:mps0:0:6:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
May 20 11:09:02 freenas (da4:mps0:0:6:0): Retrying command (per sense data)
May 20 11:09:02 freenas (da4:mps0:0:6:0): WRITE(10). CDB: 2a 00 eb 15 9b a8 00 00 20 00
May 20 11:09:02 freenas (da4:mps0:0:6:0): CAM status: SCSI Status Error
May 20 11:09:02 freenas (da4:mps0:0:6:0): SCSI status: Check Condition
May 20 11:09:02 freenas (da4:mps0:0:6:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
May 20 11:09:02 freenas (da4:mps0:0:6:0): Error 5, Retries exhausted
May 20 11:09:02 freenas (da4:mps0:0:6:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
May 20 11:09:02 freenas (da4:mps0:0: 6:0): Retrying command (per sense data)
May 20 11:09:04 freenas GEOM_ELI: g_eli_read_done() failed (error=5) gptid/5c015dd6-b72f-11e7-96c3-1866da308b0d.eli[READ(offset=0, length=4096)]
###This came up a lot of times... ####
May 20 11:09:41 freenas mps0: mpssas_prepare_remove: Sending reset for target ID 6
May 20 11:09:41 freenas da4 at mps0 bus 0 scbus0 target 6 lun 0
May 20 11:09:41 freenas da4: <ATA WDC WD40EFRX-68N 0A82> s/n WD-WCC7K3PD4PNF detached
May 20 11:09:41 freenas mps0: Unfreezing devq for target ID 6
May 20 11:09:41 freenas GEOM_MIRROR: Device swap3: provider da4p1 disconnected.
May 20 11:09:41 freenas GEOM_ELI: Device gptid/5c015dd6-b72f-11e7-96c3-1866da308b0d.eli destroyed.
May 20 11:09:41 freenas (da4:mps0:0:6:0): Periph destroyed
May 20 11:09:43 freenas mps0: SAS Address for SATA device = 4f656150bab0d586
May 20 11:09:43 freenas mps0: SAS Address from SATA device = 4f656150bab0d586
May 20 11:09:43 freenas da4 at mps0 bus 0 scbus0 target 6 lun 0
May 20 11:09:43 freenas da4: <ATA WDC WD40EFRX-68N 0A82> Fixed Direct Access SPC-4 SCSI device
May 20 11:09:43 freenas da4: Serial Number WD-WCC7K3PD4PNF
May 20 11:09:43 freenas da4: 600.000MB/s transfers
May 20 11:09:43 freenas da4: Command Queueing enabled
May 20 11:09:43 freenas da4: 3815447MB (7814037168 512 byte sectors)
May 20 11:09:43 freenas da4: quirks=0x8<4K>

SMARTCTL looks good for da4:

Code:

root@freenas:~ # smartctl -a /dev/da4
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD40EFRX-68N32N0
Serial Number:	WD-WCCXXXXXXXXXXX
LU WWN Device Id: 5 0014ee 20f14a93b
Firmware Version: 82.00A82
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Form Factor:	  3.5 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Sun May 20 23:30:37 2018 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(44400) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 471) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x303d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   161   161   021	Pre-fail  Always	   -	   6950
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   16
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   100   253   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   095   095   000	Old_age   Always	   -	   3699
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   16
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   14
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   1677
194 Temperature_Celsius	 0x0022   121   117   000	Old_age   Always	   -	   29
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	  3586		 -
# 2  Short offline	   Completed without error	   00%	  3466		 -
# 3  Short offline	   Completed without error	   00%	  3346		 -
# 4  Extended offline	Completed without error	   00%	  3238		 -
# 5  Short offline	   Completed without error	   00%	  3226		 -
# 6  Short offline	   Completed without error	   00%	  3106		 -
# 7  Short offline	   Completed without error	   00%	  2986		 -
# 8  Short offline	   Completed without error	   00%	  2866		 -
# 9  Short offline	   Completed without error	   00%	  2747		 -
#10  Short offline	   Completed without error	   00%	  2627		 -
#11  Extended offline	Completed without error	   00%	  2518		 -
#12  Short offline	   Completed without error	   00%	  2507		 -
#13  Short offline	   Completed without error	   00%	  2483		 -
#14  Short offline	   Completed without error	   00%	  2363		 -
#15  Short offline	   Completed without error	   00%	  2244		 -
#16  Short offline	   Completed without error	   00%	  2114		 -
#17  Extended offline	Completed without error	   00%	  2005		 -
#18  Short offline	   Completed without error	   00%	  1994		 -
#19  Short offline	   Completed without error	   00%	  1922		 -
#20  Short offline	   Completed without error	   00%	  1809		 -
#21  Short offline	   Completed without error	   00%	  1689		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

So basically my question is:
How to proceed?
Should I restart the server?
Offline the disks and replace it?
Poweroff and check cable?
How can this happen? :O

I haven't touched the server for days, I literally did not touch the server :D

Thanks for any quick help :)

kdragon75 · May 20, 2018

If ZFS is reporting write errors, I beleve it's because the disk wrote bad data. Odds of a cable going bad and ONLY causing write errors seem kinda slim. I would offline the drive and replace it.

IceBoosteR said:
How can this happen? :O

Drives just fail. They don't ask permission first and they don't ask for forgiveness later.

kdragon75 · May 20, 2018

On the smart tests, they dont test writes, just everything else non-destructive.

IceBoosteR · May 20, 2018

kdragon75 said:
If ZFS is reporting write errors, I beleve it's because the disk wrote bad data. Odds of a cable going bad and ONLY causing write errors seem kinda slim. I would offline the drive and replace it.

Drives just fail. They don't ask permission first and they don't ask for forgiveness later.

Thank you for your answer.
Yes I had two dying disks before, but they were slowly failing with SMART reporting errors, not ZFS.
In this case ZFS wants to access the disk, but hte disks said no.
I hope so, that the controller or cable is ok.

IceBoosteR · May 20, 2018

kdragon75 said:
On the smart tests, they don't test writes, just everything else non-destructive.

So in fact, replacing the drive is the best idea now?
If the drive is the cause, I will never buy REDs again. Hello Seagate Ironwolf ...
Just need to check if I can RMA the drive without SMART errors.

kdragon75 · May 20, 2018

You could try swapping drives around. It COULD be a controller but thats much less common. It's not too surprising that ZFS would catch something before smart.

IceBoosteR · May 20, 2018

kdragon75 said:
You could try swapping drives around. It COULD be a controller but thats much less common. It's not too surprising that ZFS would catch something before smart.

Yes I could, but probably that not the best way of determine the errors/root cause. I think the best way is to swap the drive (have always one spare available) and observe if any other will be reported on the system. If not, then the disk was faulty, if yes then baaaaaaaaad. Very baaaaaad :D

I can try to run a badblock test with the drive on another system. Maybe I can see more and have a reson for RMA.

Stux · May 20, 2018

Cause is either

1) cable
2) port
3) disk
4) Power

The right diagnostic approach is to swap data cables between the ‘faulty’ drive and another.

If problem stays with drive it was the drive. Time to replace drive

If problem moves it was the port or cable, then you replace the cable because cables are cheap. If it’s the port then you’re sol, and either RMA the board, or ignore the port going forward.

And psu issues are the hardest to diagnose.

IceBoosteR · May 20, 2018

Stux said:
Cause is either

1) cable
2) port
3) disk
4) Power

The right diagnostic approach is to swap data cables between the ‘faulty’ drive and another.

If problem stays with drive it was the drive. Time to replace drive

If problem moves it was the port or cable, the. You replace the cable because cables are cheap.

Good point.
But now I have offlined the disk. How could I put the disks back to the array?
After that I can try out to swap cables I guess.
Please advice how to proceed now Sir ;)

Stux · May 20, 2018

There should be a GUI button to online the disk I believe.

IceBoosteR · May 20, 2018

Stux said:
There should be a GUI button to online the disk I believe.

No, checked that. Only "replace" is available.

Stux · May 20, 2018

You can online in the command line.

I think I may have reported a bug about that.

IceBoosteR · May 20, 2018

Stux said:
You can online in the command line.

I think I may have reported a bug about that.

I will check that.
On the other hand I can replace da4 with da4. But is that is a good idea? I think its not.

IceBoosteR · May 20, 2018

Found this in the oracle docs.
Should I go for it?
The question ist: do I need to type in:

 

zpool online tank da4

or

 

zpool online tank /dev/da4

?

Edit: I think this command would not succeed.
See the command in this thread:
https://forums.freenas.org/index.ph...xpected-online-button-but-there-is-none.8847/
Its using the disks GPTID

IceBoosteR · May 20, 2018

Stux said:
You can online in the command line.

I think I may have reported a bug about that.

Not working ...

Code:

root@freenas:/var/log # zpool online RED gptid/5c015dd6-b72f-11e7-96c3-1866da308b0d.eli
warning: device 'gptid/5c015dd6-b72f-11e7-96c3-1866da308b0d.eli' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present
root@freenas:/var/log # zpool status
  pool: RED
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
		the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 0 days 05:10:32 with 0 errors on Sun May 20 05:10:58 2018
config:

		NAME												STATE	 READ WRITE CKSUM
		RED												 DEGRADED	 0	 0	 0
		  raidz2-0										  DEGRADED	 0	 0	 0
			gptid/5575c4d7-b72f-11e7-96c3-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/c7c9cba9-cd27-11e7-b158-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/56f73609-b72f-11e7-96c3-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/ee2f1174-1e6f-11e8-84e7-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/5881e2ae-b72f-11e7-96c3-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/593ba0f5-b72f-11e7-96c3-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/59f5e3ca-b72f-11e7-96c3-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/5aaf3ccd-b72f-11e7-96c3-1866da308b0d.eli  ONLINE	   0	 0	 0
			5217968670663961470							 UNAVAIL	  0   601	 0  was /dev/gptid/5c015dd6-b72f-11e7-96c3-1866da308b0d.eli
			gptid/5cc902e2-b72f-11e7-96c3-1866da308b0d.eli  ONLINE	   0	 0	 0

errors: No known data errors

IceBoosteR · May 20, 2018

Rebooted the system, disk is still unavailable.
I will replace it now.

kdragon75 · May 20, 2018

Yeah ZFS is not liking your drive!

IceBoosteR · May 20, 2018

kdragon75 said:
Yeah ZFS is not liking your drive!

Meeeeh xD

MrToddsFriends · May 20, 2018

While I can't help with your current zpool online / replace problem, did you ever consider a HBA heat problem as the cause of your issue ("CAM status: CCB request completed with an error") as reported in your initial posting?

See also the very informative discussion in
https://forums.freenas.org/index.php?threads/cooling-for-a-dell-perc-h310.61941/

This is just an addition to Stux' cable, port, disk, Power listing posted above.

IceBoosteR · May 20, 2018

MrToddsFriends said:
While I can't help with your current zpool online / replace problem, did you ever consider a HBA heat problem as the cause of your issue ("CAM status: CCB request completed with an error") as reported in your initial posting?

See also the very informative discussion in
https://forums.freenas.org/index.php?threads/cooling-for-a-dell-perc-h310.61941/

This is just an addition to Stux' cable, port, disk, Power listing posted above.

Hi,
no I have never thought about that one. I will add this to my option list. The CCB error was the very first one. After that I only see the "Not Ready" error. I have 4 intake fans, which should keep the system cool. Nevertheless, they are only running at 40%, due to the noise they produce. After my homeproject is finished, I move the server and can use 75% if I would like.
~added to list of things to check.

Now I will start with the resilvering.
With my test system (no ECC memory, AMD APU - should do what I need to do) I will run badblocks and another smart test.
And I will observe my mainsystem of course.

Important Announcement for the TrueNAS Community.

HELP needed -> SCSI sense: NOT READY -> pool lost

Guru

Wizard

Wizard

Guru

Guru

Wizard

Guru

MVP

Guru

MVP

Guru

MVP

Guru

Guru

Attachments

Guru

Guru

Wizard

Guru

Documentation Browser

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "HELP needed -> SCSI sense: NOT READY -> pool lost"

Similar threads