HELP needed -> SCSI sense: NOT READY -> pool lost

IceBoosteR

Guru
Joined
Sep 27, 2016
Messages
503
Jesus, do I have bad luck?
Spare drive, while resilvering the Reallocated_Event_Count is growing.
Why is Murphy always right?

Edit:
Everything is fine. Due to the lack of coffee and beeing awake for too long, I have mixed up drive's temperatur and the Reallocated_Event_Count. Sorry xD
And I have wondered, why the count is increasing, lul. IceBoosteR needs sleep :O
 
Last edited:

IceBoosteR

Guru
Joined
Sep 27, 2016
Messages
503
For your information: Pool is back in Online state. Resilvering is done without any issues. If the cable would been bad, I should have seen another SCSI error I am sure, because I have restored 1,5TB to that disk now.
If the controller would be the problem, why was only one disk dropped out of the array? I have checked the log. At the point in time where the system was showing these errors, all other disks has written about 1GB each in that 5 minutes. I would assume that other disks would also showing similar problems.
Neverthelass, thank you guys at this point for your help!
Very appreciated :)


Badblocks is now at 50% and all is fine sofar.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Never buying a drive brand / model again, just because you have a failed drive is like saying that you will never buy a car again because of a flat tire.

Sent from my SAMSUNG-SGH-I537 using Tapatalk
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
Never buying a drive brand / model again, just because you have a failed drive is like saying that you will never buy a car again because of a flat tire.

Sent from my SAMSUNG-SGH-I537 using Tapatalk
more like saying you will never buy that brand tire again. He's not swearing off the machine!
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Never buying Good-years again... I had this car once, with a set of four, and one failed.

Good-year is dead to me now ;)
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Never buying Good-years again... I had this car once, with a set of four, and one failed.

Good-year is dead to me now ;)
LOL, exactly. I had a set of Firestone tires... Never again. ;)
 

IceBoosteR

Guru
Joined
Sep 27, 2016
Messages
503
Never buying a drive brand / model again, just because you have a failed drive is like saying that you will never buy a car again because of a flat tire.

Sent from my SAMSUNG-SGH-I537 using Tapatalk
It has some more background: In the last 6 month I needed to replace two WD Red and RMA them. I would say, that it is very likely that 1 or two will fail (thats why we use parity) but not three in half a year...

Currently badblocks successfully writes without errors, it's now the reading and comparing.

Can we go into that further discussion how the heck could that error happen? It's driving me crazy.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
It has some more background: In the last 6 month I needed to replace two WD Red and RMA them. I would say, that it is very likely that 1 or two will fail (thats why we use parity) but not three in half a year...
I have a server at work that had 2 drive failures in the first month, another within the first six and another within the first 9. Four drives inside the first 9 months and these are 6TB Red Pro drives. The first six months is the time that a drive is most likely to fail. The more drives you have, the more often you see it. After the first 9 months, that server has run another 9 months without any faults at all. It just takes some time to get the wrinkles ironed out. If a drive survives the first year, it will probably last to between 4.5 and 6 years.
At least, according to the statistics I have read.
 
Last edited:

IceBoosteR

Guru
Joined
Sep 27, 2016
Messages
503
Thanks for the information Chris.

Badblocks still looks good for that drive. I have looked into the logs again, there are only two seconds of these specific errors before zfs was marking the drive as faulty.
On the other hand, the replacement drive has read errors. I will RMA that drive immediatly. If badblocks and a long SMART test will look good for the original drive, I can safely use it again, right?

fyi, new now bad drive
Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   40
  3 Spin_Up_Time			0x0027   100   253   021	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   5
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   100   253   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   100   100   000	Old_age   Always	   -	   30
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   5
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   4
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   36
194 Temperature_Celsius	 0x0022   123   103   000	Old_age   Always	   -	   29
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   100   253   000	Old_age   Offline	  -	   0

 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
so, you have no errors in the smart report for the original faulty drive right?

So, it *could* be the HBA had a hiccup...
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
so, you have no errors in the smart report for the original faulty drive right?

So, it *could* be the HBA had a hiccup...
Or, what I had once, a bad cable that was causing errors only on the one drive.

Sent from my SAMSUNG-SGH-I537 using Tapatalk
 

IceBoosteR

Guru
Joined
Sep 27, 2016
Messages
503
so, you have no errors in the smart report for the original faulty drive right?

So, it *could* be the HBA had a hiccup...
Yes.
So to be clear:
That original disk is running badblocks now 40h, not finished yet but not an error in SMART or during badblocks test. Looks like the drive is good.
I will do tests until Friday I think.
The drive I have used to replace the above one has bad SMART values, but ZFS had not complained about write errors at all:
Code:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   40
  3 Spin_Up_Time			0x0027   100   253   021	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   5
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   100   253   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   100   100   000	Old_age   Always	   -	   40
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   5
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   4
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   56
194 Temperature_Celsius	 0x0022   123   103   000	Old_age   Always	   -	   29
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   6

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed: read failure	   90%		30		 111995824
# 2  Short offline	   Completed without error	   00%		 8		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):



I have requested a RMA for that one. As you can see, it is really new (recertified product from previous RMA).

Maybe you are right @Stux that the HBA had a hiccup. Let's hope that is was only a hiccup and not more.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710

IceBoosteR

Guru
Joined
Sep 27, 2016
Messages
503
Actually there's an error, look at the last extended test line on the log: read failure at LBA 111995824
Thats the other drive. That one with smart errors and the failure in extended smart test is on the way back to WD in a few days.
The other, original one, which was dropped out of the array is healthy. No error in badblocks (2 read and 2 complete writes, 60+h) and no error in extended SMART test aswell. So drive seems to be good. I will put that drive back in at the weekend I guess.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Ok. So it seems like it's good.
 

IceBoosteR

Guru
Joined
Sep 27, 2016
Messages
503

IceBoosteR

Guru
Joined
Sep 27, 2016
Messages
503
Ok, I have now put the original drive in and tried to REPLACE the faulty drive with the original one.
Suprise:
Code:
(da4:mps0:0:6:0): READ(6). CDB: 08 00 00 00 01 00
(da4:mps0:0:6:0): CAM status: SCSI Status Error
(da4:mps0:0:6:0): SCSI status: Check Condition
(da4:mps0:0:6:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da4:mps0:0:6:0): Error 5, Retries exhausted
mps0: mpssas_prepare_remove: Sending reset for target ID 6
da4 at mps0 bus 0 scbus0 target 6 lun 0
da4: <ATA WDC WD40EFRX-68N 0A82> s/n WD-WCCxxxxxx detached
mps0: Unfreezing devq for target ID 6

Message in GUI:
Code:
	raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 61] Connection refused>
May 26 19:52:31 freenas uwsgi: [sentry.errors.uncaught:670] ['MiddlewareError: [MiddlewareError: Unable to GPT format the disk "da4": gpart: provider: Operation not supported by device\n]', '  File "django/core/handlers/exception.py", line 42, in inner', '  File "django/core/handlers/base.py", line 249, in _legacy_get_response', '  File "django/core/handlers/base.py", line 178, in _get_response', '  File "freenasUI/freeadmin/middleware.py", line 162, in process_view', '  File "django/contrib/auth/decorators.py", line 23, in _wrapped_view', '  File "freenasUI/storage/views.py", line 782, in zpool_disk_replace', '  File "freenasUI/storage/forms.py", line 2254, in done', '  File "freenasUI/middleware/notifier.py", line 987, in zfs_replace_disk', '  File "freenasUI/middleware/notifier.py", line 359, in __gpt_labeldisk']



So as the faulty drive was ok from the controller and the cable (only the HDD itself was bad...), it is maybe the drive after all? Controller issue from the HDD itself???
Why could I run badblocks on another computer without any issue?
Which option is left and can cause this?
Should I try to swap drives? At this point I guess that would be the worst idea ever.

Need help ....

Update: I have now blow the cable and port of that particular HDD, maybe there was some dust on the contacts. I have now started the resilvering. If that process will fail, I will wipe the disk and attach it to an onbaord SATA port and try it again. IF this will work, its the SAS cable, if it's not working, it's the drive.
At this point I need to wait, ZFS is now doing stuff ;)
 
Last edited:

IceBoosteR

Guru
Joined
Sep 27, 2016
Messages
503
IT is btw interesting that the resilvering is incredible slow. I have seen a resilvering in 6h, at 100+MB/s write speed on that drive. This time I do not have more than 30-45MB/s, therefore it is taking longer. I am not sure why this is this time slower. Disk is at 95% IOPS. But no issue sofar...
 

pschatz100

Guru
Joined
Mar 30, 2014
Messages
1,184
I had a similar hard-to-diagnose problem with a drive on my machine, and it turned out to be a bad power adapter. I had used a molex to sata adapter on a couple of drives because my power supply did not have enough sata power connectors, and the adapter was bad. I discovered this by accident when I rerouted my power cables and just happened to use the adapter on a different drive. The problems moved to the drive attached to the adapter.

Intermittant power problems can be very difficult to solve.
 

IceBoosteR

Guru
Joined
Sep 27, 2016
Messages
503
I had a similar hard-to-diagnose problem with a drive on my machine, and it turned out to be a bad power adapter. I had used a molex to sata adapter on a couple of drives because my power supply did not have enough sata power connectors, and the adapter was bad. I discovered this by accident when I rerouted my power cables and just happened to use the adapter on a different drive. The problems moved to the drive attached to the adapter.

Intermittant power problems can be very difficult to solve.
Hi @pschatz100 ,
thanks for your anser. I am using Molex to SATA cables also, but only on two drives, and none of these drives is affected in any way. The bad drive/cable/whatever is connected directly via SATA-Power at the PSU. But maybe this connector is bad.
I will keep this in mind!
Cheers!
 
Top