HELP needed -> SCSI sense: NOT READY -> pool lost

IceBoosteR · May 20, 2018

Jesus, do I have bad luck?
Spare drive, while resilvering the Reallocated_Event_Count is growing.
Why is Murphy always right?

Edit:
Everything is fine. Due to the lack of coffee and beeing awake for too long, I have mixed up drive's temperatur and the Reallocated_Event_Count. Sorry xD
And I have wondered, why the count is increasing, lul. IceBoosteR needs sleep :O

IceBoosteR · May 21, 2018

For your information: Pool is back in Online state. Resilvering is done without any issues. If the cable would been bad, I should have seen another SCSI error I am sure, because I have restored 1,5TB to that disk now.
If the controller would be the problem, why was only one disk dropped out of the array? I have checked the log. At the point in time where the system was showing these errors, all other disks has written about 1GB each in that 5 minutes. I would assume that other disks would also showing similar problems.
Neverthelass, thank you guys at this point for your help!
Very appreciated :)

Badblocks is now at 50% and all is fine sofar.

Chris Moore · May 21, 2018

Never buying a drive brand / model again, just because you have a failed drive is like saying that you will never buy a car again because of a flat tire.

Sent from my SAMSUNG-SGH-I537 using Tapatalk

kdragon75 · May 21, 2018

Chris Moore said:
Never buying a drive brand / model again, just because you have a failed drive is like saying that you will never buy a car again because of a flat tire.

Sent from my SAMSUNG-SGH-I537 using Tapatalk

more like saying you will never buy that brand tire again. He's not swearing off the machine!

Stux · May 21, 2018

Never buying Good-years again... I had this car once, with a set of four, and one failed.

Good-year is dead to me now ;)

Chris Moore · May 21, 2018

Stux said:
Never buying Good-years again... I had this car once, with a set of four, and one failed.

Good-year is dead to me now ;)

LOL, exactly. I had a set of Firestone tires... Never again. ;)

IceBoosteR · May 21, 2018

Chris Moore said:
Never buying a drive brand / model again, just because you have a failed drive is like saying that you will never buy a car again because of a flat tire.

Sent from my SAMSUNG-SGH-I537 using Tapatalk

It has some more background: In the last 6 month I needed to replace two WD Red and RMA them. I would say, that it is very likely that 1 or two will fail (thats why we use parity) but not three in half a year...

Currently badblocks successfully writes without errors, it's now the reading and comparing.

Can we go into that further discussion how the heck could that error happen? It's driving me crazy.

Chris Moore · May 21, 2018

IceBoosteR said:
It has some more background: In the last 6 month I needed to replace two WD Red and RMA them. I would say, that it is very likely that 1 or two will fail (thats why we use parity) but not three in half a year...

I have a server at work that had 2 drive failures in the first month, another within the first six and another within the first 9. Four drives inside the first 9 months and these are 6TB Red Pro drives. The first six months is the time that a drive is most likely to fail. The more drives you have, the more often you see it. After the first 9 months, that server has run another 9 months without any faults at all. It just takes some time to get the wrinkles ironed out. If a drive survives the first year, it will probably last to between 4.5 and 6 years.
At least, according to the statistics I have read.

IceBoosteR · May 21, 2018

Thanks for the information Chris.

Badblocks still looks good for that drive. I have looked into the logs again, there are only two seconds of these specific errors before zfs was marking the drive as faulty.
On the other hand, the replacement drive has read errors. I will RMA that drive immediatly. If badblocks and a long SMART test will look good for the original drive, I can safely use it again, right?

fyi, new now bad drive

Code:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   40
  3 Spin_Up_Time			0x0027   100   253   021	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   5
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   100   253   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   100   100   000	Old_age   Always	   -	   30
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   5
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   4
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   36
194 Temperature_Celsius	 0x0022   123   103   000	Old_age   Always	   -	   29
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   100   253   000	Old_age   Offline	  -	   0

Stux · May 22, 2018

so, you have no errors in the smart report for the original faulty drive right?

So, it *could* be the HBA had a hiccup...

Chris Moore · May 22, 2018

Stux said:
so, you have no errors in the smart report for the original faulty drive right?

So, it *could* be the HBA had a hiccup...

Or, what I had once, a bad cable that was causing errors only on the one drive.

Sent from my SAMSUNG-SGH-I537 using Tapatalk

IceBoosteR · May 22, 2018

Stux said:
so, you have no errors in the smart report for the original faulty drive right?

So, it *could* be the HBA had a hiccup...

Yes.
So to be clear:
That original disk is running badblocks now 40h, not finished yet but not an error in SMART or during badblocks test. Looks like the drive is good.
I will do tests until Friday I think.
The drive I have used to replace the above one has bad SMART values, but ZFS had not complained about write errors at all:

Code:

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   40
  3 Spin_Up_Time			0x0027   100   253   021	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   5
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   100   253   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   100   100   000	Old_age   Always	   -	   40
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   5
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   4
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   56
194 Temperature_Celsius	 0x0022   123   103   000	Old_age   Always	   -	   29
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   6

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed: read failure	   90%		30		 111995824
# 2  Short offline	   Completed without error	   00%		 8		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):

I have requested a RMA for that one. As you can see, it is really new (recertified product from previous RMA).

Maybe you are right @Stux that the HBA had a hiccup. Let's hope that is was only a hiccup and not more.

Bidule0hm · May 23, 2018

IceBoosteR said:
not an error in SMART

Actually there's an error, look at the last extended test line on the log: read failure at LBA 111995824

IceBoosteR · May 23, 2018

Bidule0hm said:
Actually there's an error, look at the last extended test line on the log: read failure at LBA 111995824

Thats the other drive. That one with smart errors and the failure in extended smart test is on the way back to WD in a few days.
The other, original one, which was dropped out of the array is healthy. No error in badblocks (2 read and 2 complete writes, 60+h) and no error in extended SMART test aswell. So drive seems to be good. I will put that drive back in at the weekend I guess.

Bidule0hm · May 23, 2018

Ok. So it seems like it's good.

IceBoosteR · May 24, 2018

Bidule0hm said:
Ok. So it seems like it's good.

Thats always good :P

IceBoosteR · May 26, 2018

Ok, I have now put the original drive in and tried to REPLACE the faulty drive with the original one.
Suprise:

Code:

(da4:mps0:0:6:0): READ(6). CDB: 08 00 00 00 01 00
(da4:mps0:0:6:0): CAM status: SCSI Status Error
(da4:mps0:0:6:0): SCSI status: Check Condition
(da4:mps0:0:6:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da4:mps0:0:6:0): Error 5, Retries exhausted
mps0: mpssas_prepare_remove: Sending reset for target ID 6
da4 at mps0 bus 0 scbus0 target 6 lun 0
da4: <ATA WDC WD40EFRX-68N 0A82> s/n WD-WCCxxxxxx detached
mps0: Unfreezing devq for target ID 6

Message in GUI:

Code:

	raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 61] Connection refused>
May 26 19:52:31 freenas uwsgi: [sentry.errors.uncaught:670] ['MiddlewareError: [MiddlewareError: Unable to GPT format the disk "da4": gpart: provider: Operation not supported by device\n]', '  File "django/core/handlers/exception.py", line 42, in inner', '  File "django/core/handlers/base.py", line 249, in _legacy_get_response', '  File "django/core/handlers/base.py", line 178, in _get_response', '  File "freenasUI/freeadmin/middleware.py", line 162, in process_view', '  File "django/contrib/auth/decorators.py", line 23, in _wrapped_view', '  File "freenasUI/storage/views.py", line 782, in zpool_disk_replace', '  File "freenasUI/storage/forms.py", line 2254, in done', '  File "freenasUI/middleware/notifier.py", line 987, in zfs_replace_disk', '  File "freenasUI/middleware/notifier.py", line 359, in __gpt_labeldisk']

So as the faulty drive was ok from the controller and the cable (only the HDD itself was bad...), it is maybe the drive after all? Controller issue from the HDD itself???
Why could I run badblocks on another computer without any issue?
Which option is left and can cause this?
Should I try to swap drives? At this point I guess that would be the worst idea ever.

Need help ....

Update: I have now blow the cable and port of that particular HDD, maybe there was some dust on the contacts. I have now started the resilvering. If that process will fail, I will wipe the disk and attach it to an onbaord SATA port and try it again. IF this will work, its the SAS cable, if it's not working, it's the drive.
At this point I need to wait, ZFS is now doing stuff ;)

IceBoosteR · May 27, 2018

IT is btw interesting that the resilvering is incredible slow. I have seen a resilvering in 6h, at 100+MB/s write speed on that drive. This time I do not have more than 30-45MB/s, therefore it is taking longer. I am not sure why this is this time slower. Disk is at 95% IOPS. But no issue sofar...

pschatz100 · May 27, 2018

I had a similar hard-to-diagnose problem with a drive on my machine, and it turned out to be a bad power adapter. I had used a molex to sata adapter on a couple of drives because my power supply did not have enough sata power connectors, and the adapter was bad. I discovered this by accident when I rerouted my power cables and just happened to use the adapter on a different drive. The problems moved to the drive attached to the adapter.

Intermittant power problems can be very difficult to solve.

IceBoosteR · May 27, 2018

pschatz100 said:
I had a similar hard-to-diagnose problem with a drive on my machine, and it turned out to be a bad power adapter. I had used a molex to sata adapter on a couple of drives because my power supply did not have enough sata power connectors, and the adapter was bad. I discovered this by accident when I rerouted my power cables and just happened to use the adapter on a different drive. The problems moved to the drive attached to the adapter.

Intermittant power problems can be very difficult to solve.

Hi @pschatz100 ,
thanks for your anser. I am using Molex to SATA cables also, but only on two drives, and none of these drives is affected in any way. The bad drive/cable/whatever is connected directly via SATA-Power at the PSU. But maybe this connector is bad.
I will keep this in mind!
Cheers!

Important Announcement for the TrueNAS Community.

HELP needed -> SCSI sense: NOT READY -> pool lost

Guru

Guru

Hall of Famer

Wizard

MVP

Hall of Famer

Guru

Hall of Famer

Guru

MVP

Hall of Famer

Guru

Server Electronics Sorcerer

Guru

Server Electronics Sorcerer

Guru

Guru

Guru

Guru

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "HELP needed -> SCSI sense: NOT READY -> pool lost"

Similar threads