HELP needed -> SCSI sense: NOT READY -> pool lost

IceBoosteR · Jun 24, 2018

Hi,

I have the error again:

Code:

root@freenas:~ # zpool status
  pool: RED
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
		Sufficient replicas exist for the pool to continue functioning in a
		degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
		repaired.
  scan: scrub repaired 0 in 0 days 04:12:40 with 0 errors on Sun May 27 17:12:18 2018
config:

		NAME												STATE	 READ WRITE CKSUM
		RED												 DEGRADED	 0	 0	 0
		  raidz2-0										  DEGRADED	 0	 0	 0
			gptid/5575c4d7-b72f-11e7-96c3-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/c7c9cba9-cd27-11e7-b158-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/56f73609-b72f-11e7-96c3-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/ee2f1174-1e6f-11e8-84e7-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/5881e2ae-b72f-11e7-96c3-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/593ba0f5-b72f-11e7-96c3-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/59f5e3ca-b72f-11e7-96c3-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/5aaf3ccd-b72f-11e7-96c3-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/bbff8686-6113-11e8-9223-1866da308b0d.eli  FAULTED	  0	 4	 0  too many errors
			gptid/5cc902e2-b72f-11e7-96c3-1866da308b0d.eli  ONLINE	   0	 0	 0

errors: No known data errors

And:

Code:

Jun 23 18:54:40 freenas (da4:mps0:0:6:0): READ(10). CDB: 28 00 00 40 00 80 00 00 08 00
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): CAM status: SCSI Status Error
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): SCSI status: Check Condition
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): Retrying command (per sense data)
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): READ(10). CDB: 28 00 00 40 00 80 00 00 08 00
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): CAM status: SCSI Status Error
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): SCSI status: Check Condition
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): Retrying command (per sense data)
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): READ(10). CDB: 28 00 00 40 00 80 00 00 08 00
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): CAM status: SCSI Status Error
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): SCSI status: Check Condition
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): Retrying command (per sense data)
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): READ(10). CDB: 28 00 00 40 00 80 00 00 08 00
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): CAM status: SCSI Status Error
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): SCSI status: Check Condition
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): Retrying command (per sense data)
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): READ(10). CDB: 28 00 00 40 00 80 00 00 08 00
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): CAM status: SCSI Status Error
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): SCSI status: Check Condition
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Jun 23 18:54:40 freenas (da4:mps0:0:6:0): Error 5, Retries exhausted
Jun 23 18:54:40 freenas GEOM_ELI: g_eli_read_done() failed (error=5) gptid/bbff8686-6113-11e8-9223-1866da308b0d.eli[READ(offset=0, length=4096)]

SMART looks alright. I may need to troubleshhot now. First I guess I will use another SATA cable, instead of the SAS/SATA breakout cable. After that, I will resilver the pool again.
Any other guesses?

IceBoosteR · Jun 24, 2018

New update:
Replaced the cable to onbaord SATA. Have the same issues again I think:

Code:

Jun 24 17:51:34 freenas uwsgi: [sentry.errors.uncaught:670] ['MiddlewareError: [MiddlewareError: Failed to wipe ada1p1: dd: /dev/ada1p1: Operation not permitted\n]', '  File "django/core/handlers/exception.py", line 42, in inner', '  File "django/core/handlers/base.py", line 249, in _legacy_get_response', '  File "django/core/handlers/base.py", line 178, in _get_response', '  File "freenasUI/freeadmin/middleware.py", line 162, in process_view', '  File "django/contrib/auth/decorators.py", line 23, in _wrapped_view', '  File "freenasUI/storage/views.py", line 864, in disk_wipe', '  File "freenasUI/middleware/notifier.py", line 3610, in disk_wipe', '  File "freenasUI/middleware/notifier.py", line 3587, in _do_disk_wipe_quick']
Jun 24 17:52:49 freenas ahcich1: Timeout on slot 24 port 0
Jun 24 17:52:49 freenas ahcich1: is 00000000 cs 01000000 ss 00000000 rs 01000000 tfd 10c1 serr 00000000 cmd 0000d817
Jun 24 17:52:49 freenas ahcich1: Error while READ LOG EXT
Jun 24 17:52:49 freenas (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 80 75 c0 40 d1 01 00 01 00 00
Jun 24 17:52:49 freenas (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Jun 24 17:52:49 freenas (ada1:ahcich1:0:0:0): ATA status: 00 ()
Jun 24 17:52:49 freenas (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Jun 24 17:52:49 freenas (ada1:ahcich1:0:0:0): Retrying command
Jun 24 17:53:07 freenas uwsgi: [middleware.exceptions:36] [MiddlewareError: Failed to wipe ada1p1: dd: /dev/ada1p1: Operation not permitted
]
Jun 24 17:53:07 freenas (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 80 b3 c0 40 d1 01 00 01 00 00
Jun 24 17:53:07 freenas (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Jun 24 17:53:07 freenas (ada1:ahcich1:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
Jun 24 17:53:07 freenas (ada1:ahcich1:0:0:0): RES: 41 10 80 b3 c0 00 d1 01 00 00 00
Jun 24 17:53:07 freenas (ada1:ahcich1:0:0:0): Retrying command
Jun 24 17:53:07 freenas uwsgi: [sentry.errors:648] Sentry responded with an error: <urlopen error [Errno 61] Connection refused> (url: https://sentry.ixsystems.com/api/2/store/)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/urllib/request.py", line 1318, in do_open
	encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/lib/python3.6/http/client.py", line 1239, in request
	self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.6/http/client.py", line 1285, in _send_request
	self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.6/http/client.py", line 1234, in endheaders
	self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.6/http/client.py", line 1026, in _send_output
	self.send(msg)
  File "/usr/local/lib/python3.6/http/client.py", line 964, in send
	self.connect()
  File "/usr/local/lib/python3.6/site-
Jun 24 17:53:07 freenas uwsgi: packages/raven/utils/http.py", line 31, in connect
	timeout=self.timeout,
  File "/usr/local/lib/python3.6/socket.py", line 724, in create_connection
	raise err
  File "/usr/local/lib/python3.6/socket.py", line 713, in create_connection
	sock.connect(sa)
ConnectionRefusedError: [Errno 61] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/raven/transport/threaded.py", line 174, in send_sync
	super(ThreadedHTTPTransport, self).send(data, headers)
  File "/usr/local/lib/python3.6/site-packages/raven/transport/http.py", line 47, in send
	ca_certs=self.ca_certs,
  File "/usr/local/lib/python3.6/site-packages/raven/utils/http.py", line 66, in urlopen
	return opener.open(url, data, timeout)
  File "/usr/local/lib/python3.6/urllib/request.py", line 526, in open
	response = self._open(req, data)
  File "/u
Jun 24 17:53:07 freenas uwsgi: sr/local/lib/python3.6/urllib/request.py", line 544, in _open
	'_open', req)
  File "/usr/local/lib/python3.6/urllib/request.py", line 504, in _call_chain
	result = func(*args)
  File "/usr/local/lib/python3.6/site-packages/raven/utils/http.py", line 46, in https_open
	return self.do_open(ValidHTTPSConnection, req)
  File "/usr/local/lib/python3.6/urllib/request.py", line 1320, in do_open
	raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 61] Connection refused>
Jun 24 17:53:07 freenas uwsgi: [sentry.errors.uncaught:670] ['MiddlewareError: [MiddlewareError: Failed to wipe ada1p1: dd: /dev/ada1p1: Operation not permitted\n]', '  File "django/core/handlers/exception.py", line 42, in inner', '  File "django/core/handlers/base.py", line 249, in _legacy_get_response', '  File "django/core/handlers/base.py", line 178, in _get_response', '  File "freenasUI/freeadmin/middleware.py", line 162, in process_view', '  File "django/contrib/auth/decorators.py", line 23, in _wrapped_view', '  File "freenasUI/storage/views.py", line 864, in disk_wipe', '  File "freenasUI/middleware/notifier.py", line 3610, in disk_wipe', '  File "freenasUI/middleware/notifier.py", line 3587, in _do_disk_wipe_quick']

Code:

Jun 24 17:56:10 freenas ahcich1: Timeout on slot 8 port 0
Jun 24 17:56:10 freenas ahcich1: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd 10c1 serr 00000000 cmd 0000c817
Jun 24 17:56:10 freenas ahcich1: Error while READ LOG EXT
Jun 24 17:56:10 freenas (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 80 3f c0 40 d1 01 00 01 00 00
Jun 24 17:56:10 freenas (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Jun 24 17:56:10 freenas (ada1:ahcich1:0:0:0): ATA status: 00 ()
Jun 24 17:56:10 freenas (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Jun 24 17:56:10 freenas (ada1:ahcich1:0:0:0): Retrying command
Jun 24 17:56:34 freenas ahcich1: Timeout on slot 2 port 0
Jun 24 17:56:34 freenas ahcich1: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd 10c1 serr 00000000 cmd 0000c217
Jun 24 17:56:34 freenas ahcich1: Error while READ LOG EXT
Jun 24 17:56:34 freenas (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 80 51 c0 40 d1 01 00 01 00 00
Jun 24 17:56:34 freenas (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Jun 24 17:56:34 freenas (ada1:ahcich1:0:0:0): ATA status: 00 ()
Jun 24 17:56:34 freenas (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Jun 24 17:56:34 freenas (ada1:ahcich1:0:0:0): Retrying command
Jun 24 17:56:44 freenas (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 80 63 c0 40 d1 01 00 01 00 00
Jun 24 17:56:44 freenas (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Jun 24 17:56:44 freenas (ada1:ahcich1:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
Jun 24 17:56:44 freenas (ada1:ahcich1:0:0:0): RES: 41 10 80 63 c0 00 d1 01 00 00 00
Jun 24 17:56:44 freenas (ada1:ahcich1:0:0:0): Retrying command
Jun 24 17:56:53 freenas ahcich1: Timeout on slot 15 port 0
Jun 24 17:56:53 freenas ahcich1: is 00000000 cs 00008000 ss 00000000 rs 00008000 tfd 10c1 serr 00000000 cmd 0000cf17
Jun 24 17:56:53 freenas ahcich1: Error while READ LOG EXT
Jun 24 17:56:53 freenas (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 80 74 c0 40 d1 01 00 01 00 00
Jun 24 17:56:53 freenas (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Jun 24 17:56:53 freenas (ada1:ahcich1:0:0:0): ATA status: 00 ()
Jun 24 17:56:53 freenas (ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Jun 24 17:56:53 freenas (ada1:ahcich1:0:0:0): Retrying command

What now? Power cable? :O
Or the drive?

IceBoosteR · Jun 24, 2018

I have now switched back to the SAS breakout cable and changed the power cord. Lets see what happens.
If I see the errors again it is defenetly the drive, if not it was the power cable or the PSU.

Edit:
Immediatly have the error again:

Code:

Jun 24 18:33:11 freenas uwsgi: [middleware.exceptions:36] [MiddlewareError: Failed to wipe da4p1: dd: /dev/da4p1: Operation not permitted
]
Jun 24 18:33:11 freenas (da4:mps0:0:6:0): WRITE(16). CDB: 8a 00 00 00 00 01 d1 c0 49 80 00 00 01 00 00 00
Jun 24 18:33:11 freenas (da4:mps0:0:6:0): CAM status: SCSI Status Error
Jun 24 18:33:11 freenas (da4:mps0:0:6:0): SCSI status: Check Condition
Jun 24 18:33:11 freenas (da4:mps0:0:6:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
Jun 24 18:33:11 freenas (da4:mps0:0:6:0): Info: 0x1d1c04980
Jun 24 18:33:11 freenas (da4:mps0:0:6:0): Error 22, Unretryable error
Jun 24 18:33:11 freenas uwsgi: [sentry.errors:648] Sentry responded with an error: <urlopen error [Errno 61] Connection refused> (url: https://sentry.ixsystems.com/api/2/store/)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/urllib/request.py", line 1318, in do_open
	encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/lib/python3.6/http/client.py", line 1239, in request
	self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.6/http/client.py", line 1285, in _send_request
	self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.6/http/client.py", line 1234, in endheaders
	self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.6/http/client.py", line 1026, in _send_output
	self.send(msg)
  File "/usr/local/lib/python3.6/http/client.py", line 964, in send
	self.connect()
  File "/usr/local/lib/python3.6/site-
Jun 24 18:33:11 freenas uwsgi: packages/raven/utils/http.py", line 31, in connect
	timeout=self.timeout,
  File "/usr/local/lib/python3.6/socket.py", line 724, in create_connection
	raise err
  File "/usr/local/lib/python3.6/socket.py", line 713, in create_connection
	sock.connect(sa)
ConnectionRefusedError: [Errno 61] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/raven/transport/threaded.py", line 174, in send_sync
	super(ThreadedHTTPTransport, self).send(data, headers)
  File "/usr/local/lib/python3.6/site-packages/raven/transport/http.py", line 47, in send
	ca_certs=self.ca_certs,
  File "/usr/local/lib/python3.6/site-packages/raven/utils/http.py", line 66, in urlopen
	return opener.open(url, data, timeout)
  File "/usr/local/lib/python3.6/urllib/request.py", line 526, in open
	response = self._open(req, data)
  File "/u
Jun 24 18:33:11 freenas uwsgi: sr/local/lib/python3.6/urllib/request.py", line 544, in _open
	'_open', req)
  File "/usr/local/lib/python3.6/urllib/request.py", line 504, in _call_chain
	result = func(*args)
  File "/usr/local/lib/python3.6/site-packages/raven/utils/http.py", line 46, in https_open
	return self.do_open(ValidHTTPSConnection, req)
  File "/usr/local/lib/python3.6/urllib/request.py", line 1320, in do_open
	raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 61] Connection refused>
Jun 24 18:33:11 freenas uwsgi: [sentry.errors.uncaught:670] ['MiddlewareError: [MiddlewareError: Failed to wipe da4p1: dd: /dev/da4p1: Operation not permitted\n]', '  File "django/core/handlers/exception.py", line 42, in inner', '  File "django/core/handlers/base.py", line 249, in _legacy_get_response', '  File "django/core/handlers/base.py", line 178, in _get_response', '  File "freenasUI/freeadmin/middleware.py", line 162, in process_view', '  File "django/contrib/auth/decorators.py", line 23, in _wrapped_view', '  File "freenasUI/storage/views.py", line 864, in disk_wipe', '  File "freenasUI/middleware/notifier.py", line 3610, in disk_wipe', '  File "freenasUI/middleware/notifier.py", line 3587, in _do_disk_wipe_quick']

So its the drive, waht are you guys thinking about that?

Chris Moore · Jun 24, 2018

IceBoosteR said:
So its the drive, waht are you guys thinking about that?

It is not impossible for a drive to fail, even if it is new. I have a server at work that had 4 drive failures in the first nine months of operation. It is another year on now and no more failures. If the drive is under warranty, I would try to get it replaced.

IceBoosteR · Jun 24, 2018

Chris Moore said:
It is not impossible for a drive to fail, even if it is new. I have a server at work that had 4 drive failures in the first nine months of operation. It is another year on now and no more failures. If the drive is under warranty, I would try to get it replaced.

I am now an expert in WD RMAs /irony off :O
Changed the drive now to a spare one. Hopefully it IS the drive and not caused by cosmic rays :P

IceBoosteR · Jul 5, 2018

@Chris Moore you often have good ideas, I am running out of my ones...
So as I have written in the previous posts - I have changed the power cable, the controller and the SAS/SATA cable. I was sure that the drive was the cause.
Now another drive (which was send to me from my first RMA) has the same problems:

Code:

(da6:mps0:0:12:0): READ(10). CDB: 28 00 31 2c 12 00 00 00 10 00
(da6:mps0:0:12:0): CAM status: SCSI Status Error
(da6:mps0:0:12:0): SCSI status: Check Condition
(da6:mps0:0:12:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da6:mps0:0:12:0): Retrying command (per sense data)
(da6:mps0:0:12:0): WRITE(16). CDB: 8a 00 00 00 00 01 3b 5d 3a 00 00 00 00 08 00 00
(da6:mps0:0:12:0): CAM status: SCSI Status Error
(da6:mps0:0:12:0): SCSI status: Check Condition
(da6:mps0:0:12:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da6:mps0:0:12:0): Info: 0x13b5d3a00
(da6:mps0:0:12:0): Error 22, Unretryable error
GEOM_ELI: g_eli_write_done() failed (error=22) gptid/ee2f1174-1e6f-11e8-84e7-1866da308b0d.eli[WRITE(offset=2706810011648, length=4096)]
		(da6:mps0:0:12:0): WRITE(16). CDB: 8a 00 00 00 00 01 76 b6 69 00 00 00 00 08 00 00 length 4096 SMID 920 terminated ioc 804b loginfo 31110d00 scsi 0 state c xfer 0
(da6:mps0:0:12:0): WRITE(16). CDB: 8a 00 00 00 00 01 76 b6 69 00 00 00 00 08 00 00
(da6:mps0:0:12:0): CAM status: CCB request completed with an error
(da6:mps0:0:12:0): Retrying command
(da6:mps0:0:12:0): WRITE(16). CDB: 8a 00 00 00 00 01 76 b6 69 00 00 00 00 08 00 00
(da6:mps0:0:12:0): CAM status: SCSI Status Error
(da6:mps0:0:12:0): SCSI status: Check Condition
(da6:mps0:0:12:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:12:0): Retrying command (per sense data)
(da6:mps0:0:12:0): WRITE(16). CDB: 8a 00 00 00 00 01 76 b6 69 00 00 00 00 08 00 00
(da6:mps0:0:12:0): CAM status: SCSI Status Error
(da6:mps0:0:12:0): SCSI status: Check Condition

This time, the error was more specifc: (Power on, reset, or bus device reset occurred)

Do you have any idea how this could all happen? It was first only one disk affected, now its another one. Its hot here in Germany right now, but the system is at 40 degrees, the disks are at 32-33 degrees. Should be fine. I assume the controller is also cool, but it has to do continous writes to an ssd running my VMs. Or can it be the power supply?

I am a little bit depressed now about this...

Thanks for any help!

IceBoosteR · Jul 5, 2018

HELP HELP, Whats going on?!

Code:

mps0: Unfreezing devq for target ID 12
(da6:mps0:0:12:0): Periph destroyed
(da8:umass-sim0:0:0:0): READ(10). CDB: 28 00 00 31 02 bb 00 01 00 00
(da8:umass-sim0:0:0:0): CAM status: SCSI Status Error
(da8:umass-sim0:0:0:0): SCSI status: Check Condition
(da8:umass-sim0:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da8:umass-sim0:0:0:0): Error 5, Unretryable error
ada1 at ahcich2 bus 0 scbus2 target 0 lun 0
ada1: <WDC WD40EFRX-68N32N0 82.00A82> s/n WD-WCC7K4HY5CAT detached
GEOM_MIRROR: Device swap0: provider ada1p1 disconnected.
ada2 at ahcich3 bus 0 scbus3 target 0 lun 0
GEOM_ELIada2: <WDC WD40EFRX-68WT0N0 82.00A82> s/n WD-WCC4E0YRX68E: Device gptid/5cc902e2-b72f-11e7-96c3-1866da308b0d.eli destroyed.
GEOM_ELI: Detached gptid/5cc902e2-b72f-11e7-96c3-1866da308b0d.eli on last close.
 detached
(ada1:ahcich2:0:0:0): Periph destroyed
GEOM_MIRROR: Request failed (error=6). ada2p1[READ(offset=53280768, length=4096)]
GEOM_ELI: g_eli_read_done() failed (error=6) mirror/swap0.eli[READ(offset=53280768, length=4096)]
swap_pager: I/O error - pagein failed; blkno 13008,size 4096, error 6
vm_fault: pager read error, pid 5050 (zfsd)
GEOM_ELI: Device gptid/5575c4d7-b72f-11e7-96c3-1866da308b0d.eli destroyed.
GEOM_ELI: Detached gptid/5575c4d7-b72f-11e7-96c3-1866da308b0d.eli on last close.
GEOM_MIRROR: Device swap0: provider ada2p1 disconnected.
GEOM_MIRROR: Device swap0: provider destroyed.
GEOM_MIRROR: Device swap0 destroyed.
GEOM_ELI: Device mirror/swap0.eli destroyed.
GEOM_ELI: Detached mirror/swap0.eli on last close.
(ada2:ahcich3:0:0:0): Periph destroyed
swap_pager: I/O error - pagein failed; blkno 13000,size 4096, error 6
vm_fault: pager read error, pid 5050 (zfsd)
Failed to fully fault in a core file segment at VA 0x801711000 with size 0x1000 to be written at offset 0x4b000 for process zfsd
swap_pager: I/O error - pagein failed; blkno 2415,size 4096, error 6
vm_fault: pager read error, pid 5050 (zfsd)
Failed to fully fault in a core file segment at VA 0x802c22000 with size 0x1000 to be written at offset 0x81000 for process zfsd
swap_pager: I/O error - pagein failed; blkno 2416,size 4096, error 6
vm_fault: pager read error, pid 5050 (zfsd)
Failed to fully fault in a core file segment at VA 0x802e24000 with size 0x1000 to be written at offset 0x82000 for process zfsd
swap_pager: I/O error - pagein failed; blkno 2417,size 4096, error 6
vm_fault: pager read error, pid 5050 (zfsd)
Failed to fully fault in a core file segment at VA 0x803400000 with size 0x400000 to be written at offset 0x91000 for process zfsd
pid 5050 (zfsd), uid 0: exited on signal 11 (core dumped)
ada1 at ahcich2 bus 0 scbus2 target 0 lun 0
ada1: <WDC WD40EFRX-68N32N0 82.00A82> ACS-3 ATA SATA 3.x device
ada1: Serial Number WD-WCxxxxxx
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 3815447MB (7814037168 512 byte sectors)
ada1: quirks=0x1<4K>
ada2 at ahcich3 bus 0 scbus3 target 0 lun 0
ada2: <WDC WD40EFRX-68WT0N0 82.00A82> ACS-2 ATA SATA 3.x device
ada2: Serial Number WD-WCC4xxxxxxx
ada2: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada2: Command Queueing enabled
ada2: 3815447MB (7814037168 512 byte sectors)
ada2: quirks=0x1<4K>
mps0: SAS Address for SATA device = 4962494bdf96bd74
mps0: SAS Address from SATA device = 4962494bdf96bd74
da6 at mps0 bus 0 scbus0 target 12 lun 0
da6: <ATA WDC WD40EFRX-68W 0A82> Fixed Direct Access SPC-4 SCSI device
da6: Serial Number WD-WCCxxxxxxx
da6: 600.000MB/s transfers
da6: Command Queueing enabled
da6: 3815447MB (7814037168 512 byte sectors)
da6: quirks=0x8<4K>
swap_pager: I/O error - pagein failed; blkno 17765,size 4096, error 6
vm_fault: pager read error, pid 3383 (consul-alerts)
swap_pager: I/O error - pagein failed; blkno 2368,size 49152, error 6
vm_fault: pager read error, pid 3377 (daemon)
swap_pager: I/O error - pagein failed; blkno 16141,size 12288, error 6
vm_fault: pager read error, pid 5339 (consul)

IceBoosteR · Jul 5, 2018

Code:

root@freenas:~ # zpool status
  pool: RED
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
		corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
		entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 42.1M in 0 days 00:00:25 with 5708 errors on Thu Jul  5 22:15:36 2018
config:

		NAME												STATE	 READ WRITE CKSUM
		RED												 DEGRADED	 0	 0 6.25K
		  raidz2-0										  DEGRADED	 0	 0 14.6K
			10181539944339161351							UNAVAIL	  0	 0	 0  was /dev/gptid/5575c4d7-b72f-11e7-96c3-1866da308b0d.eli
			gptid/c7c9cba9-cd27-11e7-b158-1866da308b0d.eli  DEGRADED	 0	 0	 0  too many errors
			gptid/56f73609-b72f-11e7-96c3-1866da308b0d.eli  DEGRADED	 0	 0	 0  too many errors
			gptid/ee2f1174-1e6f-11e8-84e7-1866da308b0d.eli  ONLINE	   0	 0	 0
			gptid/5881e2ae-b72f-11e7-96c3-1866da308b0d.eli  DEGRADED	 0	 0	 0  too many errors
			gptid/593ba0f5-b72f-11e7-96c3-1866da308b0d.eli  DEGRADED	 0	 0	 0  too many errors
			gptid/59f5e3ca-b72f-11e7-96c3-1866da308b0d.eli  DEGRADED	 0	 0	 0  too many errors
			gptid/5aaf3ccd-b72f-11e7-96c3-1866da308b0d.eli  DEGRADED	 0	 0	 0  too many errors
			gptid/b4258cdf-77d1-11e8-9397-1866da308b0d.eli  DEGRADED	 0	 0	 0  too many errors
			10388524535076039676							UNAVAIL	  0	 0	 0  was /dev/gptid/5cc902e2-b72f-11e7-96c3-1866da308b0d.eli

errors: 5703 data errors, use '-v' for a list

  pool: SSD
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:04:45 with 0 errors on Sun May 27 00:04:45 2018
config:

		NAME										  STATE	 READ WRITE CKSUM
		SSD										   ONLINE	   0	 0	 0
		  gptid/01e17262-b6b2-11e7-8f4b-1866da308b0d  ONLINE	   0	 0	 0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:01:25 with 0 errors on Thu Jul  5 21:15:24 2018
config:

		NAME		STATE	 READ WRITE CKSUM
		freenas-boot  ONLINE	   0	 0	 0
		  mirror-0  ONLINE	   0	 0	 0
			da8p2   ONLINE	   0	 0	 0
			da9p2   ONLINE	   0	 0	 0

errors: No known data errors

Chris Moore · Jul 5, 2018

There are faults in da6, ada1 and ada2 in the text you posted. The on in da6 is the SAS controller and the ada errors are the SATA controller. It is very unusual to be receiving so many errors ...

IceBoosteR said:
Do you have any idea how this could all happen? It was first only one disk affected, now its another one. Its hot here in Germany right now, but the system is at 40 degrees

It was my first thought that it could be an overheating issue. If it did overheat, the controllers may be damaged to the point that they would need to be replaced. Can you get the temperature in the room down? I would shoot for a room temperature of 24°C if at all possible and be sure there is good airflow inside the case.

IceBoosteR · Jul 5, 2018

Room temperature is 23°.
I have an IR device to get the temperatur of the system. No component is over 41° and the LSI controller is the hottest device...

I lost a ton of data. Good that I have a backup. But jesus!
I have pressed a button on the power supplys back, to enable the internal fan to work all the time.
https://seasonic.com/focus-plus-platinum

I guess that caused the issue. This killed everything and the cause of the previous error was caused by the PSU. can that be true?

Damn all the things.

IceBoosteR · Jul 5, 2018

But that mean that the SAS controller (LSI 2008) is not the root cause. Maybe its the mainboard. But I have no guess how to troubleshoot that! :'(

Elliot Dierksen · Jul 5, 2018

I would have to say that either power or temperature is the most likely cause because you have errors across multiple high level devices. It could be the main board, but that seems less likely than power or temp.

IceBoosteR · Jul 5, 2018

I would say that the temperatur is not hte problem. As I guess this big fat failure occurs when I pressed that button, the PSU killed the system. I am not sure why ZFS is killing my pool. I am not at the point to sync any data I have access to. Only 50% of the shares are available as the filesystem is corrupt.
MY GOD...

The best option would be to RMA that PSU. Does Seasonic gives me compensation? :/

kdragon75 · Jul 5, 2018

do you have any other equipment on the power circuit? Refrigerator, air compressor, dehumidifier, microwave, stove, dishwasher, cloths washer, or anything else that may draw a larger amount of power ot have a large inductive load (think electric motors.) Do you have a UPS inlist with your system? If you got all those errors when you turned on a fan on the PSU I would assume the PSU is toast built it may not be its fault.

IceBoosteR · Jul 6, 2018

No, the NAS, the UPS and a rapberry pi are on that rail, which is secured with the 16A fuse (or I guess so). Of course in my household is running a refrigerator and so on.
Can you explain your sentence:

kdragon75 said:
If you got all those errors when you turned on a fan on the PSU I would assume the PSU is toast built it may not be its fault.

Is the PSU now toasty or it is not?

I am 99% sure that is was the PSU, I think I will RMA it. Right now I am rebuilding the array and copy all the files from backup to taht pool. I have taken and backup PSU until the other one will arrive.

Or do I miss something here?

Chris Moore · Jul 7, 2018

IceBoosteR said:
I am 99% sure that is was the PSU, I think I will RMA it. Right now I am rebuilding the array and copy all the files from backup to taht pool. I have taken and backup PSU until the other one will arrive.

I would suggest doing a round of burn-in testing on the hardware before you put your data back on it.
This guide has a good section on system burn-in:
https://www.familybrown.org/dokuwiki/doku.php?id=fester:intro

Stux · Jul 7, 2018

What are the exact model numbers of your WD drives?

MrToddsFriends · Jul 8, 2018

For more information about a certain problematic WD Red 6TB series causing errors resulting in CAM status: SCSI Status Error messages see

https://forums.freenas.org/index.php?threads/various-scsi-sense-errors-during-scrubbing.54988/
https://forums.freenas.org/index.ph...ors-during-scrubbing.54988/page-5#post-403400
https://forums.freenas.org/index.php?threads/multiple-small-repairs-during-scrubs.48601/
https://forums.freenas.org/index.php?threads/warning-wd-red-wd60efrx-68l0bn1.58676/

It's unknown to me if there is a WD Red 4TB series causing the same or similar problems or if it's only this certain WD Red 6 TB series (WD60EFRX-68L0BN1) that's causing these problems.

IceBoosteR · Jul 8, 2018

Stux said:
What are the exact model numbers of your WD drives?

Hello Stux,

I have 4x WDC WD40EFRX-68WT0N0 and 6x WDC WD40EFRX-68N32N0.

Rgds, Ice

IceBoosteR · Jul 17, 2018

Bumb up....

Important Announcement for the TrueNAS Community.

HELP needed -> SCSI sense: NOT READY -> pool lost

Guru

Guru

Guru

Hall of Famer

Guru

Guru

Guru

Guru

Hall of Famer

Guru

Guru

Guru

Guru

Wizard

Guru

Hall of Famer

MVP

Documentation Browser

Guru

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "HELP needed -> SCSI sense: NOT READY -> pool lost"

Similar threads