Disk was removed as a failed disk then swap_pager: I/O error

BurntTech · Oct 17, 2017

The Failing Disk, didn't seem to surprising

Code:

ahcich7: Timeout on slot 11 port 0
ahcich7: is 00000000 cs 00000800 ss 00000000 rs 00000800 tfd c0 serr 00000000 cmd 0004cb17
(ada6:ahcich7:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
(ada6:ahcich7:0:0:0): CAM status: Command timeout
(ada6:ahcich7:0:0:0): Retrying command
ahcich7: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich7: Timeout on slot 12 port 0
ahcich7: is 00000000 cs 00001000 ss 00000000 rs 00001000 tfd 80 serr 00000000 cmd 0004cc17
(aprobe0:ahcich7:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich7:0:0:0): CAM status: Command timeout
(aprobe0:ahcich7:0:0:0): Retrying command
ahcich7: Timeout on slot 25 port 0
ahcich7: is 00000000 cs 02000000 ss 00000000 rs 02000000 tfd d0 serr 00000000 cmd 0004d917
ahcich7: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich7: Timeout on slot 26 port 0
ahcich7: is 00000000 cs 04000000 ss 00000000 rs 04000000 tfd 80 serr 00000000 cmd 0004da17
(aprobe0:ahcich7:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich7:0:0:0): CAM status: Command timeout
(aprobe0:ahcich7:0:0:0): Retrying command
ahcich7: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich7: Timeout on slot 27 port 0
ahcich7: is 00000000 cs 08000000 ss 00000000 rs 08000000 tfd 80 serr 00000000 cmd 0004db17
(aprobe0:ahcich7:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich7:0:0:0): CAM status: Command timeout
(aprobe0:ahcich7:0:0:0): Error 5, Retries exhausted
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2100740, size: 49152
ahcich7: AHCI reset: device not ready after 31000ms (tfd = 00000080)
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2100740, size: 49152
ahcich7: Timeout on slot 28 port 0
ahcich7: is 00000000 cs 10000000 ss 00000000 rs 10000000 tfd 80 serr 00000000 cmd 0004dc17
(aprobe0:ahcich7:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich7:0:0:0): CAM status: Command timeout
(aprobe0:ahcich7:0:0:0): Error 5, Retry was blocked
ada6 at ahcich7 bus 0 scbus7 target 0 lun 0
ada6: <ST32000542AS CC34> s/n 5XW28KWS detached

Then what starts to make me concerned it appears ZFSD core dumped? did ZFS crash?

Code:

GEOM_ELI: g_eli_read_done() failed (error=6) ada6p1.eli[READ(offset=14680064, length=49152)]
swap_pager: I/O error - pagein failed; blkno 2100740,size 49152, error 6
vm_fault: pager read error, pid 2940 (python3.6)
pid 2940 (python3.6), uid 0: exited on signal 11
GEOM_ELI: Device ada6p1.eli destroyed.
GEOM_ELI: Detached ada6p1.eli on last close.
swap_pager: I/O error - pagein failed; blkno 2101950,size 8192, error 6
vm_fault: pager read error, pid 3769 (zfsd)
swap_pager: I/O error - pagein failed; blkno 2098787,size 4096, error 6
vm_fault: pager read error, pid 3769 (zfsd)
Failed to fully fault in a core file segment at VA 0x800679000 with size 0x2f000 to be written at offset 0x10000 for process zfsd
swap_pager: I/O error - pagein failed; blkno 2098818,size 4096, error 6
vm_fault: pager read error, pid 3769 (zfsd)
Failed to fully fault in a core file segment at VA 0x801dfa000 with size 0x7000 to be written at offset 0x4f000 for process zfsd
swap_pager: I/O error - pagein failed; blkno 2098819,size 4096, error 6
vm_fault: pager read error, pid 3769 (zfsd)
Failed to fully fault in a core file segment at VA 0x80303a000 with size 0x1000 to be written at offset 0x84000 for process zfsd
swap_pager: I/O error - pagein failed; blkno 2098820,size 4096, error 6
vm_fault: pager read error, pid 3769 (zfsd)
Failed to fully fault in a core file segment at VA 0x803400000 with size 0x400000 to be written at offset 0x92000 for process zfsd
swap_pager: I/O error - pagein failed; blkno 2111670,size 8192, error 6
vm_fault: pager read error, pid 3769 (zfsd)
Failed to fully fault in a core file segment at VA 0x7ffffffdf000 with size 0x20000 to be written at offset 0x492000 for process zfsd
pid 3769 (zfsd), uid 0: exited on signal 11 (core dumped)
ahcich7: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich7: Timeout on slot 29 port 0
ahcich7: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd 80 serr 00000000 cmd 0004dd17
(aprobe0:ahcich7:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich7:0:0:0): CAM status: Command timeout
(aprobe0:ahcich7:0:0:0): Retrying command
ahcich7: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich7: Timeout on slot 30 port 0
ahcich7: is 00000000 cs 40000000 ss 00000000 rs 40000000 tfd 80 serr 00000000 cmd 0004de17
(aprobe0:ahcich7:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich7:0:0:0): CAM status: Command timeout
(aprobe0:ahcich7:0:0:0): Error 5, Retries exhausted
(ada6:ahcich7:0:0:0): Periph destroyed
ada6 at ahcich7 bus 0 scbus7 target 0 lun 0
ada6: <ST32000542AS CC34> ATA8-ACS SATA 2.x device
ada6: Serial Number 5XW28KWS
ada6: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada6: Command Queueing enabled
ada6: 1907729MB (3907029168 512 byte sectors)

This is what is rest of the noise that continued and web gui stopped responding and started reporting the http sessions were abort which they were since they didn't open a page after 60 seconds I closed the tab. HTTP 499

Code:

swap_pager: I/O error - pagein failed; blkno 2098667,size 8192, error 6
vm_fault: pager read error, pid 215 (python3.6)
swap_pager: I/O error - pagein failed; blkno 2098916,size 65536, error 6
vm_fault: pager read error, pid 215 (python3.6)
swap_pager: I/O error - pagein failed; blkno 2098931,size 4096, error 6
vm_fault: pager read error, pid 215 (python3.6)
swap_pager: I/O error - pagein failed; blkno 2098908,size 4096, error 6
vm_fault: pager read error, pid 215 (python3.6)
Failed to fully fault in a core file segment at VA 0x800622000 with size 0x2c000 to be written at offset 0x25000 for process python3.6
swap_pager: I/O error - pagein failed; blkno 2104739,size 4096, error 6
vm_fault: pager read error, pid 215 (python3.6)
Failed to fully fault in a core file segment at VA 0x800665000 with size 0x18e000 to be written at offset 0x68000 for process python3.6
swap_pager: I/O error - pagein failed; blkno 2104961,size 4096, error 6
vm_fault: pager read error, pid 215 (python3.6)
Failed to fully fault in a core file segment at VA 0x8007f3000 with size 0x9000 to be written at offset 0x1f6000 for process python3.6
swap_pager: I/O error - pagein failed; blkno 2099745,size 4096, error 6
vm_fault: pager read error, pid 215 (python3.6)

Tried to restart nginx and that failed so I started to assume boot drive is having issues but its mirrored with no pool errors, couldn't get a way to run smart on those usb sticks with the device param The SATA2TB doesn't show the failed disk now after I ran scrub on that pool which was weird.

Code:

 pool: SATA2TB
 state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Tue Oct 17 12:40:43 2017
config:

		NAME											STATE	 READ WRITE CKSUM
		SATA2TB										 ONLINE	   0	 0	 0
		  raidz1-0									  ONLINE	   0	 0	 0
			gptid/6a9fd7d6-a913-11e7-80cd-bc5ff4a85b10  ONLINE	   0	 0	 0
			gptid/6bc35ce7-a913-11e7-80cd-bc5ff4a85b10  ONLINE	   0	 0	 0
			gptid/6ccf6f39-a913-11e7-80cd-bc5ff4a85b10  ONLINE	   0	 0	 0
			gptid/6dc0a9a4-a913-11e7-80cd-bc5ff4a85b10  ONLINE	   0	 0	 0

errors: No known data errors
  
pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0h3m with 0 errors on Tue Oct 17 12:05:21 2017
config:

		NAME		STATE	 READ WRITE CKSUM
		freenas-boot  ONLINE	   0	 0	 0
		  mirror-0  ONLINE	   0	 0	 0
			da0p2   ONLINE	   0	 0	 0
			da1p2   ONLINE	   0	 0	 0

errors: No known data errors

  pool: small
 state: ONLINE
  scan: scrub repaired 0 in 0h4m with 0 errors on Tue Oct 17 12:07:05 2017
config:

		NAME										  STATE	 READ WRITE CKSUM
		small										 ONLINE	   0	 0	 0
		  gptid/25640af8-a7e0-11e7-86f7-bc5ff4a85b10  ONLINE	   0	 0	 0

errors: No known data errors

Here is the failing drive but scrub shouldn't add this drive back in?

Code:

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   118   100   006	Pre-fail  Always	   -	   193403480
  3 Spin_Up_Time			0x0003   100   100   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   020	Old_age   Always	   -	   22
  5 Reallocated_Sector_Ct   0x0033   100   100   036	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000f   051   051   030	Pre-fail  Always	   -	   141738809141
  9 Power_On_Hours		  0x0032   069   069   000	Old_age   Always	   -	   27667
 10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   22
183 Runtime_Bad_Block	   0x0032   100   100   000	Old_age   Always	   -	   0
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0
188 Command_Timeout		 0x0032   098   098   000	Old_age   Always	   -	   8590065667
189 High_Fly_Writes		 0x003a   100   100   000	Old_age   Always	   -	   0
190 Airflow_Temperature_Cel 0x0022   071   058   045	Old_age   Always	   -	   29 (Min/Max 29/35)
194 Temperature_Celsius	 0x0022   029   042   000	Old_age   Always	   -	   29 (0 13 0 0 0)
195 Hardware_ECC_Recovered  0x001a   026   024   000	Old_age   Always	   -	   193403480
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   1
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   1
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0
240 Head_Flying_Hours	   0x0000   100   253   000	Old_age   Offline	  -	   18856 (176 241 0)
241 Total_LBAs_Written	  0x0000   100   253   000	Old_age   Offline	  -	   1548236095
242 Total_LBAs_Read		 0x0000   100   253   000	Old_age   Offline	  -	   1946007858

I've got a replacement drive coming in today but i'm more interested in what the next steps should be as the web gui isn't work I would rather not reboot with a failing disk if I don't have to.

BurntTech · Oct 18, 2017

The reboot hung at All Buffered Synced after restart it came back fine and use the GUI to replace the disk and finished really quick. When kill -9 doesn't work on nginx I get worried. Still don't like the unstable part to the OS/FreeNAS after the disk failed on secondary pool but I guess all is well now.

Stux · Oct 18, 2017

Having swap in use when a disk fails will crash your kernel.

There is a script to periodically page in swap. Or you can relocate swap to other devices.

dlavigne · Oct 18, 2017

Notes that that should no longer be an issue as of 11.1: https://bugs.freenas.org/issues/23523.

Arwen · Oct 18, 2017

That was one thing Corral had, that had yet to make it into the other released versions, (9.10.x & 11.0.x).

I want Mirrored swap just for this reason.

BurntTech · Oct 18, 2017

would swap be on non primary zfs pool? I have mirrored USB for boot and and what I thought is a primary zfs pool

Stux · Oct 18, 2017

Swap is by default, striped across the disks that make up your storage pool. Not the boot pool. And it shouldn't be on a USB boot pool either.

As mentioned, as a work-around, until 11.1, you can use this script, say every 10 minutes.
https://forums.freenas.org/index.ph...ny-used-swap-to-prevent-kernel-crashes.46206/

Important Announcement for the TrueNAS Community.

Disk was removed as a failed disk then swap_pager: I/O error

BurntTech

Dabbler

BurntTech

Dabbler

Stux

MVP

dlavigne

Guest

Arwen

MVP

BurntTech

Dabbler

Stux

MVP

Similar threads