Smart-Long test caused major R/W errors and pool to go offline

SMnasMAN · Sep 2, 2019

(11.2-u5) I have a pool, ssd2x6x480GBz1 (btw its da0 to da11 that make up this pool) , which is 12x 480g HGST enterprise HUSMR1650ASS204 disks, connected via a BPN-SAS-216el1 expander / 2u diskshelf to my main FN 4u box (2u is connected via a lsi 9207-8e HBA (at FW= 20). It is using a intel p900 optane as its zil/slog

All has been running great for about a week or 2 now. (pool is mainly used as a NFS share to my vmware cluster of 3x hosts running latest version of 6.5 esxi w Vcenter.

I wake up this morning and all hell has broken loose on the vmware cluster, as the NFS share for ssd2x6x480GBz1 is gone.

on FREENAS, this pool is listed as ONLINE but none of the datastores are shown, pool status shows a few read and write errors across all the disks of the pool. (all other pools of other disks are fine but are all via different controllers too)

Im almost 100% sure this was cause by all of the disks starting their (scheduled) smart-test -long (i had just set this up to run once a month following a guide here on the forums, which i agree with). - so currently my best guess as to this issue is: Due to all the ssd drives doing their smart long test (which 56min later, they all passed btw)
Those smart-long tests increased latency and caused timeouts for IO requests (or either the expander or HBA started timing out IO req's) which intern caused FN to start logging errors, and eventually take down the pool (as a safeguard maybe?). (after a reboot (shutdown hung, so power off / power up) , the disks showed up fine, at which point i did a scrub, 0 errors)

here are the logs i was able to grab (and why i think this was the cause).
(ofcourse my splunk syslog server was running on this NFS ssd vol so it died! lesson learned there)

its a long log, so here is pastebin of all, and ill try to pull out some of relevant here below:
https://pastebin.com/GR863TTA

note: in nearly 10 months of stress testing this exact hardware (and FN), i had never once seen any CDB errors from these SSDs / expander / hba - however this is the first time ive run smart-long tests on these specific drives (if that is the cause). looking for any info or input from you all, as this was pretty shocking (and seems like could have been alot worse).

smart data on all 12 ssds looks about perfect

some raw data i pulled:

Code:

during the issue:

  pool: ssd2x6x480GBz1
state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-JQ
  scan: none requested
config:

        NAME                                            STATE     READ WRITE CKSUM
        ssd2x6x480GBz1                                  ONLINE      16     8     0
          raidz1-0                                      ONLINE      15     4     0
            gptid/1eae5c41-c634-11e9-975e-00259084f1c8  ONLINE       0     0     0
            gptid/1f09b571-c634-11e9-975e-00259084f1c8  ONLINE       2     0     0
            gptid/1f7dea92-c634-11e9-975e-00259084f1c8  ONLINE       8   120     0
            gptid/1fde322b-c634-11e9-975e-00259084f1c8  ONLINE       2     0     0
            gptid/203ae8b8-c634-11e9-975e-00259084f1c8  ONLINE       3     2     0
            gptid/20ae8720-c634-11e9-975e-00259084f1c8  ONLINE       0     2     0
          raidz1-1                                      ONLINE      10     4     0
            gptid/212383df-c634-11e9-975e-00259084f1c8  ONLINE       3     6     0
            gptid/218371b5-c634-11e9-975e-00259084f1c8  ONLINE       3     6     0
            gptid/21e22950-c634-11e9-975e-00259084f1c8  ONLINE       3     6     0
            gptid/22567ad2-c634-11e9-975e-00259084f1c8  ONLINE       0     0     0
            gptid/22b87683-c634-11e9-975e-00259084f1c8  ONLINE       2     0     0
            gptid/231a7df4-c634-11e9-975e-00259084f1c8  ONLINE       1     4     0
        logs
          gptid/f166ac78-c7bc-11e9-975e-00259084f1c8    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>
        ssd2x6x480GBz1/VMwareNFSssdZ1:<0xa21>
        ssd2x6x480GBz1/VMwareNFSssdZ1:<0x576>
        ssd2x6x480GBz1/VMwareNFSssdZ1:<0x3cc>
        ssd2x6x480GBz1/VMwareNFSssdZ1:<0x2d6>



all the SSDs smart looks like this:

# smartctl -a /dev/da2
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HSCP0480S5xnNMRI
Revision:             3P02
Compliance:           SPC-4
User Capacity:        480,999,997,440 bytes [480 GB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x5000cca04e0b9898
Serial number:        0QV6BPRA
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Mon Sep  2 15:34:09 2019 CDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 1%
Current Drive Temperature:     33 C
Drive Trip Temperature:        70 C

Manufactured in week 04 of year 2015
Specified cycle count over device lifetime:  0
Accumulated start-stop cycles:  0
Specified load-unload count over device lifetime:  0
Accumulated load-unload cycles:  0
Vendor (Seagate) cache information
  Blocks sent to initiator = 56558735039922948

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0      80589.840           0
write:         0        0         0         0          0      73225.161           0
verify:        0        0         0         0          0      56863.026           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -   24828                 - [-   -    -]

Long (extended) Self Test duration: 3468 seconds [57.8 minutes]



and items i pulled from logs during issue:


Sep  2 06:21:37 freenas smartd[84638]: Device: /dev/da11, starting scheduled Long Self-Test.
Sep  2 06:21:37 freenas smartd[84638]: Device: /dev/da12 [SAT], starting scheduled Long Self-Test.
Sep  2 06:21:37 freenas smartd[84638]: Device: /dev/da2, starting scheduled Long Self-Test.
Sep  2 06:21:37 freenas smartd[84638]: Device: /dev/da21 [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 68 to 69
Sep  2 06:21:37 freenas smartd[84638]: Device: /dev/da21 [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 32 to 31
Sep  2 06:21:37 freenas smartd[84638]: Device: /dev/da21 [SAT], starting scheduled Long Self-Test.
Sep  2 06:21:37 freenas smartd[84638]: Device: /dev/da3, starting scheduled Long Self-Test.
Sep  2 06:21:37 freenas smartd[84638]: Device: /dev/da34, starting scheduled Long Self-Test.
Sep  2 06:21:37 freenas smartd[84638]: Device: /dev/da35, starting scheduled Long Self-Test.
Sep  2 06:21:37 freenas smartd[84638]: Device: /dev/da36, starting scheduled Long Self-Test.
Sep  2 06:21:37 freenas smartd[84638]: Device: /dev/da4, starting scheduled Long Self-Test.
Sep  2 06:21:37 freenas smartd[84638]: Device: /dev/da5, starting scheduled Long Self-Test.
Sep  2 06:21:37 freenas smartd[84638]: Device: /dev/da6, starting scheduled Long Self-Test.
Sep  2 06:21:37 freenas smartd[84638]: Device: /dev/da7, starting scheduled Long Self-Test.
Sep  2 06:21:37 freenas smartd[84638]: Device: /dev/da8, starting scheduled Long Self-Test.
Sep  2 06:21:37 freenas smartd[84638]: Device: /dev/da9, starting scheduled Long Self-Test.
Sep  2 06:36:59 freenas         (da2:mps0:0:63:0): READ(10). CDB: 28 00 34 7b 50 90 00 00 38 00 length 28672 SMID 757 Aborting command 0xfffffe0001527190
Sep  2 06:36:59 freenas mps0: Sending reset from mpssas_send_abort for target ID 63
Sep  2 06:37:02 freenas         (da0:mps0:0:59:0): WRITE(10). CDB: 2a 00 29 87 96 40 00 00 08 00 length 4096 SMID 404 Aborting command 0xfffffe000150a240
Sep  2 06:37:02 freenas         (da4:mps0:0:65:0): WRITE(10). CDB: 2a 00 20 72 c8 98 00 00 38 00 length 28672 SMID 667 Aborting command 0xfffffe000151fb70
Sep  2 06:37:02 freenas mps0: Sending reset from mpssas_send_abort for target ID 59
Sep  2 06:37:02 freenas mps0: Sending reset from mpssas_send_abort for target ID 65
Sep  2 06:37:03 freenas         (da2:mps0:0:63:0): READ(10). CDB: 28 00 34 7b 4f f0 00 00 38 00 length 28672 SMID 400 Aborting command 0xfffffe0001509d00
Sep  2 06:37:03 freenas (da2:mps0:0:63:0): CAM status: CAM subsystem is busy
Sep  2 06:37:03 freenas (da2:mps0:0:63:0): CAM status: Command timeout
Sep  2 06:37:03 freenas (da2:mps0:0:63:0): READ(10). CDB: 28 00 34 7b 50 90 00 00 38 00
Sep  2 06:37:03 freenas (da2:mps0:0:63:0): Retrying command
Sep  2 06:37:03 freenas mps0: mpssas_action_scsiio: Freezing devq for target ID 63
Sep  2 06:37:03 freenas mps0: Sending reset from mpssas_send_abort for target ID 63
Sep  2 06:37:04 freenas (da2:mps0:0:63:0): CAM status: CAM subsystem is busy
Sep  2 06:37:04 freenas (da2:mps0:0:63:0): Error 5, Retries exhausted
Sep  2 06:37:04 freenas (da2:mps0:0:63:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00
Sep  2 06:37:04 freenas (da2:mps0:0:63:0): READ(10). CDB: 28 00 34 7b 50 90 00 00 38 00
Sep  2 06:37:04 freenas (da2:mps0:0:63:0): Retrying command
Sep  2 06:37:04 freenas mps0: mpssas_action_scsiio: Freezing devq for target ID 63
Sep  2 06:37:05 freenas (da2:mps0:0:63:0): CAM status: CAM subsystem is busy
Sep  2 06:37:05 freenas (da2:mps0:0:63:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00
Sep  2 06:37:05 freenas (da2:mps0:0:63:0): Retrying command
Sep  2 06:37:05 freenas mps0: mpssas_action_scsiio: Freezing devq for target ID 63
Sep  2 06:37:06 freenas (da0:mps0:0:59:0): CAM status: Command timeout
Sep  2 06:37:06 freenas (da0:mps0:0:59:0): Retrying command


Sep  2 06:37:06 freenas Unfreezing devq for target ID 59
Sep  2 06:37:07 freenas         (da2:mps0:0:63:0): READ(10). CDB: 28 00 34 7b 50 58 00 00 38 00 length 28672 SMID 1084 Aborting command 0xfffffe0001541ec0



Sep  2 06:37:13 freenas zfsd:   Vdev State = ONLINE
Sep  2 06:37:13 freenas zfsd: CaseFile(12609516878478959410,8386571431618165161,id1,enc@n5003048000c0fa7d/type@0/slot@8/elmdesc@Slot_08/p2)
Sep  2 06:37:13 freenas zfsd: CaseFile(12609516878478959410,8386571431618165161) closed - State ONLINE
Sep  2 06:37:13 freenas zfsd: Creating new CaseFile:
Sep  2 06:37:14 freenas (da2:mps0:0:63:0): CAM status: CAM subsystem is busy
Sep  2 06:37:14 freenas (da2:mps0:0:63:0): Error 5, Retries exhausted


Sep  2 06:38:32 freenas mps0: mpssas_action_scsiio: Freezing devq for target ID 65
Sep  2 06:38:33 freenas (da4:mps0:0:65:0): CAM status: CAM subsystem is busy
Sep  2 06:38:33 freenas (da4:mps0:0:65:0): Error 5, Retries exhausted
Sep  2 06:38:33 freenas (da4:mps0:0:65:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00
Sep  2 06:38:33 freenas (da4:mps0:0:65:0): Retrying command
Sep  2 06:38:33 freenas (da4:mps0:0:65:0): WRITE(10). CDB: 2a 00 09 ed 7f 80 00 00 10 00
Sep  2 06:38:33 freenas mps0: mpssas_action_scsiio: Freezing devq for target ID 65



Sep  2 06:40:34 freenas mps0: Unfreezing devq for target ID 68
Sep  2 06:40:42 freenas zfsd:   Vdev State = ONLINE
Sep  2 06:40:42 freenas zfsd: CaseFile(12609516878478959410,13190870688141053705,id1,enc@n5003048000c0fa7d/type@0/slot@d/elmdesc@Slot_13/p2)
Sep  2 06:40:42 freenas zfsd: Creating new CaseFile:
Sep  2 06:40:42 freenas zfsd: Evaluating existing case file
Sep  2 06:40:42 freenas zfsd: ZFS: Notify  class=ereport.fs.zfs.io ena=14853523331502576641 parent_guid=12609516878478959410 parent_type=root pool=ssd2x6x480GBz1 pool_context=0 pool_failmode=continue pool_guid=12609516878478959410 subsystem=ZFS timestamp=1567424442 type=ereport.fs.zfs.io vdev_guid=16827335826208397105 vdev_type=raidz zio_blkid=11368 zio_err=5 zio_level=0 zio_object=726 zio_objset=79 zio_offset=498760335360 zio_size=112640
Sep  2 06:40:42 freenas zfsd: ZFS: Notify  class=ereport.fs.zfs.io ena=14853523331502576641 parent_guid=16827335826208397105 parent_type=raidz pool=ssd2x6x480GBz1 pool_context=0 pool_failmode=continue pool_guid=12609516878478959410 subsystem=ZFS timestamp=1567424442 type=ereport.fs.zfs.io vdev_guid=13190870688141053705 vdev_path=/dev/gptid/20ae8720-c634-11e9-975e-00259084f1c8 vdev_type=disk zio_blkid=11368 zio_err=5 zio_level=0 zio_object=726 zio_objset=79 zio_offset=83130916864 zio_size=20480
Sep  2 06:40:42 freenas zfsd: ZFS: Notify  class=ereport.fs.zfs.io ena=14853523331502579713 parent_guid=12609516878478959410 parent_type=root pool=ssd2x6x480GBz1 pool_context=0 pool_failmode=continue pool_guid=12609516878478959410 subsystem=ZFS timestamp=1567424442 type=ereport.fs.zfs.io vdev_guid=16827335826208397105 vdev_type=raidz zio_blkid=28281 zio_err=5 zio_level=0 zio_object=726 zio_objset=79 zio_offset=498760695808 zio_size=26112
Sep  2 06:40:42 freenas zfsd: ZfsEvent::Process: Event for unknown vdev 16827335826208397105 queued
Sep  2 06:40:43 freenas         (da4:mps0:0:65:0): WRITE(10). CDB: 2a 00 0a 17 57 58 00 00 38 00 length 28672 SMID 1025 Aborting command 0xfffffe000153d150
Sep  2 06:40:43 freenas mps0: Sending reset from mpssas_send_abort for target ID 65
Sep  2 06:40:44 freenas         (da9:mps0:0:77:0): UNMAP. CDB: 42 00 00 00 00 00 00 00 18 00 length 24 SMID 993 Aborting command 0xfffffe000153a750
Sep  2 06:40:44 freenas mps0: Sending reset from mpssas_send_abort for target ID 77
Sep  2 06:40:47 freenas         (da4:mps0:0:65:0): READ(10). CDB: 28 00 34 7b 51 98 00 00 38 00 length 28672 SMID 303 Aborting command 0xfffffe0001501db0
Sep  2 06:40:47 freenas         (pass0:mps0:0:59:0): LOG SENSE. CDB: 4d 00 43 00 00 00 00 00 58 00 length 88 SMID 358 Aborting command 0xfffffe00015065e0
Sep  2 06:40:47 freenas (da4:mps0:0:65:0): CAM status: CAM subsystem is busy
Sep  2 06:40:47 freenas (da4:mps0:0:65:0): CAM status: Command timeout
Sep  2 06:40:47 freenas (da4:mps0:0:65:0): Retrying command
Sep  2 06:40:47 freenas (da4:mps0:0:65:0): WRITE(10). CDB: 2a 00 0a 17 57 58 00 00 38 00
Sep  2 06:40:47 freenas mps0: mpssas_action_scsiio: Freezing devq for target ID 65
Sep  2 06:40:47 freenas mps0: Sending reset from mpssas_send_abort for target ID 59
Sep  2 06:40:47 freenas mps0: Sending reset from mpssas_send_abort for target ID 65
Sep  2 06:40:48 freenas (da4:mps0:0:65:0): CAM status: CAM subsystem is busy
Sep  2 06:40:48 freenas (da4:mps0:0:65:0): Error 5, Retries exhausted


Sep  2 06:41:01 freenas (da4:mps0:0:65:0): Retrying command
Sep  2 06:41:01 freenas (da4:mps0:0:65:0): WRITE(10). CDB: 2a 00 0a 17 57 c8 00 00 30 00
Sep  2 06:41:01 freenas mps0: mpssas_action_scsiio: Freezing devq for target ID 65
Sep  2 06:41:01 freenas zfsd:   Vdev State = ONLINE
Sep  2 06:41:01 freenas zfsd: CaseFile(12609516878478959410,7983501312401782520,id1,enc@n5003048000c0fa7d/type@0/slot@a/elmdesc@Slot_10/p2)
Sep  2 06:41:01 freenas zfsd: CaseFile(12609516878478959410,7983501312401782520) closed - State ONLINE
Sep  2 06:41:01 freenas zfsd: Creating new CaseFile:
Sep  2 06:41:01 freenas zfsd: ZFS: Notify  class=ereport.fs.zfs.io ena=14873616139029581825 parent_guid=12609516878478959410 parent_type=root pool=ssd2x6x480GBz1 pool_context=0 pool_failmode=continue pool_guid=12609516878478959410 subsystem=ZFS timestamp=1567424461 type=ereport.fs.zfs.io vdev_guid=16827335826208397105 vdev_type=raidz zio_blkid=666701 zio_err=6 zio_level=0 zio_object=1398 zio_objset=79 zio_offset=2691976478720 zio_size=131072
Sep  2 06:41:01 freenas zfsd: ZFS: Notify  class=ereport.fs.zfs.io ena=14873616139029584897 parent_guid=12609516878478959410 parent_type=root pool=ssd2x6x480GBz1 pool_context=0 pool_failmode=continue pool_guid=12609516878478959410 subsystem=ZFS timestamp=1567424461 type=ereport.fs.zfs.io vdev_guid=16827335826208397105 vdev_type=raidz zio_blkid=39370 zio_err=6 zio_level=0 zio_object=2593 zio_objset=79 zio_offset=987861475328 zio_size=23552
Sep  2 06:41:01 freenas zfsd: ZFS: Notify  class=ereport.fs.zfs.io ena=14873616139029593089 parent_guid=12609516878478959410 parent_type=root pool=ssd2x6x480GBz1 pool_context=0 pool_failmode=continue pool_guid=12609516878478959410 subsystem=ZFS timestamp=1567424461 type=ereport.fs.zfs.io vdev_guid=16827335826208397105 vdev_type=raidz zio_blkid=666702 zio_err=6 zio_level=0 zio_object=1398 zio_objset=79 zio_offset=2691976642560 zio_size=131072
Sep  2 06:41:01 freenas zfsd: ZfsEvent::Process: Event for unknown vdev 16827335826208397105 queued
Sep  2 06:41:02 freenas (da4:mps0:0:65:0): CAM status: CAM subsystem is busy
Sep  2 06:41:02 freenas (da4:mps0:0:65:0): Retrying command
Sep  2 06:41:02 freenas (da4:mps0:0:65:0): WRITE(10). CDB: 2a 00 0a 17 57 c8 00 00 30 00
Sep  2 06:41:02 freenas mps0: mpssas_action_scsiio: Freezing devq for target ID 6


Sep  2 06:49:55 freenas (da10:mps0:0:78:0): READ(10). CDB: 28 00 37 fe e8 90 00 00 10 00
Sep  2 06:49:55 freenas (da10:mps0:0:78:0): Retrying command
Sep  2 06:49:55 freenas mps0: mpssas_action_scsiio: Freezing devq for target ID 78
Sep  2 06:49:56 freenas (da10:mps0:0:78:0): CAM status: CAM subsystem is busy
Sep  2 06:49:56 freenas (da10:mps0:0:78:0): CAM status: Command timeout
Sep  2 06:49:56 freenas (da10:mps0:0:78:0): READ(10). CDB: 28 00 37 fe e8 90 00 00 10 00
Sep  2 06:49:56 freenas (da10:mps0:0:78:0): Retrying command
Sep  2 06:49:56 freenas (da10:mps0:0:78:0): UNMAP. CDB: 42 00 00 00 00 00 00 00 18 00
Sep  2 06:49:56 freenas mps0: mpssas_action_scsiio: Freezing devq for target ID 78
Sep  2 06:49:56 freenas mps0: Unfreezing devq for target ID 78
Sep  2 06:50:07 freenas zfsd: Evaluating existing case file
Sep  2 06:50:07 freenas zfsd: ZFS: Notify  class=ereport.fs.zfs.io ena=15446347261536113665 parent_guid=16827335826208397105 parent_type=raidz pool=ssd2x6x480GBz1 pool_context=0 pool_failmode=continue pool_guid=12609516878478959410 subsystem=ZFS timestamp=1567425007 type=ereport.fs.zfs.io vdev_guid=1758322195652712981 vdev_path=/dev/gptid/1fde322b-c634-11e9-975e-00259084f1c8 vdev_type=disk zio_blkid=668231 zio_err=5 zio_level=0 zio_object=1398 zio_objset=79 zio_offset=115123744768 zio_size=24576
Sep  2 06:50:28 freenas zfsd: Evaluating existing case file
Sep  2 06:50:28 freenas zfsd: ZFS: Notify  class=ereport.fs.zfs.io ena=15446347261536113665 parent_guid=12609516878478959410 parent_type=root pool=ssd2x6x480GBz1 pool_context=0 pool_failmode=continue pool_guid=12609516878478959410 subsystem=ZFS timestamp=1567425028 type=ereport.fs.zfs.io vdev_guid=16827335826208397105 vdev_type=raidz zio_blkid=668231 zio_err=5 zio_level=0 zio_object=1398 zio_objset=79 zio_offset=690717302784 zio_size=131072
Sep  2 06:50:28 freenas zfsd: ZFS: Notify  class=ereport.fs.zfs.io ena=15446347261536113665 parent_guid=16827335826208397105 parent_type=raidz pool=ssd2x6x480GBz1 pool_context=0 pool_failmode=continue pool_guid=12609516878478959410 subsystem=ZFS timestamp=1567425028 type=ereport.fs.zfs.io vdev_guid=5936719101646054698 vdev_path=/dev/gptid/1f09b571-c634-11e9-975e-00259084f1c8 vdev_type=disk zio_blkid=668231 zio_err=5 zio_level=0 zio_object=1398 zio_objset=79 zio_offset=115123744768 zio_size=28672
Sep  2 06:50:28 freenas zfsd: ZFS: Notify  class=ereport.fs.zfs.io ena=15467621818753354753 parent_guid=12609516878478959410 parent_type=root pool=ssd2x6x480GBz1 pool_context=0 pool_failmode=continue pool_guid=12609516878478959410 subsystem=ZFS timestamp=1567425027 type=ereport.fs.zfs.io vdev_guid=16827335826208397105 vdev_type=raidz zio_blkid=668230 zio_err=6 zio_level=0 zio_object=1398 zio_objset=79 zio_offset=690717138944 zio_size=131072
Sep  2 06:50:28 freenas zfsd: ZfsEvent::Process: Event for unknown vdev 16827335826208397105 queued
Sep  2 06:50:40 freenas zfsd:   Vdev State = ONLINE
Sep  2 06:50:40 freenas zfsd: CaseFile(12609516878478959410,13579473535141282300,id1,enc@n5003048000c0fa7d/type@0/slot@15/elmdesc@Slot_21/p2)
Sep  2 06:50:40 freenas zfsd: Creating new CaseFile:
Sep  2 06:50:40 freenas zfsd: Evaluating existing case file
Sep  2 06:50:40 freenas zfsd: ZFS: Notify  class=ereport.fs.zfs.io ena=15480865334799371265 parent_guid=12609516878478959410 parent_type=root pool=ssd2x6x480GBz1 pool_context=0 pool_failmode=continue pool_guid=12609516878478959410 subsystem=ZFS timestamp=1567425040 type=ereport.fs.zfs.io vdev_guid=3203759444463917272 vdev_type=raidz zio_blkid=668233 zio_err=5 zio_level=0 zio_object=1398 zio_objset=79 zio_offset=2537086459904 zio_size=131072
Sep  2 06:50:40 freenas zfsd: ZFS: Notify  class=ereport.fs.zfs.io ena=15480865334799371265 parent_guid=3203759444463917272 parent_type=raidz pool=ssd2x6x480GBz1 pool_context=0 pool_failmode=continue pool_guid=12609516878478959410 subsystem=ZFS timestamp=1567425040 type=ereport.fs.zfs.io vdev_guid=13579473535141282300 vdev_path=/dev/gptid/22b87683-c634-11e9-975e-00259084f1c8 vdev_type=disk zio_blkid=668233 zio_err=5 zio_level=0 zio_object=1398 zio_objset=79 zio_offset=422851936256 zio_size=28672
Sep  2 06:50:40 freenas zfsd: ZFS: Notify  class=ereport.fs.zfs.io ena=15480865334799373313 parent_guid=12609516878478959410 parent_type=root pool=ssd2x6x480GBz1 pool_context=0 pool_failmode=continue pool_guid=12609516878478959410 subsystem=ZFS timestamp=1567425040 type=ereport.fs.zfs.io vdev_guid=3203759444463917272 vdev_type=raidz zio_blkid=668232 zio_err=5 zio_level=0 zio_object=1398 zio_objset=79 zio_offset=2537085968384 zio_size=131072
Sep  2 06:50:40 freenas zfsd: ZfsEvent::Process: Event for unknown vdev 3203759444463917272 queued
Sep  2 06:50:56 freenas smartd[84638]: Device: /dev/da27 [SAT], self-test in progress, 90% remaining

SMnasMAN · Sep 3, 2019

well as a test i started a long smart test on one of the 480g hgst... IT DID IT AGAIN!! (but again the test shows completed with 0 errors)

(this is my guess below - would love to hear what others say or have experienced)
these drives seem to get to a point in the smart long test (about 80% in) where the drive locks up, and thus eventually drops from Freenas. when FN had given up on this drive, I actually physically then pulled the drive and put it right back in (all in about 1 second,) and all went back to normal (so to speak).

this is how pool looks now (and is after i ran a scrub):

Code:

 pool: ssd2x6x480GBz1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 0 days 00:27:19 with 0 errors on Tue Sep  3 01:29:24 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        ssd2x6x480GBz1                                  ONLINE      23     0     0
          raidz1-0                                      ONLINE      53     0     0
            gptid/1eae5c41-c634-11e9-975e-00259084f1c8  ONLINE       0     0     0
            gptid/1f09b571-c634-11e9-975e-00259084f1c8  ONLINE       0     0     0
            gptid/1f7dea92-c634-11e9-975e-00259084f1c8  ONLINE       4    89     0
            gptid/1fde322b-c634-11e9-975e-00259084f1c8  ONLINE       0     0     0
            gptid/203ae8b8-c634-11e9-975e-00259084f1c8  ONLINE       0     0     0
            gptid/20ae8720-c634-11e9-975e-00259084f1c8  ONLINE       0     0     0
          raidz1-1                                      ONLINE       0     0     0
            gptid/212383df-c634-11e9-975e-00259084f1c8  ONLINE       0     0     0
            gptid/218371b5-c634-11e9-975e-00259084f1c8  ONLINE       0     0     0
            gptid/21e22950-c634-11e9-975e-00259084f1c8  ONLINE       0     0     0
            gptid/22567ad2-c634-11e9-975e-00259084f1c8  ONLINE       0     0     0
            gptid/22b87683-c634-11e9-975e-00259084f1c8  ONLINE       0     0     0
            gptid/231a7df4-c634-11e9-975e-00259084f1c8  ONLINE       0     0     0
        logs
          gptid/f166ac78-c7bc-11e9-975e-00259084f1c8    ONLINE       0     0     0

Important Announcement for the TrueNAS Community.

Smart-Long test caused major R/W errors and pool to go offline

SMnasMAN

Contributor

SMnasMAN

Contributor

Similar threads