Replacing an encrypted drive (possibly with itself) and failure diagnosis

Touche

Explorer
Joined
Nov 26, 2016
Messages
55
FreeNAS-11.2-U8
CPU: Intel i3-6100
MEM: Samsung 16GB DDR4 2133 ECC X8 2R (M391A2K43BB1-CPB)
MBO: Supermicro X11SSL-CF
Data: 6x Toshiba DT01ACA300 + 1 hot spare
Boot: Samsung 830 SSD
PSU: Corsair RM750x
Encrypted poll with a passphrase already set.

After a long time, I've had a drive drop out. It seems similar to my previous issues with compatibility between the Toshiba DT01ACA3 drives and the LSI 3008. The error is slightly different now, but I'm not sure if it's due to FreeNAS and HBA FW updates, or if this latest one is unrelated.

FreeNAS HDD issue:
Code:
Mar 24 06:03:23 NAS1 ahcich3: Timeout on slot 14 port 0
Mar 24 06:03:23 NAS1 ahcich3: is 00000000 cs 00004000 ss 00000000 rs 00004000 tfd d0 serr 00000000 cmd 0004ce17
Mar 24 06:03:23 NAS1 (ada3:ahcich3:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Mar 24 06:03:23 NAS1 (ada3:ahcich3:0:0:0): CAM status: Command timeout
Mar 24 06:03:23 NAS1 (ada3:ahcich3:0:0:0): Retrying command
Mar 24 06:03:56 NAS1 ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Mar 24 06:04:26 NAS1 ahcich3: Timeout on slot 15 port 0
Mar 24 06:04:26 NAS1 ahcich3: is 00000000 cs 00000000 ss 00000000 rs 00008000 tfd 150 serr 00000000 cmd 0004cf17
Mar 24 06:04:26 NAS1 (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Mar 24 06:04:26 NAS1 (aprobe0:ahcich3:0:0:0): CAM status: Command timeout
Mar 24 06:04:26 NAS1 (aprobe0:ahcich3:0:0:0): Retrying command
Mar 24 06:06:22 NAS1 ahcich3: Timeout on slot 2 port 0
Mar 24 06:06:22 NAS1 ahcich3: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd d0 serr 00000000 cmd 0004c217
Mar 24 06:06:55 NAS1 ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Mar 24 06:07:25 NAS1 ahcich3: Timeout on slot 3 port 0
Mar 24 06:07:25 NAS1 ahcich3: is 00000000 cs 00000008 ss 00000000 rs 00000008 tfd 80 serr 00000000 cmd 0004c317
Mar 24 06:07:25 NAS1 (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Mar 24 06:07:25 NAS1 (aprobe0:ahcich3:0:0:0): CAM status: Command timeout
Mar 24 06:07:25 NAS1 (aprobe0:ahcich3:0:0:0): Retrying command
Mar 24 06:08:16 NAS1 ahcich3: Timeout on slot 9 port 0
Mar 24 06:08:16 NAS1 ahcich3: is 00000000 cs 00000000 ss 0000fffc rs 0000fffc tfd 50 serr 00000000 cmd 0004cf17
Mar 24 06:08:16 NAS1 (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 48 11 01 40 2d 00 00 01 00 00
Mar 24 06:08:16 NAS1 (ada3:ahcich3:0:0:0): CAM status: Command timeout
Mar 24 06:08:16 NAS1 (ada3:ahcich3:0:0:0): Retrying command
Mar 24 06:09:26 NAS1 ahcich3: Timeout on slot 23 port 0
Mar 24 06:09:26 NAS1 ahcich3: is 00000000 cs 00800000 ss 00000000 rs 00800000 tfd d0 serr 00000000 cmd 0004d717
Mar 24 06:09:26 NAS1 (ada3:ahcich3:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Mar 24 06:09:26 NAS1 (ada3:ahcich3:0:0:0): CAM status: Command timeout
Mar 24 06:09:26 NAS1 (ada3:ahcich3:0:0:0): Retrying command
Mar 24 06:10:06 NAS1 ahcich3: Timeout on slot 11 port 0
Mar 24 06:10:06 NAS1 ahcich3: is 00000000 cs 00000800 ss 00000000 rs 00000800 tfd d0 serr 00000000 cmd 0004cb17
Mar 24 06:10:06 NAS1 (ada3:ahcich3:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Mar 24 06:10:06 NAS1 (ada3:ahcich3:0:0:0): CAM status: Command timeout
Mar 24 06:10:06 NAS1 (ada3:ahcich3:0:0:0): Retrying command
Mar 24 06:10:39 NAS1 ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Mar 24 06:11:09 NAS1 ahcich3: Timeout on slot 12 port 0
Mar 24 06:11:09 NAS1 ahcich3: is 00000000 cs 00000000 ss 00000000 rs 00001000 tfd 150 serr 00000000 cmd 0004cc17
Mar 24 06:11:09 NAS1 (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Mar 24 06:11:09 NAS1 (aprobe0:ahcich3:0:0:0): CAM status: Command timeout
Mar 24 06:11:09 NAS1 (aprobe0:ahcich3:0:0:0): Retrying command
Mar 24 06:11:41 NAS1 ahcich3: Timeout on slot 1 port 0
Mar 24 06:11:41 NAS1 ahcich3: is 00000000 cs 00000000 ss 00000006 rs 00000006 tfd 50 serr 00000000 cmd 0004c217
Mar 24 06:11:41 NAS1 (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 c8 3a 7f 40 2d 00 00 01 00 00
Mar 24 06:11:41 NAS1 (ada3:ahcich3:0:0:0): CAM status: Command timeout
Mar 24 06:11:41 NAS1 (ada3:ahcich3:0:0:0): Retrying command
Mar 24 06:12:39 NAS1 ahcich3: Timeout on slot 26 port 0
Mar 24 06:12:39 NAS1 ahcich3: is 00000000 cs 04000000 ss 00000000 rs 04000000 tfd d0 serr 00000000 cmd 0004da17
Mar 24 06:12:39 NAS1 (ada3:ahcich3:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Mar 24 06:12:39 NAS1 (ada3:ahcich3:0:0:0): CAM status: Command timeout
Mar 24 06:12:39 NAS1 (ada3:ahcich3:0:0:0): Retrying command
Mar 24 06:13:12 NAS1 ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Mar 24 06:13:42 NAS1 ahcich3: Timeout on slot 27 port 0
Mar 24 06:13:42 NAS1 ahcich3: is 00000000 cs 08000000 ss 00000000 rs 08000000 tfd 80 serr 00000000 cmd 0004db17
Mar 24 06:13:42 NAS1 (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Mar 24 06:13:42 NAS1 (aprobe0:ahcich3:0:0:0): CAM status: Command timeout
Mar 24 06:13:42 NAS1 (aprobe0:ahcich3:0:0:0): Retrying command
Mar 24 06:14:15 NAS1 ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Mar 24 06:14:45 NAS1 ahcich3: Timeout on slot 28 port 0
Mar 24 06:14:45 NAS1 ahcich3: is 00000000 cs 10000000 ss 00000000 rs 10000000 tfd 80 serr 00000000 cmd 0004dc17
Mar 24 06:14:45 NAS1 (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Mar 24 06:14:45 NAS1 (aprobe0:ahcich3:0:0:0): CAM status: Command timeout
Mar 24 06:14:45 NAS1 (aprobe0:ahcich3:0:0:0): Error 5, Retries exhausted
Mar 24 06:15:18 NAS1 ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Mar 24 06:15:48 NAS1 ahcich3: Timeout on slot 29 port 0
Mar 24 06:15:48 NAS1 ahcich3: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd 80 serr 00000000 cmd 0004dd17
Mar 24 06:15:48 NAS1 (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Mar 24 06:15:48 NAS1 (aprobe0:ahcich3:0:0:0): CAM status: Command timeout
Mar 24 06:15:48 NAS1 (aprobe0:ahcich3:0:0:0): Error 5, Retry was blocked
Mar 24 06:15:48 NAS1 ada3 at ahcich3 bus 0 scbus4 target 0 lun 0
Mar 24 06:15:48 NAS1 ada3: <TOSHIBA DT01ACA300 MX6OABB0> s/n 1234ABCD detached
Mar 24 06:16:02 NAS1 GEOM_ELI: g_eli_write_done() failed (error=6) gptid/2c59d96f-7a1f-11e8-9c02-0cc47a868d02.eli[WRITE(offset=225280, length=4096)]
Mar 24 06:16:02 NAS1 GEOM_ELI: g_eli_write_done() failed (error=6) gptid/2c59d96f-7a1f-11e8-9c02-0cc47a868d02.eli[WRITE(offset=487424, length=4096)]
Mar 24 06:16:02 NAS1 GEOM_ELI: g_eli_read_done() failed (error=6) gptid/2c59d96f-7a1f-11e8-9c02-0cc47a868d02.eli[READ(offset=270336, length=8192)]
Mar 24 06:16:02 NAS1 GEOM_ELI: g_eli_read_done() failed (error=6) gptid/2c59d96f-7a1f-11e8-9c02-0cc47a868d02.eli[READ(offset=2998444761088, length=8192)]
Mar 24 06:16:02 NAS1 GEOM_ELI: g_eli_read_done() failed (error=6) gptid/2c59d96f-7a1f-11e8-9c02-0cc47a868d02.eli[READ(offset=2998445023232, length=8192)]
Mar 24 06:16:02 NAS1 GEOM_ELI: g_eli_write_done() failed (error=6) gptid/2c59d96f-7a1f-11e8-9c02-0cc47a868d02.eli[WRITE(offset=2998444978176, length=4096)]
Mar 24 06:16:02 NAS1 GEOM_ELI: g_eli_write_done() failed (error=6) gptid/2c59d96f-7a1f-11e8-9c02-0cc47a868d02.eli[WRITE(offset=2998445240320, length=4096)]
Mar 24 06:16:02 NAS1 GEOM_MIRROR: Device swap2: provider ada3p1 disconnected.
Mar 24 06:16:02 NAS1 ZFS: vdev state changed, pool_guid=1367972315423497889 vdev_guid=8011460992587928332
Mar 24 06:16:03 NAS1 GEOM_ELI: Device gptid/2c59d96f-7a1f-11e8-9c02-0cc47a868d02.eli destroyed.
Mar 24 06:16:03 NAS1 GEOM_ELI: Detached gptid/2c59d96f-7a1f-11e8-9c02-0cc47a868d02.eli on last close.
Mar 24 06:16:03 NAS1 (ada3:ahcich3:0:0:0): Periph destroyed
Mar 24 06:16:03 NAS1 ada3 at ahcich3 bus 0 scbus4 target 0 lun 0
Mar 24 06:16:03 NAS1 ada3: <TOSHIBA DT01ACA300 MX6OABB0> ATA8-ACS SATA 3.x device
Mar 24 06:16:03 NAS1 ada3: Serial Number 1234ABCD
Mar 24 06:16:03 NAS1 ada3: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
Mar 24 06:16:03 NAS1 ada3: Command Queueing enabled
Mar 24 06:16:03 NAS1 ada3: 2861588MB (5860533168 512 byte sectors)
Mar 24 06:16:19 NAS1 ZFS: vdev state changed, pool_guid=1367972315423497889 vdev_guid=8011460992587928332
Mar 24 06:16:19 NAS1 ZFS: vdev state changed, pool_guid=1367972315423497889 vdev_guid=3439061830581111410
Mar 24 06:16:19 NAS1 GEOM_ELI: Device mirror/swap2.eli destroyed.
Mar 24 06:16:19 NAS1 GEOM_MIRROR: Device swap2: provider destroyed.
Mar 24 06:16:19 NAS1 GEOM_MIRROR: Device swap2 destroyed.
Mar 24 06:16:49 NAS1 ahcich3: Timeout on slot 31 port 0
Mar 24 06:16:49 NAS1 ahcich3: is 00000000 cs 00000000 ss 80000001 rs 80000001 tfd 50 serr 00000000 cmd 0004c017
Mar 24 06:16:49 NAS1 (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 80 fc 3f 40 00 00 00 01 00 00
Mar 24 06:16:49 NAS1 (ada3:ahcich3:0:0:0): CAM status: Command timeout
Mar 24 06:16:49 NAS1 (ada3:ahcich3:0:0:0): Retrying command
Mar 24 06:17:21 NAS1 ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Mar 24 06:17:51 NAS1 ahcich3: Timeout on slot 1 port 0
Mar 24 06:17:51 NAS1 ahcich3: is 00000000 cs 00000002 ss 00000000 rs 00000002 tfd 80 serr 00000000 cmd 0004c117
Mar 24 06:17:51 NAS1 (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Mar 24 06:17:51 NAS1 (aprobe0:ahcich3:0:0:0): CAM status: Command timeout
Mar 24 06:17:51 NAS1 (aprobe0:ahcich3:0:0:0): Retrying command
Mar 24 06:18:22 NAS1 ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Mar 24 06:18:52 NAS1 ahcich3: Timeout on slot 2 port 0
Mar 24 06:18:52 NAS1 ahcich3: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd 80 serr 00000000 cmd 0004c217
Mar 24 06:18:52 NAS1 (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Mar 24 06:18:52 NAS1 (aprobe0:ahcich3:0:0:0): CAM status: Command timeout
Mar 24 06:18:52 NAS1 (aprobe0:ahcich3:0:0:0): Error 5, Retries exhausted
Mar 24 06:19:23 NAS1 ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Mar 24 06:19:53 NAS1 ahcich3: Timeout on slot 3 port 0
Mar 24 06:19:53 NAS1 ahcich3: is 00000000 cs 00000000 ss 00000000 rs 00000008 tfd 150 serr 00000000 cmd 0004c317
Mar 24 06:19:53 NAS1 (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Mar 24 06:19:53 NAS1 (aprobe0:ahcich3:0:0:0): CAM status: Command timeout
Mar 24 06:19:53 NAS1 (aprobe0:ahcich3:0:0:0): Error 5, Retry was blocked
Mar 24 06:19:53 NAS1 ada3 at ahcich3 bus 0 scbus4 target 0 lun 0
Mar 24 06:19:53 NAS1 ada3: <TOSHIBA DT01ACA300 MX6OABB0> s/n 1234ABCD detached
Mar 24 06:19:54 NAS1 smartd[3617]: Device: /dev/ada3, failed to read SMART Attribute Data
Mar 24 06:19:54 NAS1 savecore: /dev/ada1p1: Operation not permitted
Mar 24 06:19:54 NAS1 g_access(944): provider ada3 has error 6 set
Mar 24 06:19:54 NAS1 g_access(944): provider ada3 has error 6 set
Mar 24 06:19:54 NAS1 g_access(944): provider ada3 has error 6 set
Mar 24 06:19:54 NAS1 (ada3:ahcich3:0:0:0): Periph destroyed
Mar 24 06:19:54 NAS1 ada3 at ahcich3 bus 0 scbus4 target 0 lun 0
Mar 24 06:19:54 NAS1 ada3: <TOSHIBA DT01ACA300 MX6OABB0> ATA8-ACS SATA 3.x device
Mar 24 06:19:54 NAS1 ada3: Serial Number 1234ABCD
Mar 24 06:19:54 NAS1 ada3: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
Mar 24 06:19:54 NAS1 ada3: Command Queueing enabled
Mar 24 06:19:54 NAS1 ada3: 2861588MB (5860533168 512 byte sectors)
Mar 24 06:19:57 NAS1 GEOM_MIRROR: Device mirror/swap2 launched (2/2).
Mar 24 06:19:57 NAS1 GEOM_ELI: Device mirror/swap2.eli created.
Mar 24 06:19:57 NAS1 GEOM_ELI: Encryption: AES-XTS 128
Mar 24 06:19:57 NAS1 GEOM_ELI:     Crypto: hardware


Zpool status:
Code:
[root@NAS1 ~]$ zpool status
  pool: DBR-pool1
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 1.93T in 0 days 04:52:05 with 0 errors on Wed Mar 24 11:08:16 2021
config:

        NAME                                                  STATE     READ WRITE CKSUM
        DBR-pool1                                             DEGRADED     0     0     0
          mirror-0                                            DEGRADED     0     0     0
            gptid/7442e5da-31c0-11e8-80b9-0cc47a868d02.eli    ONLINE       0     0     0
            spare-1                                           REMOVED      0     0     0
              8011460992587928332                             REMOVED      0     0     0  was /dev/gptid/2c59d96f-7a1f-11e8-9c02-0cc47a868d02.eli
              gptid/36f62083-7d28-11e8-949e-0cc47a868d02.eli  ONLINE       0     0 1.36K
          mirror-1                                            ONLINE       0     0     0
            gptid/7dd9ce06-31c0-11e8-80b9-0cc47a868d02.eli    ONLINE       0     0     0
            gptid/813153f4-31c0-11e8-80b9-0cc47a868d02.eli    ONLINE       0     0     0
          mirror-2                                            ONLINE       0     0     0
            gptid/8790b712-31c0-11e8-80b9-0cc47a868d02.eli    ONLINE       0     0     0
            gptid/8af9d490-31c0-11e8-80b9-0cc47a868d02.eli    ONLINE       0     0     0
        spares
          6616561076194790823                                 INUSE     was /dev/gptid/36f62083-7d28-11e8-949e-0cc47a868d02.eli

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:16 with 0 errors on Thu Mar 18 03:45:16 2021
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          ada2p2    ONLINE       0     0     0

errors: No known data errors


SMART:
Code:
root@NAS1:~ # smartctl -x /dev/ada3
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba 3.5" DT01ACA... Desktop HDD
Device Model:     TOSHIBA DT01ACA300
Serial Number:    1234ABCD
LU WWN Device Id: 5 000039 fe3d2e1b4
Firmware Version: MX6OABB0
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Mar 24 12:04:44 2021 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (22508) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 376) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
  2 Throughput_Performance  P-S---   140   140   054    -    68
  3 Spin_Up_Time            POS---   151   151   024    -    365 (Average 395)
  4 Start_Stop_Count        -O--C-   100   100   000    -    135
  5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    6
  7 Seek_Error_Rate         PO-R--   078   078   067    -    3670016
  8 Seek_Time_Performance   P-S---   124   124   020    -    33
  9 Power_On_Hours          -O--C-   097   097   000    -    26629
 10 Spin_Retry_Count        PO--C-   100   100   060    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    135
192 Power-Off_Retract_Count -O--CK   099   099   000    -    1626
193 Load_Cycle_Count        -O--C-   099   099   000    -    1632
194 Temperature_Celsius     -O----   181   181   000    -    33 (Min/Max 17/47)
196 Reallocated_Event_Count -O--CK   100   100   000    -    15
197 Current_Pending_Sector  -O---K   100   100   000    -    0
198 Offline_Uncorrectable   ---R--   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
0x04       GPL     R/O      7  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x20       GPL     R/O      1  Streaming performance log [OBS-8]
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     26568         -
# 2  Extended offline    Completed without error       00%     26504         -
# 3  Short offline       Completed without error       00%     26472         -
# 4  Short offline       Completed without error       00%     26377         -
# 5  Short offline       Completed without error       00%     26281         -
# 6  Short offline       Completed without error       00%     26185         -
# 7  Extended offline    Interrupted (host reset)      80%     26116         -
# 8  Short offline       Completed without error       00%     26089         -
# 9  Short offline       Completed without error       00%     25993         -
#10  Short offline       Completed without error       00%     25897         -
#11  Extended offline    Completed without error       00%     25833         -
#12  Short offline       Completed without error       00%     25801         -
#13  Short offline       Completed without error       00%     25705         -
#14  Short offline       Completed without error       00%     25609         -
#15  Short offline       Completed without error       00%     25513         -
#16  Extended offline    Completed without error       00%     25449         -
#17  Short offline       Completed without error       00%     25417         -
#18  Short offline       Completed without error       00%     25345         -
#19  Short offline       Completed without error       00%     25249         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       256 (0x0100)
SCT Support Level:                   1
Device State:                        SMART Off-line Data Collection executing in background (4)
Current Temperature:                    33 Celsius
Power Cycle Min/Max Temperature:     28/36 Celsius
Lifetime    Min/Max Temperature:     17/47 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -40/70 Celsius
Temperature History Size (Index):    128 (24)

Index    Estimated Time   Temperature Celsius
  25    2021-03-24 09:57    33  **************
 ...    ..(  4 skipped).    ..  **************
  30    2021-03-24 10:02    33  **************
  31    2021-03-24 10:03    34  ***************
  32    2021-03-24 10:04    34  ***************
  33    2021-03-24 10:05    34  ***************
  34    2021-03-24 10:06    33  **************
 ...    ..( 15 skipped).    ..  **************
  50    2021-03-24 10:22    33  **************
  51    2021-03-24 10:23    34  ***************
 ...    ..(  2 skipped).    ..  ***************
  54    2021-03-24 10:26    34  ***************
  55    2021-03-24 10:27    33  **************
 ...    ..(  7 skipped).    ..  **************
  63    2021-03-24 10:35    33  **************
  64    2021-03-24 10:36    34  ***************
 ...    ..(  6 skipped).    ..  ***************
  71    2021-03-24 10:43    34  ***************
  72    2021-03-24 10:44    33  **************
 ...    ..( 28 skipped).    ..  **************
 101    2021-03-24 11:13    33  **************
 102    2021-03-24 11:14    34  ***************
 103    2021-03-24 11:15    33  **************
 ...    ..( 16 skipped).    ..  **************
 120    2021-03-24 11:32    33  **************
 121    2021-03-24 11:33    34  ***************
 122    2021-03-24 11:34    34  ***************
 123    2021-03-24 11:35    34  ***************
 124    2021-03-24 11:36    33  **************
 ...    ..( 27 skipped).    ..  **************
  24    2021-03-24 12:04    33  **************

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4             135  ---  Lifetime Power-On Resets
0x01  0x010  4           26629  ---  Power-on Hours
0x01  0x018  6     96059608892  ---  Logical Sectors Written
0x01  0x020  6       429892287  ---  Number of Write Commands
0x01  0x028  6    190742716163  ---  Logical Sectors Read
0x01  0x030  6       705245995  ---  Number of Read Commands
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4           26625  ---  Spindle Motor Power-on Hours
0x03  0x010  4           26625  ---  Head Flying Hours
0x03  0x018  4            1632  ---  Head Load Events
0x03  0x020  4               6  ---  Number of Reallocated Logical Sectors
0x03  0x028  4             579  ---  Read Recovery Attempts
0x03  0x030  4               7  ---  Number of Mechanical Start Failures
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               7  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              33  ---  Current Temperature
0x05  0x010  1              33  N--  Average Short Term Temperature
0x05  0x018  1              32  N--  Average Long Term Temperature
0x05  0x020  1              47  ---  Highest Temperature
0x05  0x028  1              17  ---  Lowest Temperature
0x05  0x030  1              43  N--  Highest Average Short Term Temperature
0x05  0x038  1              25  N--  Lowest Average Short Term Temperature
0x05  0x040  1              42  N--  Highest Average Long Term Temperature
0x05  0x048  1              25  N--  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4             513  ---  Number of Hardware Resets
0x06  0x010  4             288  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0009  2           19  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2           13  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS



The Reallocated Sector Count isn't ideal so the drive might actually be failing, but I have a suspicion the dropout might be due to above mentioned compatibility issue.

Is the SMART error issue big enough to retire the drive regardless?

My main question is regarding the replacement procedure. I've read the manual and some of the answers on the forum, but I still am not 100% I got the procedure right.

The manual states:
9.5.1.1. Replacing an Encrypted Disk
If the ZFS pool is encrypted, additional steps are needed when replacing a failed drive.

Ensure a passphrase has been set before attempting to replace the failed drive. To replace a drive in an encrypted pool,
  1. Go to Storage ➞ Pools. Locate the encrypted pool and click  (Settings) ➞ Status. Find the disk to replace and click  (Options) ➞ Replace. Enter the passphrase that was set for the pool, then click REPLACE DISK.
  2. Click  (Encryption Options) ➞ Add Recovery Key and enter the root password to save the new recovery key. The old recovery key will no longer function and can be safely discarded.
The pool can be rekeyed by clicking  (Encryption Options) ➞ Encryption Rekey.

Is this the correct procedure:
1. Changet ADA3 from REMOVED to OFFLINE
2. Shutdown and replace ADA3 with a new drive OR simply restart if ADA3 is presumed ok
3. Boot, Go to Storage ➞ Pools. Locate the encrypted pool and click (Settings) ➞ Status. Find the disk to replace and click (Options) ➞ Replace. Enter the passphrase that was set for the pool, then click REPLACE DISK. (Will it allow me to replace the disk with itself?)
4. Immediately (don't wait for the resilver to complete) Click (Encryption Options) ➞ Add Recovery Key and enter the root password to save the new recovery key

This is where I get confused. Do I have to then do the "The pool can be rekeyed by clicking (Encryption Options) ➞ Encryption Rekey." or is that optional?

What about the currently active hot spare during all this? Will FreeNAS revert it automatically back to hot spare and I just ignore it during the whole procedure?
 

AlexGG

Contributor
Joined
Dec 13, 2018
Messages
171
The Reallocated Sector Count isn't ideal so the drive might actually be failing, but I have a suspicion the dropout might be due to above mentioned compatibility issue.

No, that's not likely due to compatibility. This is a slightly/slowly failing drive SMART. I can't think of any compatibility issue which can cause internal drive firmware to declare a sector on the surface defective. I would retire the drive as soon as practical, but your mileage may vary, as there are only six bad sectors so far. Another option is to keep an eye on the drive, see if it develops any further problems, then replace it. Seems reasonably safe given a mirror and a hot spare.
 

Touche

Explorer
Joined
Nov 26, 2016
Messages
55
No, no, the SMART sector count is definitely not related. What I am wondering is if an actual drive issue is the cause of the dropout (doesn't seem likely to me due to relatively small amount of SMART errors and the FreeNAS messages), or has the compatibility issue caused the dropout and the SMART parameters are a separate issue.

I'll definitely keep monitoring the drive, but need help with the exact steps on how to reintroduce it back to the encrypted pool.
 

Touche

Explorer
Joined
Nov 26, 2016
Messages
55
Is this the correct procedure:
1. Changet ADA3 from REMOVED to OFFLINE
2. Shutdown and replace ADA3 with a new drive OR simply restart if ADA3 is presumed ok
3. Boot, Go to Storage ➞ Pools. Locate the encrypted pool and click (Settings) ➞ Status. Find the disk to replace and click (Options) ➞ Replace. Enter the passphrase that was set for the pool, then click REPLACE DISK. (Will it allow me to replace the disk with itself?)
4. Immediately (don't wait for the resilver to complete) Click (Encryption Options) ➞ Add Recovery Key and enter the root password to save the new recovery key

This is where I get confused. Do I have to then do the "The pool can be rekeyed by clicking (Encryption Options) ➞ Encryption Rekey." or is that optional?

What about the currently active hot spare during all this? Will FreeNAS revert it automatically back to hot spare and I just ignore it during the whole procedure?

The dropped drive seems to be increasing in SMART fail values so I won't be using it and will replace it with either a new drive or the current hot spare.
Could someone confirm if the above is the correct way to do a replacement for an encrypted pool?

If I DETACH the failing drive from the pool and thus promote the hot spare as a permanent replacement, are any of the about encryption procedures necessary?
 

gary_1

Explorer
Joined
Sep 26, 2017
Messages
78
Are you using legacy encryption? If you're using ZFS Native encryption, then ignore the remainder of this post, it's not relevant.

For legacy encryption, the issue you face when you resilver, is that FreeNAS only knows about either your user key or your recovery key, depending on which it was you used to unlock your pool with.

Lets say you unlocked a 4 drive pool (drives A-D) with the user key. You remove drive D and replace it with a new drive and allow resilvering to complete. At this point, all four drives can be unlocked by the same user key, however _only_ drives A-C can be unlocked with your previous recovery key, drive D has no knowledge of this key.

This is a bad position to be in, if you left it like this long term as should you lose your user key, your recovery key wouldn't be sufficient to unlock all drives in the pool.

However, as long as you don't lose your user key, you do not have to rekey immediatly, you should be able to reboot too and still unlock the pool via the user key. That said, it's prudent to rekey as soon as you can and download new user and recovery keys thus ensuring all drives in the pool can be unlocked by the same user and recovery key. Double check with the manual for the correct process.
 

Touche

Explorer
Joined
Nov 26, 2016
Messages
55
@gary_1 I'm not sure what is the distinction between legacy and ZFS native encryption. I had encrypted the drives via the FreeNAS GUI when I first created the pool.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
"Legacy" encryption uses GELI to encrypt whole disks and ZFS doesn't have anything to do with it.
Native encryption is per dataset, with some advantages (e.g. granularity, portability) and drawbacks (e.g. not all metadata is encrypted).
 

gary_1

Explorer
Joined
Sep 26, 2017
Messages
78
Which version of FreeNAS were you using when you created the pools?

Legacy GELI was the only type of pool encryption that existed on older versions of FreeNAS. The Native ZFS is the newer encryption added in TrueNAS 12.

It's important to know which your pool is using as the steps you need to take will differ. For legacy you have to be careful after a resilver and take some extra steps. With native, you just let it resilver and get on with your life.
 

Touche

Explorer
Joined
Nov 26, 2016
Messages
55
It was done on 11.1 and is the whole disk encryption, which would make it Legacy.

So, if I understood everything correctly, the correct procedure would be:

1. Go to Storage ➞ Pools. (Settings) ➞ Status. Change the failed disk from REMOVED to OFFLINE
2. Shutdown and physically replace the failed drive
3. Boot, Go to Storage ➞ Pools. (Settings) ➞ Status. Find the disk to replace and click (Options) ➞ Replace. Enter the passphrase that was set for the pool, then click REPLACE DISK. Choose the new drive.
4. Immediately (don't wait for the resilver to complete) Click (Encryption Options) ➞ Add Recovery Key. This replaces the Recovery Key for all disks.

At this point, all the drives should have the same user key w/ password set, and the same recovery key?

Is it necessary to go through the complete Encryption Rekey at this point?

P.S. If, instead of using a new replacement disk, I DETACH the failing drive from the pool and thus promote the hot spare as the permanent replacement, are any of the above encryption procedures necessary? I'm guessing the hot spare should have the same active user and recovery keys as the rest of the pool already?
 

gary_1

Explorer
Joined
Sep 26, 2017
Messages
78
Yes, when you "replace" the disk, the key you used to unlock the pool originally, will also unlock the replaced disk. However the other key (usually recovery key) will no longer unlock the pool.

For step 4 you should reset the encryption as described https://www.ixsystems.com/documentation/freenas/11.3-U1/storage.html#replacing-a-failed-disk

This https://www.ixsystems.com/documentation/freenas/11.3-U1/storage.html#reset-encryption which will result in new recovery and user keys.

"Immediately" isn't really required, you could do it straight away or you could do it a year from now. The risk if you leave it though is that of the user key and recovery key, only one of them (whichever key you originally unlocked your pool with prior to the resilver, generally the user key) will unlock your pool. So if you lose that or forget the pass before you've completed step 4, the recovery key will not save the day. Even then all may not be lost, as you should still be able to unlock the older drives with the recovery key, although perhaps not via the UI.

As for whether you can just "Add recovery key" for step 4. I believe so, the user key (assuming you unlocked your pool with that originally) will continue to work with all drives. However, don't dispose of any keys/passwords until you know you can lock/unlock the pool with the new user/recovery keys just to be safe.

The manual recommends a "Reset Key" to generate a new user key and a new recovery key as a security precaution, since you'll be disposing of a hdd that contains an encrypted copy of the master key, if that's a concern, do a "Reset key" rather than just adding a recovery key. I'm not sure I share the concern there though.

I'm not sure what difference detach makes, however as far as encryption goes, the above are likely always needed as the new drive will only know about one of the two encryption keys.
 
Last edited:

Touche

Explorer
Joined
Nov 26, 2016
Messages
55
Thank you all! I'll post an update as soon as my replacement drive arrives.
 

Touche

Explorer
Joined
Nov 26, 2016
Messages
55
It was done on 11.1 and is the whole disk encryption, which would make it Legacy.

So, if I understood everything correctly, the correct procedure would be:

1. Go to Storage ➞ Pools. (Settings) ➞ Status. Change the failed disk from REMOVED to OFFLINE
2. Shutdown and physically replace the failed drive
3. Boot, Go to Storage ➞ Pools. (Settings) ➞ Status. Find the disk to replace and click (Options) ➞ Replace. Enter the passphrase that was set for the pool, then click REPLACE DISK. Choose the new drive.
4. Immediately (don't wait for the resilver to complete) Click (Encryption Options) ➞ Add Recovery Key. This replaces the Recovery Key for all disks.
Followed the above procedure and it went smoothly.

The one issue I've had was not being able to promote the hot spare as the permanent replacement. There are no detach options in the GUI. There is a manual procedure explained here, but I wasn't sure if it was safe to do it behind the GUI's back because of the encryption. Does it make any difference? Also, the GUI status shows disk names while CLI zpool status shows gptid. Are they interchangeable?
 
Top