systemofapwne
Dabbler
- Joined
- Oct 6, 2019
- Messages
- 16
Hello,
on my TrueNAS box, I use one single consumer SMR* drive as an on-site backup/archive (Seagate ST8000DM004). The disk is brand-new and shows no signs of degradation so far.
Every night, the data is replicated from the main pool to the backup pool, living on that disk. The majority of the time, the disk is in standby (spins down after 5 min of idle). This works flawless so far.
The problem now is, that the disk randomly detaches from the HBA due to SATA timeouts.
Even though this log started almost exactly at 4:00 in the night before the disk has been detached, there is no task** or whatsoever at that time (to my knowledge), that could have triggered it. The previous time this happened, it was at 2:07 (also in the night).
I was able to re-attach the disk and getting the pool running again by un- & re-plugging the disk and running zpool clear Backup.
The pool now shows errors, which vanish after running a scrub twice with always the same result: scrub repaired 0B in 05:39:57 with 0 errors
Here is the current state after todays incident:
What could be the reason for that behavior? How to fix it? And how to clear the "Permanent errors" (which aren't real) without scrubbing the disk twice?
Here is a SMART log right after performing a short self-test.
*: I know, that SMR is bad for ZFS, especially when resilvering. That drive is not in an active pool and is just for archiving and works absolutely fine.
**: I disabled SMART on that disk vie the TrueNAS UI, since I noticed that on a different spun-down disk, the drive would wake whenever it gets queried for its temperature. A SMART self-check is anway supposed to run once a week.
on my TrueNAS box, I use one single consumer SMR* drive as an on-site backup/archive (Seagate ST8000DM004). The disk is brand-new and shows no signs of degradation so far.
Every night, the data is replicated from the main pool to the backup pool, living on that disk. The majority of the time, the disk is in standby (spins down after 5 min of idle). This works flawless so far.
The problem now is, that the disk randomly detaches from the HBA due to SATA timeouts.
Code:
Dec 31 04:00:16 truenas ahcich3: Timeout on slot 19 port 0 Dec 31 04:00:16 truenas ahcich3: is 00000000 cs 00000000 ss 00080000 rs 00080000 tfd 40 serr 00000000 cmd 0060d317 Dec 31 04:00:16 truenas (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 d8 85 31 40 7a 00 00 00 00 00 Dec 31 04:00:16 truenas (ada3:ahcich3:0:0:0): CAM status: Command timeout Dec 31 04:00:16 truenas (ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain Dec 31 04:10:16 truenas ahcich3: Timeout on slot 16 port 0 Dec 31 04:10:16 truenas ahcich3: is 00000000 cs 00000000 ss 00010000 rs 00010000 tfd 40 serr 00000000 cmd 0060d017 Dec 31 04:10:16 truenas (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 f8 b9 50 40 52 00 00 00 00 00 Dec 31 04:10:16 truenas (ada3:ahcich3:0:0:0): CAM status: Command timeout Dec 31 04:10:16 truenas (ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain Dec 31 04:20:17 truenas ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080) Dec 31 04:20:17 truenas ahcich3: Timeout on slot 25 port 0 Dec 31 04:20:17 truenas ahcich3: is 00000000 cs 00000000 ss 02000000 rs 02000000 tfd 40 serr 00000000 cmd 0060d917 Dec 31 04:20:17 truenas (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 d8 85 31 40 7a 00 00 00 00 00 Dec 31 04:20:17 truenas (ada3:ahcich3:0:0:0): CAM status: Command timeout Dec 31 04:20:17 truenas (ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain Dec 31 04:21:18 truenas ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080) Dec 31 04:21:18 truenas ahcich3: Timeout on slot 26 port 0 Dec 31 04:21:18 truenas ahcich3: is 00000000 cs 04000000 ss 00000000 rs 04000000 tfd 00 serr 00000000 cmd 0060da17 Dec 31 04:21:18 truenas (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 31 04:21:18 truenas (aprobe0:ahcich3:0:0:0): CAM status: Command timeout Dec 31 04:21:18 truenas (aprobe0:ahcich3:0:0:0): Retrying command, 0 more tries remain Dec 31 04:30:17 truenas ahcich3: Timeout on slot 3 port 0 Dec 31 04:30:17 truenas ahcich3: is 00000000 cs 00000000 ss 00000008 rs 00000008 tfd 40 serr 00000000 cmd 0060c317 Dec 31 04:30:17 truenas (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 d8 85 31 40 7a 00 00 00 00 00 Dec 31 04:30:17 truenas (ada3:ahcich3:0:0:0): CAM status: Command timeout Dec 31 04:30:17 truenas (ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain Dec 31 04:31:18 truenas ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080) Dec 31 04:31:18 truenas ahcich3: Timeout on slot 4 port 0 Dec 31 04:31:18 truenas ahcich3: is 00000000 cs 00000010 ss 00000000 rs 00000010 tfd 00 serr 00000000 cmd 0060c417 Dec 31 04:31:18 truenas (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 31 04:31:18 truenas (aprobe0:ahcich3:0:0:0): CAM status: Command timeout Dec 31 04:31:18 truenas (aprobe0:ahcich3:0:0:0): Retrying command, 0 more tries remain Dec 31 04:40:17 truenas ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080) Dec 31 04:40:17 truenas ahcich3: Timeout on slot 13 port 0 Dec 31 04:40:17 truenas ahcich3: is 00000000 cs 00000000 ss 00002000 rs 00002000 tfd 40 serr 00000000 cmd 0060cd17 Dec 31 04:40:17 truenas (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 d8 85 31 40 7a 00 00 00 00 00 Dec 31 04:40:17 truenas (ada3:ahcich3:0:0:0): CAM status: Command timeout Dec 31 04:40:17 truenas (ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain Dec 31 04:41:18 truenas ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080) Dec 31 04:41:18 truenas ahcich3: Timeout on slot 14 port 0 Dec 31 04:41:18 truenas ahcich3: is 00000000 cs 00004000 ss 00000000 rs 00004000 tfd 00 serr 00000000 cmd 0060ce17 Dec 31 04:41:18 truenas (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 31 04:41:18 truenas (aprobe0:ahcich3:0:0:0): CAM status: Command timeout Dec 31 04:41:18 truenas (aprobe0:ahcich3:0:0:0): Retrying command, 0 more tries remain Dec 31 04:42:19 truenas ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080) Dec 31 04:42:19 truenas ahcich3: Timeout on slot 15 port 0 Dec 31 04:42:19 truenas ahcich3: is 00000000 cs 00008000 ss 00000000 rs 00008000 tfd 00 serr 00000000 cmd 0060cf17 Dec 31 04:42:19 truenas (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 31 04:42:19 truenas (aprobe0:ahcich3:0:0:0): CAM status: Command timeout Dec 31 04:42:19 truenas (aprobe0:ahcich3:0:0:0): Error 5, Retries exhausted Dec 31 04:43:20 truenas ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080) Dec 31 04:43:20 truenas ahcich3: Timeout on slot 16 port 0 Dec 31 04:43:20 truenas ahcich3: is 00000000 cs 00010000 ss 00000000 rs 00010000 tfd 00 serr 00000000 cmd 0060d017 Dec 31 04:43:20 truenas (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Dec 31 04:43:20 truenas (aprobe0:ahcich3:0:0:0): CAM status: Command timeout Dec 31 04:43:20 truenas (aprobe0:ahcich3:0:0:0): Error 5, Retry was blocked Dec 31 04:43:20 truenas ada3 at ahcich3 bus 0 scbus3 target 0 lun 0 Dec 31 04:43:20 truenas ada3: <ST8000DM004-2U9188 0001> s/n WSC0LZHJ detached Dec 31 04:44:00 truenas ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080) Dec 31 04:44:00 truenas (ada3:ahcich3:0:0:0): Periph destroyed Dec 31 04:44:00 truenas ada3 at ahcich3 bus 0 scbus3 target 0 lun 0 Dec 31 04:44:00 truenas ada3: <ST8000DM004-2U9188 0001> ACS-3 ATA SATA 3.x device Dec 31 04:44:00 truenas ada3: Serial Number WSC0LZHJ Dec 31 04:44:00 truenas ada3: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes) Dec 31 04:44:00 truenas ada3: Command Queueing enabled Dec 31 04:44:00 truenas ada3: 7630885MB (15628053168 512 byte sectors) Dec 31 04:44:00 truenas ada3: quirks=0x1<4K> Dec 31 04:44:00 truenas Solaris: WARNING: Pool 'Backup' has encountered an uncorrectable I/O failure and has been suspended.
Even though this log started almost exactly at 4:00 in the night before the disk has been detached, there is no task** or whatsoever at that time (to my knowledge), that could have triggered it. The previous time this happened, it was at 2:07 (also in the night).
I was able to re-attach the disk and getting the pool running again by un- & re-plugging the disk and running zpool clear Backup.
The pool now shows errors, which vanish after running a scrub twice with always the same result: scrub repaired 0B in 05:39:57 with 0 errors
Here is the current state after todays incident:
Code:
#zpool status -v Backup pool: Backup state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: scrub repaired 0B in 05:39:57 with 0 errors on Wed Dec 28 02:35:22 2022 config: NAME STATE READ WRITE CKSUM Backup ONLINE 0 0 0 gptid/fe817121-7be9-11ed-a0fd-977792743385 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: <metadata>:<0x0> <metadata>:<0x1> <metadata>:<0x3e> <metadata>:<0x51> Backup/Server/vm/Debian:<0x0>
What could be the reason for that behavior? How to fix it? And how to clear the "Permanent errors" (which aren't real) without scrubbing the disk twice?
Here is a SMART log right after performing a short self-test.
Code:
#smartctl -s on -a /dev/ada3 smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate BarraCuda 3.5 (SMR) Device Model: ST8000DM004-2U9188 Serial Number: WSC0LZHJ LU WWN Device Id: 5 000c50 0dfe51df6 Add. Product Id: DELL(tm) Firmware Version: 0001 User Capacity: 8,001,563,222,016 bytes [8.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-3 T13/2161-D revision 5 SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Sat Dec 31 17:00:39 2022 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF ENABLE/DISABLE COMMANDS SECTION === SMART Enabled. === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 974) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x30a5) SCT Status supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 073 070 006 Pre-fail Always - 20939536 3 Spin_Up_Time 0x0003 081 081 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 17 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 100 253 045 Pre-fail Always - 32657 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 353h+08m+52.522s 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 18 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 2 2 2 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 048 044 040 Old_age Always - 52 (Min/Max 45/52) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 8 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 115 194 Temperature_Celsius 0x0022 052 056 000 Old_age Always - 52 (0 20 0 0 0) 195 Hardware_ECC_Recovered 0x001a 073 070 000 Old_age Always - 20939536 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 26h+01m+33.183s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 5786563929 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 16949118876 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 353 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
*: I know, that SMR is bad for ZFS, especially when resilvering. That drive is not in an active pool and is just for archiving and works absolutely fine.
**: I disabled SMART on that disk vie the TrueNAS UI, since I noticed that on a different spun-down disk, the drive would wake whenever it gets queried for its temperature. A SMART self-check is anway supposed to run once a week.
Last edited by a moderator: