Hallo zusammen,
Ich habe mir vor kurzem ein NAS mit Truenas aufgesetzt.
Folgende Hardware kommt zum Einsatz:
Eine VM mit:
Festplatten: Hitachi HGST Ultrastar HUH721010ALE600 Festplatte 10 TB 3.5" SATA 6Gb/s 7200 rpm Puffer: 256 MB
HBA-Controller an VM durchgereicht: HBA 1100-8i
Backplane: ICY Box IB-563SSK
32Gig Ram ECC
RaidZ2 ist verwendet.
Ich habe folgendes Problem:
TrueNAS gibt mir folgende Fehlermeldung aus:
und während ich das hier schreibe gab es einen Neustart der den Status mit folgender Nachricht geändert hat:
Die Ausgabe von "zpool status" ist aktuell:
und "smartctl -a /dev/da6" ergibt:
Was allerdings am seltsamsten ist sind die Consolen-Ausgaben:
Kabel habe ich schon einmal gewechselt - ohne Erfolg
Ist das die Festplatte, das Kabel, die Backplane oder der Controller oder ... ?
Wer kennt sich aus mit Smart-Werten oder kann aus den Fehlermeldung hier schlau werden?
Ich habe mir vor kurzem ein NAS mit Truenas aufgesetzt.
Folgende Hardware kommt zum Einsatz:
Eine VM mit:
Festplatten: Hitachi HGST Ultrastar HUH721010ALE600 Festplatte 10 TB 3.5" SATA 6Gb/s 7200 rpm Puffer: 256 MB
HBA-Controller an VM durchgereicht: HBA 1100-8i
Backplane: ICY Box IB-563SSK
32Gig Ram ECC
RaidZ2 ist verwendet.
Ich habe folgendes Problem:
TrueNAS gibt mir folgende Fehlermeldung aus:
New alert:
* Pool datenpool1 state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
- Disk 9518944293388547218 is FAULTED
The following alert has been cleared:
* Pool datenpool1 state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
und während ich das hier schreibe gab es einen Neustart der den Status mit folgender Nachricht geändert hat:
The following alert has been cleared:
* Pool datenpool1 state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
Die Ausgabe von "zpool status" ist aktuell:
Code:
pool: datenpool1 state: ONLINE scan: resilvered 53.7M in 00:00:12 with 0 errors on Tue Jan 19 18:41:35 2021 config: NAME STATE READ WRITE CKSUM datenpool1 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 gptid/b7c9c85e-fcc2-11ea-bc6d-358825bfcd9e ONLINE 0 0 0 gptid/b7d4082a-fcc2-11ea-bc6d-358825bfcd9e ONLINE 0 0 0 gptid/b7cdd27c-fcc2-11ea-bc6d-358825bfcd9e ONLINE 0 0 0 gptid/b7da6581-fcc2-11ea-bc6d-358825bfcd9e ONLINE 0 0 0 gptid/b803ef6e-fcc2-11ea-bc6d-358825bfcd9e ONLINE 0 0 0 gptid/b810db71-fcc2-11ea-bc6d-358825bfcd9e ONLINE 0 0 0 errors: No known data errors pool: freenas-boot state: ONLINE status: Some supported features are not enabled on the pool. The pool can still be used, but some features are unavailable. action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features. See zpool-features(5) for details. scan: scrub repaired 0B in 00:01:15 with 0 errors on Sat Jan 16 03:46:15 2021 config: NAME STATE READ WRITE CKSUM freenas-boot ONLINE 0 0 0 da0p2 ONLINE 0 0 0 errors: No known data errors
und "smartctl -a /dev/da6" ergibt:
Code:
smartctl 7.1 2019-12-30 r5022 [FreeBSD 12.2-RELEASE-p2 amd64] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: HGST Ultrastar He10 Device Model: HGST HUH721010ALE600 Serial Number: 1EJJAB1Z LU WWN Device Id: 5 000cca 27ee39bd2 Firmware Version: LHGNT384 User Capacity: 10,000,831,348,736 bytes [10.0 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Thu Jan 21 22:05:33 2021 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 241) Self-test routine in progress... 10% of test remaining. Total time to complete Offline data collection: ( 93) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: (1228) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 133 133 054 Pre-fail Offline - 100 3 Spin_Up_Time 0x0007 146 146 024 Pre-fail Always - 450 (Average 449) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 35 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 128 128 020 Pre-fail Offline - 18 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 5299 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 8 22 Helium_Level 0x0023 100 100 025 Pre-fail Always - 100 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 163 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 163 194 Temperature_Celsius 0x0002 176 176 000 Old_age Always - 34 (Min/Max 23/45) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 112921 SMART Error Log Version: 1 ATA Error Count: 65535 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 65535 occurred at disk power-on lifetime: 4056 hours (169 days + 0 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 43 00 00 00 00 00 Error: ICRC, ABRT at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 70 80 8d a7 40 00 1d+22:10:28.632 READ FPDMA QUEUED 60 f0 c0 50 97 a7 40 00 1d+22:10:28.626 READ FPDMA QUEUED 60 00 b8 50 96 a7 40 00 1d+22:10:28.626 READ FPDMA QUEUED 60 00 b0 50 95 a7 40 00 1d+22:10:28.626 READ FPDMA QUEUED 60 00 a8 50 94 a7 40 00 1d+22:10:28.626 READ FPDMA QUEUED Error 65534 occurred at disk power-on lifetime: 4056 hours (169 days + 0 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 43 00 00 00 00 00 Error: ICRC, ABRT at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 c8 50 c4 a6 40 00 1d+22:10:28.452 READ FPDMA QUEUED 60 a8 68 20 d0 a6 40 00 1d+22:10:28.447 READ FPDMA QUEUED 60 00 60 20 cf a6 40 00 1d+22:10:28.447 READ FPDMA QUEUED 60 00 58 20 ce a6 40 00 1d+22:10:28.447 READ FPDMA QUEUED 60 00 50 20 cd a6 40 00 1d+22:10:28.447 READ FPDMA QUEUED Error 65533 occurred at disk power-on lifetime: 4056 hours (169 days + 0 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 43 00 00 00 00 00 Error: ICRC, ABRT at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 28 e0 9b a5 40 00 1d+22:10:28.112 READ FPDMA QUEUED 60 c8 88 c8 a7 a5 40 00 1d+22:10:28.087 READ FPDMA QUEUED 60 00 80 c8 a6 a5 40 00 1d+22:10:28.087 READ FPDMA QUEUED 60 00 78 c8 a5 a5 40 00 1d+22:10:28.087 READ FPDMA QUEUED 60 00 70 c8 a4 a5 40 00 1d+22:10:28.087 READ FPDMA QUEUED Error 65532 occurred at disk power-on lifetime: 4056 hours (169 days + 0 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 43 00 00 00 00 00 Error: ICRC, ABRT at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 a0 e0 97 a5 40 00 1d+22:10:28.085 READ FPDMA QUEUED 60 e8 e0 e0 9f a5 40 00 1d+22:10:28.079 READ FPDMA QUEUED 60 00 d8 e0 9e a5 40 00 1d+22:10:28.079 READ FPDMA QUEUED 60 00 d0 e0 9d a5 40 00 1d+22:10:28.079 READ FPDMA QUEUED 60 00 c8 e0 9c a5 40 00 1d+22:10:28.079 READ FPDMA QUEUED Error 65531 occurred at disk power-on lifetime: 4056 hours (169 days + 0 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 43 00 00 00 00 00 Error: ICRC, ABRT at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 d8 10 7a a5 40 00 1d+22:10:28.047 READ FPDMA QUEUED 60 f0 60 10 88 a5 40 00 1d+22:10:28.042 READ FPDMA QUEUED 60 00 40 10 87 a5 40 00 1d+22:10:28.042 READ FPDMA QUEUED 60 00 38 10 86 a5 40 00 1d+22:10:28.042 READ FPDMA QUEUED 60 00 30 10 85 a5 40 00 1d+22:10:28.042 READ FPDMA QUEUED SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Aborted by host 10% 5299 - # 2 Short offline Aborted by host 10% 5243 - # 3 Short offline Interrupted (host reset) 10% 5149 - # 4 Short offline Interrupted (host reset) 10% 4507 - # 5 Short offline Interrupted (host reset) 10% 4452 - # 6 Short offline Interrupted (host reset) 10% 4392 - # 7 Short offline Interrupted (host reset) 10% 4286 - # 8 Short offline Interrupted (host reset) 10% 4262 - # 9 Short offline Interrupted (host reset) 10% 4038 - #10 Short offline Interrupted (host reset) 10% 3996 - #11 Short offline Interrupted (host reset) 10% 3897 - #12 Short offline Interrupted (host reset) 10% 3844 - #13 Short offline Interrupted (host reset) 10% 3739 - #14 Short offline Interrupted (host reset) 10% 3709 - #15 Short offline Interrupted (host reset) 10% 3627 - #16 Short offline Interrupted (host reset) 10% 3496 - #17 Short offline Interrupted (host reset) 10% 3473 - #18 Short offline Interrupted (host reset) 10% 3388 - #19 Short offline Interrupted (host reset) 10% 3340 - #20 Short offline Interrupted (host reset) 10% 3309 - #21 Short offline Interrupted (host reset) 10% 3272 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
Was allerdings am seltsamsten ist sind die Consolen-Ausgaben:
Kabel habe ich schon einmal gewechselt - ohne Erfolg
Ist das die Festplatte, das Kabel, die Backplane oder der Controller oder ... ?
Wer kennt sich aus mit Smart-Werten oder kann aus den Fehlermeldung hier schlau werden?