Alertas de erros críticos em HD - Freenas 2016 - RAID0 (Striping)

clon¥

Cadet
Joined
Oct 22, 2019
Messages
3
Bom dia,
Estou com alertas de erros críticos em um dos 03 HDs montados em um Servidor FreeNAS.
Está montado em sistema de RAID0 - Striping.

Qual a melhor e mais segura solução para este caso onde um dos HDs está apresentando falhas?

=========================
  • CRÍTICO: 29 de Outubro de 2019 às 13:43 - Device: /dev/ada1, 10784 Offline uncorrectable sectors
    CRÍTICO: 29 de Outubro de 2019 às 13:43 - Device: /dev/ada1, 10784 Currently unreadable (pending) sectors
  • CRÍTICO: 29 de Outubro de 2019 às 15:13 - Device: /dev/ada1, ATA error count increased from 18199 to 18298
  • OK: 29 de Outubro de 2019 às 13:43 - There is a new update available! Apply it in System -> Update tab.

  • CRÍTICO: 29 de Outubro de 2019 às 13:43 - The volume jobs (ZFS) state is DEGRADED: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
=========================

HD ADA1
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.14 (AF)
Device Model: ST3000DM001-1ER166
Serial Number: Z502SB9B
LU WWN Device Id: 5 000c50 090c7f14e
Firmware Version: CC26
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Thu Oct 31 11:41:02 2019 BRST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM feature is: Disabled
Rd look-ahead is: Enabled
Write cache is: Enabled
ATA Security is: Disabled, frozen [SEC2]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 80) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 326) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x1085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-- 076 076 006 - 106143015
3 Spin_Up_Time PO---- 096 093 000 - 0
4 Start_Stop_Count -O--CK 100 100 020 - 57
5 Reallocated_Sector_Ct PO--CK 098 098 010 - 2368
7 Seek_Error_Rate POSR-- 084 060 030 - 323216881
9 Power_On_Hours -O--CK 066 066 000 - 29790
10 Spin_Retry_Count PO--C- 100 100 097 - 0
12 Power_Cycle_Count -O--CK 100 100 020 - 52
183 Runtime_Bad_Block -O--CK 100 100 000 - 0
184 End-to-End_Error -O--CK 100 100 099 - 0
187 Reported_Uncorrect -O--CK 001 001 000 - 22257
188 Command_Timeout -O--CK 100 100 000 - 0 0 0
189 High_Fly_Writes -O-RCK 039 039 000 - 61
190 Airflow_Temperature_Cel -O---K 060 044 045 Past 40 (Min/Max 19/46 #120)
191 G-Sense_Error_Rate -O--CK 100 100 000 - 0
192 Power-Off_Retract_Count -O--CK 100 100 000 - 35
193 Load_Cycle_Count -O--CK 100 100 000 - 251
194 Temperature_Celsius -O---K 040 056 000 - 40 (0 14 0 0 0)
197 Current_Pending_Sector -O--C- 035 016 000 - 10808
198 Offline_Uncorrectable ----C- 035 016 000 - 10808
199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 1
240 Head_Flying_Hours ------ 100 253 000 - 29789h+08m+08.833s
241 Total_LBAs_Written ------ 100 253 000 - 68663041296
242 Total_LBAs_Read ------ 100 253 000 - 818014471826
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning

ATA_READ_LOG_EXT (addr=0x00:0x00, page=0, n=1) failed: 48-bit ATA commands not implemented for legacy controllers
Read GP Log Directory failed

SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 SL R/O 1 Log Directory
0x01 SL R/O 1 Summary SMART error log
0x02 SL R/O 5 Comprehensive SMART error log
0x06 SL R/O 1 SMART self-test log
0x09 SL R/W 1 Selective self-test log
0x30 SL R/O 9 IDENTIFY DEVICE data log
0x80-0x9f SL R/W 16 Host vendor specific log
0xa1 SL VS 20 Device vendor specific log
0xa8 SL VS 129 Device vendor specific log
0xa9 SL VS 1 Device vendor specific log
0xc0 SL VS 1 Device vendor specific log
0xc1 SL VS 10 Device vendor specific log
0xc3 SL VS 8 Device vendor specific log
0xe0 SL R/W 1 SCT Command/Status
0xe1 SL R/W 1 SCT Data Transfer

SMART Extended Comprehensive Error Log (GP Log 0x03) not supported

SMART Error Log Version: 1
ATA Error Count: 22235 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 22235 occurred at disk power-on lifetime: 29767 hours (1240 days + 7 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 40 ff ff ff 4f 00 22d+22:12:56.904 READ DMA EXT
25 00 40 ff ff ff 4f 00 22d+22:12:53.287 READ DMA EXT
25 00 40 ff ff ff 4f 00 22d+22:12:49.703 READ DMA EXT
25 00 40 ff ff ff 4f 00 22d+22:12:46.152 READ DMA EXT
25 00 40 ff ff ff 4f 00 22d+22:12:42.558 READ DMA EXT

Error 22234 occurred at disk power-on lifetime: 29767 hours (1240 days + 7 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 40 ff ff ff 4f 00 22d+22:12:53.287 READ DMA EXT
25 00 40 ff ff ff 4f 00 22d+22:12:49.703 READ DMA EXT
25 00 40 ff ff ff 4f 00 22d+22:12:46.152 READ DMA EXT
25 00 40 ff ff ff 4f 00 22d+22:12:42.558 READ DMA EXT
35 00 38 ff ff ff 4f 00 22d+22:12:42.557 WRITE DMA EXT

Error 22233 occurred at disk power-on lifetime: 29767 hours (1240 days + 7 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 40 ff ff ff 4f 00 22d+22:12:49.703 READ DMA EXT
25 00 40 ff ff ff 4f 00 22d+22:12:46.152 READ DMA EXT
25 00 40 ff ff ff 4f 00 22d+22:12:42.558 READ DMA EXT
35 00 38 ff ff ff 4f 00 22d+22:12:42.557 WRITE DMA EXT
ca 00 10 90 02 40 e0 00 22d+22:12:42.556 WRITE DMA

Error 22232 occurred at disk power-on lifetime: 29767 hours (1240 days + 7 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 40 ff ff ff 4f 00 22d+22:12:46.152 READ DMA EXT
25 00 40 ff ff ff 4f 00 22d+22:12:42.558 READ DMA EXT
35 00 38 ff ff ff 4f 00 22d+22:12:42.557 WRITE DMA EXT
ca 00 10 90 02 40 e0 00 22d+22:12:42.556 WRITE DMA
35 00 10 ff ff ff 4f 00 22d+22:12:42.556 WRITE DMA EXT

Error 22231 occurred at disk power-on lifetime: 29767 hours (1240 days + 7 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 40 ff ff ff 4f 00 22d+22:12:42.558 READ DMA EXT
35 00 38 ff ff ff 4f 00 22d+22:12:42.557 WRITE DMA EXT
ca 00 10 90 02 40 e0 00 22d+22:12:42.556 WRITE DMA
35 00 10 ff ff ff 4f 00 22d+22:12:42.556 WRITE DMA EXT
35 00 10 ff ff ff 4f 00 22d+22:12:42.555 WRITE DMA EXT

SMART Extended Self-test Log (GP Log 0x07) not supported

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version: 3
SCT Version (vendor specific): 522 (0x020a)
SCT Support Level: 1
Device State: Active (0)
Current Temperature: 40 Celsius
Power Cycle Min/Max Temperature: 19/46 Celsius
Lifetime Min/Max Temperature: 14/56 Celsius
Under/Over Temperature Limit Count: 0/0

SCT Data Table command not supported

SCT Error Recovery Control command not supported

Device Statistics (GP/SMART Log 0x04) not supported

ATA_READ_LOG_EXT (addr=0x11:0x00, page=0, n=1) failed: 48-bit ATA commands not implemented for legacy controllers
Read SATA Phy Event Counters failed
 
Joined
Aug 23, 2016
Messages
35
Esse modelo de RAID configurado não fornece tolerância a falha de disco.
Analisando a tabela smart:
5 Reallocated_Sector_Ct PO--CK 098 098 010 - 2368 significa que 2368 setores foram realocados, ou seja, a controladora do HD constatou que estão com problemas e os realocou para setores reservas.
197 Current_Pending_Sector -O--C- 035 016 000 - 10808 significa que 10808 setores estão sob suspeita de estarem ruins.
Minha recomendação é:
Coloque o mais rapidamente possível outro HD de tamanho igual ou superior a esse, faça a substituição do disco defeituoso pelo disco novo e aguarde para ver se o pool saia de DEGRADED
 

clon¥

Cadet
Joined
Oct 22, 2019
Messages
3
De fato tenho consciência de que este modelo de configuração dos HDs não nos dá segurança, estou com este servidor de cliente que assumi, montado por outro técnico.

Pergunta:
Sua sugestão foi substituir o disco por um idêntico, mas com a substituição, eu não perderei dados, uma vez que os HDs estão "somados" 3 +3 +3 = 9TB sendo que um pouco mais que 5TB já está ocupado? O POOL (striping - RAID 0) ainda terá acesso aos dados?

Grato
 
Joined
Aug 23, 2016
Messages
35
Pergunta:
Sua sugestão foi substituir o disco por um idêntico, mas com a substituição, eu não perderei dados, uma vez que os HDs estão "somados" 3 +3 +3 = 9TB sendo que um pouco mais que 5TB já está ocupado? O POOL (striping - RAID 0) ainda terá acesso aos dados?

Pelo que entendi, no seu caso um HD está apresentando defeito.
Para resolver isso, via ZFS solicite a substituição do disco defeituoso por um novo, o ZFS fará a cópia dos arquivos e somente após finalizar o disco defeituoso poderá ser removido do servidor.
 

clon¥

Cadet
Joined
Oct 22, 2019
Messages
3
Pelo que entendi, no seu caso um HD está apresentando defeito.
Para resolver isso, via ZFS solicite a substituição do disco defeituoso por um novo, o ZFS fará a cópia dos arquivos e somente após finalizar o disco defeituoso poderá ser removido do servidor.

Entendi, mas... ...como sou "barriga verde" no FreeNas não sei o caminho que devo seguir para executar a sua dica. Poderia me ajudar com um pouco mais de detalhes?
Talvez um passo-a-passo se puder. Irá me ajudar muito!

Agradeço desde já
 

Alfredo_Ak47

Cadet
Joined
May 5, 2021
Messages
2
amigos tenhop um disco que esta degraded... como que faço para copiar os dados dele para outro HD
 

modchips

Cadet
Joined
Jan 4, 2022
Messages
2
NAO USEM RAID 0 se voce nao tiver outro backup das informacoes..
perdi 14TB de dados.. levei 1 ano e 2 meses para conseguir r ecuperar boa parte dos arquivos..

Baita dor de cabeça
 
Top