Hi
I appear to be getting the following errors in the messages log on my server.
Originally appeared on the 20th March. The drive appears to be online and working. It is part of my Boot pool in a mirror.
I ran a long smart test which appears to have failed and the message appeared again.
I ran a short test and that passed fine. In the GUI the pool is showing fine and ZPOOL status shows the following
SMART Results are:
I am wondering if there is a loose cable somewhere, I know the SATA cables used on these two drives are pretty tough to bend so wondering if one of them is causing this issue. If not then maybe the drive is going bad.
Can someone advise if it is safe to do the following:
Can I just remove the SATA cable from one drive while the server is on so it shows the drive offline, this way I can identify the drive's physically and label them, or would I have to turn it off?
The boot drive is mirrored.
I appear to be getting the following errors in the messages log on my server.
Mar 20 06:27:19 S-STORE01 ahcich1: Timeout on slot 4 port 0
Mar 20 06:27:19 S-STORE01 ahcich1: is 00000000 cs 00000000 ss 00000010 rs 00000010 tfd 40 serr 00000000 cmd 0004c417
Mar 20 06:27:19 S-STORE01 (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 5c ef dd 7e 40 0e 00 00 00 00 00
Mar 20 06:27:19 S-STORE01 (ada1:ahcich1:0:0:0): CAM status: Command timeout
Mar 20 06:27:19 S-STORE01 (ada1:ahcich1:0:0:0): Retrying command, 3 more tries remain
Originally appeared on the 20th March. The drive appears to be online and working. It is part of my Boot pool in a mirror.
I ran a long smart test which appears to have failed and the message appeared again.
Mar 31 14:35:33 S-STORE01 ahcich1: Timeout on slot 30 port 0
Mar 31 14:35:33 S-STORE01 ahcich1: is 00000000 cs 00000000 ss 40000000 rs 40000000 tfd 40 serr 00000000 cmd 0004de17
Mar 31 14:35:33 S-STORE01 (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 af d9 7e 33 40 00 00 00 00 00 00
Mar 31 14:35:33 S-STORE01 (ada1:ahcich1:0:0:0): CAM status: Command timeout
Mar 31 14:35:33 S-STORE01 (ada1:ahcich1:0:0:0): Retrying command, 3 more tries remain
I ran a short test and that passed fine. In the GUI the pool is showing fine and ZPOOL status shows the following
pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:07 with 0 errors on Mon Mar 25 03:45:08 2024
config:
NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada1p2 ONLINE 0 0 0
ada0p2 ONLINE 0 0 0
errors: No known data errors
SMART Results are:
=== START OF INFORMATION SECTION ===
Device Model: SPCC Solid State Disk
Serial Number: AA230202S3128003036
Firmware Version: V0823A0
User Capacity: 128,035,676,160 bytes [128 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: M.2
TRIM Command: Available
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2 T13/2015-D revision 3
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Mar 31 15:09:56 2024 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 120) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0002) Does not save SMART data before
entering power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 10) minutes.
SCT capabilities: (0x0001) SCT Status supported.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0032 100 100 050 Old_age Always - 0
5 Reallocated_Sector_Ct 0x0032 100 100 050 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 050 Old_age Always - 5271
12 Power_Cycle_Count 0x0032 100 100 050 Old_age Always - 70
160 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 0
161 Unknown_Attribute 0x0033 100 100 050 Pre-fail Always - 100
163 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 6
164 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 7796
165 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 47
166 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 2
167 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 15
168 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 5050
169 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 100
175 Program_Fail_Count_Chip 0x0032 100 100 050 Old_age Always - 0
176 Erase_Fail_Count_Chip 0x0032 100 100 050 Old_age Always - 0
177 Wear_Leveling_Count 0x0032 100 100 050 Old_age Always - 0
178 Used_Rsvd_Blk_Cnt_Chip 0x0032 100 100 050 Old_age Always - 0
181 Program_Fail_Cnt_Total 0x0032 100 100 050 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 050 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 050 Old_age Always - 29
194 Temperature_Celsius 0x0022 100 100 050 Old_age Always - 40
195 Hardware_ECC_Recovered 0x0032 100 100 050 Old_age Always - 0
196 Reallocated_Event_Count 0x0032 100 100 050 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 050 Old_age Always - 0
198 Offline_Uncorrectable 0x0032 100 100 050 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 050 Old_age Always - 0
232 Available_Reservd_Space 0x0032 100 100 050 Old_age Always - 100
241 Total_LBAs_Written 0x0030 100 100 050 Old_age Offline - 13798
242 Total_LBAs_Read 0x0030 100 100 050 Old_age Offline - 5628
245 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 36555
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 5271 -
# 2 Extended offline Interrupted (host reset) 80% 5271 -
# 3 Extended offline Interrupted (host reset) 90% 4999 -
# 4 Short offline Completed without error 00% 4759 -
# 5 Short offline Completed without error 00% 4064 -
# 6 Extended offline Completed without error 00% 3560 -
# 7 Short offline Completed without error 00% 3320 -
# 8 Short offline Completed without error 00% 2576 -
# 9 Extended offline Completed without error 00% 2096 -
#10 Short offline Completed without error 00% 1856 -
#11 Short offline Completed without error 00% 1393 -
#12 Short offline Completed without error 00% 1225 -
#13 Short offline Completed without error 00% 1084 -
#14 Short offline Completed without error 00% 917 -
#15 Short offline Completed without error 00% 749 -
#16 Short offline Completed without error 00% 582 -
#17 Extended offline Completed without error 00% 573 -
#18 Extended offline Completed without error 00% 550 -
#19 Short offline Completed without error 00% 82 -
Selective Self-tests/Logging not supported
I am wondering if there is a loose cable somewhere, I know the SATA cables used on these two drives are pretty tough to bend so wondering if one of them is causing this issue. If not then maybe the drive is going bad.
Can someone advise if it is safe to do the following:
Can I just remove the SATA cable from one drive while the server is on so it shows the drive offline, this way I can identify the drive's physically and label them, or would I have to turn it off?
The boot drive is mirrored.