Help with timeout messages on boot pool drive.

serverboy

Dabbler
Joined
Aug 11, 2023
Messages
39
Hi

I appear to be getting the following errors in the messages log on my server.

Mar 20 06:27:19 S-STORE01 ahcich1: Timeout on slot 4 port 0
Mar 20 06:27:19 S-STORE01 ahcich1: is 00000000 cs 00000000 ss 00000010 rs 00000010 tfd 40 serr 00000000 cmd 0004c417
Mar 20 06:27:19 S-STORE01 (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 5c ef dd 7e 40 0e 00 00 00 00 00
Mar 20 06:27:19 S-STORE01 (ada1:ahcich1:0:0:0): CAM status: Command timeout
Mar 20 06:27:19 S-STORE01 (ada1:ahcich1:0:0:0): Retrying command, 3 more tries remain

Originally appeared on the 20th March. The drive appears to be online and working. It is part of my Boot pool in a mirror.

I ran a long smart test which appears to have failed and the message appeared again.

Mar 31 14:35:33 S-STORE01 ahcich1: Timeout on slot 30 port 0
Mar 31 14:35:33 S-STORE01 ahcich1: is 00000000 cs 00000000 ss 40000000 rs 40000000 tfd 40 serr 00000000 cmd 0004de17
Mar 31 14:35:33 S-STORE01 (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 af d9 7e 33 40 00 00 00 00 00 00
Mar 31 14:35:33 S-STORE01 (ada1:ahcich1:0:0:0): CAM status: Command timeout
Mar 31 14:35:33 S-STORE01 (ada1:ahcich1:0:0:0): Retrying command, 3 more tries remain

I ran a short test and that passed fine. In the GUI the pool is showing fine and ZPOOL status shows the following

pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:07 with 0 errors on Mon Mar 25 03:45:08 2024
config:

NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada1p2 ONLINE 0 0 0
ada0p2 ONLINE 0 0 0

errors: No known data errors

SMART Results are:

=== START OF INFORMATION SECTION ===
Device Model: SPCC Solid State Disk
Serial Number: AA230202S3128003036
Firmware Version: V0823A0
User Capacity: 128,035,676,160 bytes [128 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: M.2
TRIM Command: Available
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2 T13/2015-D revision 3
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Mar 31 15:09:56 2024 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 120) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0002) Does not save SMART data before
entering power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 10) minutes.
SCT capabilities: (0x0001) SCT Status supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0032 100 100 050 Old_age Always - 0
5 Reallocated_Sector_Ct 0x0032 100 100 050 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 050 Old_age Always - 5271
12 Power_Cycle_Count 0x0032 100 100 050 Old_age Always - 70
160 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 0
161 Unknown_Attribute 0x0033 100 100 050 Pre-fail Always - 100
163 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 6
164 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 7796
165 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 47
166 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 2
167 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 15
168 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 5050
169 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 100
175 Program_Fail_Count_Chip 0x0032 100 100 050 Old_age Always - 0
176 Erase_Fail_Count_Chip 0x0032 100 100 050 Old_age Always - 0
177 Wear_Leveling_Count 0x0032 100 100 050 Old_age Always - 0
178 Used_Rsvd_Blk_Cnt_Chip 0x0032 100 100 050 Old_age Always - 0
181 Program_Fail_Cnt_Total 0x0032 100 100 050 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 050 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 050 Old_age Always - 29
194 Temperature_Celsius 0x0022 100 100 050 Old_age Always - 40
195 Hardware_ECC_Recovered 0x0032 100 100 050 Old_age Always - 0
196 Reallocated_Event_Count 0x0032 100 100 050 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 050 Old_age Always - 0
198 Offline_Uncorrectable 0x0032 100 100 050 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 050 Old_age Always - 0
232 Available_Reservd_Space 0x0032 100 100 050 Old_age Always - 100
241 Total_LBAs_Written 0x0030 100 100 050 Old_age Offline - 13798
242 Total_LBAs_Read 0x0030 100 100 050 Old_age Offline - 5628
245 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 36555

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 5271 -
# 2 Extended offline Interrupted (host reset) 80% 5271 -
# 3 Extended offline Interrupted (host reset) 90% 4999 -
# 4 Short offline Completed without error 00% 4759 -
# 5 Short offline Completed without error 00% 4064 -
# 6 Extended offline Completed without error 00% 3560 -
# 7 Short offline Completed without error 00% 3320 -
# 8 Short offline Completed without error 00% 2576 -
# 9 Extended offline Completed without error 00% 2096 -
#10 Short offline Completed without error 00% 1856 -
#11 Short offline Completed without error 00% 1393 -
#12 Short offline Completed without error 00% 1225 -
#13 Short offline Completed without error 00% 1084 -
#14 Short offline Completed without error 00% 917 -
#15 Short offline Completed without error 00% 749 -
#16 Short offline Completed without error 00% 582 -
#17 Extended offline Completed without error 00% 573 -
#18 Extended offline Completed without error 00% 550 -
#19 Short offline Completed without error 00% 82 -

Selective Self-tests/Logging not supported

I am wondering if there is a loose cable somewhere, I know the SATA cables used on these two drives are pretty tough to bend so wondering if one of them is causing this issue. If not then maybe the drive is going bad.

Can someone advise if it is safe to do the following:

Can I just remove the SATA cable from one drive while the server is on so it shows the drive offline, this way I can identify the drive's physically and label them, or would I have to turn it off?

The boot drive is mirrored.
 

serverboy

Dabbler
Joined
Aug 11, 2023
Messages
39
Just ran another long smart test and it completed successfully.

smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: SPCC Solid State Disk
Serial Number: AA230202S3128003036
Firmware Version: V0823A0
User Capacity: 128,035,676,160 bytes [128 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: M.2
TRIM Command: Available
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2 T13/2015-D revision 3
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Mar 31 16:02:11 2024 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 120) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0002) Does not save SMART data before
entering power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 10) minutes.
SCT capabilities: (0x0001) SCT Status supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0032 100 100 050 Old_age Always - 0
5 Reallocated_Sector_Ct 0x0032 100 100 050 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 050 Old_age Always - 5272
12 Power_Cycle_Count 0x0032 100 100 050 Old_age Always - 70
160 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 0
161 Unknown_Attribute 0x0033 100 100 050 Pre-fail Always - 100
163 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 6
164 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 7796
165 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 47
166 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 2
167 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 15
168 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 5050
169 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 100
175 Program_Fail_Count_Chip 0x0032 100 100 050 Old_age Always - 0
176 Erase_Fail_Count_Chip 0x0032 100 100 050 Old_age Always - 0
177 Wear_Leveling_Count 0x0032 100 100 050 Old_age Always - 0
178 Used_Rsvd_Blk_Cnt_Chip 0x0032 100 100 050 Old_age Always - 0
181 Program_Fail_Cnt_Total 0x0032 100 100 050 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 050 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 050 Old_age Always - 29
194 Temperature_Celsius 0x0022 100 100 050 Old_age Always - 40
195 Hardware_ECC_Recovered 0x0032 100 100 050 Old_age Always - 0
196 Reallocated_Event_Count 0x0032 100 100 050 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 050 Old_age Always - 0
198 Offline_Uncorrectable 0x0032 100 100 050 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 050 Old_age Always - 0
232 Available_Reservd_Space 0x0032 100 100 050 Old_age Always - 100
241 Total_LBAs_Written 0x0030 100 100 050 Old_age Offline - 13800
242 Total_LBAs_Read 0x0030 100 100 050 Old_age Offline - 5628
245 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 36555

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 5272 -
# 2 Short offline Completed without error 00% 5271 -
# 3 Extended offline Interrupted (host reset) 80% 5271 -
# 4 Extended offline Interrupted (host reset) 90% 4999 -
# 5 Short offline Completed without error 00% 4759 -
# 6 Short offline Completed without error 00% 4064 -
# 7 Extended offline Completed without error 00% 3560 -
# 8 Short offline Completed without error 00% 3320 -
# 9 Short offline Completed without error 00% 2576 -
#10 Extended offline Completed without error 00% 2096 -
#11 Short offline Completed without error 00% 1856 -
#12 Short offline Completed without error 00% 1393 -
#13 Short offline Completed without error 00% 1225 -
#14 Short offline Completed without error 00% 1084 -
#15 Short offline Completed without error 00% 917 -
#16 Short offline Completed without error 00% 749 -
#17 Short offline Completed without error 00% 582 -
#18 Extended offline Completed without error 00% 573 -
#19 Extended offline Completed without error 00% 550 -
#20 Short offline Completed without error 00% 82 -

Selective Self-tests/Logging not supported
Are these AHCI messages something to be concerned about?
Mar 19 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload request received, reloading configuration;
Mar 19 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload finished;
Mar 20 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload request received, reloading configuration;
Mar 20 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload finished;
Mar 20 06:27:19 S-STORE01 ahcich1: Timeout on slot 4 port 0
Mar 20 06:27:19 S-STORE01 ahcich1: is 00000000 cs 00000000 ss 00000010 rs 00000010 tfd 40 serr 00000000 cmd 0004c417
Mar 20 06:27:19 S-STORE01 (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 5c ef dd 7e 40 0e 00 00 00 00 00
Mar 20 06:27:19 S-STORE01 (ada1:ahcich1:0:0:0): CAM status: Command timeout
Mar 20 06:27:19 S-STORE01 (ada1:ahcich1:0:0:0): Retrying command, 3 more tries remain
Mar 21 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload request received, reloading configuration;
Mar 21 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload finished;
Mar 22 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload request received, reloading configuration;
Mar 22 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload finished;
Mar 23 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload request received, reloading configuration;
Mar 23 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload finished;
Mar 24 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload request received, reloading configuration;
Mar 24 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload finished;
Mar 25 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload request received, reloading configuration;
Mar 25 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload finished;
Mar 26 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload request received, reloading configuration;
Mar 26 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload finished;
Mar 27 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload request received, reloading configuration;
Mar 27 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload finished;
Mar 28 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload request received, reloading configuration;
Mar 28 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload finished;
Mar 29 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload request received, reloading configuration;
Mar 29 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload finished;
Mar 30 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload request received, reloading configuration;
Mar 30 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload finished;
Mar 31 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload request received, reloading configuration;
Mar 31 00:00:00 S-STORE01 syslog-ng[5881]: Configuration reload finished;
Mar 31 14:35:33 S-STORE01 ahcich1: Timeout on slot 30 port 0
Mar 31 14:35:33 S-STORE01 ahcich1: is 00000000 cs 00000000 ss 40000000 rs 40000000 tfd 40 serr 00000000 cmd 0004de17
Mar 31 14:35:33 S-STORE01 (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 af d9 7e 33 40 00 00 00 00 00 00
Mar 31 14:35:33 S-STORE01 (ada1:ahcich1:0:0:0): CAM status: Command timeout
Mar 31 14:35:33 S-STORE01 (ada1:ahcich1:0:0:0): Retrying command, 3 more tries remain
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
Can I just remove the SATA cable from one drive while the server is on so it shows the drive offline, this way I can identify the drive's physically and label them, or would I have to turn it off?

So you don't know how the drive is connected? Because you are using an HBA that would be interesting.

If your mainboard is hot swap capable you could probably do that. I would shut down the server though.

Under storage -> disks you can see the serial number of your drives. With my WD they also have part of the serial printed on the front of.the drive so you can identify without removing them.

Try to switch the sata cable for a new one / known good while your at it.
 

serverboy

Dabbler
Joined
Aug 11, 2023
Messages
39
So you don't know how the drive is connected? Because you are using an HBA that would be interesting.

If your mainboard is hot swap capable you could probably do that. I would shut down the server though.

Under storage -> disks you can see the serial number of your drives. With my WD they also have part of the serial printed on the front of.the drive so you can identify without removing them.

Try to switch the sata cable for a new one / known good while your at it.
Sorry forgot to mention the boot drives are connected to the onboard SATA ports. So it will be either one of the two that I will need to identify, which is ADA0 or ADA1.
The rest of the drives are all connected to the front backplane which are going to my HBA.
Ye just thinking about it probably best bet is to shut it down and do it.
 
Top