SSD Drives read/write error

PhlMike

Cadet
Joined
Dec 5, 2022
Messages
4
System: Supermicro SYS-7048R-C1R4+
Motherboard: Supermicro X10DRC-LN4+
Processors: 2x Intel Xeon 2670v3
Memory: 128 GB ECC - DDR4 2133
OS: TrueNAS-13.0-U3.1
Mechanical: 8x Seagate Exos X16 12TB
NVMe: 4x MSI M470 2TB
SSD: 8x Kingston DC450R 1.92TB
HBA: LSI 9300-8i for SSDs
Motherboard SATA ports for mech
PCI-E x16 carrier for nvme drives.
Network: 4x Intel 82599 10GB SFP+ ports

Problem: The Kingston DC450R's are acting weird. Specifically da3. Moved drives on backplane, no change. Swapped drives, no change, da3 errors.
Error:
Code:
Error: [Errno 5] Input/output error


smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Phison Driven SSDs
Device Model: KINGSTON SEDC450R1920G
Serial Number: 50026B7282DA0F0D
LU WWN Device Id: 5 0026b7 282da0f0d
Firmware Version: SCEKH3.7
User Capacity: 1,920,383,410,176 bytes [1.92 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 (minor revision not indicated)
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Dec 5 10:21:31 2022 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 30) seconds.
Offline data collection
capabilities: (0x79) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 2) minutes.
Conveyance self-test routine
recommended polling time: ( 3) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0032 100 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 3
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 11
167 Write_Protect_Mode 0x0000 100 100 000 Old_age Offline - 0
168 SATA_Phy_Error_Count 0x0012 100 100 000 Old_age Always - 19
169 Bad_Block_Rate 0x0000 100 100 000 Old_age Offline - 0
170 Bad_Blk_Ct_Erl/Lat 0x0000 100 100 010 Old_age Offline - 0/655
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 MaxAvgErase_Ct 0x0000 100 100 000 Old_age Offline - 1
181 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
182 Erase_Fail_Count 0x0000 100 100 000 Old_age Offline - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
192 Unsafe_Shutdown_Count 0x0012 100 100 000 Old_age Always - 8
193 Power_Fail_Uncompl_Cnt 0x0000 100 100 000 Old_age Offline - 0
194 Temperature_Celsius 0x0022 071 070 000 Old_age Always - 29 (Min/Max 22/30)
195 Power_Fail_Health 0x0000 100 100 000 Old_age Offline - 100
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
199 SATA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 17
207 Thermal_Throttling_Cnt 0x0000 100 100 000 Old_age Offline - 0
218 CRC_Error_Count 0x0032 100 100 000 Old_age Always - 17
231 SSD_Life_Left 0x0000 100 100 000 Old_age Offline - 100
232 Read_Fail_Count 0x0013 100 100 000 Pre-fail Always - 0
233 Flash_Writes_GiB 0x0032 100 100 000 Old_age Always - 0
241 Lifetime_Writes_GiB 0x0032 100 100 000 Old_age Always - 0
242 Lifetime_Reads_GiB 0x0032 100 100 000 Old_age Always - 0
244 Average_Erase_Count 0x0000 100 100 000 Old_age Offline - 0
245 Max_Erase_Count 0x0000 100 100 000 Old_age Offline - 1
246 Total_Erase_Count 0x0000 100 100 000 Old_age Offline - 384

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 3 -
# 2 Extended offline Completed without error 00% 2 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I did however see smart errors showing from DA2 and DA3 complaining about attributes. But they don't show back up after clearing. But DA3 even after replacing it, refuses to wipe so I can't make a pool with it. I no longer get SMART errors however after I upgraded the BIOS and Firmware on the LSI card. I used the latest available. Firmware 17.0

We are attempting another backplane as well as another cable to the backplane.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Firmware 17.0

You're running mismatched firmware, which can cause all sorts of bizarre problems. The approved firmware is 16.00.12.00. Please correct this and then let us know what happens.
 

PhlMike

Cadet
Joined
Dec 5, 2022
Messages
4
You're running mismatched firmware, which can cause all sorts of bizarre problems. The approved firmware is 16.00.12.00. Please correct this and then let us know what happens.
I was running 16.00.12.00 and I was having problems, upgrading to 17.00.00.00 resolved the SMART error issues or at least SMART stopped reporting errors. Resolved is nebulous.

We replaced the SFF-8643 cables and the backplane for the 2.5" SAS Tray and the issue resolved itself. One of the SFF-8643 cables had a broken retention clip which may have been the issue. We don't know because we also replaced the backplane at the same time.

Logically this could all of been an SFF-8643 cable with a bad clip all along and when we moved around drives, we didn't move them around enough so that the "drive" was still on the SFF-8643 cable with the wanky connection.

Or none of that fixed anything and instead of reboots, just powering it completely down, letting it sit for an hour and powering it back on fixed the issue.

I can go back on and reflash the firmware later, I saved the old firmware before upgrading it.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Where did you get P12 P17 firmware? The latest I've seen for any 93xx card is P16.0.0.12.
 
Last edited:

PhlMike

Cadet
Joined
Dec 5, 2022
Messages
4
Where did you get P12 firmware? The latest I've seen for any 93xx card is P16.0.0.12.
P16, from Broadcom Support, you check "Include Legacy Products" then Group: Storage Adapters, Controllers, and ICs, Family: Storage Adapters, Controllers, and ICs, Product: SAS 9300-8i Host Bus Adapter.

However as jgreco said, you should stick to 16.00.12.00, as bad things may happen.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Looks to me like it's the installer that's at version 17, while the actual firmware is still 16.00.10.00. Though the 9305s seem to have 16.00.12.00 available.

What a mess. I wish we'd finally get NVMe HDDs so that we can get rid of SAS and all this crazy firmware/driver nonsense.
 

PhlMike

Cadet
Joined
Dec 5, 2022
Messages
4
They try to make these products too accommodating having both an IR mode and an IT mode.

The issue is cost, a NVMe drive at 8-12TB would be 300x the cost of a spinning rust drive, and I made arrays with dozens of spinning drives and some l2arc & zil and I can use it as backend storage for VMWare and run 100+ vms on it easy. All for what $30k?

Even moving to a significantly smaller scale with say 15 NVMe's would cost $60k. Then handling that. I take retired servers and make them NAS units for everything from backup to a temp dump to do NAS maintenance. With NVMe you would need brand new hardware.

The unit I had the problem with, scrabbled together from spare parts to make a temp storage for my primary NAS units so I can refresh them with all new drives and NVMes, upgrade firmware, upgrade ram.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The issue is cost, a NVMe drive at 8-12TB would be 300x the cost of a spinning rust drive, and I made arrays with dozens of spinning drives and some l2arc & zil and I can use it as backend storage for VMWare and run 100+ vms on it easy. All for what $30k?
Well, let's not overstate this, it's around 10x the cost to go from spinning rust to midrange enterprise NVMe SSD. But I did mean spinning rust attached over NVMe, which is a thing some people are working on.
 
Top