having issues with my hardware - looking for suggestions

Joined
Mar 5, 2022
Messages
224
I had some issues with heat causing reboots (replacing the heatsink grease made a huge difference.) Now, the only reason the system reboots is because of a power failure (I live in Florida but have a UPS that is only good for so long.)

The previous power supply gave up the ghost so I replaced it with a brand new EVGA Supernova 1600 a couple of months ago, which has been working fine until recently: The system won't boot unless I remove one of the 14 HDD's. Once it gets to the TrueNAS boot screen, I can safely insert it and it will boot successfully every time. I had a similar problem with the previous power supply, but I would have to remove all of the drives to successfully boot.


The system consists of an Asus M5A99X EVO with 32GB of RAM (prettry sure its DDR3) and an AMD processor of some sort (AM3+/AM3 socket.) There are two SSD's for the OS, 14 2TB drives for the pools, and a single USB drive with an external power supply for backup. From what I understand, the HDD's only use about 8W, so there is no way they are the problem unless several are failing. The motherboard should take no more than 150W.

I am not seeing any warnings or errors for the HDD's.
Any idea how to check how much power a HDD is actually consuming?

Any suggestions to my dilema?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Extensive discussions of the topic exist.

 

rvassar

Guru
Joined
May 2, 2018
Messages
972
AM3/AM3+ is getting pretty long in the tooth, and had terrible thermal engineering numbers even by 2011 standards. You didn't mention exactly which CPU you are using, but the Bulldozer & Piledriver CPU's span 95 to 220 watts, and have very poor idle/sleep capability. Using one for a NAS is kind of starting off from a bad place, but... I'm also going to point out that 2Tb HDD's are also pretty close to relic status, and most solutions with that many drives implement a staggered spin up solution to prevent overloading the 12v rail. 14 x 2Tb in RAIDZ1 nets you 26Tb down to 14Tb mirrored, depending on vdev's and such. You can buy 22Tb single drives these days, and 10 - 14 Tb drives are available at discount prices. I would consider an upgrade/migration that reduces the device count. A 4 drive solution should be simple, and a 6 drive solution with some capacity expansion within reach. If you're doing block storage with those drive counts for the IOPS rate, you should be moving to SSD's.

Possible lower budget solutions:
1. The EVGA Supernova 1600 appears to have a single 12v rail. Yes at 133 amps of 12v it's beefy, but a dual rail 12v supply is probably a better choice.
2. Swap out two of the 2Tb drives for 2Tb SSDs. Yes its dumb/wasteful... But it saves you the spin up power you seem to need and costs about the same amount as a 1+kW PSU swap in my locale. Bonus: You can keep swapping out a drive every couple weeks at lunch money rates and end up with an all SSD pool...
 
Joined
Mar 5, 2022
Messages
224
AM3/AM3+ is getting pretty long in the tooth, and had terrible thermal engineering numbers even by 2011 standards. You didn't mention exactly which CPU you are using, but the Bulldozer & Piledriver CPU's span 95 to 220 watts, and have very poor idle/sleep capability. Using one for a NAS is kind of starting off from a bad place, but... I'm also going to point out that 2Tb HDD's are also pretty close to relic status, and most solutions with that many drives implement a staggered spin up solution to prevent overloading the 12v rail. 14 x 2Tb in RAIDZ1 nets you 26Tb down to 14Tb mirrored, depending on vdev's and such. You can buy 22Tb single drives these days, and 10 - 14 Tb drives are available at discount prices. I would consider an upgrade/migration that reduces the device count. A 4 drive solution should be simple, and a 6 drive solution with some capacity expansion within reach. If you're doing block storage with those drive counts for the IOPS rate, you should be moving to SSD's.

Possible lower budget solutions:
1. The EVGA Supernova 1600 appears to have a single 12v rail. Yes at 133 amps of 12v it's beefy, but a dual rail 12v supply is probably a better choice.
2. Swap out two of the 2Tb drives for 2Tb SSDs. Yes its dumb/wasteful... But it saves you the spin up power you seem to need and costs about the same amount as a 1+kW PSU swap in my locale. Bonus: You can keep swapping out a drive every couple weeks at lunch money rates and end up with an all SSD pool...
I'm slowly cozying up to this idea... I had no idea that SSD's had dropped so much in price! I'm seeing 2tb SSD's for under $70! Would you recommend one over another for a server application? AFAIK, as long as the transfer rate is equivalent, they are all pretty much the same.
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
I'm slowly cozying up to this idea... I had no idea that SSD's had dropped so much in price! I'm seeing 2tb SSD's for under $70! Would you recommend one over another for a server application? AFAIK, as long as the transfer rate is equivalent, they are all pretty much the same.

For a server application you want to check the lifetime TBW rating, and power failure protection status. Remember it's the writing that wears out SSD memory cells, so you need to determine what your write/rewrite rate is and plan accordingly. ZFS will beat up a weak SSD, so check the reviews at places catering to IT pro's. I'd probably avoid QLC NAND at this point. But you may find the cost trade off vs redundancy immaterial if you're running a multi-vdev RAIDz2 that can soak up a lot of failures vs say a 7 vdev mirror-stripe that is always 2 failures away from loss. Only you know your config & workload, and can determine your aversion to risk.

Keep in mind, SSD's tend to fail suddenly and differently than a HDD. There's no weeks or months of bearing noises and head actuator reset clicks. So make sure the models you do deploy have proper SMART reporting and offer a media wear out indicator.

I concluded that a SSD pool would wear out too quickly for my applications, so I have a two pool solution. VM's & performance stuff on a block storage optimized pool, and bulk data like surveillance video on a RAIDZ2 HDD pool. The block storage will move to SSD soon.
 
Joined
Mar 5, 2022
Messages
224
For a server application you want to check the lifetime TBW rating, and power failure protection status. Remember it's the writing that wears out SSD memory cells, so you need to determine what your write/rewrite rate is and plan accordingly. ZFS will beat up a weak SSD, so check the reviews at places catering to IT pro's. I'd probably avoid QLC NAND at this point. But you may find the cost trade off vs redundancy immaterial if you're running a multi-vdev RAIDz2 that can soak up a lot of failures vs say a 7 vdev mirror-stripe that is always 2 failures away from loss. Only you know your config & workload, and can determine your aversion to risk.

Keep in mind, SSD's tend to fail suddenly and differently than a HDD. There's no weeks or months of bearing noises and head actuator reset clicks. So make sure the models you do deploy have proper SMART reporting and offer a media wear out indicator.

I concluded that a SSD pool would wear out too quickly for my applications, so I have a two pool solution. VM's & performance stuff on a block storage optimized pool, and bulk data like surveillance video on a RAIDZ2 HDD pool. The block storage will move to SSD soon.
I think this will work great for at least one of my pools (plex). I rarely add media but occasionally review it. Four of my drives are dedicated to it, so I'm planning to replace 4 spinning disks with 4 SSD's. I've been reading that the cheap ones don't have decent (any?) DRAM cache, but for my application, that shouldn't be an issue. At some point I'll be moving from CORE to SCALE and will look at larger drives then for my pool (and reduce the number required.)
 
Joined
Mar 5, 2022
Messages
224
I've replaced my four Plex pook drives and the reboot issue is gone. I am now getting what looks like controller warnings: smid command timeout on target:
1000009882.jpg
(perhaps the card was damaged by low voltage?)

The card I have now is a LSI 9207-8i RAID Controller Card 6Gbs SAS SATA PCI-E 3.0 HBA IT Mode Expander Card for ZFS FreeNAS unRAID (LSI 9207-8i RAID Controller Card 6Gbs SAS SATA PCI-E 3.0 HBA IT Mode Expander Card for ZFS FreeNAS unRAID https://a.co/d/4pB74Cy).

Any suggestions for a replacement (or is it a good card that's just failing?)
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
All clustered on targets 4 - 7? Or is it random? Bad cable? Bend stress, etc...? If it's clustered on a group of 4 addresses, it's likely the cable.
 
Joined
Mar 5, 2022
Messages
224
Forgive my ignorance, but I thought each hd had it's own cable between the card and the hd.
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
Forgive my ignorance, but I thought each hd had it's own cable between the card and the hd.

I'm not familiar with your hardware, case, backplane(s), etc... I saw "LSI controller" and assumed there was a breakout cable of some sort, ala: SFF-8087 to 4 x SATA, which would give a common point of failure for a set of 4 drives. Only you know your cabling, I was just suggesting things to inspect.

If the controller is failing it would likely be producing errors on all the drives connected it, not just a select grouping. The screenshot only showed a group of 4, which may not even be the case, I may have just inferred that incorrectly. Generally, the LSI controllers will throw a fault when they detect controller failures and the firmware halts until it gets a PCIe reset, but you'd need a debug cable to see it. You would only see the controller go missing abruptly, and if you have an advanced enough BIOS with some kind of LSI aware LOM, perhaps a bit more detail about a controller fault.
 
Joined
Mar 5, 2022
Messages
224
Actually, I think you may be on to something. That controller card handles a total of four drives. The rest of the drives are handled directly from the motherboard. I'll check which drives it handles and try moving the new SSD drives to this controller. There is less traffic on the pool they soley comprise, so hopefully this problem may actually go away.

I had one of these same controller cards fail just at the end of the warranty. The seller graciously sent me this replacement (which worked fine for the last year or so.) This is why I was asking about replacement suggestions.
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
Actually, I think you may be on to something. That controller card handles a total of four drives. The rest of the drives are handled directly from the motherboard.

Ok. In that case, you have to expand the list of suspects to the controller as well, just as you suggest. Remember, ZFS identifies drives by UUID. You can swap cables between a motherboard drive and a LSI controller drive, and it should just find them and import it into the correct pool at boot regardless. ZFS handles drive roaming quite seamlessly, and this may help you narrow down the problem.
 
Joined
Mar 5, 2022
Messages
224
Not sure whether to start another thread or not. I'm sure someone will tell me if I am wrong ;-)

I replaced my controller card with a 16 port card (SVNXINGTII SAS9300-16i 9300-16i16-Port 12Gb/s SAS-3 SAS/SATA PCIe HBA IT Model ZFS) from Amazon yesterday. I have put ALL of my drives on this new card.

I saw this error on the console earlier today:
PXL_20231214_161026011.jpg

I looked up the error from the console and saw that it could be a loose cable or connector, so I moved the connectors around (to see if the error stayed on the same drive), plugged them back in (to reset the other end of the cables) , and rebooted.

Then I noticed this error (it could be that the pool was degraded earlier, but unfortunately I didn't look at the web page to see):
pool.png


I ran the smart_report.sh script from FreeNAS-scripts-master and got this:

########## SMART status report summary for all SATA drives on server PHONG ########## +-------+------------------------+----+------+-----+-----+-------+-------+--------+------+----------+------+-----------+----+ |Device |Serial |Temp| Power|Start|Spin |ReAlloc|Current|Offline |Seek |Total |High | Command|Last| | |Number | | On |Stop |Retry|Sectors|Pending|Uncorrec|Errors|Seeks |Fly | Timeout|Test| | | | | Hours|Count|Count| |Sectors|Sectors | | |Writes| Count |Age | +-------+------------------------+----+------+-----+-----+-------+-------+--------+------+----------+------+-----------+----+ |da0 |PNY21442111050102589 |33 | 12766| 218| | | | | N/A| N/A| N/A| N/A| 1| |da1 |PNY21442111050102587 |33 | 12765| 218| | | | | N/A| N/A| N/A| N/A| 1| |da2 |WD-WX72A9232NL9 |31 | 3118| 69| 0| 0| 0| 0| N/A| N/A| N/A| N/A| 1| |da3 |WD-WX42D517A0YV |33 | 13744| 219| 0| 0| 0| 0| N/A| N/A| N/A| N/A| 1| |da4 |WD-WX32DB08YEHU |33 | 10387| 167| 0| 0| 0| 0| N/A| N/A| N/A| N/A| 1| |da5 |WD-WX42D614DP0Z |33 | 13981| 214| 0| 0| 0| 0| N/A| N/A| N/A| N/A| 1| |da6 |WD-WX12D415NYJC |32 | 11052| 152| 0| 0| 0| 0| N/A| N/A| N/A| N/A| 1| |da7 |WD-WX12D41361E7 |32 | 11042| 152| 0| 0| 0| 0| N/A| N/A| N/A| N/A| 1| |da8 |WD-WX92DA0LJUKF |31 | 11344| 161| 0| 0| 0| 0| N/A| N/A| N/A| N/A| 1| |da9 |WD-WX92DA0L9CVT |32 | 11292| 158| 0| 0| 0| 0| N/A| N/A| N/A| N/A| 1| |da10 |NF9351W1004740S30B |25 | 753| 14| | 0| | | N/A| N/A| N/A| N/A| 1| |da11 |NHH737W1039820S30B |25 | 852| 16| | 0| | | N/A| N/A| N/A| N/A| 1| |da12 |NF9351W1005300S30B |34 | 775| 14| | 0| | | N/A| N/A| N/A| N/A| 1| |da13 |NHH737W1039630S30B |23 | 838| 14| | 0| | | N/A| N/A| N/A| N/A| 1| |da14 ! |WWY0ACWY |54* | 3996| 50| 0| 0| 0| 0| 0| 395656636| N/A| 8590065666| 1| +-------+------------------------+----+------+-----+-----+-------+-------+--------+------+----------+------+-----------+----+ ########## SATA drive /dev/da0 Serial: XXXXXXXXXXX2589 ########## Phison Driven SSDs (PNY CS900 120GB SSD) SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 12766 12 Power_Cycle_Count 0x0012 100 100 000 Old_age Always - 218 168 SATA_Phy_Error_Count 0x0012 100 100 000 Old_age Always - 0 170 Bad_Blk_Ct_Erl/Lat 0x0003 100 100 000 Pre-fail Always - 0/61 173 MaxAvgErase_Ct 0x0012 100 100 000 Old_age Always - 22 (Average 1) 192 Unsafe_Shutdown_Count 0x0012 100 100 000 Old_age Always - 181 194 Temperature_Celsius 0x0023 067 067 000 Pre-fail Always - 33 (Min/Max 33/33) 218 CRC_Error_Count 0x000b 100 100 050 Pre-fail Always - 0 231 SSD_Life_Left 0x0013 100 100 000 Pre-fail Always - 99 241 Lifetime_Writes_GiB 0x0012 100 100 000 Old_age Always - 27 No Errors Logged Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Short offline Completed without error 00% 12745 - ########## SATA drive /dev/da1 Serial: XXXXXXXXXXX2587 ########## Phison Driven SSDs (PNY CS900 120GB SSD) SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 12765 12 Power_Cycle_Count 0x0012 100 100 000 Old_age Always - 218 168 SATA_Phy_Error_Count 0x0012 100 100 000 Old_age Always - 0 170 Bad_Blk_Ct_Erl/Lat 0x0003 100 100 000 Pre-fail Always - 0/92 173 MaxAvgErase_Ct 0x0012 100 100 000 Old_age Always - 25 (Average 1) 192 Unsafe_Shutdown_Count 0x0012 100 100 000 Old_age Always - 181 194 Temperature_Celsius 0x0023 067 067 000 Pre-fail Always - 33 (Min/Max 33/33) 218 CRC_Error_Count 0x000b 100 100 050 Pre-fail Always - 0 231 SSD_Life_Left 0x0013 100 100 000 Pre-fail Always - 99 241 Lifetime_Writes_GiB 0x0012 100 100 000 Old_age Always - 27 No Errors Logged Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Short offline Completed without error 00% 12745 - ########## SATA drive /dev/da2 Serial: XXXXXXXXXXX2NL9 ########## WDC WD20EFZX-68AWUN0 SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 204 203 021 Pre-fail Always - 2791 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 70 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3118 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 69 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 63 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 16 194 Temperature_Celsius 0x0022 116 097 000 Old_age Always - 31 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 No Errors Logged Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Short offline Completed without error 00% 3104 - ########## SATA drive /dev/da3 Serial: XXXXXXXXXXXA0YV ########## WDC WD20EFZX-68AWUN0 SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 198 197 021 Pre-fail Always - 3058 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 224 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 082 082 000 Old_age Always - 13744 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 219 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 206 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 50 194 Temperature_Celsius 0x0022 114 098 000 Old_age Always - 33 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 No Errors Logged Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Short offline Completed without error 00% 13723 - ########## SATA drive /dev/da4 Serial: XXXXXXXXXXXYEHU ########## WDC WD20EFZX-68AWUN0 SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 199 197 021 Pre-fail Always - 3016 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 167 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 086 086 000 Old_age Always - 10387 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 167 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 161 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 15 194 Temperature_Celsius 0x0022 114 099 000 Old_age Always - 33 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 No Errors Logged Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Short offline Completed without error 00% 10367 - ########## SATA drive /dev/da5 Serial: XXXXXXXXXXXDP0Z ########## WDC WD20EFZX-68AWUN0 SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 198 197 021 Pre-fail Always - 3066 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 215 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 081 081 000 Old_age Always - 13981 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 214 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 201 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 48 194 Temperature_Celsius 0x0022 114 100 000 Old_age Always - 33 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 No Errors Logged Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Short offline Completed without error 00% 13960 - ########## SATA drive /dev/da6 Serial: XXXXXXXXXXXNYJC ########## WDC WD20EFZX-68AWUN0 SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 198 197 021 Pre-fail Always - 3100 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 153 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 085 085 000 Old_age Always - 11052 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 152 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 149 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 14 194 Temperature_Celsius 0x0022 115 101 000 Old_age Always - 32 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 No Errors Logged Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Short offline Completed without error 00% 11031 - ########## SATA drive /dev/da7 Serial: XXXXXXXXXXX61E7 ########## WDC WD20EFZX-68AWUN0 SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 198 197 021 Pre-fail Always - 3075 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 153 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 085 085 000 Old_age Always - 11042 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 152 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 149 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 15 194 Temperature_Celsius 0x0022 115 100 000 Old_age Always - 32 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 No Errors Logged Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Short offline Completed without error 00% 11021 - ########## SATA drive /dev/da8 Serial: WDXXXXXXXXXXXJUKF ########## WDC WD20EFZX-68AWUN0 SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 197 197 021 Pre-fail Always - 3116 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 162 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 085 085 000 Old_age Always - 11344 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 161 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 158 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 14 194 Temperature_Celsius 0x0022 116 101 000 Old_age Always - 31 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 No Errors Logged Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Short offline Completed without error 00% 11323 - ########## SATA drive /dev/da9 Serial: XXXXXXXXXXX9CVT ########## WDC WD20EFZX-68AWUN0 SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 198 198 021 Pre-fail Always - 3066 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 159 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 085 085 000 Old_age Always - 11292 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 158 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 155 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 14 194 Temperature_Celsius 0x0022 115 101 000 Old_age Always - 32 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 No Errors Logged Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Short offline Completed without error 00% 11271 - ########## SATA drive /dev/da10 Serial: XXXXXXXXXXXS30B ########## Lexar SSD NS100 2TB SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0013 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 753 12 Power_Cycle_Count 0x0012 100 100 000 Old_age Always - 14 164 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 21475295238 165 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 7 166 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 5 167 Unknown_Attribute 0x0022 100 100 000 Old_age Always - 6 194 Temperature_Celsius 0x0022 026 026 000 Old_age Always - 26 (Min/Max 20/46) 199 UDMA_CRC_Error_Count 0x0012 100 100 000 Old_age Always - 0 241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 27093 242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 83184 No Errors Logged Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Short offline Completed without error 00% 731 - ########## SATA drive /dev/da11 Serial: XXXXXXXXXXXS30B ########## Lexar SSD NS100 2TB SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0013 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 852 12 Power_Cycle_Count 0x0012 100 100 000 Old_age Always - 16 164 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 12885295108 165 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 6 166 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 3 167 Unknown_Attribute 0x0022 100 100 000 Old_age Always - 4 194 Temperature_Celsius 0x0022 024 024 000 Old_age Always - 24 (Min/Max 19/38) 199 UDMA_CRC_Error_Count 0x0012 100 100 000 Old_age Always - 0 241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 27137 242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 171598 No Errors Logged Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Short offline Completed without error 00% 830 - ########## SATA drive /dev/da12 Serial: XXXXXXXXXXXS30B ########## Lexar SSD NS100 2TB SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0013 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 775 12 Power_Cycle_Count 0x0012 100 100 000 Old_age Always - 14 164 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 17180327941 165 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 7 166 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 4 167 Unknown_Attribute 0x0022 100 100 000 Old_age Always - 5 194 Temperature_Celsius 0x0022 028 028 000 Old_age Always - 28 (Min/Max 20/48) 199 UDMA_CRC_Error_Count 0x0012 100 100 000 Old_age Always - 0 241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 27108 242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 108371 No Errors Logged Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Short offline Completed without error 00% 752 - ########## SATA drive /dev/da13 Serial: XXXXXXXXXXXS30B ########## Lexar SSD NS100 2TB SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0013 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 838 12 Power_Cycle_Count 0x0012 100 100 000 Old_age Always - 14 164 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 12885229572 165 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 5 166 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 3 167 Unknown_Attribute 0x0022 100 100 000 Old_age Always - 4 194 Temperature_Celsius 0x0022 024 024 000 Old_age Always - 24 (Min/Max 19/43) 199 UDMA_CRC_Error_Count 0x0012 100 100 000 Old_age Always - 0 241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 27123 242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 146414 No Errors Logged Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Short offline Completed without error 00% 816 - ########## SATA drive /dev/da14 Serial: XXXXXXXXXXXACWY ########## ST10000DM005-3AW101 SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 083 064 044 Pre-fail Always - 197014840 3 Spin_Up_Time 0x0003 088 088 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 73 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 086 060 045 Pre-fail Always - 395656653 9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3996 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 50 18 Unknown_Attribute 0x000b 100 100 050 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 099 000 Old_age Always - 8590065666 190 Airflow_Temperature_Cel 0x0022 046 025 000 Old_age Always - 54 (Min/Max 33/64) 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 25 193 Load_Cycle_Count 0x0032 094 094 000 Old_age Always - 12181 194 Temperature_Celsius 0x0022 054 075 000 Old_age Always - 54 (0 26 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 2304 (96 214 0) 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 12426814994 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 605359293401 No Errors Logged Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Short offline Completed without error 00% 3975 -



I don't see anthing that stands out, but then again, I never see the ketchup in the refrigerator :-/

I'm running a scrub on it now
 
Joined
Mar 5, 2022
Messages
224
I just finished the scrub and it completed with no errors. The pool is still marked as online/unhealthy
 
Joined
Mar 5, 2022
Messages
224
So I reset all of the cables AGAIN and rebooted. Now the pool is marked as online with no errors. I'm going to let this run for a while and report back in a week or sooner
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
What sort of case is this in?
HBA's do suffer from heat issues as they are designed for enterprise type installations with lots of airflow. You might take a look at possible over heating on the LSI card and add an additional fan to keep it cool
 
Joined
Mar 5, 2022
Messages
224
CPU temps are below 40 degrees but I
did notice the controller card was pretty warm. Unfortunately it is toward the bottom of the case where there is less air flow. I will look into adding another case fan.

On the other hand, so far, so good. The top-level menu is showing all the pools are up and running.
 
Joined
Mar 5, 2022
Messages
224
It ran overnight and I am getting the same errors:
PXL_20231216_141249681.jpg


I am really at a loss and any help would be most welcome
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
CPU temps are below 40 degrees but I
did notice the controller card was pretty warm. Unfortunately it is toward the bottom of the case where there is less air flow. I will look into adding another case fan.
You should take this VERY seriously. An overheated HBA can permanently damage data beyond repair.
 
Joined
Mar 5, 2022
Messages
224
I've moved it to another PCIE slot that is higher in the case and is more open to air flow. I've also removed the case cover for now.
How do I know if it is overheating?

I rebooted and the error cleared temporarily, but it is back. I could re-run the RAID tests, but I assume they would be the same. I honestly have no idea what else to try
 
Last edited:
Top