having issues with my hardware - looking for suggestions

jordanthompson · Nov 7, 2023

I had some issues with heat causing reboots (replacing the heatsink grease made a huge difference.) Now, the only reason the system reboots is because of a power failure (I live in Florida but have a UPS that is only good for so long.)

The previous power supply gave up the ghost so I replaced it with a brand new EVGA Supernova 1600 a couple of months ago, which has been working fine until recently: The system won't boot unless I remove one of the 14 HDD's. Once it gets to the TrueNAS boot screen, I can safely insert it and it will boot successfully every time. I had a similar problem with the previous power supply, but I would have to remove all of the drives to successfully boot.

The system consists of an Asus M5A99X EVO with 32GB of RAM (prettry sure its DDR3) and an AMD processor of some sort (AM3+/AM3 socket.) There are two SSD's for the OS, 14 2TB drives for the pools, and a single USB drive with an external power supply for backup. From what I understand, the HDD's only use about 8W, so there is no way they are the problem unless several are failing. The motherboard should take no more than 150W.

I am not seeing any warnings or errors for the HDD's.
Any idea how to check how much power a HDD is actually consuming?

Any suggestions to my dilema?

jgreco · Nov 7, 2023

Extensive discussions of the topic exist.

Proper Power Supply Sizing Guidance

I've seen about 1,000 threads like this one where people decide that they can power a dozen hard drives off a 360 watt supply. DO NOT DO THIS. I've seen another 1,000 threads where people decide to buy the cheapest power supply that they can...

www.truenas.com

rvassar · Nov 7, 2023

AM3/AM3+ is getting pretty long in the tooth, and had terrible thermal engineering numbers even by 2011 standards. You didn't mention exactly which CPU you are using, but the Bulldozer & Piledriver CPU's span 95 to 220 watts, and have very poor idle/sleep capability. Using one for a NAS is kind of starting off from a bad place, but... I'm also going to point out that 2Tb HDD's are also pretty close to relic status, and most solutions with that many drives implement a staggered spin up solution to prevent overloading the 12v rail. 14 x 2Tb in RAIDZ1 nets you 26Tb down to 14Tb mirrored, depending on vdev's and such. You can buy 22Tb single drives these days, and 10 - 14 Tb drives are available at discount prices. I would consider an upgrade/migration that reduces the device count. A 4 drive solution should be simple, and a 6 drive solution with some capacity expansion within reach. If you're doing block storage with those drive counts for the IOPS rate, you should be moving to SSD's.

Possible lower budget solutions:
1. The EVGA Supernova 1600 appears to have a single 12v rail. Yes at 133 amps of 12v it's beefy, but a dual rail 12v supply is probably a better choice.
2. Swap out two of the 2Tb drives for 2Tb SSDs. Yes its dumb/wasteful... But it saves you the spin up power you seem to need and costs about the same amount as a 1+kW PSU swap in my locale. Bonus: You can keep swapping out a drive every couple weeks at lunch money rates and end up with an all SSD pool...

jordanthompson · Nov 8, 2023

rvassar said:
AM3/AM3+ is getting pretty long in the tooth, and had terrible thermal engineering numbers even by 2011 standards. You didn't mention exactly which CPU you are using, but the Bulldozer & Piledriver CPU's span 95 to 220 watts, and have very poor idle/sleep capability. Using one for a NAS is kind of starting off from a bad place, but... I'm also going to point out that 2Tb HDD's are also pretty close to relic status, and most solutions with that many drives implement a staggered spin up solution to prevent overloading the 12v rail. 14 x 2Tb in RAIDZ1 nets you 26Tb down to 14Tb mirrored, depending on vdev's and such. You can buy 22Tb single drives these days, and 10 - 14 Tb drives are available at discount prices. I would consider an upgrade/migration that reduces the device count. A 4 drive solution should be simple, and a 6 drive solution with some capacity expansion within reach. If you're doing block storage with those drive counts for the IOPS rate, you should be moving to SSD's.

Possible lower budget solutions:
1. The EVGA Supernova 1600 appears to have a single 12v rail. Yes at 133 amps of 12v it's beefy, but a dual rail 12v supply is probably a better choice.
2. Swap out two of the 2Tb drives for 2Tb SSDs. Yes its dumb/wasteful... But it saves you the spin up power you seem to need and costs about the same amount as a 1+kW PSU swap in my locale. Bonus: You can keep swapping out a drive every couple weeks at lunch money rates and end up with an all SSD pool...

I'm slowly cozying up to this idea... I had no idea that SSD's had dropped so much in price! I'm seeing 2tb SSD's for under $70! Would you recommend one over another for a server application? AFAIK, as long as the transfer rate is equivalent, they are all pretty much the same.

rvassar · Nov 8, 2023

jordanthompson said:
I'm slowly cozying up to this idea... I had no idea that SSD's had dropped so much in price! I'm seeing 2tb SSD's for under $70! Would you recommend one over another for a server application? AFAIK, as long as the transfer rate is equivalent, they are all pretty much the same.

For a server application you want to check the lifetime TBW rating, and power failure protection status. Remember it's the writing that wears out SSD memory cells, so you need to determine what your write/rewrite rate is and plan accordingly. ZFS will beat up a weak SSD, so check the reviews at places catering to IT pro's. I'd probably avoid QLC NAND at this point. But you may find the cost trade off vs redundancy immaterial if you're running a multi-vdev RAIDz2 that can soak up a lot of failures vs say a 7 vdev mirror-stripe that is always 2 failures away from loss. Only you know your config & workload, and can determine your aversion to risk.

Keep in mind, SSD's tend to fail suddenly and differently than a HDD. There's no weeks or months of bearing noises and head actuator reset clicks. So make sure the models you do deploy have proper SMART reporting and offer a media wear out indicator.

I concluded that a SSD pool would wear out too quickly for my applications, so I have a two pool solution. VM's & performance stuff on a block storage optimized pool, and bulk data like surveillance video on a RAIDZ2 HDD pool. The block storage will move to SSD soon.

jordanthompson · Nov 8, 2023

rvassar said:
For a server application you want to check the lifetime TBW rating, and power failure protection status. Remember it's the writing that wears out SSD memory cells, so you need to determine what your write/rewrite rate is and plan accordingly. ZFS will beat up a weak SSD, so check the reviews at places catering to IT pro's. I'd probably avoid QLC NAND at this point. But you may find the cost trade off vs redundancy immaterial if you're running a multi-vdev RAIDz2 that can soak up a lot of failures vs say a 7 vdev mirror-stripe that is always 2 failures away from loss. Only you know your config & workload, and can determine your aversion to risk.

Keep in mind, SSD's tend to fail suddenly and differently than a HDD. There's no weeks or months of bearing noises and head actuator reset clicks. So make sure the models you do deploy have proper SMART reporting and offer a media wear out indicator.

I concluded that a SSD pool would wear out too quickly for my applications, so I have a two pool solution. VM's & performance stuff on a block storage optimized pool, and bulk data like surveillance video on a RAIDZ2 HDD pool. The block storage will move to SSD soon.

I think this will work great for at least one of my pools (plex). I rarely add media but occasionally review it. Four of my drives are dedicated to it, so I'm planning to replace 4 spinning disks with 4 SSD's. I've been reading that the cheap ones don't have decent (any?) DRAM cache, but for my application, that shouldn't be an issue. At some point I'll be moving from CORE to SCALE and will look at larger drives then for my pool (and reduce the number required.)

jordanthompson · Nov 19, 2023

I've replaced my four Plex pook drives and the reboot issue is gone. I am now getting what looks like controller warnings: smid command timeout on target:

(perhaps the card was damaged by low voltage?)

The card I have now is a LSI 9207-8i RAID Controller Card 6Gbs SAS SATA PCI-E 3.0 HBA IT Mode Expander Card for ZFS FreeNAS unRAID (LSI 9207-8i RAID Controller Card 6Gbs SAS SATA PCI-E 3.0 HBA IT Mode Expander Card for ZFS FreeNAS unRAID https://a.co/d/4pB74Cy).

Any suggestions for a replacement (or is it a good card that's just failing?)

rvassar · Nov 19, 2023

All clustered on targets 4 - 7? Or is it random? Bad cable? Bend stress, etc...? If it's clustered on a group of 4 addresses, it's likely the cable.

jordanthompson · Nov 20, 2023

Forgive my ignorance, but I thought each hd had it's own cable between the card and the hd.

rvassar · Nov 20, 2023

jordanthompson said:
Forgive my ignorance, but I thought each hd had it's own cable between the card and the hd.

I'm not familiar with your hardware, case, backplane(s), etc... I saw "LSI controller" and assumed there was a breakout cable of some sort, ala: SFF-8087 to 4 x SATA, which would give a common point of failure for a set of 4 drives. Only you know your cabling, I was just suggesting things to inspect.

If the controller is failing it would likely be producing errors on all the drives connected it, not just a select grouping. The screenshot only showed a group of 4, which may not even be the case, I may have just inferred that incorrectly. Generally, the LSI controllers will throw a fault when they detect controller failures and the firmware halts until it gets a PCIe reset, but you'd need a debug cable to see it. You would only see the controller go missing abruptly, and if you have an advanced enough BIOS with some kind of LSI aware LOM, perhaps a bit more detail about a controller fault.

jordanthompson · Nov 21, 2023

Actually, I think you may be on to something. That controller card handles a total of four drives. The rest of the drives are handled directly from the motherboard. I'll check which drives it handles and try moving the new SSD drives to this controller. There is less traffic on the pool they soley comprise, so hopefully this problem may actually go away.

I had one of these same controller cards fail just at the end of the warranty. The seller graciously sent me this replacement (which worked fine for the last year or so.) This is why I was asking about replacement suggestions.

rvassar · Nov 21, 2023

jordanthompson said:
Actually, I think you may be on to something. That controller card handles a total of four drives. The rest of the drives are handled directly from the motherboard.

Ok. In that case, you have to expand the list of suspects to the controller as well, just as you suggest. Remember, ZFS identifies drives by UUID. You can swap cables between a motherboard drive and a LSI controller drive, and it should just find them and import it into the correct pool at boot regardless. ZFS handles drive roaming quite seamlessly, and this may help you narrow down the problem.

jordanthompson · Dec 14, 2023

Not sure whether to start another thread or not. I'm sure someone will tell me if I am wrong ;-)

I replaced my controller card with a 16 port card (SVNXINGTII SAS9300-16i 9300-16i16-Port 12Gb/s SAS-3 SAS/SATA PCIe HBA IT Model ZFS) from Amazon yesterday. I have put ALL of my drives on this new card.

I saw this error on the console earlier today:

I looked up the error from the console and saw that it could be a loose cable or connector, so I moved the connectors around (to see if the error stayed on the same drive), plugged them back in (to reset the other end of the cables) , and rebooted.

Then I noticed this error (it could be that the pool was degraded earlier, but unfortunately I didn't look at the web page to see):

I ran the smart_report.sh script from FreeNAS-scripts-master and got this:


########## SMART status report summary for all SATA drives on server PHONG ##########

+-------+------------------------+----+------+-----+-----+-------+-------+--------+------+----------+------+-----------+----+
|Device |Serial                  |Temp| Power|Start|Spin |ReAlloc|Current|Offline |Seek  |Total     |High  |    Command|Last|
|       |Number                  |    | On   |Stop |Retry|Sectors|Pending|Uncorrec|Errors|Seeks     |Fly   |    Timeout|Test|
|       |                        |    | Hours|Count|Count|       |Sectors|Sectors |      |          |Writes|    Count  |Age |
+-------+------------------------+----+------+-----+-----+-------+-------+--------+------+----------+------+-----------+----+
|da0    |PNY21442111050102589    |33  | 12766|  218|     |       |       |        |   N/A|       N/A|   N/A|        N/A|   1|
|da1    |PNY21442111050102587    |33  | 12765|  218|     |       |       |        |   N/A|       N/A|   N/A|        N/A|   1|
|da2    |WD-WX72A9232NL9         |31  |  3118|   69|    0|      0|      0|       0|   N/A|       N/A|   N/A|        N/A|   1|
|da3    |WD-WX42D517A0YV         |33  | 13744|  219|    0|      0|      0|       0|   N/A|       N/A|   N/A|        N/A|   1|
|da4    |WD-WX32DB08YEHU         |33  | 10387|  167|    0|      0|      0|       0|   N/A|       N/A|   N/A|        N/A|   1|
|da5    |WD-WX42D614DP0Z         |33  | 13981|  214|    0|      0|      0|       0|   N/A|       N/A|   N/A|        N/A|   1|
|da6    |WD-WX12D415NYJC         |32  | 11052|  152|    0|      0|      0|       0|   N/A|       N/A|   N/A|        N/A|   1|
|da7    |WD-WX12D41361E7         |32  | 11042|  152|    0|      0|      0|       0|   N/A|       N/A|   N/A|        N/A|   1|
|da8    |WD-WX92DA0LJUKF         |31  | 11344|  161|    0|      0|      0|       0|   N/A|       N/A|   N/A|        N/A|   1|
|da9    |WD-WX92DA0L9CVT         |32  | 11292|  158|    0|      0|      0|       0|   N/A|       N/A|   N/A|        N/A|   1|
|da10   |NF9351W1004740S30B      |25  |   753|   14|     |      0|       |        |   N/A|       N/A|   N/A|        N/A|   1|
|da11   |NHH737W1039820S30B      |25  |   852|   16|     |      0|       |        |   N/A|       N/A|   N/A|        N/A|   1|
|da12   |NF9351W1005300S30B      |34  |   775|   14|     |      0|       |        |   N/A|       N/A|   N/A|        N/A|   1|
|da13   |NHH737W1039630S30B      |23  |   838|   14|     |      0|       |        |   N/A|       N/A|   N/A|        N/A|   1|
|da14 ! |WWY0ACWY                |54* |  3996|   50|    0|      0|      0|       0|     0| 395656636|   N/A| 8590065666|   1|
+-------+------------------------+----+------+-----+-----+-------+-------+--------+------+----------+------+-----------+----+

########## SATA drive /dev/da0 Serial: XXXXXXXXXXX2589
########## Phison Driven SSDs (PNY CS900 120GB SSD)

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       12766
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       218
168 SATA_Phy_Error_Count    0x0012   100   100   000    Old_age   Always       -       0
170 Bad_Blk_Ct_Erl/Lat      0x0003   100   100   000    Pre-fail  Always       -       0/61
173 MaxAvgErase_Ct          0x0012   100   100   000    Old_age   Always       -       22 (Average 1)
192 Unsafe_Shutdown_Count   0x0012   100   100   000    Old_age   Always       -       181
194 Temperature_Celsius     0x0023   067   067   000    Pre-fail  Always       -       33 (Min/Max 33/33)
218 CRC_Error_Count         0x000b   100   100   050    Pre-fail  Always       -       0
231 SSD_Life_Left           0x0013   100   100   000    Pre-fail  Always       -       99
241 Lifetime_Writes_GiB     0x0012   100   100   000    Old_age   Always       -       27

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%     12745         -

########## SATA drive /dev/da1 Serial: XXXXXXXXXXX2587
########## Phison Driven SSDs (PNY CS900 120GB SSD)

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       12765
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       218
168 SATA_Phy_Error_Count    0x0012   100   100   000    Old_age   Always       -       0
170 Bad_Blk_Ct_Erl/Lat      0x0003   100   100   000    Pre-fail  Always       -       0/92
173 MaxAvgErase_Ct          0x0012   100   100   000    Old_age   Always       -       25 (Average 1)
192 Unsafe_Shutdown_Count   0x0012   100   100   000    Old_age   Always       -       181
194 Temperature_Celsius     0x0023   067   067   000    Pre-fail  Always       -       33 (Min/Max 33/33)
218 CRC_Error_Count         0x000b   100   100   050    Pre-fail  Always       -       0
231 SSD_Life_Left           0x0013   100   100   000    Pre-fail  Always       -       99
241 Lifetime_Writes_GiB     0x0012   100   100   000    Old_age   Always       -       27

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%     12745         -

########## SATA drive /dev/da2 Serial: XXXXXXXXXXX2NL9
########## WDC WD20EFZX-68AWUN0

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   204   203   021    Pre-fail  Always       -       2791
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       70
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3118
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       69
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       63
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       16
194 Temperature_Celsius     0x0022   116   097   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%      3104         -

########## SATA drive /dev/da3 Serial: XXXXXXXXXXXA0YV
########## WDC WD20EFZX-68AWUN0

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   198   197   021    Pre-fail  Always       -       3058
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       224
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   082   082   000    Old_age   Always       -       13744
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       219
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       206
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       50
194 Temperature_Celsius     0x0022   114   098   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%     13723         -

########## SATA drive /dev/da4 Serial: XXXXXXXXXXXYEHU
########## WDC WD20EFZX-68AWUN0

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   199   197   021    Pre-fail  Always       -       3016
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       167
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       10387
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       167
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       161
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       15
194 Temperature_Celsius     0x0022   114   099   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%     10367         -

########## SATA drive /dev/da5 Serial: XXXXXXXXXXXDP0Z
########## WDC WD20EFZX-68AWUN0

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   198   197   021    Pre-fail  Always       -       3066
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       215
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   081   081   000    Old_age   Always       -       13981
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       214
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       201
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       48
194 Temperature_Celsius     0x0022   114   100   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%     13960         -

########## SATA drive /dev/da6 Serial: XXXXXXXXXXXNYJC
########## WDC WD20EFZX-68AWUN0

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   198   197   021    Pre-fail  Always       -       3100
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       153
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       11052
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       152
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       149
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       14
194 Temperature_Celsius     0x0022   115   101   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%     11031         -

########## SATA drive /dev/da7 Serial: XXXXXXXXXXX61E7
########## WDC WD20EFZX-68AWUN0

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   198   197   021    Pre-fail  Always       -       3075
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       153
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       11042
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       152
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       149
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       15
194 Temperature_Celsius     0x0022   115   100   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%     11021         -

########## SATA drive /dev/da8 Serial: WDXXXXXXXXXXXJUKF
########## WDC WD20EFZX-68AWUN0

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   197   197   021    Pre-fail  Always       -       3116
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       162
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       11344
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       161
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       158
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       14
194 Temperature_Celsius     0x0022   116   101   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%     11323         -

########## SATA drive /dev/da9 Serial: XXXXXXXXXXX9CVT
########## WDC WD20EFZX-68AWUN0

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   198   198   021    Pre-fail  Always       -       3066
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       159
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       11292
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       158
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       155
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       14
194 Temperature_Celsius     0x0022   115   101   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%     11271         -

########## SATA drive /dev/da10 Serial: XXXXXXXXXXXS30B
########## Lexar SSD NS100 2TB

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0013   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       753
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       14
164 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       21475295238
165 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       7
166 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       5
167 Unknown_Attribute       0x0022   100   100   000    Old_age   Always       -       6
194 Temperature_Celsius     0x0022   026   026   000    Old_age   Always       -       26 (Min/Max 20/46)
199 UDMA_CRC_Error_Count    0x0012   100   100   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       27093
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       83184

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%       731         -

########## SATA drive /dev/da11 Serial: XXXXXXXXXXXS30B
########## Lexar SSD NS100 2TB

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0013   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       852
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       16
164 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       12885295108
165 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       6
166 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       3
167 Unknown_Attribute       0x0022   100   100   000    Old_age   Always       -       4
194 Temperature_Celsius     0x0022   024   024   000    Old_age   Always       -       24 (Min/Max 19/38)
199 UDMA_CRC_Error_Count    0x0012   100   100   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       27137
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       171598

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%       830         -

########## SATA drive /dev/da12 Serial: XXXXXXXXXXXS30B
########## Lexar SSD NS100 2TB

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0013   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       775
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       14
164 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       17180327941
165 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       7
166 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       4
167 Unknown_Attribute       0x0022   100   100   000    Old_age   Always       -       5
194 Temperature_Celsius     0x0022   028   028   000    Old_age   Always       -       28 (Min/Max 20/48)
199 UDMA_CRC_Error_Count    0x0012   100   100   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       27108
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       108371

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%       752         -

########## SATA drive /dev/da13 Serial: XXXXXXXXXXXS30B
########## Lexar SSD NS100 2TB

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0013   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       838
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       14
164 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       12885229572
165 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       5
166 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       3
167 Unknown_Attribute       0x0022   100   100   000    Old_age   Always       -       4
194 Temperature_Celsius     0x0022   024   024   000    Old_age   Always       -       24 (Min/Max 19/43)
199 UDMA_CRC_Error_Count    0x0012   100   100   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       27123
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       146414

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%       816         -

########## SATA drive /dev/da14 Serial: XXXXXXXXXXXACWY
########## ST10000DM005-3AW101

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   064   044    Pre-fail  Always       -       197014840
  3 Spin_Up_Time            0x0003   088   088   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       73
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   086   060   045    Pre-fail  Always       -       395656653
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3996
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       50
 18 Unknown_Attribute       0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       8590065666
190 Airflow_Temperature_Cel 0x0022   046   025   000    Old_age   Always       -       54 (Min/Max 33/64)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       25
193 Load_Cycle_Count        0x0032   094   094   000    Old_age   Always       -       12181
194 Temperature_Celsius     0x0022   054   075   000    Old_age   Always       -       54 (0 26 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       2304 (96 214 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       12426814994
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       605359293401

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%      3975         -

I don't see anthing that stands out, but then again, I never see the ketchup in the refrigerator :-/

I'm running a scrub on it now

jordanthompson · Dec 14, 2023

I just finished the scrub and it completed with no errors. The pool is still marked as online/unhealthy

jordanthompson · Dec 14, 2023

So I reset all of the cables AGAIN and rebooted. Now the pool is marked as online with no errors. I'm going to let this run for a while and report back in a week or sooner

NugentS · Dec 15, 2023

What sort of case is this in?
HBA's do suffer from heat issues as they are designed for enterprise type installations with lots of airflow. You might take a look at possible over heating on the LSI card and add an additional fan to keep it cool

jordanthompson · Dec 16, 2023

CPU temps are below 40 degrees but I
did notice the controller card was pretty warm. Unfortunately it is toward the bottom of the case where there is less air flow. I will look into adding another case fan.

On the other hand, so far, so good. The top-level menu is showing all the pools are up and running.

jordanthompson · Dec 16, 2023

It ran overnight and I am getting the same errors:

I am really at a loss and any help would be most welcome

ChrisRJ · Dec 16, 2023

jordanthompson said:
CPU temps are below 40 degrees but I
did notice the controller card was pretty warm. Unfortunately it is toward the bottom of the case where there is less air flow. I will look into adding another case fan.

You should take this VERY seriously. An overheated HBA can permanently damage data beyond repair.

jordanthompson · Dec 16, 2023

I've moved it to another PCIE slot that is higher in the case and is more open to air flow. I've also removed the case cover for now.
How do I know if it is overheating?

I rebooted and the error cleared temporarily, but it is back. I could re-run the RAID tests, but I assume they would be the same. I honestly have no idea what else to try

Important Announcement for the TrueNAS Community.

having issues with my hardware - looking for suggestions

Patron

Resident Grinch

Guru

Patron

Guru

Patron

Patron

Guru

Patron

Guru

Patron

Guru

Patron

Patron

Patron

MVP

Patron

Patron

Wizard

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "having issues with my hardware - looking for suggestions"

Similar threads