Truenas Scale Pool degraded after a power outage. 15 read errors on random drives. R720xd

cole4011

Dabbler
Joined
Sep 29, 2023
Messages
18
Recently I had a power outage that has affected my Truenas Scale system. Different drives are marked as faulted with 15 read errors during scrub tasks. I'm 80% sure the drives are fine as they are only 2-3 months old and it's not the same drive or the same slot each time. The pool is an all SSD pool, x6 wide raidz2 with 2 hot spares. The data is backed up in multiple locations and safe so that's not a concern. however, I do not want to rebuild the pool. I don't want to nuke the data, I'd like to be able to recover from this in the event it happens again somehow to prove it can be fixed without data loss.

System info:
  • TrueNAS-SCALE-22.12.3.3
  • R720xd x24 2.5 bay with Rear flex bay
  • X2 Xeon E5-2697 v2
  • H710 HBA (IT mode)
  • 256gb ECC Registered DDR3 Ram
  • 10Gbe Networking card
  • x2 1100w PSU
  • x8 Samsung_SSD_870_EVO_4TB (Raidz2 x6 wide vdev, with x2 hot spares)
  • x2 Samsung_SSD_870_EVO_250GB (mirrored boot pool)
The R720xd has a mirrored boot pool using the rear flex bay and x8 drives in the x24 bay backplane on the front of the server.

What I've Tried:
  • Replace HBA (multiple times. I even bought an HBA from the Art of Server on eBay that is in IT mode. I've flashed 2 HBAs to IT mode myself, an H310 & H710)
  • Replaced Backplane & SAS cables (more SAS cables on the way)
  • Replaced PSU's
  • Replaced Ram
  • Replaced Motherboard
  • Swapped drive locations (not including rear flex bay)
Smart Output:

1. S6PJNS0W501176B

Code:
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       2094
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       79
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       8
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   062   060   000    Old_age   Always       -       38
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       73
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       12615480528
252 Added_Bad_Flash_Blk_Ct  0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

2. S6PJNS0W500034H
Code:
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       2094
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       76
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       7
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   062   060   000    Old_age   Always       -       38
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       74
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       11553271266
252 Added_Bad_Flash_Blk_Ct  0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

3. S6PJNS0W500042J
Code:
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       2089
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       84
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       8
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   063   045   000    Old_age   Always       -       37
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   099   099   000    Old_age   Always       -       1
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       81
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       16103206435
252 Added_Bad_Flash_Blk_Ct  0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

4. S6PJNS0W502767R
Code:
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       2093
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       79
177 Wear_Leveling_Count     0x0013   100   100   000    Pre-fail  Always       -       0
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   063   060   000    Old_age   Always       -       37
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       76
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       3863904288
252 Added_Bad_Flash_Blk_Ct  0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

5. S6PJNS0W501290Y
Code:
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       408
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       111
177 Wear_Leveling_Count     0x0013   100   100   000    Pre-fail  Always       -       0
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   064   050   000    Old_age   Always       -       36
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   099   099   000    Old_age   Always       -       1
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       105
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       5267480562
252 Added_Bad_Flash_Blk_Ct  0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

6. S6PJNS0W501175A
Code:
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       2094
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       75
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       7
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   062   058   000    Old_age   Always       -       38
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       73
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       12045385001
252 Added_Bad_Flash_Blk_Ct  0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

7. S6PJNS0W501171K
Code:
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       2094
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       73
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       8
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   074   057   000    Old_age   Always       -       26
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       70
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       18555807651
252 Added_Bad_Flash_Blk_Ct  0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

8. S6PJNS0W501170P
Code:
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       2094
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       80
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       11
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   074   058   000    Old_age   Always       -       26
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   099   099   000    Old_age   Always       -       3
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       77
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       40138025532
252 Added_Bad_Flash_Blk_Ct  0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged


Notes on SMART Tests:
I know I have some CRC Errors on 3 drives, but this doesn't really bother me or cause any concern, these errors have not increased anymore and I'm only getting read errors, no checksum errors. I did get a write error one time, but I don't have this data to share and it has not occurred again. I think the drives are fine!

Latest Resilver Email:
Code:
ZFS has finished a resilver:

   eid: 51
 class: resilver_finish
  host: truenas-scale
  time: 2023-10-05 03:13:46-0400
  pool: mainframe
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
    repaired.
  scan: resilvered 990G in 00:40:39 with 0 errors on Thu Oct  5 03:13:46 2023
config:

    NAME                                        STATE     READ WRITE CKSUM
    mainframe                                   DEGRADED     0     0     0
      raidz2-0                                  DEGRADED     0     0     0
        e7ca087b-3fcc-4c87-a72e-269a50751068    ONLINE       0     0     0
        spare-1                                 DEGRADED     0     0     0
          4ae20dae-448b-43f8-93f7-6a3577743cba  FAULTED     15     0     0  too many errors
          2bdfb31c-c361-4a15-8ee7-9f83f938b1b7  ONLINE       0     0     0
        a6b4e4cf-4d8a-4363-bb95-e7d88b6874cc    ONLINE       0     0     0
        1c25a11a-96bb-4d61-830e-8b80684b2590    ONLINE       0     0     0
        f5015a2d-30f7-42f9-ae18-5c2e6feade1b    ONLINE       0     0     0
        87d21052-4dbe-4bdf-afa9-94aa2353e678    ONLINE       0     0     0
    spares
      2bdfb31c-c361-4a15-8ee7-9f83f938b1b7      INUSE     currently in use
      adb4cb64-58fa-4d18-b965-0afcad5dac81      AVAIL  

errors: No known data errors


Final Notes:
Brain hurts. Not sure how to proceed, probably going to replace the CPUs depending on what the community has to say. I've dropped BANK on an automatic generator system to make sure this never happens again. Knock on wood. I'd like to fix the server but I can get another r720xd for $200 and call it a day, but that doesn't help figure out the issue.

I've read a LOT in the forums and tried to avoid posting but I'm at my wit's end.

This is my first post. Hopefully, I've included all the relevant information. How to proceed?
Thank you in advance for any assistance.
 

cole4011

Dabbler
Joined
Sep 29, 2023
Messages
18
This may be useful as well, sorry for the screenshot and not clean code.
 

Attachments

  • testytest.PNG
    testytest.PNG
    15 KB · Views: 160

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
zpool status -v please in code blocks.

You might consider:
1. Running a scrub
2. Running zpool clear
3. Running a scrub

See what happens
 

cole4011

Dabbler
Joined
Sep 29, 2023
Messages
18
Code:
zpool status -v
  pool: boot-pool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:00:22 with 0 errors on Sat Sep 30 03:45:23 2023
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda3    ONLINE       0     0     0
            sdc3    ONLINE       0     0     0

errors: No known data errors

  pool: mainframe
 state: ONLINE
  scan: resilvered 2.22M in 00:00:00 with 0 errors on Thu Oct  5 05:39:55 2023
config:

        NAME                                      STATE     READ WRITE CKSUM
        mainframe                                 ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            e7ca087b-3fcc-4c87-a72e-269a50751068  ONLINE       0     0     0
            4ae20dae-448b-43f8-93f7-6a3577743cba  ONLINE       0     0     0
            a6b4e4cf-4d8a-4363-bb95-e7d88b6874cc  ONLINE       0     0     0
            1c25a11a-96bb-4d61-830e-8b80684b2590  ONLINE       0     0     0
            f5015a2d-30f7-42f9-ae18-5c2e6feade1b  ONLINE       0     0     0
            87d21052-4dbe-4bdf-afa9-94aa2353e678  ONLINE       0     0     0
        spares
          2bdfb31c-c361-4a15-8ee7-9f83f938b1b7    AVAIL
          adb4cb64-58fa-4d18-b965-0afcad5dac81    AVAIL

errors: No known data errors
 

cole4011

Dabbler
Joined
Sep 29, 2023
Messages
18
I've run multiple scrubs and always do a zpool clear after resilver if the scrub fails. Made it to 4 scrubs without issue one time lol.
 

cole4011

Dabbler
Joined
Sep 29, 2023
Messages
18
So I have some new developments on this issue. I've installed Truenas core and am getting the error "IOC Fault 0x40007e23" Lots of stuff on the forums about this issue with HBAs and drivers for FreeBSD. I've repasted the HBA & installed a little 40mm fan onto it. This rules out a cooling issue.

Attached is a screenshot of what's currently going on. I'm moving a couple TB's of files while running a scrub and one of the drives was taken offline and replaced by a spare. Resilvering now, almost done.

What I might try:
1. replace the HBA with 2 PCIe cards flashed to IT
2. install truenas-core 11
3. install a newer version of IT mode to the HBA

1697087401164.png
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Is the H710 a true HBA or is it a lobotomized RAID Card? You stated in your kit list that the HBA was running IT Mode - how exactly?

Looking through the forums says the H710 is not a reccomended device for use with TrueNAS
 

cole4011

Dabbler
Joined
Sep 29, 2023
Messages
18
Just to note. Any HBA I mention is a MINI designed specifically for the r720xd and attached directly to the motherboard, not through PCIE.

The H710 is a true HBA in the sense that I flashed it to IT mode myself (I think you mean using the default firmware when you say lobotomized). Thanks for the info on that! I've replaced the H710 with an H310 that I also flashed to IT mode (screenshot to show)

1697123451500.png


Still getting the "IOC Fault 0x40007e23" error (screenshot)

1697125031695.png


I'm hoping replacing these onboard mini HBAs with a PCIE HBA will fix the issues. I'll report back when they arrive and have been tested.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
I would scrub the boot pool as well. Also a clean install might help (just import back the config).

Did you try power cycling?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112

cole4011

Dabbler
Joined
Sep 29, 2023
Messages
18
I would scrub the boot pool as well. Also a clean install might help (just import back the config).

Did you try power cycling?
Truenas core was installed just a few days ago but no scrubs yet. I'm actually a little worried about scrubbing the boot-pool as it's attached to the HBA as well and is a single drive (for now. The system initially used a mirror pool with Truenas scale, but this is all just testing for now). I've seen where people say don't boot off of the HBA and people say it's no issue. I plan on booting from the HBA.

I may reflash the HBA again, but I have the tried and true LSI 9211-8i PCIE cards on the way now.

Have you checked for a firmware update on the 870 EVO drives? The 2TB and 4TB models have been reported to have firmware issues by some third-party users (I have no formal testing data of my own)

https://www.techpowerup.com/forums/...ware-certain-batches-prone-to-failure.291504/
I've tried checking but this is old hardware, no firmware updates from Dell on the R720xd regarding these drives. Do you have any recommendations on how to update the firmware considering this?

May just be a $2,000 mistake, In the future I plan on using only Enterprise Intel SSDs. (Found some that are actually $30 cheaper than the Evos, NEW.)


Updates on scrub:
scrub passed with "no errors" but showing CAM errors and the IOC fault error.

I also haven't run SMART tests since changing the OS, so just set up daily short tests and weekly long tests. Running a manual Short now, then a manual long test.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
The power outrage might have damaged something... have fun verifying the possibility. I should move my lazy ass and have my UPS fixed.
 

cole4011

Dabbler
Joined
Sep 29, 2023
Messages
18
The power outrage might have damaged something... have fun verifying the possibility. I should move my lazy ass and have my UPS fixed.
I've replaced everything possible haha. I have UPS systems but this was a longggg outage, I was not able to safely shut things down as I was away. A $12,000 automatic generator backup system is in the works.

My initial post shows everything I've replaced. Do you recommend I try replacing anything else? I have a new SAS cable that just arrived that I should probably try to replace as well. Would be cool if it was a $12 fix.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
I have UPS systems but this was a longggg outage, I was not able to safely shut things down as I was away.
You should be able to configure TN to do so by its own (at least with certain UPS).
 

cole4011

Dabbler
Joined
Sep 29, 2023
Messages
18
You should be able to configure TN to do so by its own (at least with certain UPS).
Not this one sadly. Looked into that as well. Lots of lessons have been learned the hard way. I wish I was made of money haha.
 

cole4011

Dabbler
Joined
Sep 29, 2023
Messages
18
Also just to note, I've had a couple of drives increase in "199 CRC_ERROR_COUNTS" but that is related to the issue at hand correct?

Edit only 2 increases in error count, 2 drives 1 error each. This makes sense because a drive was offlined and replaced by a spare during my tests last night, or am I incorrect in assuming this?
 
Last edited:

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Also just to note, I've had a couple of drives increase in "199 CRC_ERROR_COUNTS" but that is related to the issue at hand correct?
Possibly.

This makes sense because a drive was offlined and replaced by a spare during my tests last night, or am I incorrect in assuming this?
It shouldn't be correlated.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I've tried checking but this is old hardware, no firmware updates from Dell on the R720xd regarding these drives. Do you have any recommendations on how to update the firmware considering this?
Dell isn't likely to have firmware for a consumer SSD included in their regular downloads; you'll have to go directly to Samsung, and likely update through their proprietary tools:

https://semiconductor.samsung.com/consumer-storage/magician/

You should be able to see the current firmware version through the SMART output in TrueNAS though, to compare it against the latest available version, and determine if there's a need to go through the trouble.
 

cole4011

Dabbler
Joined
Sep 29, 2023
Messages
18
Dell isn't likely to have firmware for a consumer SSD included in their regular downloads; you'll have to go directly to Samsung, and likely update through their proprietary tools:

https://semiconductor.samsung.com/consumer-storage/magician/

You should be able to see the current firmware version through the SMART output in TrueNAS though, to compare it against the latest available version, and determine if there's a need to go through the trouble.
I have checked the firmware from Smart, and it's all good there. Thanks for the tip!
 

cole4011

Dabbler
Joined
Sep 29, 2023
Messages
18
So I stated earlier in this forum that I might try installing Truenas 11 as people are saying they didn't experience this issue on that version and I can confirm I am not experiencing the issue on Truenas 11.3-U5 which is the latest version of 11.

more information about this can be found here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224496

Not that that helps any really if you want to maintain an up-to-date system but this does point more towards a bug than a hardware failure IMO.

Still waiting to replace the HBA with LSI PCIE cards and be done with this issue.

Edit: Tecnically I guess this is Freenas, not truenas. Truenas starts at version 12

Edit 2: I did have to nuke the data on my pool and recreate it. something I stated I wouldn't like to do, butttttt...
 
Top