Pool is not healthy

JBtje

Cadet
Joined
Nov 5, 2023
Messages
4
Hello!

I've got a completely new system (hopefully it will show in the signature) and installed TrueNAS yesterday. (or actually, reinstalled it today, because I was able to break it within a day...). Which is better then unraid, which i broke in 10 minutes (GUI only, no CLI!). In unraid I created an ZFS pool (worked) then turned the disks into zfs-encrypted and then system became unusable... So, I switched over to TrueNAS.

I created 3 ZFS pools (see signature) from scratch. Played around with apps and at the end of the day i copied 150 GiB to the Toshiba pool. While this was happening, the system crashed? and restarted (around 23:49). After the restart, I saw "Last Scan Errors: 6" under ZFS Health for the Toshiba pool.
However, the "Samsung pool" also shows "ZFS Health: Pool is not healthy". The Lexar pool shows 4 green checks and is fine.

Today I copied more data to the Toshiba pool, and... during receiving data the system crashed (again around 23:53). Now it shows "Last Scan Errors: 18" for the Toshiba pool.

Im running SMART tests on the Toshiba's, but it has been on 10% for the past hour...
1699228782628.png


My assumption is that one or multiple drives are not ok. Write performance it like 53-60MB/s to the pool, which I think is not right.

So here are my questions:
- The Samsung pool is not healthy, according to the system, but no errors. How can i debug / find out why the system means by it's not healthy?
- What other tests can I do for the Toshiba pool, assuming the SMART remains on 10%?
- I assume the system crashing has to do with the data transfer? Where could i find the cause of the crashes? I did some looking around in the debug zip file, but did not find anything useful so far.

Thank you!
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
How are the drives connected to the motherboard?
Please post the output of zpool status using [CODE][/CODE] tags, you can use the Shell of the WebUI for this.

Please post the results of the long smart tests after they are available using smartctl -a /dev/adaX where X is the number of the drive (0, 1, 2, etc...) use the [CODE][/CODE] tags; given the size of the drives, it's likely a long smart test will take at least a day to complete.

Also check the power and data connections of the drives, make sure they are properly latched and there is no apparent damage.

Finally, personally I wouldn't use latest-gen CPUs for this kind of work: cutting-edge is often unstable. Which SCALE version are you on? 23.10.1 shouldn't have issues with the scheduler; check your temps. On which drive have you installed TrueNas?
 
Last edited:

JBtje

Cadet
Joined
Nov 5, 2023
Messages
4
Thank you for your reply!

I did see the zpool status somewhere else, but it reported that zpool was not found. I'll check it later when I have TrueNas running again. I'm currently running memtest86 and the results so far are: error...

With all 4 dimms, a few thousand errors, but with either 2 pairs, no errors. So my guess is the memory is unstable at the XMP1 (DDR5-5600) profile with all 4 inserted (even though the cpu, mobo and ram all support it). I'm now testing the default (non xpm) profile (DDR5-4800) with all 4 dimms, and so far no error. Will do more mem testing to see when it becomes unstable.
My guess is that this might have been the cause of the corruption; I assume bit-flipping ram can definitely cause the weird symptoms.

The SMART test took about 25 hours, but in the GUI it didn't show any errors. Will use your command to see if there is anything notable in there and report back.

I would love to call the 14th gen intel "cutting-edge" but the reality is that they are the same as the 13th gen.

TrueNas is actually installed on an "old" Crucial m4 CT128M4SSD2 128GB That was the lowest size SSD I had available, since i was actually planning on using unraid in the first place, I did not account for a drive for the TrueNas boot-pool.

Is it possible to make a daily backup of the boot-pool[icode] drive? If this disk fails, its fine by me if there is some down time, but i would like to be able to restore everything on a different disk then. Also its not possible to use part of a nvm drive right? using the 2TB drive for TrueNas, means it will make a small partition for TrueNas and render the remaining of the disk unavailable?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
With all 4 dimms, a few thousand errors, but with either 2 pairs, no errors. So my guess is the memory is unstable at the XMP1 (DDR5-5600) profile with all 4 inserted (even though the cpu, mobo and ram all support it). I'm now testing the default (non xpm) profile (DDR5-4800) with all 4 dimms, and so far no error. Will do more mem testing to see when it becomes unstable.
It's usually suggested not to use XMP and similar on RAM... becasue there is no reason to do so and only adds instability (and DDR5 is unstable per se).

I would love to call the 14th gen intel "cutting-edge" but the reality is that they are the same as the 13th gen.
13th gen and DDR5 is cutting edge technology for this kind of application.
Just going gaming/consumer hardware can generate instability, adding new tech... eh, here goes the uptime, as you found out... honestly I wouldn't trust the data already inside the system and once I got it stable I would nuke the data pools and save the files there again in order to guarantee (big word withouth ECC RAM, but we work with what we have) data integrity.

TL;DR: disable any overclock and let memtest do its thing for at least a few days.

That was the lowest size SSD I had available
That's perfeclty fine. With a config backup, you can just reinstall the OS on a new drive when needed and bam, everything works as before. The following scripts provides you an easy way to get regual config backups, amongst other things.

Also its not possible to use part of a nvm drive right? using the 2TB drive for TrueNas, means it will make a small partition for TrueNas and render the remaining of the disk unavailable?
Yes. Using the boot drive for something else is possible, but heavily discouraged and not supported.
 
Last edited:

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
I did see the zpool status somewhere else, but it reported that zpool was not found. I'll check it later when I have TrueNas running again. I'm currently running memtest86 and the results so far are: error...
sudo zpool ...

Unfortunately it's not clear from the error message (not found) that you are lacking privileges ;)
 

JBtje

Cadet
Joined
Nov 5, 2023
Messages
4
It's usually suggested not to use XMP and similar on RAM... becasue there is no reason to do so and only adds instability (and DDR5 is unstable per se).
The result of memtest is that it runs "stable" with XMP profile 2 (DDR5-5200) (5 consecutive memtest runs). But yeah... what is that 400MHz going to do. So I followed your advice and turned off XMP and run it now on default (DDR5-4800). It's too bad there are hardly any 1700-socket based ECC supported motherboards.

13th gen and DDR5 is cutting edge technology for this kind of application.
Just going gaming/consumer hardware can generate instability, adding new tech... eh, here goes the uptime, as you found out... honestly I wouldn't trust the data already inside the system and once I got it stable I would nuke the data pools and save the files there again in order to guarantee (big word withouth ECC RAM, but we work with what we have) data integrity.
Data nuked, currently transfering again on newly created ZFS pools! All pools are healthy again. sudo zspool status also reports there are no known errors.

That's perfeclty fine. With a config backup, you can just reinstall the OS on a new drive when needed and bam, everything works as before. The following scripts provides you an easy way to get regual config backups, amongst other things.
Thank you!

3 of the 4 20TiB disks have no errors whatsoever, one shows the below error. Do I understand correctly that this error happened around 4 minutes after starting the system? (I was messing with unraid then) I guess it is an irrelevant error?
Code:
=== START OF INFORMATION SECTION ===
Device Model:     TOSHIBA MG10ACA20TE
Serial Number:    7370A1B9F4MJ
LU WWN Device Id: 5 000039 ca8c8302c
Firmware Version: 0102
User Capacity:    20,000,588,955,648 bytes [20.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Nov 13 16:49:49 2023 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1660) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       9393
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       16
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       194
 10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       15
 23 Unknown_Attribute       0x0023   100   100   075    Pre-fail  Always       -       0
 24 Unknown_Attribute       0x0023   100   100   075    Pre-fail  Always       -       0
 27 Unknown_Attribute       0x0023   100   100   030    Pre-fail  Always       -       1052764
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       23
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       33 (Min/Max 15/53)
196 Reallocated_Event_Count 0x0033   100   100   010    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       1
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       1310720
222 Loaded_Hours            0x0032   100   100   000    Old_age   Always       -       179
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       679
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       7330185392
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       7098182781

SMART Error Log Version: 1
ATA Error Count: 1
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 43 c8 3f 8a 00 40  Error: ICRC, ABRT at LBA = 0x00008a3f = 35391

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 40 c8 00 85 00 40 00      00:04:42.477  READ FPDMA QUEUED
  60 40 c0 c0 7f 00 40 00      00:04:42.473  READ FPDMA QUEUED
  60 40 b8 80 7a 00 40 00      00:04:42.472  READ FPDMA QUEUED
  60 40 b0 40 75 00 40 00      00:04:42.469  READ FPDMA QUEUED
  60 40 a8 00 70 00 40 00      00:04:42.467  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%        44         -
# 2  Short offline       Completed without error       00%         2         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I saw the HDD temperatures in the reporting page and moved a fan directly on the HDD's. They are now all <=38.
 
Top