zfs pool eror checksum and GUI stuck

Blastmun

Dabbler
Joined
Jun 4, 2022
Messages
13
Hi everybody!
Following a checksum error on 2/3 HDDs in my zfs pool, I first thought about replacing one of the HDDs. Everything went well except that the next day the message came back. So I thought it was a problem with the SATA cables. So I shut down the machine and replaced the SATA cables.
I had bought a cable like this , so I replaced them with 4 classic SATA cables.
Except that after restarting the GUI loads constantly and I have two tasks that remain blocked:
pool.import_on_boot at 53% and pool.dataset.sync_db_keys.
truenas1.PNG



As you can see, the monitoring does not load as well as various inaccessible menus.

Oddly enough my ZFS pool seems to be working fine I have an ISCSI share on it which seems to be working fine.
On the other hand, I have simple storage which had no problems with the "bank" not being updated.
And in CLI it takes a very long time to respond to me....

Do you have an idea ? :frown:
 
Last edited:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
Run a SMART Long/Extended test on all of your drives, you can do this all at the same time if you desire. Run the command
Code:
smartctl -t long /dev/sda
and sdb, sdc, you get the point. It will take many hours to complete the test. If you run the command
Code:
smartctl -a /dev/sda
and typically near the top there will be a listing of times, normally in minutes on how long a Long/Extended test takes to complete. Wait that amount of time plus 30 minutes. The smart test can take longer if you are using the pool as the smart test runs in the lowest priority. To verify the tests passed, run the command
Code:
smartctl -a /dev/sda
again and near the bottom should be a listing of test completions. you are looking for the extended test in spot #1 and there should be no errors.

In the meantime, you can post your smartctl -a results for each drive in brackets so we can read them and find out if you have any drives that are on the edge. You can do this yourself, look at my old link in my signature for hard drive troubleshooting. It still applies but I should update it for NVMe.

If you have not done so, make sure you backup any important files while you can. Checksum errors mean you have data corruption. A Scrub may fix it but also may tell you some files are gone and you need to manual delete the corrupt files.
 

Blastmun

Dabbler
Joined
Jun 4, 2022
Messages
13
I'm going to see if I can run the test, because as said previously since the restart, it seems that some tasks are causing problems and a majority of the GUI is inaccessible...
I actually thought of a sata cabme problem because I already replaced an HDD with the famous "checksum 2" error with a new one and the next day two disks in the pool went into "checksum 4 error".
Knowing that at the same time I tested the "defective" disk on Windows with a "chkdsk /r" and everything was OK.
I would have liked a technique to be able to stop the assembly of all the disks and tasks in order to regain control of the GUI.
Even in CLI when I want to do a "zpool status" nothing is displayed, like it crashes.

For example, at startup I get this kind of error:
 

Attachments

  • Capture d'écran 2023-12-25 182048.png
    Capture d'écran 2023-12-25 182048.png
    1.1 MB · Views: 110

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
Something I thought I had asked (there have been a lot of failures recently), post your system specs IAW the forum rules. Motherboard, CPU, RAM, add-on cards, all the hardware, and the version of TrueNAS you are running.

After reading what you posted a light came on in my head and it said, conduct stability testing (a CPU Stress Test and MemTest86) and ensure the CPU stress test is run for at least 30 minutes, and MemTest86 for at least 3 complete passes. This will hopefully tell you if your system is stable. You can also run solnet-array-test-v3 script to test the drives out for data connectivity.
 

Blastmun

Dabbler
Joined
Jun 4, 2022
Messages
13
I managed to regain control by carrying out a new installation, loading the backup. On the other hand, I have the "Bank" pool which crashes my startup, I had to remove the HDD.
smartctl -a /dev/sda (this is a new disk)

root@truenas[~]# smartctl -a /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Seagate Skyhawk
Device Model: ST4000VX007-2DT166
Serial Number: ZGY8NAW0
LU WWN Device Id: 5 000c50 0c861a448
Firmware Version: CV11
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5980 rpm
Form Factor: 3.5 inches
Device is: In smartctl database 7.3/5319
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Dec 26 14:18:24 2023 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 581) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 607) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x50bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 076 064 044 Pre-fail Always - 36621736
3 Spin_Up_Time 0x0003 094 093 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 310
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 096 060 045 Pre-fail Always - 3902808667
9 Power_On_Hours 0x0032 082 082 000 Old_age Always - 15853h+00m+00.000s
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 234
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 057 039 040 Old_age Always In_the_past 43 (Min/Max 34/45 #4)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 220
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1347
194 Temperature_Celsius 0x0022 043 061 000 Old_age Always - 43 (0 15 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 15820h+39m+44.687s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 27886290135
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 17617007981

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 15853 -
# 2 Extended offline Completed without error 00% 15248 -
# 3 Extended offline Completed without error 00% 14527 -
# 4 Extended offline Completed without error 00% 13782 -
# 5 Extended offline Completed without error 00% 13062 -
# 6 Extended offline Completed without error 00% 12318 -
# 7 Extended offline Completed without error 00% 11574 -
# 8 Extended offline Completed without error 00% 10854 -
# 9 Short offline Completed without error 00% 10275 -
#10 Short offline Completed without error 00% 10254 -
#11 Short offline Completed without error 00% 10228 -
#12 Short offline Completed without error 00% 10204 -
#13 Short offline Completed without error 00% 10180 -
#14 Short offline Completed without error 00% 10156 -
#15 Short offline Completed without error 00% 10132 -
#16 Short offline Completed without error 00% 10108 -
#17 Short offline Completed without error 00% 10084 -
#18 Short offline Completed without error 00% 10060 -
#19 Short offline Completed without error 00% 10036 -
#20 Short offline Completed without error 00% 10012 -
#21 Short offline Completed without error 00% 9988 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

Blastmun

Dabbler
Joined
Jun 4, 2022
Messages
13
smartctl -a /dev/sdb

root@truenas[~]# smartctl -a /dev/sdb
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: ST4000VN006-3CW104
Serial Number: WW60PFX1
LU WWN Device Id: 5 000c50 0f178153f
Firmware Version: SC60
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database 7.3/5319
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Dec 26 14:22:47 2023 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 455) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 070 064 006 Pre-fail Always - 9257720
3 Spin_Up_Time 0x0003 096 095 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 98
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 067 060 045 Pre-fail Always - 4955995
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 150
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 25
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 058 056 040 Old_age Always - 42 (Min/Max 35/44)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 20
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 131
194 Temperature_Celsius 0x0022 042 044 000 Old_age Always - 42 (0 25 0 0 0)
195 Hardware_ECC_Recovered 0x001a 070 064 000 Old_age Always - 9257720
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 132 (192 163 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 968588619
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 36723677709

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 149 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

Blastmun

Dabbler
Joined
Jun 4, 2022
Messages
13
smartctl /dev/sdd

root@truenas[~]# smartctl -a /dev/sdd
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD40EFRX-68WT0N0
Serial Number: WD-WCC4E5CCC8F2
LU WWN Device Id: 5 0014ee 261693011
Firmware Version: 82.00A82
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database 7.3/5319
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Dec 26 14:27:26 2023 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 241) Self-test routine in progress...
10% of test remaining.
Total time to complete Offline
data collection: (54000) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 540) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x703d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 206 177 021 Pre-fail Always - 6691
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 704
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 069 069 000 Old_age Always - 22771
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 504
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 356
193 Load_Cycle_Count 0x0032 199 199 000 Old_age Always - 5898
194 Temperature_Celsius 0x0022 108 089 000 Old_age Always - 44
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Interrupted (host reset) 90% 22757 -
# 2 Extended offline Completed without error 00% 22633 -
# 3 Extended offline Completed without error 00% 22169 -
# 4 Extended offline Completed without error 00% 21449 -
# 5 Extended offline Completed without error 00% 20705 -
# 6 Extended offline Completed without error 00% 19986 -
# 7 Extended offline Completed without error 00% 19243 -
# 8 Extended offline Completed without error 00% 18500 -
# 9 Extended offline Completed without error 00% 17780 -
#10 Short offline Completed without error 00% 17199 -
#11 Short offline Completed without error 00% 17194 -
#12 Short offline Completed without error 00% 17178 -
#13 Short offline Completed without error 00% 17153 -
#14 Short offline Completed without error 00% 17129 -
#15 Short offline Completed without error 00% 17105 -
#16 Short offline Completed without error 00% 17081 -
#17 Short offline Completed without error 00% 17057 -
#18 Short offline Completed without error 00% 17033 -
#19 Short offline Completed without error 00% 17009 -
#20 Short offline Completed without error 00% 16985 -
#21 Short offline Completed without error 00% 16961 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

Blastmun

Dabbler
Joined
Jun 4, 2022
Messages
13
To concern my HDD "bank"


1703597478913.png


1703597591414.png




root@truenas[~]# smartctl -a /dev/sdg
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Seagate BarraCuda 3.5 (SMR)
Device Model: ST6000DM003-2CY186
Serial Number: ZF205QHS
LU WWN Device Id: 5 000c50 0c333028e
Firmware Version: 0001
User Capacity: 6,001,175,126,016 bytes [6.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5425 rpm
Form Factor: 3.5 inches
Device is: In smartctl database 7.3/5319
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Dec 26 14:29:39 2023 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 243) Self-test routine in progress...
30% of test remaining.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 722) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x30a5) SCT Status supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 082 064 006 Pre-fail Always - 175496733
3 Spin_Up_Time 0x0003 093 091 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 345
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 086 061 045 Pre-fail Always - 444820705
9 Power_On_Hours 0x0032 084 084 000 Old_age Always - 14259h+33m+21.956s
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 230
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 054 042 040 Old_age Always - 46 (Min/Max 27/47)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 611
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1109
194 Temperature_Celsius 0x0022 046 058 000 Old_age Always - 46 (0 16 0 0 0)
195 Hardware_ECC_Recovered 0x001a 082 064 000 Old_age Always - 175496733
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 13052h+49m+37.763s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 34454208786
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 37359183498

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Self-test routine in progress 30% 14259 -
# 2 Extended offline Completed without error 00% 13674 -
# 3 Extended offline Completed without error 00% 12955 -
# 4 Extended offline Completed without error 00% 12209 -
# 5 Extended offline Completed without error 00% 11484 -
# 6 Extended offline Completed without error 00% 10734 -
# 7 Extended offline Completed without error 00% 9991 -
# 8 Extended offline Completed without error 00% 9270 -
# 9 Short offline Completed without error 00% 8687 -
#10 Short offline Completed without error 00% 8665 -
#11 Short offline Completed without error 00% 8640 -
#12 Short offline Completed without error 00% 8616 -
#13 Short offline Completed without error 00% 8592 -
#14 Short offline Completed without error 00% 8568 -
#15 Short offline Completed without error 00% 8544 -
#16 Short offline Completed without error 00% 8520 -
#17 Short offline Completed without error 00% 8496 -
#18 Short offline Completed without error 00% 8472 -
#19 Short offline Completed without error 00% 8448 -
#20 Short offline Completed without error 00% 8424 -
#21 Short offline Completed without error 00% 8400 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

Blastmun

Dabbler
Joined
Jun 4, 2022
Messages
13
Something I thought I had asked (there have been a lot of failures recently), post your system specs IAW the forum rules. Motherboard, CPU, RAM, add-on cards, all the hardware, and the version of TrueNAS you are running.

After reading what you posted a light came on in my head and it said, conduct stability testing (a CPU Stress Test and MemTest86) and ensure the CPU stress test is run for at least 30 minutes, and MemTest86 for at least 3 complete passes. This will hopefully tell you if your system is stable. You can also run solnet-array-test-v3 script to test the drives out for data connectivity.
I try this when I'm at home! thanks!
OS:TrueNAS-SCALE-23.10.1

MB: MSI MAG B550 Tomahawk (MS-7C91)
CPU: Rysen 5 5600G
RAM: Corsair Vengeance CMU16GX4M2C3000C15R (2*8)
 
Last edited:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
Here is your problem, and it is easy to spot.
root@truenas[~]# smartctl -a /dev/sdg
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Seagate BarraCuda 3.5 (SMR)
Device Model: ST6000DM003-2CY186
Serial Number: ZF205QHS

You are using an SMR drive.

I don't want to sound like a broken record of what some of my colleagues would say but it is true here, you are using substandard hardware for a ZFS type NAS. We see this when someone watches a Youtube video that says "Turn your old computer into a NAS" and that is such a wrong statement to make. I just build my last NAS (I hope) and it is all new hardware and cost me $2600 USD. Of course I could have done it cheaper if I used anything other than NVMe drives. I will admit that your motherboard does look nice, as a gaming board. I'm not sure if it support ECC RAM but if it does, and you install ECC RAM, and if your AMD CPU supports ECC RAM, you might have a viable motherboard for TrueNAS.

But in your case, all your drives are proper except for the one listed above, you cannot use an SMR drive for normal ZFS operations, it just is not fast enough. You can google SMR and read about what that means and then google CMR and see the differences. SMR aka. Archive drives are all about write once and leave the data alone. That is simplified but true. You can rewrite data but it takes a long time for the drive to process it and re-record the data.

I would recommend that you read the TrueNAS user guide, and the Hardware Guide in the resources section of the forums if you plan to continue to use TrueNAS. As for your RAM, you should know how much RAM you have on the system, 8GB, 16GB, 32GB, 64GB.

So copy the data you can off the SMR drive to a better location, replace the SMR drive with a proper CMR drive, and you should be back in operation.

Good Luck
 

Blastmun

Dabbler
Joined
Jun 4, 2022
Messages
13
Here is your problem, and it is easy to spot.


You are using an SMR drive.

I don't want to sound like a broken record of what some of my colleagues would say but it is true here, you are using substandard hardware for a ZFS type NAS. We see this when someone watches a Youtube video that says "Turn your old computer into a NAS" and that is such a wrong statement to make. I just build my last NAS (I hope) and it is all new hardware and cost me $2600 USD. Of course I could have done it cheaper if I used anything other than NVMe drives. I will admit that your motherboard does look nice, as a gaming board. I'm not sure if it support ECC RAM but if it does, and you install ECC RAM, and if your AMD CPU supports ECC RAM, you might have a viable motherboard for TrueNAS.

But in your case, all your drives are proper except for the one listed above, you cannot use an SMR drive for normal ZFS operations, it just is not fast enough. You can google SMR and read about what that means and then google CMR and see the differences. SMR aka. Archive drives are all about write once and leave the data alone. That is simplified but true. You can rewrite data but it takes a long time for the drive to process it and re-record the data.

I would recommend that you read the TrueNAS user guide, and the Hardware Guide in the resources section of the forums if you plan to continue to use TrueNAS. As for your RAM, you should know how much RAM you have on the system, 8GB, 16GB, 32GB, 64GB.

So copy the data you can off the SMR drive to a better location, replace the SMR drive with a proper CMR drive, and you should be back in operation.

Good Luck
I understand what you mean. Concerning the "SMR" HDD I know exactly what it is. This is why my raidz uses CMRs. The HDD SMR is mainly used for backing up more or less large files such as archives, videos, etc., which I only need to request very rarely.

And regarding RAIDZ storage, why is it not healthy?
 
Last edited:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
And regarding RAIDZ storage, why is it not healthy?
Pretty sure it is because you are over 80% full. You should really read the user guide, this is basic knowledge. Your system is not slow yet, wait until you hit 90%, then a special algorithm for writing is enabled and it is very slow and this is because you have very little space to write to.

I'm not trying to "bust your chops" or give you a hard time but you do need to read the user guide and examine the resources section of the forum.
 
Top