Drive(s) Read/Write error since switch from unraid.

MrYoshii

Cadet
Joined
Mar 19, 2023
Messages
9
My System:
  • Motherboard: Gigabyte B760 Gaming X AX DDR4
  • CPU: Intel Core I5 13500
  • RAM: Corsair Vengeance LPX 32GB (2 x 16 GB) DDR4 3200MHz
  • Hard drives, quantity, model numbers, and RAID configuration, including boot drives:
    4x ST8000DMZ04 1x ST8000VNZ04 RAIDZ1
  • Hard disk controllers: Adaptec ASR-72405
  • Network cards: onboard from MB

The Problem:
On the weekend, i moved from Unraid to Truenas Scale. I put all my data from my Backup back on the Array.
Overnight i did a Scrub Tasks.
In the morning, I saw a few massages via Telegram:
And this was the array status:
1679313084620.png

The one drive with no errors was the missing one.
Code:
HomeServer01, [20.03.2023 00:00]
TrueNAS @ HomeServer01 
 
New alerts:

  * Scrub of pool 'Array1_32TB' started.

Current alerts:

  * Scrub of pool 'Array1_32TB' started.

HomeServer01, [20.03.2023 00:04]
TrueNAS @ HomeServer01 
 
New alerts:

  * Pool Array1_32TB state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

Current alerts:

  * Scrub of pool 'Array1_32TB' started.
  * Pool Array1_32TB state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

HomeServer01, [20.03.2023 00:05]
TrueNAS @ HomeServer01 
 
New alert:

  * Pool Array1_32TB state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. 
The following devices are not healthy:

    * Disk ST8000DM004-2U9188 WSC**** is REMOVED

The following alert has been cleared:

  * Pool Array1_32TB state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

Current alerts:

  * Scrub of pool 'Array1_32TB' started.
  * Pool Array1_32TB state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. 
The following devices are not healthy:

    * Disk ST8000DM004-2U9188 WSC**** is REMOVED

HomeServer01, [20.03.2023 00:52]
TrueNAS @ HomeServer01 
 
New alerts:

  * Scrub of pool 'Array1_32TB' finished.

Current alerts:

  * Scrub of pool 'Array1_32TB' started.
  * Pool Array1_32TB state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. 
The following devices are not healthy:

    * Disk ST8000DM004-2U9188 WSC**** is REMOVED
  * Scrub of pool 'Array1_32TB' finished.

HomeServer01, [20.03.2023 00:53]
TrueNAS @ HomeServer01 
 
New alert:

  * Pool Array1_32TB state is DEGRADED: One or more devices has experienced an error resulting in data corruption. Applications may be affected. 
The following devices are not healthy:

    * Disk ST8000DM004-2U9188 WSC**** is REMOVED

The following alert has been cleared:

  * Pool Array1_32TB state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. 
The following devices are not healthy:

    * Disk ST8000DM004-2U9188 WSC**** is REMOVED

Current alerts:

  * Scrub of pool 'Array1_32TB' started.
  * Scrub of pool 'Array1_32TB' finished.
  * Pool Array1_32TB state is DEGRADED: One or more devices has experienced an error resulting in data corruption. Applications may be affected. 
The following devices are not healthy:

    * Disk ST8000DM004-2U9188 WSC**** is REMOVED


After that, i did a restart of the system. The drive was back online and the System did a resilver.
That one drive did still have errors so i removed it from the array and checked with smart. Looks ok to me so i quick format the drive and put it back in. It resilvers again but the drive(s) still gets errors?
1679313301118.png

Not sure whats up with that in unraid i did not have any problems with the drives?

SMART:
Code:
smartctl -a /dev/sdl
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.79+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate BarraCuda 3.5 (SMR)
Device Model:     ST8000DM004-2U9188
Serial Number:   
LU WWN Device Id: 5 000c50 0df81a45e
Add. Product Id:  DELL(tm)
Firmware Version: 0001
User Capacity:    8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Mar 20 12:09:22 2023 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 974) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x30a5) SCT Status supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   082   064   006    Pre-fail  Always       -       173954785
  3 Spin_Up_Time            0x0003   097   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       558
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   089   060   045    Pre-fail  Always       -       761297024
  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       12840h+30m+46.447s
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       168
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   068   055   040    Old_age   Always       -       32 (Min/Max 32/33)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       116
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       1880
194 Temperature_Celsius     0x0022   032   045   000    Old_age   Always       -       32 (0 21 0 0 0)
195 Hardware_ECC_Recovered  0x001a   082   064   000    Old_age   Always       -       173954785
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       8786h+17m+11.597s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       51032138593
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       437887931800

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     12804         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

LarsR

Guru
Joined
Oct 23, 2020
Messages
719
There are two things that could cause your problems:

1. Your Adaptec Card only has an HBA mode, not a real IT mode.
2. You're using SMR drives which are known to cause problems with zfs.

SMR drives are known to cause problems with zfs.
If you haven't used the zfs plugin on unraid then you've propably used xfs or btrfs which work better with SMR drives
 

MrYoshii

Cadet
Joined
Mar 19, 2023
Messages
9
1. Your Adaptec Card only has an HBA mode, not a real IT mode.
Yes im useing HBA mode why is that a problem?
2. You're using SMR drives which are known to cause problems with zfs.

SMR drives are known to cause problems with zfs.
If you haven't used the zfs plugin on unraid then you've propably used xfs or btrfs which work better with SMR drives
Uhh damn ok that's unfortunate.
 

MrYoshii

Cadet
Joined
Mar 19, 2023
Messages
9

here's a resource from this forum explaining why hba-mode and HBA Card are not the same
Ahh, ok, since I can see smart and the drive's real names, I think the controller is not the problem.
 

MrYoshii

Cadet
Joined
Mar 19, 2023
Messages
9
@LarsR is correct. Let me be a little clearer for you: THE CONTROLLER IS THE PROBLEM. ADAPTEC'S "HBA MODE" IS NOT COMPATIBLE WITH TRUENAS.
okok what controller should i buy?
If got a 24Bay inter-tech 4U-4424 (SFF-8643)
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I'd throw an SAS expander in there and then an LSI 9300-8i as the HBA. I recommend researching any 12G SAS expander prior to purchase; my memory suggests there was one that is strongly disrecommended but my memory isn't being particularly helpful today as to which one. If you are unfamiliar with SAS expanders, please see the SAS Primer posted in the Resources section.

In the future, it's probably better not to get an off-brand chassis. What you really want is to have an SAS backplane with an SAS expander built-in, which is an option simply not available with many of the off-brands. If you had a nice Supermicro CSE-846BE1C or 847BE1C, for example, you'd just need to connect one or two SFF-8643 cables to the chassis backplane(s) and call it a day. It makes everything shockingly easy.
 

MrYoshii

Cadet
Joined
Mar 19, 2023
Messages
9
Ok, thanks for the info.
With zpool status -v i found two corrupted trash files on the array. I deleted them did a restart and did a scrub again, now it seems to be fine. Also, I changed the Controller mode from HBA to RAID RAW. Is there a way to check if this fixes the issue?
 
Top