Replaced One Drive, Resilvering, Now Multiple Failed AND Data Corruption!?

isopropyl · Feb 24, 2024

dak180 said:
Check the settings in the gui and make sure that sqlite3 /data/freenas-v1.db 'select em_fromemail from system_email;' gets you the same address as it sends from in the mails you are getting.

Yes that command prints my correct e-mail

dak180 · Feb 24, 2024

@isopropyl who are you using for the outgoing smtp provider?

isopropyl · Feb 24, 2024

dak180 said:
@isopropyl who are you using for the outgoing smtp provider?

The outgoing mail server? (GUI > System > Email > Send Mail Method > Outgoing Mail Server)
I'm using my own domain. Self hosted e-mail server (Mailcow).

Like I said, I get e-mails no issues from TrueNAS's built in alerts.

dak180 · Feb 25, 2024

isopropyl said:
The outgoing mail server? (GUI > System > Email > Send Mail Method > Outgoing Mail Server)
I'm using my own domain. Self hosted e-mail server (Mailcow).

Do you have any logs that would tell you if the email is even getting to your smtp server or not?

isopropyl · Feb 27, 2024

dak180 said:
Do you have any logs that would tell you if the email is even getting to your smtp server or not?

Well can we do it this way to at least get the report for the time being, shouldn't the report also be able to just run and print locally into a text file? Instead of needing to use the e-mail to send it to?

dak180 · Feb 27, 2024

isopropyl said:
shouldn't the report also be able to just run and print locally into a text file?

It does and you can see the line where it actually sends the report.

isopropyl · Feb 28, 2024

dak180 said:
It does and you can see the line where it actually sends the report.

Ah, I will take a look when I get home and see if it's generating

isopropyl · Mar 7, 2024

dak180 said:
It does and you can see the line where it actually sends the report.

Something is up because I changed the logfileLocation="/tmp" to an actual directory (/root/report-scripts/)
and when I edit that into the report.conf file, then save the report.conf file, then run report.sh it just still just prints:
"Please specify a config file location, if none exist one will be created"

joeschmuck · Mar 7, 2024

I know running the script is not directly related to your original problem however try this little command:
zpool status -v | mail -s "testing" your_email@address.com
Do you get the email?

If yes, I'm not trying to peddle Multi-Report but see if you can make that work, located in the Resources area (I think I have a link in my signature as well). I've been told that many users like to drop the script into the root folder. I'm not personally a fan but it makes setting permissions easier.

When you run the script the first time it will tell you to create a configuration but using multi_report.sh -config and then select the option to generate a New configuration file, answer the questions and if you are unsure what to do, hit Return to skip it. All we really need to do is setup your email address. Select to Automatically Scan and adjust for the drives. Once done you exit, then you can run the script normally using multi_report.sh and if that works, great. I am a bit perplexed about why FreeNAS-Report isn't working for you, it's a well written script and many people use it. I'm offering this option to see if it works or not.

dak180 · Mar 7, 2024

isopropyl said:
Something is up because I changed the logfileLocation="/tmp" to an actual directory (/root/report-scripts/)
and when I edit that into the report.conf file, then save the report.conf file, then run report.sh it just still just prints:
"Please specify a config file location, if none exist one will be created"

Something very odd is definitely happening (or maybe not happening); try changing it back to the tmp location and then grabbing the file from there.

Stux · Mar 18, 2024

Meanwhile, did you verify the serial number before removing the original failing drive?

Is it possible you removed the wrong drive?

Often on SATA disks, you will see the udma error count increase in the smart reports when the cabling is bad, or power delivery. If that’s not happening, it’s possibly not the case.

ZFS will detect bit errors in blocks. You probably won’t detect a bit error in a video.

If your lose a sector then you may be missing 4KB. A decent video player would just skip over this corrupted section, maybe glitching a bit.

isopropyl · Mar 20, 2024

Stux said:
Meanwhile, did you verify the serial number before removing the original failing drive?

Is it possible you removed the wrong drive?

Often on SATA disks, you will see the udma error count increase in the smart reports when the cabling is bad, or power delivery. If that’s not happening, it’s possibly not the case.

ZFS will detect bit errors in blocks. You probably won’t detect a bit error in a video.

If your lose a sector then you may be missing 4KB. A decent video player would just skip over this corrupted section, maybe glitching a bit.

Yeah so I know MPV is pretty decent at playing corrupted issues, however I don't know. I feel like after all this time, I haven't really noticed any issues with files, or videos. Maybe I am just getting "lucky" and missing any files with issues. However I skimmed through the large list of stuff it says are errored and really don't see issues with those files.

It's tough to read the SMART reports when there's SO many drives, but gotta get this script working hopefully.
Yes I verified the serial number

isopropyl · Mar 20, 2024

dak180 said:
Something very odd is definitely happening (or maybe not happening); try changing it back to the tmp location and then grabbing the file from there.

Been so busy, let me try this. Haven't had a moment to read through joe's comment thoroughly but let me see if this works I didn't even try checking tmp

isopropyl · Mar 27, 2024

Looks like the script has been running itself on the cron actually, and placing tmp logfiles in the /Logs folder I made. But just not e-mailing anything. Not the end of the world rn, care more about the contents of the log than it e-mailing right now.

Just trying to figure out how to copy the log file out of the shell so I can get it on my clipboard to post here..?

Edit: Ended up copying the file to my pool and opening it in samba with windows notepad lol

isopropyl · Mar 27, 2024

Ended up copying the file to my pool and opening it in samba with windows notepad lol

But okay, finally. Here is the logfile.
Looks like a few of the drives at the bottom show old-age and pre-fail?
Unless I am missing anything though all the Smart Health Statuses seem to show "OK"? Is this a good sign? Or am I missing something

Just some reminder context:

- Yes some of the 4tb drives are old, however remember these are pairs of 2, and I always swapped them out as soon as they thrown any errors, AND I had 2 live spares in the system.

- I replaced all the cables between the backplane's and HBA's
- Furthermore, remember everything was working perfect until I went to go and replace that 1 20tb drive.
- The x3 20tb drives are on the rear of the chassis (so separate backplane) from all the 4tb drives (on the front).
- There are 2 HBA cards, each have 2 cables. 2 cables (HBA #1) go to front backplane, 2 cables (HBA #2) go to rear backplane.

I had replaced like x3 2-way mirror vdev's of 4tb drives with x1 3-way mirror 20tb drives. Everything was working good for a few weeks, resilvered no issues, no errors. Then suddenly I just got a error saying could not read smart data on da17, so I RMAed that one drive just in case. Upon placing the new drive in the spot and during the resilver a BUNCH of my 4tb drives thrown errors showing degraded, as well as all 3 of the 20tb drives.

A crapload of files show under "errors: Permanent errors have been detected in the following files:", however from what I can tell the files SEEEEEM fine. Keyword seem. Not 100% sure.

Code:

<b>########## Glabel Status ##########</b>
                                      Name  Status  Components
gptid/db71bcb5-32ca-11ec-b815-002590f52cc2     N/A  da1p2
gptid/d8b2f42f-32ca-11ec-b815-002590f52cc2     N/A  da0p2
gptid/9ff0f041-8f64-11ec-8462-002590f52cc2     N/A  da3p2
gptid/d8d6aa36-32ca-11ec-b815-002590f52cc2     N/A  da4p2
gptid/da1e1121-32ca-11ec-b815-002590f52cc2     N/A  da8p2
gptid/d9fb7757-32ca-11ec-b815-002590f52cc2     N/A  da7p2
gptid/0d56b97d-1e91-11ed-a6aa-ac1f6be66d76     N/A  da6p2
gptid/0cd1e905-3c2e-11ee-96af-ac1f6be66d76     N/A  da5p2
gptid/d96847a9-32ca-11ec-b815-002590f52cc2     N/A  da10p2
gptid/0d48d4ab-1e91-11ed-a6aa-ac1f6be66d76     N/A  da9p2
gptid/d7476d46-32ca-11ec-b815-002590f52cc2     N/A  da12p2
gptid/c774316e-3c2c-11ee-96af-ac1f6be66d76     N/A  da14p2
gptid/14811777-1b6d-11ed-8423-ac1f6be66d76     N/A  da15p2
gptid/749a1891-1b5c-11ee-941f-ac1f6be66d76     N/A  da13p2
gptid/d9a6f5dc-32ca-11ec-b815-002590f52cc2     N/A  da2p2
gptid/9fd0872d-8f64-11ec-8462-002590f52cc2     N/A  da11p2
gptid/0db68b72-32a7-11ec-8c36-002590f52cc2     N/A  da16p1
gptid/8aa4f83e-4c0d-11ee-8b4c-ac1f6be66d76     N/A  da17p2
gptid/8ab75bbc-4c0d-11ee-8b4c-ac1f6be66d76     N/A  da18p2
gptid/82164f5e-ab11-11ee-b98e-ac1f6be66d76     N/A  da19p2
gptid/0dbaec8b-32a7-11ec-8c36-002590f52cc2     N/A  da16p3
gptid/7481d076-1b5c-11ee-941f-ac1f6be66d76     N/A  da13p1
gptid/c75d4fed-3c2c-11ee-96af-ac1f6be66d76     N/A  da14p1
gptid/d7f09611-32ca-11ec-b815-002590f52cc2     N/A  da0p1
gptid/db4dd595-32ca-11ec-b815-002590f52cc2     N/A  da1p1
<br><br>
<b>########## ZPool status report for PrimaryPool ##########</b>
  pool: PrimaryPool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 128K in 15:01:43 with 11510 errors on Mon Mar 25 18:01:46 2024
remove: Removal of vdev 6 copied 1.39T in 3h55m, completed on Tue Sep  5 16:54:55 2023
    28.2M memory used for removed device mappings
config:

    NAME                                              STATE     READ WRITE CKSUM
    PrimaryPool                                       DEGRADED     0     0     0
      mirror-0                                        ONLINE       0     0     0
        gptid/d7476d46-32ca-11ec-b815-002590f52cc2    ONLINE       0     0     2
        gptid/d8d6aa36-32ca-11ec-b815-002590f52cc2    ONLINE       0     0     2
      mirror-1                                        ONLINE       0     0     0
        gptid/d9a6f5dc-32ca-11ec-b815-002590f52cc2    ONLINE       0     0     2
        gptid/db71bcb5-32ca-11ec-b815-002590f52cc2    ONLINE       0     0     2
      mirror-2                                        ONLINE       0     0     0
        gptid/d8b2f42f-32ca-11ec-b815-002590f52cc2    ONLINE       0     0     0
        gptid/d96847a9-32ca-11ec-b815-002590f52cc2    ONLINE       0     0     0
      mirror-3                                        ONLINE       0     0     0
        gptid/d9fb7757-32ca-11ec-b815-002590f52cc2    ONLINE       0     0     0
        gptid/da1e1121-32ca-11ec-b815-002590f52cc2    ONLINE       0     0     0
      mirror-4                                        DEGRADED     0     0     0
        spare-0                                       DEGRADED     0     0 3.92K
          gptid/9fd0872d-8f64-11ec-8462-002590f52cc2  DEGRADED     0     0     0  too many errors
          gptid/0d56b97d-1e91-11ed-a6aa-ac1f6be66d76  ONLINE       0     0     0
        gptid/9ff0f041-8f64-11ec-8462-002590f52cc2    DEGRADED     0     0 3.92K  too many errors
      mirror-5                                        DEGRADED     0     0     0
        gptid/14811777-1b6d-11ed-8423-ac1f6be66d76    DEGRADED     0     0 4.33K  too many errors
        spare-1                                       DEGRADED     0     0 4.33K
          gptid/0cd1e905-3c2e-11ee-96af-ac1f6be66d76  DEGRADED     0     0     0  too many errors
          gptid/0d48d4ab-1e91-11ed-a6aa-ac1f6be66d76  ONLINE       0     0     0
      mirror-7                                        DEGRADED     0     0     0
        gptid/82164f5e-ab11-11ee-b98e-ac1f6be66d76    DEGRADED     0     0  727K  too many errors
        gptid/8ab75bbc-4c0d-11ee-8b4c-ac1f6be66d76    DEGRADED     0     0  727K  too many errors
        gptid/8aa4f83e-4c0d-11ee-8b4c-ac1f6be66d76    DEGRADED     0     0  727K  too many errors
    spares
      gptid/0d48d4ab-1e91-11ed-a6aa-ac1f6be66d76      INUSE     currently in use
      gptid/0d56b97d-1e91-11ed-a6aa-ac1f6be66d76      INUSE     currently in use

errors: Permanent errors have been detected in the following files:

[REDACTED FILENAMES FOR PRIVACY]


<br><br>
<b>########## ZPool status report for boot-pool ##########</b>
  pool: boot-pool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
    The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:01:18 with 0 errors on Mon Mar 25 03:46:18 2024
config:

    NAME        STATE     READ WRITE CKSUM
    boot-pool   ONLINE       0     0     0
      da16p2    ONLINE       0     0     0

errors: No known data errors
<br><br>
<b>########## SMART status report for da0 drive (HGST HUS726040AL4210: N8GG3NYY) ##########</b>

SMART Health Status: OK

Current Drive Temperature:     35 C
Drive Trip Temperature:        85 C

Accumulated power on time, hours:minutes 54200:04
Manufactured in week 29 of year 2016
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  172
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2427
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        3         0         3   14212378     536317.613           0
write:         0        0         0         0    1494668      38965.916           0
verify:        0        0         0         0     866113          0.000           0

Non-medium error count:        0

<br><br>
<b>########## SMART status report for da1 drive (HGST HUS726040AL4210: N8GEBA2Y) ##########</b>

SMART Health Status: OK

Current Drive Temperature:     39 C
Drive Trip Temperature:        85 C

Accumulated power on time, hours:minutes 54200:24
Manufactured in week 29 of year 2016
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  183
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2433
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0       70         0        70   13439250     521720.041           0
write:         0       10         0        10    2957502      40015.795           0
verify:        0        0         0         0     534849          0.000           0

Non-medium error count:        0

<br><br>
<b>########## SMART status report for da2 drive (HGST HUS726040AL4210: N8GEX1NY) ##########</b>

SMART Health Status: OK

Current Drive Temperature:     38 C
Drive Trip Temperature:        85 C

Accumulated power on time, hours:minutes 54201:02
Manufactured in week 29 of year 2016
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  176
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2434
Elements in grown defect list: 10

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0       72         0        72   14205635     522175.606           0
write:         0     4710         0      4710    4104110      40129.779           0
verify:        0        0         0         0     420792          0.000           0

Non-medium error count:        0

<br><br>
<b>########## SMART status report for da3 drive (HGST HUS726040AL4210: NHG9JAAY) ##########</b>

SMART Health Status: OK

Current Drive Temperature:     37 C
Drive Trip Temperature:        85 C

Accumulated power on time, hours:minutes 53933:00
Manufactured in week 06 of year 2016
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  118
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2362
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0      537         0       537   14732144     312853.241           0
write:         0       14         0        14    4174780      46304.938           0
verify:        0        0         0         0    5796339          0.000           0

Non-medium error count:        0

<br><br>
<b>########## SMART status report for da4 drive (HGST HUS726040AL4210: N8GG150Y) ##########</b>

SMART Health Status: OK

Current Drive Temperature:     33 C
Drive Trip Temperature:        85 C

Accumulated power on time, hours:minutes 54185:00
Manufactured in week 29 of year 2016
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  164
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2416
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0       11         0        11   12574189     506729.109           0
write:         0        4         0         4    2645730      39118.807           0
verify:        0        0         0         0     393149          0.000           0

Non-medium error count:        0

<br><br>
<b>########## SMART status report for da5 drive (SEAGATE ST4000NXCLAR4000: Z1Z2428B    0000C4118VDG) ##########</b>

SMART Health Status: OK

Current Drive Temperature:     37 C
Drive Trip Temperature:        68 C

Accumulated power on time, hours:minutes 69776:42
Manufactured in week 39 of year 2013
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  257
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  2813
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   856518592        0         0  856518592          0     495093.463           0
write:         0        0         0         0          0     277974.097           0
verify: 2896572367        0         0  2896572367          0     381390.365           0

Non-medium error count: 30238953

<br><br>
<b>########## SMART status report for da6 drive (HITACHI HUS72604CLAR4000: K4K6EM8B) ##########</b>

SMART Health Status: OK

Current Drive Temperature:     37 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 42495:39
Manufactured in week 20 of year 2017
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  259
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  1884
Elements in grown defect list: 3

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0       10         0        10   10599811     517748.100           0
write:         0       45         0        45     549721     137089.229           2
verify:        0        0         0         0      77377        771.884           0

Non-medium error count:        0

<br><br>
<b>########## SMART status report for da7 drive (HGST HUS726040AL4210: N8GEX1PY) ##########</b>

SMART Health Status: OK

Current Drive Temperature:     39 C
Drive Trip Temperature:        85 C

Accumulated power on time, hours:minutes 54185:17
Manufactured in week 29 of year 2016
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  169
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2419
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0       83         0        83   13365247     536096.161           0
write:         0       15         0        15    2901188      40020.071           0
verify:        0        0         0         0     540003          0.000           0

Non-medium error count:        0

<br><br>
<b>########## SMART status report for da8 drive (HGST HUS726040AL4210: N8GEBBRY) ##########</b>

SMART Health Status: OK

Current Drive Temperature:     38 C
Drive Trip Temperature:        85 C

Accumulated power on time, hours:minutes 54199:46
Manufactured in week 29 of year 2016
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  177
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2428
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0       40         0        40   13898408     531312.770           0
write:         0        1         0         1    2534379      40095.037           0
verify:        0        0         0         0     408775          0.000           0

Non-medium error count:        0

<br><br>
<b>########## SMART status report for da9 drive (HITACHI HUS72604CLAR4000: K4K7KU5B) ##########</b>

SMART Health Status: OK

Current Drive Temperature:     36 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 42495:35
Manufactured in week 20 of year 2017
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  262
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  1892
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0       15         0        15   11499630     517541.016           2
write:         0        0         0         0     816743     132353.264           0
verify:        0        0         0         0     140953        785.723           0

Non-medium error count:        0

<br><br>
<b>########## SMART status report for da10 drive (HGST HUS726040AL4210: N8GEW7XY) ##########</b>

SMART Health Status: OK

Current Drive Temperature:     33 C
Drive Trip Temperature:        85 C

Accumulated power on time, hours:minutes 54199:35
Manufactured in week 29 of year 2016
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  167
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2417
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0       12         0        12   14768514     530673.998           0
write:         0      224         0       224    2460376      39349.768           0
verify:        0        0         0         0     548574          0.000           0

Non-medium error count:        0

<br><br>
<b>########## SMART status report for da11 drive (HGST HUS726040AL4210: NHG9ZP7Y) ##########</b>

SMART Health Status: OK

Current Drive Temperature:     37 C
Drive Trip Temperature:        85 C

Accumulated power on time, hours:minutes 53983:21
Manufactured in week 06 of year 2016
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  115
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2368
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0       15         0        15   14451693     326191.653           0
write:         0        2         0         2     839661      45552.873           0
verify:        0        0         0         0     947232          0.000           0

Non-medium error count:        0

<br><br>
<b>########## SMART status report for da12 drive (HGST HUS726040AL4210: N8GEBDYY) ##########</b>

SMART Health Status: OK

Current Drive Temperature:     36 C
Drive Trip Temperature:        85 C

Accumulated power on time, hours:minutes 54185:06
Manufactured in week 29 of year 2016
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  159
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2408
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0      213         0       213   13899115     521843.978           0
write:         0        9         0         9    1993262      39187.139           0
verify:        0        0         0         0     543875          0.000           0

Non-medium error count:        0

<br><br>
<b>########## SMART status report for da13 drive (HITACHI HUS72604CLAR4000: K3GM0E6L) ##########</b>

SMART Health Status: OK

Current Drive Temperature:     36 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 48316:54
Manufactured in week 37 of year 2017
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  32
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2042
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0       24         0        24   11663594     432468.294           0
write:         0      176         0       176    1486245     448046.144           0
verify:        0      716         0       716     395131       3614.828           0

Non-medium error count:        0

<br><br>
<b>########## SMART status report for da14 drive (HGST HUS726040AL4210: N8G8D8PT) ##########</b>

SMART Health Status: OK

Current Drive Temperature:     35 C
Drive Trip Temperature:        85 C

Accumulated power on time, hours:minutes 48434:38
Manufactured in week 19 of year 2016
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  73
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2077
Elements in grown defect list: 85

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0       25         0        25    7853714     114333.632          12
write:         0     1422         0      1422    5511819     126037.456          14
verify:        0        0         0         0     190160          0.000           0

Non-medium error count:        0

<br><br>
<b>########## SMART status report for da15 drive (HGST HUS726040AL4210: NHGAJWEY) ##########</b>

SMART Health Status: OK

Current Drive Temperature:     33 C
Drive Trip Temperature:        85 C

Accumulated power on time, hours:minutes 53931:48
Manufactured in week 06 of year 2016
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  124
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2365
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0       29         0        29   15451797     271767.183           0
write:         0       12         0        12    1763862      42872.005           0
verify:        0        0         0         0     644407          0.000           0

Non-medium error count:        0

<br><br>
<b>########## SMART status report for da16 drive (SanDisk SDSSDH3 512G: 21280S800845) ##########</b>

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   ---    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   ---    Old_age   Always       -       20653
 12 Power_Cycle_Count       0x0032   100   100   ---    Old_age   Always       -       112
165 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       524292
166 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       0
167 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       21
168 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       1
169 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       181
170 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       0
171 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       0
173 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       0
174 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       17
184 End-to-End_Error        0x0032   100   100   ---    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   ---    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   ---    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   074   040   ---    Old_age   Always       -       26 (Min/Max 11/40)
199 UDMA_CRC_Error_Count    0x0032   100   100   ---    Old_age   Always       -       136
230 Unknown_SSD_Attribute   0x0032   001   001   ---    Old_age   Always       -       0
232 Available_Reservd_Space 0x0033   100   100   004    Pre-fail  Always       -       100
233 Media_Wearout_Indicator 0x0032   100   100   ---    Old_age   Always       -       16
234 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       49
241 Total_LBAs_Written      0x0030   253   253   ---    Old_age   Offline      -       28
242 Total_LBAs_Read         0x0030   253   253   ---    Old_age   Offline      -       1115
244 Unknown_Attribute       0x0032   000   100   ---    Old_age   Always       -       0

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%     20630         -
<br><br>
<b>########## SMART status report for da17 drive (ST20000NM007D-3DJ103: ZVT5JPF5) ##########</b>

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   084   064   044    Pre-fail  Always       -       230117720
  3 Spin_Up_Time            0x0003   092   090   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       16
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   079   060   045    Pre-fail  Always       -       81572541
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       5103
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       16
 18 Unknown_Attribute       0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   066   054   000    Old_age   Always       -       34 (Min/Max 30/39)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       10
193 Load_Cycle_Count        0x0032   099   099   000    Old_age   Always       -       3218
194 Temperature_Celsius     0x0022   034   046   000    Old_age   Always       -       34 (0 25 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   100   000    Old_age   Offline      -       3854 (241 31 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       16933799011
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       32078877649848

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%      5079         -
<br><br>
<b>########## SMART status report for da18 drive (ST20000NM007D-3DJ103: ZVT5J3MY) ##########</b>

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   084   064   044    Pre-fail  Always       -       233454560
  3 Spin_Up_Time            0x0003   092   090   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       15
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   080   060   045    Pre-fail  Always       -       96532293
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       5103
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       15
 18 Unknown_Attribute       0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   066   052   000    Old_age   Always       -       34 (Min/Max 30/39)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       11
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       1403
194 Temperature_Celsius     0x0022   034   048   000    Old_age   Always       -       34 (0 24 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   100   000    Old_age   Offline      -       4748 (31 11 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       19340798947
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       35503816551615

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%      5079         -
<br><br>
<b>########## SMART status report for da19 drive (ST20000NM007D-3DJ103: ZVT06KLB) ##########</b>

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   064   044    Pre-fail  Always       -       192017240
  3 Spin_Up_Time            0x0003   092   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       8
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   076   060   045    Pre-fail  Always       -       42108942
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       1979
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       8
 18 Unknown_Attribute       0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   067   047   000    Old_age   Always       -       33 (Min/Max 28/37)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       7
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       84
194 Temperature_Celsius     0x0022   033   042   000    Old_age   Always       -       33 (0 18 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   100   000    Old_age   Offline      -       1848 (13 202 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       8054649956
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       16803461162

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%      1955         -
<br><br>
</pre>
--88c82a9dd0a2325b9a788f54660399c1--

I am nervous about all this, even though I have backups of all my critical data, this is a lot of money, and less-critical data that I do not have backed up simply because of the size that I would love not to lose. And I just have enough stress in life right now this has been on the back of my mind driving me absolutely insane.
I am at a complete loss of what to do from here, and honestly feeling so frustrated.

isopropyl · Apr 2, 2024

Anyone have any ideas?

Stux · Apr 2, 2024

Well... like I said... subtle errors in videos are subtle. Most likely cause is removing the wrong drive.

BTW, in a core->scale migrated pool I just made that mistake and nearly lost the pool. Ie I had a failure in a mirror vdev, and the UI said "was sdBlah", and helpfully provided the serial, so I found the drive by serial number, removed that...

Oh Dear.

Yeah... that "was blah blah" is bull, and then TrueNAS SCALE reports the serial for the WRONG drive.

SO, I ended up checking every single serial that WAS a member of the pool, and replacing the drive that was NOT showing up as a member of the pool... rather than just looking up the drive by serial as reported by the UI... basically it lied...

So, if @isopropyl removed the drive by serial as reported by the UI, then he probably made the mistake I did... and removed the wrong drive.

Other theory is a power blip.

So, scrub, replace any failing drives... and then replace the files that you can from backup. That's all you can do.

dak180 · Apr 2, 2024

In the bottom of the file (not sure why the top summary table is cut off) you can see that several drives are reporting uncorrected errors and grown defect logs (not as familiar with sas drive as sata) and another is reporting crc errors which the crc stuff is indicative of bad cables but the defect logs are a good reason to replace the drives.

Important Announcement for the TrueNAS Community.

Replaced One Drive, Resilvering, Now Multiple Failed AND Data Corruption!?

isopropyl

Contributor

dak180

Patron

isopropyl

Contributor

dak180

Patron

isopropyl

Contributor

dak180

Patron

isopropyl

Contributor

isopropyl

Contributor

joeschmuck

Old Man

dak180

Patron

Stux

MVP

isopropyl

Contributor

isopropyl

Contributor

isopropyl

Contributor

isopropyl

Contributor

isopropyl

Contributor

Stux

MVP

dak180

Patron

Similar threads

Important Announcement for the TrueNAS Community.

Replaced One Drive, Resilvering, Now Multiple Failed AND Data Corruption!?

Contributor

Patron

Contributor

Patron

Contributor

Patron

Contributor

Contributor

Old Man

Patron

MVP

Contributor

Contributor

Contributor

Contributor

Contributor

MVP

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Replaced One Drive, Resilvering, Now Multiple Failed AND Data Corruption!?"

Similar threads