Faulted Drive or Hardware?

702enigma

Cadet
Joined
Mar 1, 2023
Messages
6
Dell R730XD Running Truenas In VM with proxmox Using a HBA330 Mini.
Boot Drive : USB running clover to boot from PCIE M.2
Ram: 64GB
CPU: 2x Xeon e5-2660V3
OS Version:TrueNAS-SCALE-22.12.1

Drives are passed through using the HBA330

I have 4 of the same HGST drives HUH721010AL4200 in a storage pool. I bought them used. A day after installing I got a faulted message for 1 drive. I contacted the seller and was issued a new drive. after resilvering it did the same exact thing. So I cleared the zpool status and placed it into a different bay and it ran fine for 2 days. I switched it back to the previous bay and within 15 min I got 1 read error on it turning my pool to unhealthy. Im trying to determine if its just another bad drive from this seller or possibly a hardware issue.
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
If you'd like some help working out your issue, please respect the forum rules (link in red above) and share the details of how you're connecting your disks (as well as your other details, like versions of software and hardware specs).

What's going to be very important is how you're getting access to your disks inside the VM:

 
Joined
Jun 15, 2022
Messages
674
Sounds like a bad SAS cable. Maybe it's not plugged in completely.*

Could also be a bad SAS port (they can fail).

Buy some backup hardware and learn how to run that disk check the name of which I can't can't remember but it starts with a 'c' and ends with an 'l'. That will tell you a lot. Then run badblocks, then that other program again, it'll tell you a lot about what's going on. Change out the suspect hardware and retest.

You really should do a hardware burn- in, it's documented all over these forums.**

---
*I mean it should be, but if you had the hangover I have things can happen.
**Man my head is killing me.
 

702enigma

Cadet
Joined
Mar 1, 2023
Messages
6
Sounds like a bad SAS cable. Maybe it's not plugged in completely.*

Could also be a bad SAS port (they can fail).

Buy some backup hardware and learn how to run that disk check the name of which I can't can't remember but it starts with a 'c' and ends with an 'l'. That will tell you a lot. Then run badblocks, then that other program again, it'll tell you a lot about what's going on. Change out the suspect hardware and retest.

You really should do a hardware burn- in, it's documented all over these forums.**

---
*I mean it should be, but if you had the hangover I have things can happen.
**Man my head is killing me.
I will reInsert all cables and see if that changes anything. Just odd that the other 7 Disks have no Issues. Its what's making me either suspect the bad drive or the bad SAS port in the backplane.
 
Joined
Jun 15, 2022
Messages
674
There's another thread on here where a person was upset something "suddenly failed." Like they expected a Post-It* note.

Errors happen all the time in computers, when checking --smartctl-- I remember now! If people would check the logs they'd see frequent errors. Gotta be proactive.

smartctl --xall /dev/[device]

---
*Pretty sure that's trademarked.
 

702enigma

Cadet
Joined
Mar 1, 2023
Messages
6
I am pretty new to this so Im finding out all the things I did wrong LOL.

Code:
=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUH721010AL4200
Revision:             A21D
Compliance:           SPC-4
User Capacity:        10,000,831,348,736 bytes [10.0 TB]
Logical block size:   4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca3c41fa3f0
Serial number:        7P0KDEGC
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri Mar 17 09:40:10 2023 PDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     35 C
Drive Trip Temperature:        85 C

Manufactured in week 42 of year 2017
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  101
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  552
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 5932197152292864

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0     3773         0      3773     737362     149031.614           0
write:         0        1         0         1      99472     161387.848           0
verify:        0        0         0         0      19922          0.000           0

Non-medium error count:        0

Self-test execution status:             100% of test remaining
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Self test in progress ...   -     NOW                 - [-   -    -]

Long (extended) Self-test duration: 65535 seconds [1092.2 minutes]

Background scan results log
  Status: scan is active
    Accumulated power on time, hours:minutes 10866:33 [651993 minutes]
    Number of background scans performed: 22,  scan progress: 0.18%
    Number of background medium scans performed: 22

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 2
  number of phys = 1
  phy identifier = 0
    attached device type: expander device
    attached reason: SMP phy control function
    reason: unknown
    negotiated logical link rate: phy enabled; 12 Gbps
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000cca3c41fa3f1
    attached SAS address = 0x500056b3a2197bff
    attached phy identifier = 11
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0
relative target port id = 2
  generation code = 2
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: power on
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000cca3c41fa3f2
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0
 
Joined
Jun 15, 2022
Messages
674
I'm still new to S.M.A.R.T. reporting so this may be a bit off, but if that's the case hopefully other members correct me and we'll both learn something.

The Delayed Read Errors is high, but not necessarily bad if the data sat on the drive a long time; the drive was able to correct them without remapping any blocks so that's good.

The Delayed Write Error is something to keep an eye on, save this log for future reference.

The Non-Medium Error Count is 0, which generally means the drive controller located on the drive is in great condition.

Don't shut the system down until the Long Background Test completes or the drive will record it as an error and TrueNAS will probably throw an error that will have to be cleared (which is all more work than necessary).

After the Long test completes you might want to try a badblocks -w as this will show the system in action because it takes place on the CPU instead of the drive, then a smartctl -xall will show how the drive behaved through all that. HOWEVER, the -w will do a write and blank any information on the drive, so be careful! If the drive contains information there's a badblocks option for write-in-place that should read the data, do a write test, then restore the data, however I've not used it; I'm guessing it's somewhat dangerous and could take a really long, long time to complete but don't know because I haven't tried it.

The smartctl --long test is read-only, much like badblocks without using a write flag. badblocks -w will write one test pattern to the whole drive, then when done read the whole drive back to see if it stored the information correctly. It then does this 3 more times with 3 differrent patterns that should discover any issues with the platters (it does a really good job of this). You'll need to use the 4096 block size option because 512 blocks won't work.

---
I'm feeling a lot better, which is good because today is Friday and I usually go out after work for dinner and drinks. Well, mainly drinks. Dinner is just so I can hold my liquer, which of course I do. Or is it supper, I always get confused--it's supper, I'm pretty sure of that. Whatever. I'm getting pretty hungry so maybe I'll leave work early given it's Friday. That'll give me more time for drinking anyway.
 
Joined
Jun 15, 2022
Messages
674
Addition: badblocks -n
Use non-destructive read-write mode.

Without any options badblocks makes one pass and verifies the whole disk's integrity by checking sector data against that sector's checksum. If there is a mismatch the data is corrected (using the sector ECC if possible) and the data is re-written, checked and if it still doesn't match the sector is marked bad and remapped to a backup sector. A 5 TB drive takes approximately 10 hours.

-w will overwrite all sectors (the whole disk) with 0xaa, then go back and read the disk to verify the data (0xaa) and checksum (two passes: one write, one read). It then does this three more times with patterns (0x55, 0xff, 0x00) for a total of eight passes over the disk surface. Total time depends on how fast the drive can write data, older drives write much more slowly than they read. With that said, one run (totaling eight passes) takes somewhere between 4 to 6 days depending on various factors.

My apparently incorrect understanding:
-n will read 64 blocks (by default, -c can change that count), overwrite them with a test pattern, verify them, do that for the other three test patterns, then put the original data back and verify it make it correctly (assuming there were no previous errors). That's 8 passes + read/write/verify of the original data = 11 passes (as I understand it), which is approximately 40% longer than -w.

I thought this was what happened because -n reportedly takes longer than -w, though I haven't yet tried -n. However, from a more definitive source:

Alternate information:
This test is designed for devices with data already on them. A non-destructive read-write test makes a backup of the original content of a sector before testing with a single random pattern and then restoring the content from the backup. This is a single pass test and is useful as a general maintenance test.

---
Non-destructive...maybe it does make only a single pass but writes the blocks that are under test to a remapped location so they're not lost in the event of sudden power loss...that I don't know, but it would explain the longer run time. Stackexchange seems a bit vague on the exact mechanics and it was the most descriptive source I found.

Guessing at this is only that, maybe someone who knows can comment on what's really happening.
 
Top