Do I have a disk problem or something else?

arrowd

Dabbler
Joined
Jul 12, 2019
Messages
16
First scary alert greeted me this morning: "Pool pool1 state is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected."

It's an ASUS Z170-A with an SSD boot and 4 2TB WD Red Pro (each with < 14,000 hours) in one pool. Running v13.0 - I've not installed U1 yet. I think the pool scrub ran last night. The pool is showing as unhealthy. Storage - Disks - shows 2 checksum errors - the same as below and last SMART status is good for all four data disks.

I went to /var/db/system/syslog-.../log and did 'zpool status -v' which gave this:

Code:
pool: pool1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 02:29:58 with 1 errors on Sun Jul 17 02:44:58 2022
config:


    NAME                                            STATE     READ WRITE CKSUM
    pool1                                           ONLINE       0     0     0
      raidz1-0                                      ONLINE       0     0     0
        gptid/39686f88-46bf-11eb-9bc5-305a3a5a2315  ONLINE       0     0     2
        gptid/c6be12f6-46e4-11eb-80bd-305a3a5a2315  ONLINE       0     0     2
        gptid/ce865e08-4714-11eb-a895-305a3a5a2315  ONLINE       0     0     2
        gptid/38966df8-4776-11eb-840c-305a3a5a2315  ONLINE       0     0     2


errors: Permanent errors have been detected in the following files:


        /mnt/pool1/Backups/Berman/Berman image backup/Berman image backup2022-06-04T123009.vbk


I don't care about this older backup file from a Windows client and can happily delete it. Questions:
1. Is this error from one drive?
1.1 If so, which?
1.2 If not, what does the error mean?
2. Why wasn't ZFS able to repair it?
3. Will deleting the file make the pool healthy again?
3.1 If that won't do it, how to get pool to show healthy again?
4. Any other suggestions about what to do now?
 
Joined
Jun 2, 2019
Messages
591
More details about your NAS might be helpful. See forum rules.


1. Mobo, RAM, etc.
2. HBA controller. Looks like mobo has HW RAID. Are you putting it in AHCI or IT mode?
3. Drive model, age, etc. Are disks CMR or SMR?
4. Have you run an extended (aka. long) self test?
 
Last edited:

arrowd

Dabbler
Joined
Jul 12, 2019
Messages
16
As revealed by the Show:System button:
TrueNAS 13.0, ASUS Z170-A, Core i5-6600, 24GB DDR4, Samsung 850 PRO 256GB SSD boot, 4 2TB WD Red Pro (CMR), StarTech 4-port PCIe SATA III 6GBs disk controller, on-board Intel i219-V 1GB network, APC Back-UPS Pro 1000 S

I have not run a long test as I didn't know which to run it on and didn't want to make anything worse.
 
Joined
Jun 2, 2019
Messages
591
As revealed by the Show:System button:
TrueNAS 13.0, ASUS Z170-A, Core i5-6600, 24GB DDR4, Samsung 850 PRO 256GB SSD boot, 4 2TB WD Red Pro (CMR), StarTech 4-port PCIe SATA III 6GBs disk controller, on-board Intel i219-V 1GB network, APC Back-UPS Pro 1000 S

I have not run a long test as I didn't know which to run it on and didn't want to make anything worse.

Is the Star Tech SATA controller this model with HW RAID support?


Recommendations will likely be to replace the Star Tech controller with a non-RAID, non-expansion SATA card, LSI HBA flashed to IT mode.

If you can't put it in AHCI or IT mode, that may be your problem. My understanding is that ZFS can't fix data corruption errors if it does not have direct access to the disks through a RAID controller, even if put in AHCI mode.

Some background reading material


While ZFS can work with hardware RAID devices, ZFS will usually work more efficiently and with greater data protection if it has raw access to all storage devices. ZFS relies on the disk for an honest view to determine the moment data is confirmed as safely written and it has numerous algorithms designed to optimize its use of caching, cache flushing, and disk handling.

Disks connected to the system using a hardware, firmware, other "soft" RAID, or any other controller that modifies the ZFS-to-disk I/O path will affect ZFS performance and data integrity. If a third-party device performs caching or presents drives to ZFS as a single system without the low level view ZFS relies upon, there is a much greater chance that the system will perform less optimally and that ZFS will be less likely to prevent failures, recover from failures more slowly, or lose data due to a write failure. For example, if a hardware RAID card is used, ZFS may not be able to: determine the condition of disks; determine if the RAID array is degraded or rebuilding; detect all data corruption; place data optimally across the disks; make selective repairs; control how repairs are balanced with ongoing use; or make repairs that ZFS could usually undertake. The hardware RAID card will interfere with ZFS' algorithms. RAID controllers also usually add controller-dependent data to the drives which prevents software RAID from accessing the user data. In the case of a hardware RAID controller failure, it may be possible to read the data with another compatible controller, but this isn't always possible and a replacement may not be available. Alternate hardware RAID controllers may not understand the original manufacturer's custom data required to manage and restore an array.

Unlike most other systems where RAID cards or similar hardware can offload resources and processing to enhance performance and reliability, with ZFS it is strongly recommended that these methods not be used as they typically reduce the system's performance and reliability.

If disks must be attached through a RAID or other controller, it is recommended to minimize the amount of processing done in the controller by using a plain HBA (host adapter), a simple fanout card, or configure the card in JBOD mode (i.e. turn off RAID and caching functions), to allow devices to be attached with minimal changes in the ZFS-to-disk I/O pathway. A RAID card in JBOD mode may still interfere if it has a cache or, depending upon its design, may detach drives that do not respond in time (as has been seen with many energy-efficient consumer-grade hard drives), and as such, may require Time-Limited Error Recovery (TLER)/CCTL/ERC-enabled drives to prevent drive dropouts, so not all cards are suitable even with RAID functions disabled.[35]

 
Last edited:

arrowd

Dabbler
Joined
Jul 12, 2019
Messages
16
Thanks for the prompt and detailed reply. Yes, it is that StartTech model (uses a Marvell 9230 chip). It is set in JBOD mode - no hardware RAID. You have to configure it during POST or I could show its settings now. I don't see that it has any caching in place.

The system has been working perfectly for perhaps 4+ years until now (I swapped out the HDDs last fall for new ones so the system run time is much greater that the 14K hours I posted for the data drives.)

If I were to replace the controller with an LSI model, would I have to rebuild TrueNAS and the pool, or do you think I could just swap out the cards and reboot?

Is there any way to see if data recovery is possible with the current card?
 
Joined
Jun 2, 2019
Messages
591
If I were to replace the controller with an LSI model, would I have to rebuild TrueNAS and the pool, or do you think I could just swap out the cards and reboot?
Won't know until you try. I think you are blazing new trails.
Is there any way to see if data recovery is possible with the current card?
Unknown. Once again, you are blazing new trails.

You should be able to restore any data from your existing 3-2-1 backup strategy, right? RAID is not a backup, RAID is resiliency.
 
Last edited:

arrowd

Dabbler
Joined
Jul 12, 2019
Messages
16
Here's a follow up to close this out: I replaced the StarTech PEXSAT34RH adapter (set in JBOD mode - no hardware RAID) with an LSI 9211-8i (from Art of Server). To my great surprise, TrueNAS booted right up and started right where it had been. The four data drives have different device names and may now be in a different order from before - I guess it doesn't matter.

I took the opportunity to add a second SSD as a mirrored boot drive, and that works fine also.

Thanks for providing helpful comments.
 
Top