SOLVED zpool degraded on truenas

polakkenak

Cadet
Joined
Jun 10, 2022
Messages
3
Hi folks. I'm fairly new to TrueNAS and ZFS, so please bear with me.

I recently bought a minisforum HM90 to serve serve my need for running VMs at home.
I put in 2x8GB of new memory and two older sata SSDs I had laying around.

After running for a short while, SCALE warns me that one of the pools are in a degraded state. This happens on a fresh install of SCALE 22.02.1

Code:
root@truenas[~]# zpool status boot-pool
  pool: boot-pool
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:00:13 with 0 errors on Fri Jun 10 07:59:52 2022
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   DEGRADED     0     0     0
          sdb3      DEGRADED    23     0    34  too many errors

errors: No known data errors


After clearing the pool status and running a new scrub, the error will appear almost immediately again.

Code:
root@truenas[~]# zpool clear boot-pool
root@truenas[~]# zpool scrub boot-pool
root@truenas[~]# zpool status boot-pool
  pool: boot-pool
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:00:13 with 0 errors on Fri Jun 10 08:42:57 2022
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   DEGRADED     0     0     0
          sdb3      DEGRADED    16     0    79  too many errors

errors: No known data errors


Doing some testing, I found that the problem would appear on whichever disk was plugged into the "JFPC1" port, while the drive attached to the other port had no problem.

I thought this problem was caused by a hardware problem (e.g. faulty sata port/controller) and contacted the reseller to enquire about repairs and the reseller said this could be a compatibility problem and I should try to replicate the problem on windows before sending it in for repairs.

I tried this, and I wasn't able to cause windows to complain about the drive during the testing I did. Windows doesn't really offer zfs support, so take this statement with a grain of salt.

I have tried replicating the problem again with a live usb for both proxmox and ubuntu server, and neither of these detected any problem. Running TrueNAS in recovery mode (off the boot drive, not USB) also does not show any problems with the drive.

So, to summarize:
  • zpool status says a pool is degraded
  • Changing "sata" port move the problem to the other drive
  • Testing while running off a live USB doesn't exhibit the same problem
  • Testing while in truenas recovery mode doesn't exhibit the same problem

I'm stumped as to what is causing this problem. Could a setting be causing this problem?
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
reseller said this could be a compatibility problem
I think the reseller is on the right track.
I guestimate, based on your excellent attempts, that the sata controller / or part of if its setup, might not be well supported in the underlying FreeBsd13.

Have a look at the recommended hardware guide - it might appear slightly dated - but it jist of it holds very true to this day.
If you can't find similar system to yours, of similar age as the guide (spoiler - you will not), it is safe to say this is why there is a handy hardware recommendations guide.

Kudos to your efforts.
 

polakkenak

Cadet
Joined
Jun 10, 2022
Messages
3
Thank you for the link to the hardware guide, I'll keep it handy for next time I'm putting something together.

I managed to solve this by modifying the power settings as described in this post over at minisforum.

My guess is that the recovery mode (and live USBs) weren't mounting the other SSD, so it might be related to power supplied to the two drives, since it looks like they're sharing the same SATA controller according to the output from lshw.
 

polakkenak

Cadet
Joined
Jun 10, 2022
Messages
3
For what it's worth, Truenas SCALE is using med_power_with_dipm as the power mode for the scsi hosts.
Testing showed that this problem applied to all of the med_power_with_dipm variants, but using medium_power or max_performance works great.
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
Thank you for your contributions!
 
Top