Daily disk fault at 04:30

liteswap

Dabbler
Joined
Jun 7, 2011
Messages
37
My TrueNAS instance (TrueNAS-12.0-U4) runs as a VM under VMware ESXi, with the SAS controller and disks passed through for direct access. Every morning at around 0430, I get an alert that one disk is faulted. A reboot of the VM resilvers a tiny amoutn and the fault indication is removed (see below). Note that the disks do not power cycle because the host server stays up.

The host server is powered through a UPS so I would assume that the power is smoothed which in my mind would rule out a power glitch. I also find it unlikely that there’s an actual fault with the disks, since are only just over a year old - they are 14TB Seagate IronWolfs – and this behaviour has only recently evidenced itself even though they’ve been powered up continuously since installation. And it wouldn't be a daily occurrence.

Any thoughts as to what may be causing this?

Thank you.

Code:
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 51.5M in 00:00:04 with 0 errors on Wed Sep  8 08:36:20 2021
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/a5b2b36a-a99d-11ea-8b4f-000c29f4b07c  ONLINE       0     0     0
            gptid/a5d5643f-a99d-11ea-8b4f-000c29f4b07c  ONLINE       0     0     0
            gptid/a5e3f313-a99d-11ea-8b4f-000c29f4b07c  ONLINE       0     0     1

errors: No known data errors
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Every morning at around 0430
Is that when you run a SMART test?

Any other maintenance tasks run on the ESXi at that time?

with the SAS controller and disks passed through for direct access
And in your signature:
3 x 14TB Seagate drives via direct-accessed

Can you clarify how the disks are passed through?

Are you passing through the entire PCI HBA? or is the HBA managed by ESXi and only the disks passed in?
 

liteswap

Dabbler
Joined
Jun 7, 2011
Messages
37
As per my first post, both controller and disks are passed through.

I shall check for SMART tests timing (could be that I guess) and for other tasks. Thanks for the steer.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
As per my first post, both controller and disks are passed through.
I get it, but the statement here and the way you worded it before had me wondering if I understood correctly what you had done.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Can you post the exact alert you get? This is sounding like a disk firmware bug of some sort, but more data is better.
 

liteswap

Dabbler
Joined
Jun 7, 2011
Messages
37
Sure:
New alerts:
* Pool tank state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
  • Disk 12982955141895746249 is FAULTED

I have also checked the times of SMART tests, and they're not run daily (so long ago that I set them I forgot when they run!), nor are they run at 04:30.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Is there some pattern to the time of day when this happens? E.g. does it decrease by a minute every day or something like that?
 

liteswap

Dabbler
Joined
Jun 7, 2011
Messages
37
Sorry for the delay. I must have deleted a few but here are some that crept through the net:
Tuesday, 31 August 2021 04:23
Sunday, 5 September 2021 04:16
Wednesday, 8 September 2021 04:28

So they're semi-random but there definitely seems to be a pattern...
 
Joined
Jun 2, 2019
Messages
591
Does your UPS automatically perform a run time self test?

Some UPS are a simulated sine wave when running on battery power, which could cause unexpected behavior if the server power supply doesn't tolerate the choppy sine wave.


Also, if the battery is nearing it's EOL, the transfer time may take longer and/or there may be a momentary glitch in power.

If your UPS supports it, you should be able to issue a command to manually initiate one to at least rule it out as a contributor.

P.S. My UPS automatic self test alerted me the battery needed to be replaced. I already had one on order when the UPS battery gave up and the UPS automatically shut itself off. I removed the bad battery and ran without one for a couple of days until I hot installed the replacement.
 
Last edited:

liteswap

Dabbler
Joined
Jun 7, 2011
Messages
37
You may have something there. I shall check. Thank you for your suggestions.
 

liteswap

Dabbler
Joined
Jun 7, 2011
Messages
37
Just to round off this thread, which has effectively died, the daily alerts have stopped since I had cause to completely shut down the host VMware machine, and restart it some time later.

Points to a possible hardware fault...
 
Top