Daily disk fault at 04:30

liteswap · Sep 8, 2021

My TrueNAS instance (TrueNAS-12.0-U4) runs as a VM under VMware ESXi, with the SAS controller and disks passed through for direct access. Every morning at around 0430, I get an alert that one disk is faulted. A reboot of the VM resilvers a tiny amoutn and the fault indication is removed (see below). Note that the disks do not power cycle because the host server stays up.

The host server is powered through a UPS so I would assume that the power is smoothed which in my mind would rule out a power glitch. I also find it unlikely that there’s an actual fault with the disks, since are only just over a year old - they are 14TB Seagate IronWolfs – and this behaviour has only recently evidenced itself even though they’ve been powered up continuously since installation. And it wouldn't be a daily occurrence.

Any thoughts as to what may be causing this?

Thank you.

Code:

  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 51.5M in 00:00:04 with 0 errors on Wed Sep  8 08:36:20 2021
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/a5b2b36a-a99d-11ea-8b4f-000c29f4b07c  ONLINE       0     0     0
            gptid/a5d5643f-a99d-11ea-8b4f-000c29f4b07c  ONLINE       0     0     0
            gptid/a5e3f313-a99d-11ea-8b4f-000c29f4b07c  ONLINE       0     0     1

errors: No known data errors

sretalla · Sep 8, 2021

liteswap said:
Every morning at around 0430

Is that when you run a SMART test?

Any other maintenance tasks run on the ESXi at that time?

liteswap said:
with the SAS controller and disks passed through for direct access

And in your signature:
3 x 14TB Seagate drives via direct-accessed

Can you clarify how the disks are passed through?

Are you passing through the entire PCI HBA? or is the HBA managed by ESXi and only the disks passed in?

liteswap · Sep 8, 2021

As per my first post, both controller and disks are passed through.

I shall check for SMART tests timing (could be that I guess) and for other tasks. Thanks for the steer.

sretalla · Sep 8, 2021

liteswap said:
As per my first post, both controller and disks are passed through.

I get it, but the statement here and the way you worded it before had me wondering if I understood correctly what you had done.

Ericloewe · Sep 8, 2021

Can you post the exact alert you get? This is sounding like a disk firmware bug of some sort, but more data is better.

liteswap · Sep 8, 2021

Sure:

New alerts:
* Pool tank state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:

Disk 12982955141895746249 is FAULTED

I have also checked the times of SMART tests, and they're not run daily (so long ago that I set them I forgot when they run!), nor are they run at 04:30.

Ericloewe · Sep 8, 2021

Is there some pattern to the time of day when this happens? E.g. does it decrease by a minute every day or something like that?

liteswap · Sep 9, 2021

Sorry for the delay. I must have deleted a few but here are some that crept through the net:
Tuesday, 31 August 2021 04:23
Sunday, 5 September 2021 04:16
Wednesday, 8 September 2021 04:28

So they're semi-random but there definitely seems to be a pattern...

elvisimprsntr · Sep 9, 2021

Does your UPS automatically perform a run time self test?

Some UPS are a simulated sine wave when running on battery power, which could cause unexpected behavior if the server power supply doesn't tolerate the choppy sine wave.

Sine Wave vs. Simulated Sine Wave - Which is Best? - Minuteman UPS

minutemanups.com

Also, if the battery is nearing it's EOL, the transfer time may take longer and/or there may be a momentary glitch in power.

If your UPS supports it, you should be able to issue a command to manually initiate one to at least rule it out as a contributor.

P.S. My UPS automatic self test alerted me the battery needed to be replaced. I already had one on order when the UPS battery gave up and the UPS automatically shut itself off. I removed the bad battery and ran without one for a couple of days until I hot installed the replacement.

liteswap · Sep 10, 2021

You may have something there. I shall check. Thank you for your suggestions.

liteswap · Sep 25, 2021

Just to round off this thread, which has effectively died, the daily alerts have stopped since I had cause to completely shut down the host VMware machine, and restart it some time later.

Points to a possible hardware fault...

Desc	Model	OS	Size/Speed	Boot/Pool	Other
NAS-1	QNAP TS-453A	TrueNAS CORE	2x8GB	16GB SLC eUSB DOM	APACHE, LACP, RSYNC, SMB, TFTP, TM
- Hourly	Seagate ST2000VN00[04]		4x2TB	RZ2	SATA
- Daily	Google Drive		15GB		Offsite
NAS-2	QNAP TS-253A	TrueNAS CORE	2x8GB	16GB SLC eUSB DOM	LACP, RSYNC, SMB, TM, VM
- Daily	Seagate ST4000VN008		2x4TB	RZ1	SATA
- Weekly	Crucial X8		500GB	RZ0	USB
NAS-3	QNAP TS-453A	TrueNAS SCALE	2x8GB	16GB SLC eUSB DOM	LACP, RSYNC, SMB, TM
- Weekly	Seagate ST2000[DM,VN]00[46]		4x2TB	RZ2	SATA
- Monthly	Crucial X8		500GB	RZ0	USB
NAS-4	QNAP TS-253A	TrueNAS SCALE	2x8GB	16GB SLC eUSB DOM	LACP, RSYNC, SMB, TM, VM
- Testing	WD40EFRX		2x4TB	RZ1	SATA
VM-1	QNAP TS-253A		2x8GB	16GB SLC eUSB DOM	KVM, LXC
- PVE	Segate ST4000VN000	Proxmox VE	2x4TB	RZ1	SATA
FW	Protectli FW4C	pfSense CE	8GB	256GB TLC mSATA	DHCP/DNS, IDS/IPS, GPS/PPS, NTP, VPN
- WAN-1	Arris NVG599		375Mbps		ATT Fiber
~~- WAN-2~~	~~Netgear LB1120~~		~~150Mbps~~		~~SpeedTalk LTE~~
- GPS	Garmin 18X LVC				RS232, PPS
- LCD	Crystalfontz XES635	LCDproc			USB, NTP, UPS
- UPS	APC BX1500M		1500VA		USB, Master
SW	Linksys LGS326		52Gbps		LACP, VLAN
- AP-[123]	EnGenius EWS377APv3		3.6Gbps		WPA2/3
- SK-1	EnGenius SkyKeyIv1.1		1GB	4GB MLC eMMC	400GB mSD
NVR
- DB-1	Lorex B862AJD		8MP	32GB mSDXC	RTSP
- DB-[23]	Lorex B451AJD		4MP	32GB mSDXC	RTSP
- CAM-[12]	Lorex W461ASC		4MP	32GB mSDXC	RTSP

Important Announcement for the TrueNAS Community.

Daily disk fault at 04:30

liteswap

Dabbler

sretalla

Powered by Neutrality

liteswap

Dabbler

sretalla

Powered by Neutrality

Ericloewe

Server Wrangler

liteswap

Dabbler

Ericloewe

Server Wrangler

liteswap

Dabbler

elvisimprsntr

Guru

Sine Wave vs. Simulated Sine Wave - Which is Best? - Minuteman UPS

liteswap

Dabbler

liteswap

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Daily disk fault at 04:30

Dabbler

Powered by Neutrality

Dabbler

Powered by Neutrality

Server Wrangler

Dabbler

Server Wrangler

Dabbler

Guru

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Daily disk fault at 04:30"

Similar threads