"Phantom" Pool is Offline: None

Emile.Belcourt

Dabbler
Joined
Feb 16, 2021
Messages
23
Hello all,

I have a bit of a weird one where I cannot find any reference to it in the logs but I got email alerted by our primary storage system last night that our primary drive pool was OFFLINE: None as below:

TrueNAS @ stor-01

New alerts:
* Pool Warm_HDD_Pool state is OFFLINE: None

Current alerts:
* Pool Warm_HDD_Pool state is OFFLINE: None

I first got this alert at 5.30 in the morning then got a second email 2 minutes later of:

TrueNAS @ stor-01

The following alert has been cleared:
* Pool Warm_HDD_Pool state is OFFLINE: None

I've gone through and done a grep -rnw on the logs directory and cannot find any reference to the pool status at 5.30am this morning nor are there actually any alerts in the TrueNAS alerts interface nor did our monitoring system (Zabbix) pull any data regarding health status changes.

The wording is a bit unusual because it says the alerts: "None". Is this something that has been seen before?

This pool is primarily used for our company SMB access as well as storage of our iSCSI target for VMWare so if the pool went offline for even a few seconds it would have taken out one of our Virtual Hosts but there are no issues that can be observed in any of our connected systems.

Many thanks,
Emile
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Was there a pool scrub under way during this time? That's the only thing I can think of that might result in a transient OFFLINE.
 

Emile.Belcourt

Dabbler
Joined
Feb 16, 2021
Messages
23
Was there a pool scrub under way during this time? That's the only thing I can think of that might result in a transient OFFLINE.
Hi Samuel, there were no scrubs running last night but there is a long running replication task that was going over the weekend, I wonder if this was when it was switching from replicating one of the datasets to another (stor-03 pulling from stor-01)?

I can see on the pulling storage server at that exact time this (I was looking solely at the stor-01 initially):

./zettarepl.log:160:[2021/09/25 17:05:30] INFO [replication_task__task_1] [zettarepl.replication.run] For replication task 'task_1': doing pull from 'Warm_HDD_Pool/dataset1' to 'Warm_HDD_Pool/dataset1' of snapshot='dataset1_Daily-2021-09-21_00-00' incremental_base='dataset1_Daily-2021-09-20_00-00' receive_resume_token=None encryption=False

(dataset names obfuscated)

That time stamp is exactly when the transient error occurred on stor-01.

Emile
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
That's bizarre, but that's the only candidate for now. As Sherlock Holmes said: "When you've eliminated the impossible, what remains, however improbable, must be the truth."
 

Emile.Belcourt

Dabbler
Joined
Feb 16, 2021
Messages
23
That's bizarre, but that's the only candidate for now. As Sherlock Holmes said: "When you've eliminated the impossible, what remains, however improbable, must be the truth."

I had an error in my original grep for looking through logs, I have found the following:

./cron:186:Nov 1 05:30:00 med-stor-01 1 2021-11-01T05:30:00.009338+00:00 stor-01 /usr/sbin/cron 91256 - - (root) CMD (/etc/ix.rc.d/ix-kinit renew > /dev/null 2>&1)
./cron:187:Nov 1 05:30:00 med-stor-01 1 2021-11-01T05:30:00.009920+00:00 stor-01 /usr/sbin/cron 91258 - - (root) CMD (/usr/libexec/atrun)
./cron:188:Nov 1 05:30:00 med-stor-01 1 2021-11-01T05:30:00.009987+00:00 stor-01 /usr/sbin/cron 91257 - - (root) CMD (/usr/libexec/atrun > /dev/null 2>&1)

./middlewared.log:43920:[2021/11/01 05:30:00] (WARNING) middlewared._loop_monitor_thread():1415 - Task seems blocked:
./middlewared.log:43942:[2021/11/01 05:30:04] (WARNING) middlewared._loop_monitor_thread():1415 - Task seems blocked:
./middlewared.log:43964:[2021/11/01 05:30:06] (WARNING) middlewared._loop_monitor_thread():1415 - Task seems blocked:
./middlewared.log:43986:[2021/11/01 05:30:09] (WARNING) middlewared._loop_monitor_thread():1415 - Task seems blocked:


./middlewared.log:44008:[2021/11/01 05:30:09] (DEBUG) EtcService.generate():378 - No new changes for /etc/krb5.conf
./middlewared.log:44009:[2021/11/01 05:30:10] (DEBUG) EtcService.generate():378 - No new changes for /etc/krb5.conf

./maillog:2:Nov 1 05:30:10 med-stor-01 1 2021-11-01T05:30:10.887106+00:00 stor-01 /middlewared 2120 - - sending mail to emile.belcourt@email
./maillog:8:Date: Mon, 01 Nov 2021 05:30:09 -0000

Could it be that the loop monitoring thread having some kind of lockup due to the starting of the next replication task dataset created a race condition where the loop monitoring thread was unable to respond in a correct/timely manner meant that it picked up an error?

Looking through the middlewared.log there are actually a lot of these errors throughout the log with an exception catch of the code at differing points, here are two samples:
[2021/11/01 05:29:58] (WARNING) middlewared._loop_monitor_thread():1415 - Task seems blocked:
File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1195, in _call
return await methodobj(*prepared_call.args)
File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 973, in nf
return await f(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/system.py", line 570, in info
product_version = (await(await Popen(
File "/usr/local/lib/python3.8/asyncio/subprocess.py", line 236, in create_subprocess_exec
transport, protocol = await loop.subprocess_exec(
File "/usr/local/lib/python3.8/asyncio/base_events.py", line 1630, in subprocess_exec
transport = await self._make_subprocess_transport(
File "/usr/local/lib/python3.8/asyncio/unix_events.py", line 197, in _make_subprocess_transport
transp = _UnixSubprocessTransport(self, protocol, args, shell,
File "/usr/local/lib/python3.8/asyncio/base_subprocess.py", line 36, in __init__
self._start(args=args, shell=shell, stdin=stdin, stdout=stdout,
File "/usr/local/lib/python3.8/asyncio/unix_events.py", line 789, in _start
self._proc = subprocess.Popen(
File "/usr/local/lib/python3.8/subprocess.py", line 854, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/local/lib/python3.8/subprocess.py", line 1658, in _execute_child
part = os.read(errpipe_read, 50000)

[2021/11/01 05:30:00] (WARNING) middlewared._loop_monitor_thread():1415 - Task seems blocked:
File "/usr/local/lib/python3.8/site-packages/middlewared/job.py", line 403, in __run_body
rv = await self.method(*([self] + args))
File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/alert.py", line 496, in send_alerts
await alert_service.send(alerts, service_gone_alerts, service_new_alerts)
File "/usr/local/lib/python3.8/site-packages/middlewared/alert/service/mail.py", line 22, in send
text = await self._format_alerts(alerts, gone_alerts, new_alerts)
File "/usr/local/lib/python3.8/site-packages/middlewared/alert/base.py", line 246, in _format_alerts
hostname = (await self.middleware.call("system.info"))["hostname"]
File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1238, in call
return await self._call(
File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1195, in _call
return await methodobj(*prepared_call.args)
File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 973, in nf
return await f(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/system.py", line 594, in info
'physmem': psutil.virtual_memory().total,
File "/usr/local/lib/python3.8/site-packages/psutil/__init__.py", line 1991, in virtual_memory
ret = _psplatform.virtual_memory()
File "/usr/local/lib/python3.8/site-packages/psutil/_psbsd.py", line 185, in virtual_memory
mem = cext.virtual_mem()

Are these something of concern? Our other storage servers do not have these.

Many thanks,
Emile
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
I wonder if these are early indications of memory failing on this box. Can you take this box down to run a memory test?
 

Emile.Belcourt

Dabbler
Joined
Feb 16, 2021
Messages
23
I wonder if these are early indications of memory failing on this box. Can you take this box down to run a memory test?

Hi Samuel,

I was planning on doing a maintenance outage this Friday, I will look into it. It hasn't been rebooted in nearly half a year, it was mainly to do U5 update as it has passed internal testing on the test and qual environments.
 
Top