Multiple ungraceful reboots and a temporarily unhealthy pool on "scrub-day"

Mastakilla · Dec 2, 2020

Hi all,

A few days ago, I had some very unpleasant events with my TrueNAS server. When I woke up, before even touching my NAS or computer, I suddenly noticed that my NAS rebooted out of itself. So I quickly turned on my computer to login the IPMI and check what was happening and a few minutes later (I think it was before I managed to login to the IPMI), the NAS again rebooted out of itself. Later I found in the logs that the NAS had rebooted 6 times out of itself that morning. While this has never happened before since about a year of usage (and monthly scrubs)...

Also, at first it claimed the pool was healthy, but then it suddenly claimed the pool was unhealthy and a (few) day(s) later (after perhaps some more graceful reboots) the pool is suddenly healthy again?? When it temporarily was unhealthy, I was able to grab this screenshot:

However, when I look at the status of this same pool (without "changing" or "fixing" anything, it now says

Also in the alerts, there first was first some critical error of my pool being unhealthy, which, from itself, disappeared again a (few) day(s) later??

The checksum errors occur on multiple disks, so it seems unlikely to me that the HDDs themselves are dying
No SMART errors on any of the disks. No issues after a short self test either.
I couldn't find any logs of the scrub. I know it starts at 4h00 for hgstpool, but I don't know until which time it runs or if it resumes after an (un)graceful reboot or not. Why isn't there a log of the scrub process??
When looking at my alert emails, the reboots occured at 8h21, 9h18, 10h39, 10h44. However, when looking in /var/log/messages, I can see ungraceful reboots at 8h17, 8h21, 9h14, 9h18, 10h39 and 10h44.
When looking at my alert emails, I can see messages like "Pool hgstpool state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected." in emails at 9h35, which was then cleared at both 10h40 and 10h45 (no idea how it got cleared, I certainly didn't want it to get cleared, I wanted to investigate it! Also no idea how it can get cleared twice??). Then the same message as a "new alert" at 12h48 (I suppose this means that the scrub was still running?), which got cleared at 16h07. And then one more new alert with the same message at 16h12, for which I never received a cleared email.
I've checked my IPMI event log, but it doesn't contain anything around those reboots
I've thoroughly studied my /var/log/messages and /var/log/console.log, but couldn't really find anything in there that is not in there during a normal or graceful boot. However, there is nothing in there regarding the shutdown around those times (which tells me that it was not a graceful reboot).
/var/log/messages does contain following recurring error messages (also with graceful reboots), but it doubt this is related

In /var/log/middlewared.log I did find multiple entries of failures to send emails with a time that's 14 seconds before the first entries of /var/log/messages (so I guess before the interface was brought up, so no wonder it fails... Could this be a bug perhaps?)

Code:

[2020/11/28 08:17:24] (DEBUG) middlewared.setup():1641 - Timezone set to Europe/Brussels[2020/11/28 08:17:25] (DEBUG) middlewared.setup():2834 - Certificate setup for System complete
[2020/11/28 08:17:25] (WARNING) MailService.send_raw():432 - Failed to send email: [Errno 8] Name does not resolve
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/mail.py", line 407, in send_raw
server = self._get_smtp_server(config, message['timeout'], local_hostname=local_hostname)
File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/mail.py", line 454, in _get_smtp_server
server = smtplib.SMTP(
File "/usr/local/lib/python3.8/smtplib.py", line 253, in __init__
(code, msg) = self.connect(host, port)
File "/usr/local/lib/python3.8/smtplib.py", line 339, in connect
self.sock = self._get_socket(host, port, self.timeout)
File "/usr/local/lib/python3.8/smtplib.py", line 308, in _get_socket
return socket.create_connection((host, port), timeout,
File "/usr/local/lib/python3.8/socket.py", line 787, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
File "/usr/local/lib/python3.8/socket.py", line 918, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 8] Name does not resolve
[2020/11/28 08:17:25] (ERROR) middlewared.job.run():373 - Job <bound method accepts.<locals>.wrap.<locals>.nf of <middlewared.plugins.mail.MailService object at 0x81cb7afa0>> failed
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/mail.py", line 407, in send_raw
server = self._get_smtp_server(config, message['timeout'], local_hostname=local_hostname)
File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/mail.py", line 454, in _get_smtp_server
server = smtplib.SMTP(
File "/usr/local/lib/python3.8/smtplib.py", line 253, in __init__
(code, msg) = self.connect(host, port)
File "/usr/local/lib/python3.8/smtplib.py", line 339, in connect
self.sock = self._get_socket(host, port, self.timeout)
File "/usr/local/lib/python3.8/smtplib.py", line 308, in _get_socket
return socket.create_connection((host, port), timeout,
File "/usr/local/lib/python3.8/socket.py", line 787, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
File "/usr/local/lib/python3.8/socket.py", line 918, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 8] Name does not resolve

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/middlewared/job.py", line 361, in run
await self.future
File "/usr/local/lib/python3.8/site-packages/middlewared/job.py", line 399, in __run_body
rv = await self.middleware.run_in_thread(self.method, *([self] + args))
File "/usr/local/lib/python3.8/site-packages/middlewared/utils/run_in_thread.py", line 10, in run_in_thread
return await self.loop.run_in_executor(self.run_in_thread_executor, functools.partial(method, *args, **kwargs))
File "/usr/local/lib/python3.8/site-packages/middlewared/utils/io_thread_pool_executor.py", line 25, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 977, in nf
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/mail.py", line 276, in send
return self.send_raw(job, message, config)
File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 977, in nf
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/mail.py", line 436, in send_raw
raise CallError(f'Failed to send email: {e}')
middlewared.service_exception.CallError: [EFAULT] Failed to send email: [Errno 8] Name does not resolve

I'm not sure how to proceed... After hours of digging through the log files, I still have no clue why my system suddenly rebooted a couple of times (I guess during a scrub) and why my pool first became unhealthy and later automagically turning healthy again??

I did do proper burn-in testing. Since then I did perform 2 changes to my hardware:

I've replaced my 32GB of ECC RAM with 64GB ECC RAM and re-did Memtest86 testing (This was about 1 month ago, so this might be the first scrub with the new RAM. I did however test ECC functionality of this RAM using memtest86, so it should properly report ECC errors in /var/log/messages when they occur - and it didn't report any)
I've added an Optane drive that I use as SLOG (this was done some months ago, so the server certainly has undergone some scrubs since then)

My powersupply is a high quality Platinum powersupply that surviced days of running prime95 and the HDD burn-in tools and which should be plenty overkill for my setup. I don't have an UPS yet (planning on getting one though), but normally power is very stable in our region.

Does anyone know which log files or screens I should check to dig deeper? Or how I could best "prepare" (like setting a verbose flag somewhere) a test-scrub to see if the issues re-occur?

Thanks!

Mastakilla · Dec 4, 2020

After giving this some more thought, I think I should probably focus on the new RAM first.

The RAM is probably the only thing that changed since the last successful scrub
The RAM was 100% Memtest86 stable
I've confirmed that the ECC functionality of the RAM works in Memtest86
I've confirmed with my previous ECC memory that my platform in TrueNAS 12 properly reports ECC errors in /var/log/messages. I didn't re-confirm this with the new memory. It would very much surprise me if it would be any different though
Memtest86 only stresses the RAM and some parts of the CPU, it doesn't stress the HDDs or it doesn't fully stress the CPU
Highly unstable RAM can cause sudden reboots, but as my RAM is 100% stable when stressed alone, it seems like perhaps some other factor is causing the RAM to become unstable
A scrub of my 8 disk pool will cause intensive use of those 8 disks at the same time for a long period. As far as I know, the only impact this can have on the rest of the system, is long time high power usage. I don't think using HDDs can cause "electrical disturbance" to other remote components, like for example a microwave that can affect Wifi signals...
The scrub might also cause some higher usage of the CPU for doing checksums, but with a Ryzen 3600, that should be still very minimal
A cheap / bad powersupply can deliver too low voltage for a rail, when that rail stressed too much. My PSU should be high quality, so really should not do this, but...
RAM uses the 3.3V rail, HDDs use the 5V and 12V rail, so normally stressing the HDDs should not impact the RAM voltage
The reboots only occurred after the scrub was already running for more than 4 hours. The reboots seemed to occur in pairs, with 4 or 5 minutes in between... Very weird...

So how about some testing?

I'd like to confirm that this issue is reproducible by starting another scrub, but I'm a bit scared that it might damage my pool. If it is indeed the RAM that is causing the reboots, it could cause incorrect checksum failures and perhaps the scrub will do damage by trying to fix non-existing problems? I don't know much about scrub, but can anyone perhaps tell me, if it could be dangerous to run scrub?
I do have an offline backup, but this is without any parity (just loose HDDs) and also a backup from some weeks ago, so I rather not come in the situation where I need to use this
Perhaps I can run some memtest tool in a VM to stress the memory a bit extra while doing the scrub

But, as I'm not sure of the impact / risk of running a scrub in this situation, I'm trying this first:
Boot in Windows and start of long self-test in Hard Disk Sentinel on all disks + running prime95 (blend mode) in the background. That should stress all HDDs, RAM, CPU and PSU.
This has been running without fault for about an hour now and it should be non-intrusive in case it fails...

If anyone has any thoughts on this, please let me know...

AlexGG · Dec 4, 2020

The other component (other than RAM, that is) which will cause behavior like this is the motherboard, and I don't know any way to test it short of replacement.

diversity · Dec 4, 2020

Mastakilla said:
I've confirmed that the ECC functionality of the RAM works in Memtest8

you and i both know this is not how we test ram brother ;)

diversity · Dec 4, 2020

Havnt really read the rest yet so I might just be rambling

diversity · Dec 4, 2020

AlexGG said:
The other component (other than RAM, that is) which will cause behavior like this is the motherboard, and I don't know any way to test it short of replacement.

Are we taking about the asrack x470d40-t2?

Mastakilla · Dec 4, 2020

It could indeed also be the motherboard I guess (the Asrock x470d40-t2, yes). But that would very hard to test. As far as I know, memtest86 is still the best way to test memory stability. As for the functioning of ECC, it should not matter which brand of ECC RAM you use. If the platform properly "supports" it, it should support "all" ECC memory. And we did properly test this platform support, right Diversity? ;)
But still, at one point in this troubleshooting, I might try to confirm this anyway...

Anyway, Prime95 blended (which also stresses the memory really hard) has now been running with an Extended Self Test of ALL HDDs at the same time for 4 hours... Without any error, reboot or crash...

This is about the hardest test I can imagine for my server, so I'm a bit surprised it actually hasn't crashed, if the problem is indeed hardware related...

This should certainly stress the CPU and RAM a lot harder than that scrub in TrueNAS. The HDDs and RAID controller are stressed in a different way though, so I'm entirely sure if that could still be a factor... I don't think issues with the HDDs themselves can cause a reboot like this... Perhaps the RAID controller? But that one was VERY properly "burn-in-tested" for extremely long time...

diversity · Dec 4, 2020

I'd suggest getting rid ofvthe raid controller. Let zfs worry about what that controller does. 1 less thing to worry about. I don't think it is related to your headache though. Btw. your board has sata hot swap ability baked in. So do get rid of that controller

Respect

Mastakilla · Dec 4, 2020

Euh... Not really an option I'm afraid. The onboard controller doesn't have sufficient ports and is not properly supported in IT-mode on TrueNAS, as far as I know (certainly not recommended)

diversity · Dec 4, 2020

? You have like at least 6 sata ports correct?

diversity · Dec 4, 2020

Just get your data safe, take a sledge hammer, break out the raid controller, plug in the drives directly and recover your data

diversity · Dec 4, 2020

Mastakilla · Dec 5, 2020

After 19 hours, all "Extended Self Tests" were successfully completed while running Prime95 blended simultaneously. No issue at all... This confirms to me that me server is pretty damn stable and that all my HDDs are in good health...

But still apparently something is wrong with it... I guess I don't have any other choice right now then to try another scrub and see if it happens again. I will make an up-to-date offline backup of my pool first...

A few questions remain:

Can I increase the verbosity of scrub (or actually: make it log anything at all)?
Can I increase the verbosity of TrueNAS?
Is there any log of the pool status changes anywhere? I now only found it temporarily in the pool status screen (which cleared itself somehow - I think I'll make a bug for this, this can't be right...) and in the emails send to me, but it should be somewhere in some log as well, no? Perhaps with a bit more detail I hope?

Mastakilla · Dec 6, 2020

offline backup is running now...

In meantime I had another look at zpool status

Code:

data# zpool status -v hgstpool
  pool: hgstpool
state: ONLINE
  scan: scrub repaired 192K in 21:31:10 with 0 errors on Sun Nov 29 01:31:11 2020
config:

        NAME                                                STATE     READ WRITE CKSUM
        hgstpool                                            ONLINE       0     0     0
          raidz2-0                                          ONLINE       0     0     0
            gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli  ONLINE       0     0     0
            gptid/ef5b1297-cdd4-11ea-a82a-d05099d3fdfe.eli  ONLINE       0     0     0
            gptid/f00c5450-cdd4-11ea-a82a-d05099d3fdfe.eli  ONLINE       0     0     0
            gptid/efd8967a-cdd4-11ea-a82a-d05099d3fdfe.eli  ONLINE       0     0     0
            gptid/f04a1e2c-cdd4-11ea-a82a-d05099d3fdfe.eli  ONLINE       0     0     0
            gptid/f01fd8e3-cdd4-11ea-a82a-d05099d3fdfe.eli  ONLINE       0     0     0
            gptid/efde27c0-cdd4-11ea-a82a-d05099d3fdfe.eli  ONLINE       0     0     0
            gptid/f03634f0-cdd4-11ea-a82a-d05099d3fdfe.eli  ONLINE       0     0     0
        logs
          gptid/cf3dbeb8-c5dc-11ea-8b09-d05099d3fdfe        ONLINE       0     0     0

errors: No known data errors
data#

Note the "scrub repaired 192K in 21:31:10 with 0 errors on Sun Nov 29 01:31:11 2020"

Is it possible that the following has happened:

28 Nov 4h00 - Scrub starts
Scrub finds issues and tries to repair them. While scrub is still running, these issues are shown in the CKSUM column and the pool state is changed to UNHEALTHY
29 Nov 1h31 - (21h31min later) Scrub finally is completed and has managed to fix all issues, so it changes the pool state back to ONLINE and clears all CKSUM counters

Meaning that it did resume after the server crashed a couple of times.

Still I find it very weird that only during the scrub you can see on which disk it has found issues...? Is this normal? So far, I didn't find any way to increase verbosity of the scrub or to "enable" logging.

I would really like to know the time that it found these errors to figure out if they are related to those reboots or not

Mastakilla · Dec 7, 2020

Offline backup is completed and I've created my own script to do the scrub logging, which is running now...

More details on the scrub-logging-script here:

scrub troubleshooting - timestamped logging of scrub

As I've had some crashes / reboots during a scrub with (repaired) CKSUM errors, I wanted to know if the scrub errors occurred around the same time as the reboots. For this, I've created some very basic scripts that does a 'zpool status' every x...

www.truenas.com

Edit:

Scrub completed
Again 4 CKSUM errors on 2 different disks
No reboots

Started another scrub run now, to check if there are less CKSUM errors if little time is in between the scrubs

One weird difference with last time:
29 Nov: scrub repaired 192K in 21:31:10 with 0 errors on Sun Nov 29 01:31:11 2020
vs
8 Dec: scrub repaired 96K in 11:28:53 with 0 errors on Tue Dec 8 02:35:28 2020

So somehow the last scrub was almost twice as fast as the previous one???

Mastakilla · Dec 9, 2020

Here are my findings so far:

The catastrophy

28 November at 8h17, 8h21, 9h14, 9h18, 10h39 and 10h44 my TrueNAS server rebooted without properly shutting down
28 November from 4h00 till 29 November 1h30 a scrub ran which detected and fixed 8 errors on 5 different HDDs

More on this

Reboots
- First time ever after about a year of uptime that my server has an unexpected reboot (6 actually!)
- "Recently" following changes were made to the server
  - Intel Optane disk was added as SLOG about 2 months ago
  - 32GB ECC RAM was replaced by 64GB ECC RAM about 1 month ago
- Weird, unexpected, unlogged reboots can happen because of
  - extreme RAM instability
  - cheap / broken PSU
  - maybe a broken motherboard or HBA?
Scrub
- My scrubs are scheduled monthly, so the added Optane disk certainly had a problem-free scrub last month, but the RAM might have had its first scrub now
- I noticed that scrub checksum errors are cleared on reboot. (Is this a bug?)
- I noticed that sometimes TrueNAS tries to send emails on boot before the network interfaces are online (so maybe some scrub error mails didn't get through)
- As the scrub completed only after all those reboots, it seems like scrub automatically resumes after a crash / reboot, but it also seems like it requires the pool to be unlocked before it can start the resume (still need to confirm this). This could (partly?) explain why it took 21h30min to complete the 28 November scrub, compared to about 12h for all scrubs I did for testing afterwards.

Tests before the catastrophy

The "old" 32GB ECC RAM was tested with Memtest86 / Linux / TrueNAS 12 and confirmed that
- it was 100% stable
- the platform fully supported ECC error reporting to the OS (both corrected and uncorrected errors) - FreeNAS 11 doesn't support the platform
The "new" 64GB ECC RAM was also tested with Memtest86 to be 100% stable and fully support ECC error reporting to Memtest86, but I didn't re-validate this in TrueNAS 12, as the platform should be similar enough to still support this and as my server was already "live" in meantime (I might re-test this later though)
I've done all recommended burn-in tests that I could find and the random access dd test, for example, didn't error, but did make me aware of large duration differences between the HDDs, which caused me to suspect overheating (or another problem) of my HBA. So I replaced the HBA, added a fan to it and re-tested it. And the large duration difference between the HDDs was solved and the full burn-in test was consistently passed. In short, it is very well tested and, although I realize it (everything) is still possible, I'm quite confident my HBA is pretty ok now...

Tests after the catastrophy

I've checked the SMART values of all my HDDs and they are all 100% healthy
In Windows, using Hard Disk Sentinel, I've performed an "Extended Self Test" of every HDD simultaneously, while running Prime95 (blended mode with AVX) at the same time. This took about 19 hours and was also 100% successful. This is an extreme test for testing
- RAM stability
- CPU stability
- PSU stability
- HDD health
I've create a logging script for scrub and re-ran scrub on 7/8 December and it again found 4 errors on 2 different HDDs. There were no reboots this time.
On 8 December, a couple hours later, I ran another scrub and it again found 1 error on yet another HDD. Again no reboots.

Conclusions thus far

The scrub checksum errors
- are easy to reproduce. So something is certainly VERY wrong with my NAS.
- seem to accumulate over time. I think this means that
  - The errors are not caused by scrub incorrectly detecting (and fixing) issues
  - The errors are actually being written to my HDDs and scrub correctly detects and fixes them (for now at least)
- It seems like they are caused not by the HDDs themselves, but by something "before" the HDDs
  - Could by the HBA. But as said earlier, this HBA was confirmed to be rock solid stable during burn-in testing.
  - Could be the RAM, as that is the only new piece of hardware since the scrubs started failing. But also this piece was thoroughly tested and I would expect issues with the RAM to get logged in /var/log/messages. I'm still performing extra tests for the RAM to make sure though...
  - Could be something with the motherboard
The reboots seem harder to reproduce, no sure what exactly triggered them or to trigger them
- Perhaps the load of the scrub itself triggered them (but then I would expect them to occur during new scrubs as well)
- Probably the same hardware problem that is causing the checksum errors on my HDDs, also caused those reboots (accidently at the same time as the scrub ran)

Upcoming tests

Today (10 December), I underclocked the hell out of my RAM and I'm running another scrub now
I will run another scrub after this to confirm if new errors still get written to my HDDs or not, even after underclocking the RAM
Overclock the RAM to confirm that RAM ECC errors are properly logged in /var/log/messages, like before
If that still does not solve the scrub errors, I'll probably try removing my Optane SLOG
If that also doesn't solve the scrub errors, I'll try running solnet-array-test again (but that doesn't really clarify which component is the problem if it errors) and perhaps try to replace the HBA?

AlexGG · Dec 9, 2020

Mastakilla said:
Could by the HBA. But as said earlier, this HBA was confirmed to be rock solid stable during burn-in testing.

How did you do the burn-in? My understanding is that HBA is not involved much, if at all, in the "Extended self test". Self-tests are performed inside the drive, without transferring data over the bus.

Mastakilla · Dec 10, 2020

AlexGG said:
How did you do the burn-in? My understanding is that HBA is not involved much, if at all, in the "Extended self test". Self-tests are performed inside the drive, without transferring data over the bus.

That was my assumption as well... Currently I'm still focusing on the RAM, as it is "easy" to test. But if that isn't able to pinpoint the problem, then I'll probably need to focus on the HBA again, but the problem is that there is no "tweaking" (like underclocking) possible to the HBA, so it is very hard to confirm that the HBA is actually the problem

I guess all I can do is first make sure that all other (easier to test) components are 100% ok (which I think I now did already for the CPU, HDDs and soon the RAM as well), but figuring out if it's the motherboard, HBA or PSU is tricky...

I've done the following as burn-in (before the replacement of the 32GB RAM by 64GB RAM and adding the Optane)
1) day of memtest86
2) day of prime95 blended with AVX
3) couple runs of solnet-array-test from the link below (took about a week to complete per

Building, Burn-In, and Testing your FreeNAS system

I've been meaning to post some guidance here for a while now. We frequently see people come to the forums with hardware problems that should have washed out in the system build process, but since many of the users here are DIY'ers without...

www.truenas.com

More details on why I replaced my HBA and added active cooling to it here (which solved the duration variances):

Does the SAS 2008 chip throttle when overheating?

Hi all, While running the https://www.ixsystems.com/community/resources/solnet-array-test.1/ script on FreeNAS, I had some very weird behaviour during the seek-tress-read test (which uses dd and takes about a week to complete). During the test my IPMI and network interfaces became unreachable...

forums.servethehome.com

4) I think I might have run some other HDD burn-in test, but I can't directly find it... But the solnet-array-test was definitely the most crazy one...

Mastakilla · Dec 10, 2020

I just realized that something else has changed between the last "problem-free" scrub and the "problem scrubs"...

I've upgraded from FreeNAS11 to TrueNAS12... Although I thought AMD support was supposed to be better in TrueNAS12, it might also be a software issue (TrueNAS12 bug?) I guess...

But this would certainly be extremely hard / annoying to test, as

I've already upgraded the feature flags of my pool in meantime :(
I've only an offline backup without parity (just separate disks)
There isn't anything in the logs that I could find

normanlu · Dec 10, 2020

same problem here.

Few things are the same with your issue.
1. HBA card, LSI SAS 9211-8i
2. Upgraded from Freenas to Truenas recently
3. Loop reboots during zpool scrubbing, when i stop the scrub, everything fine without reboot

i ordered a new HBA card LSI SAS 9300-8i, will make a test this weekend.

Important Announcement for the TrueNAS Community.

Multiple ungraceful reboots and a temporarily unhealthy pool on "scrub-day"

Patron

Patron

Contributor

Contributor

Contributor

Contributor

Patron

Contributor

Patron

Contributor

Contributor

Contributor

Patron

Patron

Patron

Patron

Contributor

Patron

Patron

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Multiple ungraceful reboots and a temporarily unhealthy pool on "scrub-day""

Similar threads