Citadel - Build Plan and Log

ctag · Jun 4, 2019

The AC at the house went out over the weekend, and I forgot to go shutdown the FreeNAS box. Woke up to the email message attached to this post. It looks something like this:

Code:

bns-citadel.csb.sh kernel log messages:
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0
> MCA: CPU 0 COR (1) RD channel ?? memory error
> MCA: Address 0x2a9c7880
> MCA: Misc 0x10000000080681
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0
> MCA: CPU 0 COR (1) RD channel ?? memory error
> MCA: Address 0x2a9c7880
> MCA: Misc 0x10000000080684
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0
> MCA: CPU 0 COR (1) RD channel ?? memory error
> MCA: Address 0x2a9c7880
> MCA: Misc 0x10000000081681
...

Which is pretty scary. I've shut it down, and will take a look at things after work.

~~I had to edit the email content to get the post under 30,000 characters...~~

*edit: email added as an attachment from feedback.

Ericloewe · Jun 4, 2019

ctag said:
I had to edit the email content to get the post under 30,000 characters

In those situations, it's best to attach a file and maybe quote a small, interesting bit.

Chris Moore · Jun 4, 2019

ctag said:
Which is pretty scary. I've shut it down, and will take a look at things after work.

That is a lot of memory errors, but it is a possibility that the only reason they happened is because the system was overheated. You will not know for sure if this is a real problem until the room temperature is under control and you bring the system back online.
If I remember correctly, you used a Dell Precision workstation for your build and that gives you an advantage. There is a built-in diagnostic utility in the Dell systems and if you do the long test, it tests the memory extensively.
You could have a bad memory module, but the steps I would take is to re-seat all the memory modules. Actually take them out and put them back in. Sometimes the high temperatures can cause things to shift around. Once you re-seat all the memory modules, boot into the diagnostics and run through that. It should tell you if there is a hardware fault.
The good news, your data should be safe.

ctag · Jun 5, 2019

Chris Moore said:
That is a lot of memory errors, but it is a possibility that the only reason they happened is because the system was overheated. You will not know for sure if this is a real problem until the room temperature is under control and you bring the system back online.
If I remember correctly, you used a Dell Precision workstation for your build and that gives you an advantage. There is a built-in diagnostic utility in the Dell systems and if you do the long test, it tests the memory extensively.
You could have a bad memory module, but the steps I would take is to re-seat all the memory modules. Actually take them out and put them back in. Sometimes the high temperatures can cause things to shift around. Once you re-seat all the memory modules, boot into the diagnostics and run through that. It should tell you if there is a hardware fault.
The good news, your data should be safe.

Thanks Chris.

I couldn't find any memory tests in the system BIOS menu. From this manual it looks like I'm missing a Diagnostics CD?

http://sivirt.utsa.edu/Documents/Manuals/precision-t7500_Service Manual_en-us.pdf

Chris Moore · Jun 5, 2019

ctag said:
Thanks Chris.

I couldn't find any memory tests in the system BIOS menu. From this manual it looks like I'm missing a Diagnostics CD?

http://sivirt.utsa.edu/Documents/Manuals/precision-t7500_Service Manual_en-us.pdf

If it is the Precision T7500, if I recall correctly, you should be able to press F-12 during boot to get a boot menu and one of the options on the boot menu is to boot into diagnostics. I know that newer Dell workstations have diagnostics integrated on the system board and I thought that the T7500 was new enough to have it also. It is not accessed from the BIOS configuration, you need to boot from the diagnostics option of the boot menu. It is like you were selecting a different hard drive or CD drive to boot from.

ctag · Jun 6, 2019

Ah, I had to hit the key blind, the computer moves past that screen before my monitor wakes up.

The memory test took half an hour (which was much faster than I expected) and appears to have passed. I restarted the system and it booted up OK.

Thank you for the help!

ctag · Jun 14, 2019

I've been noticing a jitter/bug in scrolling recently. Has anyone else seen this?

I haven't found a corresponding bug in jira, so if Chris and maybe some others don't know about it I'll try submitting a bug report.

https://imgur.com/HoYOsw3

Chris Moore · Jun 14, 2019

That does look like a bug.

ctag · Jun 15, 2019

OK, I filed a bug report.

ctag · Jun 18, 2019

Got another memory error email.

Code:

bns-citadel.local kernel log messages:
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0
> MCA: CPU 0 COR (1) RD channel ?? memory error
> MCA: Address 0x2a9c7880
> MCA: Misc 0x10000000080389
> ugen6.2: <CP1000PFCLCD CRDA103BJ1> at usbus6 (disconnected)
> ugen6.2: <CP1000PFCLCD CRDA103BJ1> at usbus6

-- End of security output --

I'm away on business travel, will look at the box when I get home :-/

ctag · Jun 21, 2019

Home now. I woke up to some emails:

3:01AM

Code:

bns-citadel.local kernel log messages:
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0
> MCA: CPU 0 COR (1) RD channel ?? memory error
> MCA: Address 0x2a9c7880
> MCA: Misc 0x10000000081184

-- End of security output --

3:45AM

Code:

starting scrub of pool 'freenas-boot'

4:06AM

Code:

FreeNAS @ bns-citadel.local

New alerts:
* The boot volume state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.

Current alerts:
* New feature flags are available for volume main-pool. Refer to the "Upgrading a ZFS Pool" subsection in the User Guide "Installing and Upgrading" chapter and "Upgrading" section for more instructions.
* The boot volume state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.

4:06AM

Code:

FreeNAS @ bns-citadel.local

New alerts:
* The boot volume state is DEGRADED: One or more devices are faulted in response to IO failures.

Gone alerts:
* The boot volume state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.

Current alerts:
* New feature flags are available for volume main-pool. Refer to the "Upgrading a ZFS Pool" subsection in the User Guide "Installing and Upgrading" chapter and "Upgrading" section for more instructions.
* The boot volume state is DEGRADED: One or more devices are faulted in response to IO failures.

I couldn't reach the web UI or SSH, so I held the power button to power down the machine and then had to leave for work..

ctag · Jun 22, 2019

Re-seated the RAM (because I forgot to last time) and booted the machine back up.

The boot volume resilvered, and then appears to have failed again.

Code:

FreeNAS @ bns-citadel.local

New alerts:
* The boot volume state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

Gone alerts:
* The boot volume state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

Current alerts:
* New feature flags are available for volume main-pool. Refer to the "Upgrading a ZFS Pool" subsection in the User Guide "Installing and Upgrading" chapter and "Upgrading" section for more instructions.
* The boot volume state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

ctag · Jun 23, 2019

Got another email:

Code:

Checking status of zfs pools:
NAME           SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
freenas-boot    29G  6.52G  22.5G        -         -      -    22%  1.00x  ONLINE  -
main-pool     43.5T  8.21T  35.3T        -         -     1%    18%  1.00x  ONLINE  /mnt

  pool: freenas-boot
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 4.36G in 0 days 00:53:24 with 0 errors on Sat Jun 22 20:12:28 2019
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            da6p2   ONLINE       0     0     0
            da7p2   ONLINE       0     0     4

errors: No known data errors

-- End of daily output --

The reference link explains that this was a checksum error and says severity: minor. The email sounds like the resilvering worked properly, so I must have read the alerts out of sequence yesterday.

ctag · Jul 1, 2019

I've gotten a couple of

Code:

bns-citadel.local kernel log messages:
> arp: 192.168.13.23 moved from 02:14:10:00:05:0a to f0:4d:a2:30:14:44 on epair1b

-- End of security output --

And a few more

Code:

bns-citadel.local kernel log messages:
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0
> MCA: CPU 0 COR (1) RD channel ?? memory error
> MCA: Address 0x2a9c7880
> MCA: Misc 0x10000000080380
> arp: 192.168.13.23 moved from 02:14:10:00:05:0a to f0:4d:a2:30:14:44 on epair1b

-- End of security output --

Should I go find a memtest disk and try a few passes of that?

ctag · Jul 2, 2019

That `192.168.13.23` IP is the Freenas host. I'm not sure why it's moving MAC addresses, but I looked up that message and it appears to be something that doesn't require action.

I'm hoping to get memtest running this weekend. Those memory errors are worrying me. Until then I need some of the jails to keep working...

ctag · Jul 17, 2019

OK, finally got memtest running again. I'm going to give it a few days worth of passes and see if anything comes up. Sucks to not have my services while the system is down though

ctag · Jul 19, 2019

Two days of memtesting, 7 passes and no errors so far.

I'm going to let it keep running, but have some observations.

I've suspected previously that the case temperature was playing a role in the memory errors. I've gotten alerts previously because the hard drives have gotten too warm, and indeed the AC system in the house has been struggling to keep the building cool this summer. It felt like the memory errors were more common when the temperature was higher. That AC system died outright yesterday, and the temperature got pretty high as a result. Still no errors in memtest leads me to believe temperature, while an issue for hard drives, is not causing the memory errors.

My roommate informed me that the system, while it was running FreeNAS, had a high-pitched whine that I couldn't hear but was quite annoying to him. He promised it didn't sound like a normal computer, and informed me that something in there was probably getting ready to break. That sounds like either the hard drives or power supply to me, and I'm wondering if the power supply is going bad and is responsible for the RAM issues. It could explain why the memory errors only take place while the full system is running (6 hard drives included) but not with just memtest.

ctag · Jul 21, 2019

15 passes now.

AC is still out. Our rental company has been useless for over a week now. I'm starting to think I don't have an adequate environment to be playing with computers.

ctag · Jul 23, 2019

OK, final count: 23 passes, no errors.. The AC is fixed, and I've booted FreeNAS back up. If the MCA errors keep happening, what should be my next step?

Also, the first time it rebooted I got this crazy glitch screen, and it was unresponsive (no DHCP, no login). Rebooted again and it was fine.

ctag · Jul 27, 2019

It's been a few days now, and there haven't been any more memory errors.. Which is weird.

Maybe having working AC did it, or just running memtest for a few days had some affect somehow. I want to give it a few more days to prove itself, but then I'd really like to get back to actually using the system and trying to fix the jails up.

Important Announcement for the TrueNAS Community.

Citadel - Build Plan and Log

Patron

Attachments

Server Wrangler

Hall of Famer

Patron

Hall of Famer

Patron

Patron

Hall of Famer

Patron

Patron

Patron

Patron

Patron

Patron

Patron

Patron

Patron

Patron

Patron

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Citadel - Build Plan and Log"

Similar threads