Citadel - Build Plan and Log

ctag

Patron
Joined
Jun 16, 2017
Messages
225
The AC at the house went out over the weekend, and I forgot to go shutdown the FreeNAS box. Woke up to the email message attached to this post. It looks something like this:

Code:
bns-citadel.csb.sh kernel log messages:
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0
> MCA: CPU 0 COR (1) RD channel ?? memory error
> MCA: Address 0x2a9c7880
> MCA: Misc 0x10000000080681
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0
> MCA: CPU 0 COR (1) RD channel ?? memory error
> MCA: Address 0x2a9c7880
> MCA: Misc 0x10000000080684
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0
> MCA: CPU 0 COR (1) RD channel ?? memory error
> MCA: Address 0x2a9c7880
> MCA: Misc 0x10000000081681
...


Which is pretty scary. I've shut it down, and will take a look at things after work.

I had to edit the email content to get the post under 30,000 characters...

*edit: email added as an attachment from feedback.
 

Attachments

  • citadel_email.txt
    80.2 KB · Views: 583
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I had to edit the email content to get the post under 30,000 characters
In those situations, it's best to attach a file and maybe quote a small, interesting bit.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Which is pretty scary. I've shut it down, and will take a look at things after work.
That is a lot of memory errors, but it is a possibility that the only reason they happened is because the system was overheated. You will not know for sure if this is a real problem until the room temperature is under control and you bring the system back online.
If I remember correctly, you used a Dell Precision workstation for your build and that gives you an advantage. There is a built-in diagnostic utility in the Dell systems and if you do the long test, it tests the memory extensively.
You could have a bad memory module, but the steps I would take is to re-seat all the memory modules. Actually take them out and put them back in. Sometimes the high temperatures can cause things to shift around. Once you re-seat all the memory modules, boot into the diagnostics and run through that. It should tell you if there is a hardware fault.
The good news, your data should be safe.
 

ctag

Patron
Joined
Jun 16, 2017
Messages
225
That is a lot of memory errors, but it is a possibility that the only reason they happened is because the system was overheated. You will not know for sure if this is a real problem until the room temperature is under control and you bring the system back online.
If I remember correctly, you used a Dell Precision workstation for your build and that gives you an advantage. There is a built-in diagnostic utility in the Dell systems and if you do the long test, it tests the memory extensively.
You could have a bad memory module, but the steps I would take is to re-seat all the memory modules. Actually take them out and put them back in. Sometimes the high temperatures can cause things to shift around. Once you re-seat all the memory modules, boot into the diagnostics and run through that. It should tell you if there is a hardware fault.
The good news, your data should be safe.
Thanks Chris.

I couldn't find any memory tests in the system BIOS menu. From this manual it looks like I'm missing a Diagnostics CD?

http://sivirt.utsa.edu/Documents/Manuals/precision-t7500_Service Manual_en-us.pdf
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Thanks Chris.

I couldn't find any memory tests in the system BIOS menu. From this manual it looks like I'm missing a Diagnostics CD?

http://sivirt.utsa.edu/Documents/Manuals/precision-t7500_Service Manual_en-us.pdf
If it is the Precision T7500, if I recall correctly, you should be able to press F-12 during boot to get a boot menu and one of the options on the boot menu is to boot into diagnostics. I know that newer Dell workstations have diagnostics integrated on the system board and I thought that the T7500 was new enough to have it also. It is not accessed from the BIOS configuration, you need to boot from the diagnostics option of the boot menu. It is like you were selecting a different hard drive or CD drive to boot from.
 

ctag

Patron
Joined
Jun 16, 2017
Messages
225
Ah, I had to hit the key blind, the computer moves past that screen before my monitor wakes up.

The memory test took half an hour (which was much faster than I expected) and appears to have passed. I restarted the system and it booted up OK.

Thank you for the help!

IMG_20190605_221303.jpg


IMG_20190606_071411.jpg
 

ctag

Patron
Joined
Jun 16, 2017
Messages
225
I've been noticing a jitter/bug in scrolling recently. Has anyone else seen this?

I haven't found a corresponding bug in jira, so if Chris and maybe some others don't know about it I'll try submitting a bug report.

https://imgur.com/HoYOsw3
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
That does look like a bug.
 

ctag

Patron
Joined
Jun 16, 2017
Messages
225
Got another memory error email.
Code:
bns-citadel.local kernel log messages:
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0
> MCA: CPU 0 COR (1) RD channel ?? memory error
> MCA: Address 0x2a9c7880
> MCA: Misc 0x10000000080389
> ugen6.2: <CP1000PFCLCD CRDA103BJ1> at usbus6 (disconnected)
> ugen6.2: <CP1000PFCLCD CRDA103BJ1> at usbus6

-- End of security output --


I'm away on business travel, will look at the box when I get home :-/
 

ctag

Patron
Joined
Jun 16, 2017
Messages
225
Home now. I woke up to some emails:

3:01AM
Code:
bns-citadel.local kernel log messages:
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0
> MCA: CPU 0 COR (1) RD channel ?? memory error
> MCA: Address 0x2a9c7880
> MCA: Misc 0x10000000081184

-- End of security output --


3:45AM
Code:
starting scrub of pool 'freenas-boot'


4:06AM
Code:
FreeNAS @ bns-citadel.local

New alerts:
* The boot volume state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.

Current alerts:
* New feature flags are available for volume main-pool. Refer to the "Upgrading a ZFS Pool" subsection in the User Guide "Installing and Upgrading" chapter and "Upgrading" section for more instructions.
* The boot volume state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.


4:06AM
Code:
FreeNAS @ bns-citadel.local

New alerts:
* The boot volume state is DEGRADED: One or more devices are faulted in response to IO failures.

Gone alerts:
* The boot volume state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.

Current alerts:
* New feature flags are available for volume main-pool. Refer to the "Upgrading a ZFS Pool" subsection in the User Guide "Installing and Upgrading" chapter and "Upgrading" section for more instructions.
* The boot volume state is DEGRADED: One or more devices are faulted in response to IO failures.


I couldn't reach the web UI or SSH, so I held the power button to power down the machine and then had to leave for work..
 

ctag

Patron
Joined
Jun 16, 2017
Messages
225
Re-seated the RAM (because I forgot to last time) and booted the machine back up.

The boot volume resilvered, and then appears to have failed again.

Code:
FreeNAS @ bns-citadel.local

New alerts:
* The boot volume state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

Gone alerts:
* The boot volume state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

Current alerts:
* New feature flags are available for volume main-pool. Refer to the "Upgrading a ZFS Pool" subsection in the User Guide "Installing and Upgrading" chapter and "Upgrading" section for more instructions.
* The boot volume state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
 

ctag

Patron
Joined
Jun 16, 2017
Messages
225
Got another email:

Code:
Checking status of zfs pools:
NAME           SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
freenas-boot    29G  6.52G  22.5G        -         -      -    22%  1.00x  ONLINE  -
main-pool     43.5T  8.21T  35.3T        -         -     1%    18%  1.00x  ONLINE  /mnt

  pool: freenas-boot
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 4.36G in 0 days 00:53:24 with 0 errors on Sat Jun 22 20:12:28 2019
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            da6p2   ONLINE       0     0     0
            da7p2   ONLINE       0     0     4

errors: No known data errors

-- End of daily output --


The reference link explains that this was a checksum error and says severity: minor. The email sounds like the resilvering worked properly, so I must have read the alerts out of sequence yesterday.
 

ctag

Patron
Joined
Jun 16, 2017
Messages
225
I've gotten a couple of
Code:
bns-citadel.local kernel log messages:
> arp: 192.168.13.23 moved from 02:14:10:00:05:0a to f0:4d:a2:30:14:44 on epair1b

-- End of security output --


And a few more
Code:
bns-citadel.local kernel log messages:
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0
> MCA: CPU 0 COR (1) RD channel ?? memory error
> MCA: Address 0x2a9c7880
> MCA: Misc 0x10000000080380
> arp: 192.168.13.23 moved from 02:14:10:00:05:0a to f0:4d:a2:30:14:44 on epair1b

-- End of security output --


Should I go find a memtest disk and try a few passes of that?
 

ctag

Patron
Joined
Jun 16, 2017
Messages
225
That `192.168.13.23` IP is the Freenas host. I'm not sure why it's moving MAC addresses, but I looked up that message and it appears to be something that doesn't require action.

I'm hoping to get memtest running this weekend. Those memory errors are worrying me. Until then I need some of the jails to keep working...
 

ctag

Patron
Joined
Jun 16, 2017
Messages
225
OK, finally got memtest running again. I'm going to give it a few days worth of passes and see if anything comes up. Sucks to not have my services while the system is down though o_O

IMG_20190717_084213.jpg
 

ctag

Patron
Joined
Jun 16, 2017
Messages
225
Two days of memtesting, 7 passes and no errors so far.

I'm going to let it keep running, but have some observations.

I've suspected previously that the case temperature was playing a role in the memory errors. I've gotten alerts previously because the hard drives have gotten too warm, and indeed the AC system in the house has been struggling to keep the building cool this summer. It felt like the memory errors were more common when the temperature was higher. That AC system died outright yesterday, and the temperature got pretty high as a result. Still no errors in memtest leads me to believe temperature, while an issue for hard drives, is not causing the memory errors.

My roommate informed me that the system, while it was running FreeNAS, had a high-pitched whine that I couldn't hear but was quite annoying to him. He promised it didn't sound like a normal computer, and informed me that something in there was probably getting ready to break. That sounds like either the hard drives or power supply to me, and I'm wondering if the power supply is going bad and is responsible for the RAM issues. It could explain why the memory errors only take place while the full system is running (6 hard drives included) but not with just memtest.
 

ctag

Patron
Joined
Jun 16, 2017
Messages
225
15 passes now.

AC is still out. Our rental company has been useless for over a week now. I'm starting to think I don't have an adequate environment to be playing with computers.

IMG_20190721_092115.jpg
 

ctag

Patron
Joined
Jun 16, 2017
Messages
225
OK, final count: 23 passes, no errors.. The AC is fixed, and I've booted FreeNAS back up. If the MCA errors keep happening, what should be my next step?

IMG_20190723_190941.jpg


Also, the first time it rebooted I got this crazy glitch screen, and it was unresponsive (no DHCP, no login). Rebooted again and it was fine.
IMG_20190723_192038.jpg
 

ctag

Patron
Joined
Jun 16, 2017
Messages
225
It's been a few days now, and there haven't been any more memory errors.. Which is weird.

Maybe having working AC did it, or just running memtest for a few days had some affect somehow. I want to give it a few more days to prove itself, but then I'd really like to get back to actually using the system and trying to fix the jails up.
 
Top