memory / disk errors

craig51

Dabbler
Joined
Oct 29, 2017
Messages
19
Freenas Friends,

I am having an issue with 2 of my freenas units, they are both Dell R510 with 12x3tb SAS 7200 rpm drives is mirror and 128 g ram, 10g ethernet my issue is this

on one of the units (call it FN134) i have a drive that is showing up as suspect (da4) but not degrading the volume, i am going to replace this drive this weekend, I have also been getting error messages in the nightly log that there is some memory errors but i can not determine what banks and if it is possibly related to the disk errors; I was moving the VM's from it last night to FN131 using storage vmotion and the process failed,
Here is the output from that unit last night, this log is extremely long but thought it may have some insight as to what is going on
I also have the log from FN131 which references many memory errors, this is the first time i have had any abnormal logs from FN131 and it only happened
during this movement process, any assistance would be appreciated, if having more info on the hardware comnsfi

FN134 Log
freenas134.local kernel log messages:
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 3
> MCA: CPU 11 COR (1) RD channel ?? memory error
> MCA: Address 0x713fbf740
> MCA: Misc 0x3957121000011a83
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 2
> MCA: CPU 10 COR (1) RD channel ?? memory error
> MCA: Address 0x713fbf740
> MCA: Misc 0x3957121000010285
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 3
> MCA: CPU 11 COR (1) RD channel ?? memory error
> MCA: Address 0x713fbf740
> MCA: Misc 0x3957121000010285
> MCA: Bank 8, Status 0x88000040000200cf
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 2
> MCA: CPU 10 COR (1) MS channel ?? memory error
> MCA: Misc 0x3957121000014185
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 2
> MCA: CPU 10 COR (1) RD channel ?? memory error
> MCA: Address 0x713fbf740
> MCA: Misc 0x3957121000010380
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 3
> MCA: CPU 11 COR (1) RD channel ?? memory error
> MCA: Address 0x713fbf740
> MCA: Misc 0x3957121000010380
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 2
> MCA: CPU 10 COR (1) RD channel ?? memory error
> MCA: Address 0x713fbf740
> MCA: Misc 0x3957121000010587
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 3
> MCA: CPU 11 COR (1) RD channel ?? memory error
> MCA: Address 0x713fbf740
> MCA: Misc 0x3957121000016747
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 2
> MCA: CPU 10 COR (1) RD channel ?? memory error
> MCA: Address 0x713fbf740
> MCA: Misc 0x3957121000016747
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 2
> MCA: CPU 10 COR (1) RD channel ?? memory error
> MCA: Address 0x713fbf740
> MCA: Misc 0x3957121000016242
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 2
> MCA: CPU 10 COR (1) RD channel ?? memory error
> MCA: Address 0x713fbf740
> MCA: Misc 0x3957121000011782
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 3
> MCA: CPU 11 COR (1) RD channel ?? memory error
> MCA: Address 0x713fbf740
> MCA: Misc 0x395712100001068f
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 2
> MCA: CPU 10 COR (1) RD channel ?? memory error
> MCA: Address 0x713fbf740
> MCA: Misc 0x3957121000016545
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 3
> MCA: CPU 11 COR (1) RD channel ?? memory error
> MCA: Address 0x713fbf740
> MCA: Misc 0x3957121000016545
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 2
> MCA: CPU 10 COR (1) RD channel ?? memory error
> MCA: Address 0x713fbf740
> MCA: Misc 0x3957121000011381
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 3
> MCA: CPU 11 COR (1) RD channel ?? memory error
> MCA: Address 0x713fbf740
> MCA: Misc 0x3957121000011585
> (da4:mps0:0:29:0): WRITE(16). CDB: 8a 00 00 00 00 01 43 da 27 a8 00 00 00 68 00 00 length 53248 SMID 79 Aborting command 0xfffffe00012c37b0
> mps0: Sending reset from mpssas_send_abort for target ID 29
> mps0: Unfreezing devq for target ID 29
> (da4:mps0:0:29:0): WRITE(16). CDB: 8a 00 00 00 00 01 43 da 27 a8 00 00 00 68 00 00
> (da4:mps0:0:29:0): CAM status: Command timeout
> (da4:mps0:0:29:0): Retrying command
> (da4:mps0:0:29:0): READ(10). CDB: 28 00 d9 05 0c a8 00 00 08 00 length 4096 SMID 170 Aborting command 0xfffffe00012caf20
> mps0: Sending reset from mpssas_send_abort for target ID 29
> mps0: Unfreezing devq for target ID 29
> (da4:mps0:0:29:0): READ(10). CDB: 28 00 d9 05 0c a8 00 00 08 00
> (da4:mps0:0:29:0): CAM status: Command timeout
> (da4:mps0:0:29:0): Retrying command
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1653985, size: 4096
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1586727, size: 8192
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1577732, size: 4096
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1653985, size: 4096
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1586727, size: 8192
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1577732, size: 4096
> (da4:mps0:0:29:0): WRITE(16). CDB: 8a 00 00 00 00 01 43 da 27 a8 00 00 00 68 00 00 length 53248 SMID 153 Aborting command 0xfffffe00012c98d0
> mps0: Sending reset from mpssas_send_abort for target ID 29
> (da4:mps0:0:29:0): READ(6). CDB: 08 09 e7 70 08 00 length 4096 SMID 539 Aborting command 0xfffffe00012e9370
> (da4:mps0:0:29:0): WRITE(16). CDB: 8a 00 00 00 00 01 43 da 27 a8 00 00 00 68 00 00
> mps0: Sending reset from mpssas_send_abort for target ID 29
> (da4:mps0:0:29:0): CAM status: Command timeout
> (da4:mps0:0:29:0): Retrying command
> mps0: mpssas_action_scsiio: Freezing devq for target ID 29
> (da4:mps0:0:29:0): WRITE(16). CDB: 8a 00 00 00 00 01 43 da 27 a8 00 00 00 68 00 00
> (da4:mps0:0:29:0): CAM status: CAM subsystem is busy
> (da4:mps0:0:29:0): Retrying command
> mps0: Unfreezing devq for target ID 29
> (da4:mps0:0:29:0): READ(6). CDB: 08 09 e7 70 08 00
> (da4:mps0:0:29:0): CAM status: Command timeout
> (da4:mps0:0:29:0): Retrying command
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1653985, size: 4096
> (da4:mps0:0:29:0): READ(6). CDB: 08 01 b1 a0 10 00 length 8192 SMID 574 Aborting command 0xfffffe00012ec160
> mps0: Sending reset from mpssas_send_abort for target ID 29
> mps0: Unfreezing devq for target ID 29
> (da4:mps0:0:29:0): READ(6). CDB: 08 01 b1 a0 10 00
> (da4:mps0:0:29:0): CAM status: Command timeout
> (da4:mps0:0:29:0): Retrying command
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1586727, size: 8192
> (da4:mps0:0:29:0): READ(6). CDB: 08 00 98 88 08 00 length 4096 SMID 598 Aborting command 0xfffffe00012ee0e0
> mps0: Sending reset from mpssas_send_abort for target ID 29
> mps0: Unfreezing devq for target ID 29
> (da4:mps0:0:29:0): READ(6). CDB: 08 00 98 88 08 00
> (da4:mps0:0:29:0): CAM status: Command timeout
> (da4:mps0:0:29:0): Retrying command
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1577732, size: 4096
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1662180, size: 16384
> (da4:mps0:0:29:0): READ(10). CDB: 28 00 d9 05 0c a8 00 00 08 00 length 4096 SMID 280 Aborting command 0xfffffe00012d3f80
> mps0: Sending reset from mpssas_send_abort for target ID 29
> mps0: Unfreezing devq for target ID 29
> (da4:mps0:0:29:0): READ(10). CDB: 28 00 d9 05 0c a8 00 00 08 00
> (da4:mps0:0:29:0): CAM status: Command timeout
> (da4:mps0:0:29:0): Retrying command
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1656878, size: 16384
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1653985, size: 4096
-- End of security


FN131 Log
freenas kernel log messages:
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0
> MCA: CPU 8 COR (1) RD channel ?? memory error
> MCA: Address 0xfbbbac540
> MCA: Misc 0x8020080200016545
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 1
> MCA: CPU 9 COR (1) RD channel ?? memory error
> MCA: Address 0xfbbbac540
> MCA: Misc 0x8020080200016545
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0
> MCA: CPU 8 COR (1) RD channel ?? memory error
> MCA: Address 0xfbbbac540
> MCA: Misc 0x8020080200016a4a
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0
> MCA: CPU 8 COR (1) RD channel ?? memory error
> MCA: Address 0xfbbbac540
> MCA: Misc 0x8020080200011586
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 1
> MCA: CPU 9 COR (1) RD channel ?? memory error
> MCA: Address 0xfbbbac540
> MCA: Misc 0x8020080200011586
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0
> MCA: CPU 8 COR (1) RD channel ?? memory error
> MCA: Address 0xfbbbac540
> MCA: Misc 0x8020080200011287
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 1
> MCA: CPU 9 COR (1) RD channel ?? memory error
> MCA: Address 0xfbbbac540
> MCA: Misc 0x8020080200011287
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 1
> MCA: CPU 9 COR (1) RD channel ?? memory error
> MCA: Address 0xfbbbac540
> MCA: Misc 0x8020080200016c4c
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0
> MCA: CPU 8 COR (1) RD channel ?? memory error
> MCA: Address 0xfbbbac540
> MCA: Misc 0x8020080200016949
> MCA: Bank 8, Status 0x8c0000400001009f
> MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 1
> MCA: CPU 9 COR (1) RD channel ?? memory error
> MCA: Address 0xfbbbac540
> MCA: Misc 0x8020080200016949

-- End of security output --
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
I would start with a memtest if you can afford the down time. I'm also guessing that you have ECC memory if your detecting memory errors and not crashing. Perhaps the BIOS/idrac has some logs for the memory errors. Have you checked /var/log/vmkernel.log on your ESXi host to see its it looks related? I would guess that there are storage timeouts but I would expect you to have alerts on your datastore too then.
 

craig51

Dabbler
Joined
Oct 29, 2017
Messages
19
Thanks for your input, and i was able to reseat ram and replace the offending drive, since then the logs have been clean so loos like it must have been a connection issue
 
Top