JayG30
Contributor
- Joined
- Jun 26, 2013
- Messages
- 158
Hello,
Came in today to find that over the weekend my server was unresponsive and I'm not sure what is going on. No SSH, web GUI, CIFS, NFS...not even a response to pings. A reboot brought it back.
Equipment is server grade.
Notice the "scrub repaired 0 in 55h13m with 0 errors on Mon Jun 19 09:14:09 2017". The timestamp is when I rebooted the server and 55h13m sounds very long to me from how fast I remember scrubs being on this machine. It doesn't hold a ton of data really.
I did just alter SMART and SCRUB schedules. This weekend a Short SMART test should have run but NOT a SCRUB. I really don't know why SCRUB would be saying it ran and I'm not sure if ZFS thought it was scheduled to run a SCRUB. I think the SCRUB might have ran because it is out of wack after I updated the schedule, but it was ran on the old schedule and now is trying to change to the new schedule (I have to look into this because I think this will prevent it from running when I want). Still, a SCRUB shouldn't lock the system up I wouldn't think. However I also noticed that the Short SMART test doesn't seem to even show as ran when I check with smartctl. It is supposed to run at 5am on Sunday and I'm guessing the machine was locked up at that point.
This is the only thing I could see on the screen (over IPMI).
This is what I see in dmesg.yesterday (which I guess is the only relevant dmesg)
dmesg.today is showing;
And here is a link to /var/log/messages since it is a bit long to post here (I don't think the .login_conf errors are related);
https://pastebin.com/pYqCSfzJ
Also, in the Intel BMC Web Console I saw this;
Is there anywhere else I can look for clues? Anyone have an idea?
Thanks.
Came in today to find that over the weekend my server was unresponsive and I'm not sure what is going on. No SSH, web GUI, CIFS, NFS...not even a response to pings. A reboot brought it back.
Equipment is server grade.
- FreeNAS-9.10.1-U2 (f045a8b)
- Intel R2312GL4GS barebones server
- 128GB of RAM
- 2 x E5-2670 CPU's
- 6 x 4TB HDD on an LSI 2308 (IT firmware)
- Intel S3710 SLOG on SATA port
- 32GB USB 3.0 Sandisk Ultra Fit
Code:
[root@freenas] ~# zpool status pool: freenas-boot state: ONLINE scan: scrub repaired 0 in 0h0m with 0 errors on Wed May 17 03:45:46 2017 config: NAME STATE READ WRITE CKSUM freenas-boot ONLINE 0 0 0 gptid/954aa8f3-02a9-11e7-b39b-001e67d50a3e ONLINE 0 0 0 errors: No known data errors pool: store state: ONLINE scan: scrub repaired 0 in 55h13m with 0 errors on Mon Jun 19 09:14:09 2017 config: NAME STATE READ WRITE CKSUM store ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 da5p2 ONLINE 0 0 0 da4p2 ONLINE 0 0 0 da3p2 ONLINE 0 0 0 da2p2 ONLINE 0 0 0 da1p2 ONLINE 0 0 0 da0p2 ONLINE 0 0 0 logs gptid/86bc1495-0ab8-11e7-b9df-001e67d50a3e ONLINE 0 0 0 errors: No known data errors
Notice the "scrub repaired 0 in 55h13m with 0 errors on Mon Jun 19 09:14:09 2017". The timestamp is when I rebooted the server and 55h13m sounds very long to me from how fast I remember scrubs being on this machine. It doesn't hold a ton of data really.
Code:
[root@freenas] ~# zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT freenas-boot 28.8G 658M 28.1G - - 2% 1.00x ONLINE - store 21.8T 1.06T 20.7T - 1% 4% 1.00x ONLINE /mnt
I did just alter SMART and SCRUB schedules. This weekend a Short SMART test should have run but NOT a SCRUB. I really don't know why SCRUB would be saying it ran and I'm not sure if ZFS thought it was scheduled to run a SCRUB. I think the SCRUB might have ran because it is out of wack after I updated the schedule, but it was ran on the old schedule and now is trying to change to the new schedule (I have to look into this because I think this will prevent it from running when I want). Still, a SCRUB shouldn't lock the system up I wouldn't think. However I also noticed that the Short SMART test doesn't seem to even show as ran when I check with smartctl. It is supposed to run at 5am on Sunday and I'm guessing the machine was locked up at that point.
This is the only thing I could see on the screen (over IPMI).

This is what I see in dmesg.yesterday (which I guess is the only relevant dmesg)
Code:
[root@freenas] ~# vi /var/log/dmesg.yesterday MCA: Bank 5, Status 0x8c00004000010091 MCA: Global Cap 0x0000000001000c14, Status 0x0000000000000000 MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 0 MCA: CPU 0 COR (1) RD channel 1 memory error MCA: Address 0x6578cfc40 MCA: Misc 0x40666686 MCA: Bank 11, Status 0x8c000046000800c3 MCA: Global Cap 0x0000000001000c14, Status 0x0000000000000000 MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 46 MCA: CPU 30 COR (1) MS channel 3 memory error MCA: Address 0x1b101545c0 MCA: Misc 0x908420002000c8c Limiting closed port RST response from 250 to 200 packets/sec
dmesg.today is showing;
Code:
MCA: Bank 11, Status 0x8c000046000800c3 MCA: Global Cap 0x0000000001000c14, Status 0x0000000000000000 MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 46 MCA: CPU 30 COR (1) MS channel 3 memory error MCA: Address 0x1b101545c0 MCA: Misc 0x908420002000c8c Limiting closed port RST response from 250 to 200 packets/sec nfsd: can't register svc name
And here is a link to /var/log/messages since it is a bit long to post here (I don't think the .login_conf errors are related);
https://pastebin.com/pYqCSfzJ
Also, in the Intel BMC Web Console I saw this;
Code:
Event IDTime Stamp
Sensor Name
Sensor Type
Description
442 05/14/2017 13:00:39 Mmry ECC Sensor Memory Correctable ECC. CPU: 2, DIMM: H1. - Asserted![]()
Is there anywhere else I can look for clues? Anyone have an idea?
Thanks.
Last edited: