Truenas Scale Errors after a few hours

wackymole

Explorer
Joined
Aug 21, 2017
Messages
59
Hello,

I been trying to troubleshoot this server for a while now. I had it running FreeNas for a couple years and then it started to crash. It would reboot every couple of days.

I took it offline to diagnosis. Didn't see anything so I thought I would update it and remove the encryption.

Updated it to truenas scale. It would hang and never recover until reboot.

I checked ram, it passed MemTest86

I replaced Power Supply. No change.

I updated BIOS from 4.2 to 8.02 and it no longer hangs, but I get constant errors after 5-12 hours. Then every hour at least, sometimes every 10 minutes, until reboot.

Specs
TrueNAS-SCALE-22.12.3.3
AMD Ryzen 7 3700X 8-Core Processor
Asrock B450 PRO 4
L8.02 Bios
64 GB ECC ~ 2600 MHZ
2 Pool
1 Pool Zfs raidz3 - 10 Drives 14-TB ea - No encryption
1 Pool just a 2TB ssd
2 x expansion cards
1 x intel modem expansion card for network
GPU Nvida P2000
Main OS on an nvme ~250 gb I think

I am thinking next step is replace the processor, but I don't know.


Code:
 Error: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 204, in call_method
    result = await self.middleware._call(message['method'], serviceobj, methodobj, params, app=self)
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1344, in _call
    return await methodobj(*prepared_call.args)
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1246, in nf
    res = await f(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1378, in nf
    return await func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/docker_linux/images.py", line 68, in query
    for image in await docker.images.list():
  File "/usr/lib/python3/dist-packages/aiodocker/images.py", line 31, in list
    response = await self.docker._query_json("images/json", "GET", params=params)
  File "/usr/lib/python3/dist-packages/aiodocker/docker.py", line 302, in _query_json
    async with self._query(
  File "/usr/lib/python3/dist-packages/aiodocker/utils.py", line 309, in __aenter__
    resp = await self._coro
  File "/usr/lib/python3/dist-packages/aiodocker/docker.py", line 240, in _do_query
    await self._check_version()
  File "/usr/lib/python3/dist-packages/aiodocker/docker.py", line 192, in _check_version
    ver = await self._query_json("version", versioned_api=False)
  File "/usr/lib/python3/dist-packages/aiodocker/docker.py", line 302, in _query_json
    async with self._query(
  File "/usr/lib/python3/dist-packages/aiodocker/utils.py", line 309, in __aenter__
    resp = await self._coro
  File "/usr/lib/python3/dist-packages/aiodocker/docker.py", line 250, in _do_query
    response = await self.session.request(
  File "/usr/lib/python3/dist-packages/aiohttp/client.py", line 544, in _request
    await resp.start(conn)
  File "/usr/lib/python3/dist-packages/aiohttp/client_reqrep.py", line 905, in start
    self._continue = None
  File "/usr/lib/python3/dist-packages/aiohttp/helpers.py", line 656, in __exit__
    raise asyncio.TimeoutError from None
asyncio.exceptions.TimeoutError


Code:
New alerts:

    Failed to check for alert Smartd: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/middlewared/plugins/alert.py", line 784, in __run_source alerts = (await alert_source.check()) or [] File "/usr/lib/python3/dist-packages/middlewared/alert/base.py", line 223, in check return await self.middleware.run_in_thread(self.check_sync) File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1261, in run_in_thread return await self.run_in_executor(self.thread_pool_executor, method, *args, **kwargs) File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1258, in run_in_executor return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs)) File "/usr/lib/python3.9/concurrent/futures/thread.py", line 52, in run result = self.fn(*self.args, **self.kwargs) File "/usr/lib/python3/dist-packages/middlewared/alert/source/smartd.py", line 22, in check_sync if not self.middleware.call_sync("service.started", "smartd"): File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1414, in call_sync return self.run_coroutine(methodobj(*prepared_call.args)) File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1454, in run_coroutine return fut.result() File "/usr/lib/python3.9/concurrent/futures/_base.py", line 433, in result return self.__get_result() File "/usr/lib/python3.9/concurrent/futures/_base.py", line 389, in __get_result raise self._exception File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1378, in nf return await func(*args, **kwargs) File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1246, in nf res = await f(*args, **kwargs) File "/usr/lib/python3/dist-packages/middlewared/plugins/service.py", line 201, in started state = await service_object.get_state() File "/usr/lib/python3/dist-packages/middlewared/plugins/service_/services/base.py", line 37, in get_state return await self.middleware.run_in_thread(self._get_state_sync) File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1261, in run_in_thread return await self.run_in_executor(self.thread_pool_executor, method, *args, **kwargs) File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1258, in run_in_executor return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs)) File "/usr/lib/python3.9/concurrent/futures/thread.py", line 52, in run result = self.fn(*self.args, **self.kwargs) File "/usr/lib/python3/dist-packages/middlewared/plugins/service_/services/base.py", line 42, in _get_state_sync state = unit.Unit.ActiveState File "/usr/lib/python3/dist-packages/pystemd/base.py", line 191, in _call return func(self, name, *args) File "/usr/lib/python3/dist-packages/pystemd/base.py", line 127, in _get_property return bus.get_property( File "pystemd/dbuslib.pyx", line 478, in pystemd.dbuslib.DBus.get_property pystemd.dbusexc.DBusTimeoutError: [err -110]: b'Connection timed out'




Both logs indicate some sort of timeout error,
[err -110]: b'Connection timed out'


The attached jpg is what I got when I tried to restart through GUI.

I have a similar system AMD Ryzen 5 PRO 4650G running fine.

If anyone has any ideas I would love to hear them.
 

Attachments

  • IMG_1853.jpg
    IMG_1853.jpg
    269.2 KB · Views: 90
Last edited:

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Its clearly some type of hardware failure.... but I doubt its the processor.

The debug file or logs might show an error event....
 

wackymole

Explorer
Joined
Aug 21, 2017
Messages
59
I also saw this error. But haven't seen this one for a while.

Code:
These alerts have been cleared:

    Device /dev/disk/by-partuuid/8ce0e87a-4f1b-4953-933f-d4964c35762a is causing slow I/O on pool primedev.
    Device /dev/disk/by-partuuid/45354b7a-c425-4978-8eeb-06b3f3db9621 is causing slow I/O on pool primedev.
    Device /dev/disk/by-partuuid/6c49bb3b-67d8-4f11-82b7-1f934b35e12b is causing slow I/O on pool primedev.
    Device /dev/disk/by-partuuid/f7e8b9bb-d04a-49fc-bbfe-5aa00bdbb75b is causing slow I/O on pool primedev.
    Failed to check for alert Smartd: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/middlewared/plugins/alert.py", line 784, in __run_source alerts = (await alert_source.check()) or [] File "/usr/lib/python3/dist-packages/middlewared/alert/base.py", line 223, in check return await self.middleware.run_in_thread(self.check_sync) File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1261, in run_in_thread return await self.run_in_executor(self.thread_pool_executor, method, *args, **kwargs) File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1258, in run_in_executor return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs)) File "/usr/lib/python3.9/concurrent/futures/thread.py", line 52, in run result = self.fn(*self.args, **self.kwargs) File "/usr/lib/python3/dist-packages/middlewared/alert/source/smartd.py", line 22, in check_sync if not self.middleware.call_sync("service.started", "smartd"): File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1414, in call_sync return self.run_coroutine(methodobj(*prepared_call.args)) File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1454, in run_coroutine return fut.result() File "/usr/lib/python3.9/concurrent/futures/_base.py", line 433, in result return self.__get_result() File "/usr/lib/python3.9/concurrent/futures/_base.py", line 389, in __get_result raise self._exception File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1378, in nf return await func(*args, **kwargs) File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1246, in nf res = await f(*args, **kwargs) File "/usr/lib/python3/dist-packages/middlewared/plugins/service.py", line 201, in started state = await service_object.get_state() File "/usr/lib/python3/dist-packages/middlewared/plugins/service_/services/base.py", line 37, in get_state return await self.middleware.run_in_thread(self._get_state_sync) File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1261, in run_in_thread return await self.run_in_executor(self.thread_pool_executor, method, *args, **kwargs) File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1258, in run_in_executor return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs)) File "/usr/lib/python3.9/concurrent/futures/thread.py", line 52, in run result = self.fn(*self.args, **self.kwargs) File "/usr/lib/python3/dist-packages/middlewared/plugins/service_/services/base.py", line 40, in _get_state_sync unit = self._get_systemd_unit() File "/usr/lib/python3/dist-packages/middlewared/plugins/service_/services/base.py", line 69, in _get_systemd_unit unit.load() File "/usr/lib/python3/dist-packages/pystemd/base.py", line 89, in load unit_xml = self.get_introspect_xml() File "/usr/lib/python3/dist-packages/pystemd/base.py", line 75, in get_introspect_xml bus.call_method( File "pystemd/dbuslib.pyx", line 442, in pystemd.dbuslib.DBus.call_method pystemd.dbusexc.DBusTimeoutError: [err -110]: b'Connection timed out'
    Failed to get properties: Transport endpoint is not connected



I checked error. - Lot of corrected ECC dram errors. MemTest was clean though
Code:
Sep  9 16:46:49 truenas blkmapd[1024]: open pipe file /run/rpc_pipefs/nfs/blocklayout failed: No such file or directory
Sep  9 16:46:49 truenas kernel: Error: Driver 'pcspkr' is already registered, aborting...
Sep  9 16:47:30 truenas systemd-modules-load[7558]: Failed to find module 'nvidia-drm'
Sep  9 16:48:01 truenas systemd-tmpfiles[7807]: "/var/log" already exists and is not a directory.
Sep  9 16:48:01 truenas systemd[1]: Failed to start console-setup.service - Set console font and keymap.
Sep  9 16:48:01 truenas systemd[1]: Failed to start nslcd.service - LSB: LDAP connection daemon.
Sep  9 16:51:51 truenas kernel: [Hardware Error]: Corrected error, no action required.
Sep  9 16:51:51 truenas kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Sep  9 16:51:51 truenas kernel: [Hardware Error]: Error Addr: 0x00000000a1618d80
Sep  9 16:51:51 truenas kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0xe4da80000a800603
Sep  9 16:51:51 truenas kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Sep  9 16:51:51 truenas kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Sep  9 16:53:55 truenas systemd[1]: Failed to start collectd.service - Statistics collection and monitoring daemon.
Sep  9 16:46:49 truenas blkmapd[1024]: open pipe file /run/rpc_pipefs/nfs/blocklayout failed: No such file or directory
Sep  9 16:46:49 truenas kernel: Error: Driver 'pcspkr' is already registered, aborting...
Sep  9 16:47:30 truenas systemd-modules-load[7558]: Failed to find module 'nvidia-drm'
Sep  9 16:48:01 truenas systemd-tmpfiles[7807]: "/var/log" already exists and is not a directory.
Sep  9 16:48:01 truenas systemd[1]: Failed to start console-setup.service - Set console font and keymap.
Sep  9 16:48:01 truenas systemd[1]: Failed to start nslcd.service - LSB: LDAP connection daemon.
Sep  9 16:51:51 truenas kernel: [Hardware Error]: Corrected error, no action required.
Sep  9 16:51:51 truenas kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Sep  9 16:51:51 truenas kernel: [Hardware Error]: Error Addr: 0x00000000a1618d80
Sep  9 16:51:51 truenas kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0xe4da80000a800603
Sep  9 16:51:51 truenas kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Sep  9 16:51:51 truenas kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Sep  9 16:53:55 truenas systemd[1]: Failed to start collectd.service - Statistics collection and monitoring daemon.
Sep  9 16:54:14 truenas collectd[16588]: rrdcached plugin: Failed to connect to RRDCacheD at unix:/var/run/rrdcached.sock: Unable to connect to rrdcached: No such file or directory (status=2)
Sep  9 16:54:14 truenas collectd[16588]: rrdcached plugin: Failed to connect to RRDCacheD at unix:/var/run/rrdcached.sock: Unable to connect to rrdcached: No such file or directory (status=2)
Sep  9 16:57:02 truenas kernel: [Hardware Error]: Corrected error, no action required.
Sep  9 16:57:02 truenas kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Sep  9 16:57:02 truenas kernel: [Hardware Error]: Error Addr: 0x00000000a28915c0
Sep  9 16:57:02 truenas kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0xe4da80000a800603
Sep  9 16:57:02 truenas kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Sep  9 16:57:02 truenas kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Sep  9 17:02:13 truenas kernel: [Hardware Error]: Corrected error, no action required.
Sep  9 17:02:13 truenas kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Sep  9 17:02:13 truenas kernel: [Hardware Error]: Error Addr: 0x00000000a24518a0
Sep  9 17:02:13 truenas kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0xe4da80000a800603
Sep  9 17:02:13 truenas kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Sep  9 17:02:13 truenas kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Sep  9 17:07:25 truenas kernel: [Hardware Error]: Corrected error, no action required.
Sep  9 17:07:25 truenas kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Sep  9 17:07:25 truenas kernel: [Hardware Error]: Error Addr: 0x00000000a592c580
Sep  9 17:07:25 truenas kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0xe4da80000a800603
Sep  9 17:07:25 truenas kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Sep  9 17:07:25 truenas kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Sep  9 17:12:36 truenas kernel: [Hardware Error]: Corrected error, no action required.
Sep  9 17:12:36 truenas kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Sep  9 17:12:36 truenas kernel: [Hardware Error]: Error Addr: 0x00000000a0358d80
Sep  9 17:12:36 truenas kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0xe4da80000a800603
Sep  9 17:12:36 truenas kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Sep  9 17:12:36 truenas kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Sep  9 17:17:47 truenas kernel: [Hardware Error]: Corrected error, no action required.
Sep  9 17:17:47 truenas kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Sep  9 17:17:47 truenas kernel: [Hardware Error]: Error Addr: 0x00000000a6479d40
Sep  9 17:17:47 truenas kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0xe4da80000a800603
Sep  9 17:17:47 truenas kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Sep  9 17:17:47 truenas kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Sep  9 17:22:58 truenas kernel: [Hardware Error]: Corrected error, no action required.
Sep  9 17:22:58 truenas kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Sep  9 17:22:58 truenas kernel: [Hardware Error]: Error Addr: 0x00000000a54a7ac0
Sep  9 17:22:58 truenas kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0xe4da80000a800603
Sep  9 17:22:58 truenas kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Sep  9 17:22:58 truenas kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Sep  9 17:28:10 truenas kernel: [Hardware Error]: Corrected error, no action required.
Sep  9 17:28:10 truenas kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Sep  9 17:28:10 truenas kernel: [Hardware Error]: Error Addr: 0x00000000a2451cc0
Sep  9 17:28:10 truenas kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0xe4da80000a800603
Sep  9 17:28:10 truenas kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Sep  9 17:28:10 truenas kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Sep  9 17:33:21 truenas kernel: [Hardware Error]: Corrected error, no action required.
Sep  9 17:33:21 truenas kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b



message is full of this stuff -
Code:
Sep 17 10:38:13 truenas kernel: mce: [Hardware Error]: Machine check events logged
Sep 17 10:38:13 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2b7f47 offset:0x1c0 grain:64 syndrome:0x8000)
Sep 17 10:43:24 truenas kernel: mce: [Hardware Error]: Machine check events logged
Sep 17 10:43:24 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2b856d offset:0x500 grain:64 syndrome:0x8000)
Sep 17 10:53:47 truenas kernel: mce: [Hardware Error]: Machine check events logged
Sep 17 10:53:47 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2b7bcf offset:0x1c0 grain:64 syndrome:0x8000)
Sep 17 11:09:21 truenas kernel: mce: [Hardware Error]: Machine check events logged
Sep 17 11:09:21 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2b938b offset:0xdc0 grain:64 syndrome:0x8000)
Sep 17 11:14:32 truenas kernel: mce: [Hardware Error]: Machine check events logged
Sep 17 11:14:32 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2c3c0d offset:0xd80 grain:64 syndrome:0x8000)
Sep 17 11:19:43 truenas kernel: mce: [Hardware Error]: Machine check events logged
Sep 17 11:19:43 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2b6c86 offset:0xd80 grain:64 syndrome:0x8000)
Sep 17 11:24:54 truenas kernel: mce: [Hardware Error]: Machine check events logged
Sep 17 11:24:54 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2c2fc4 offset:0xd60 grain:64 syndrome:0x8000)
Sep 17 11:30:06 truenas kernel: mce: [Hardware Error]: Machine check events logged
Sep 17 11:30:06 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2b7a6b offset:0xd80 grain:64 syndrome:0x8000)
Sep 17 11:35:17 truenas kernel: mce: [Hardware Error]: Machine check events logged
Sep 17 11:35:17 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2c3c0d offset:0xd80 grain:64 syndrome:0x8000)
Sep 17 11:40:28 truenas kernel: mce: [Hardware Error]: Machine check events logged
Sep 17 11:40:28 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2b571a offset:0x900 grain:64 syndrome:0x8000)
Sep 17 11:45:40 truenas kernel: mce: [Hardware Error]: Machine check events logged
Sep 17 11:45:40 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2b7f49 offset:0xd80 grain:64 syndrome:0x8000)
Sep 17 11:50:51 truenas kernel: mce: [Hardware Error]: Machine check events logged
Sep 17 11:50:51 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2b74bf offset:0xd80 grain:64 syndrome:0x8000)
Sep 17 11:56:02 truenas kernel: mce: [Hardware Error]: Machine check events logged
Sep 17 11:56:02 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2bbc3d offset:0xd80 grain:64 syndrome:0x8000)
Sep 17 12:01:14 truenas kernel: mce: [Hardware Error]: Machine check events logged
Sep 17 12:01:14 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2b7f4a offset:0xd80 grain:64 syndrome:0x8000)
Sep 17 12:06:25 truenas kernel: mce: [Hardware Error]: Machine check events logged
Sep 17 12:06:25 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2b7515 offset:0x5c0 grain:64 syndrome:0x8000)
Sep 17 12:16:47 truenas kernel: mce: [Hardware Error]: Machine check events logged
Sep 17 12:16:47 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x6b5c71 offset:0xc00 grain:64 syndrome:0x100)
Sep 17 12:16:47 truenas kernel: mce: [Hardware Error]: Machine check events logged
Sep 17 12:16:47 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2c1e04 offset:0xd60 grain:64 syndrome:0x8000)
Sep 17 12:21:59 truenas kernel: mce: [Hardware Error]: Machine check events logged
Sep 17 12:21:59 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2c2ca1 offset:0xd80 grain:64 syndrome:0x8000)
Sep 17 12:27:10 truenas kernel: mce: [Hardware Error]: Machine check events logged
Sep 17 12:27:10 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2c7097 offset:0xd40 grain:64 syndrome:0x8000)
Sep 17 12:32:21 truenas kernel: mce: [Hardware Error]: Machine check events logged
Sep 17 12:32:21 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2b790f offset:0x1c0 grain:64 syndrome:0x8000)
Sep 17 12:47:55 truenas kernel: mce: [Hardware Error]: Machine check events logged
Sep 17 12:47:55 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2b87a4 offset:0xd80 grain:64 syndrome:0x8000)
Sep 17 12:53:06 truenas kernel: mce: [Hardware Error]: Machine check events logged
Sep 17 12:53:06 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2c557b offset:0x580 grain:64 syndrome:0x8000)
Sep 17 12:58:18 truenas kernel: mce: [Hardware Error]: Machine check events logged


I am going to try without ECC memory.

Interesting read I found.
 
Last edited:

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
I doubt ECC memory is causing the issue.... but perhaps one of the memory sticks is bad?
ECC hides some of the problems, but if there are too many errors it will cause a crash.
How long did the memtest run and does it detect ECC errors that were corrected?
 

wackymole

Explorer
Joined
Aug 21, 2017
Messages
59
Memtest ran for at least 12 hours. I did four (4) passes. Received the green checkmark. I didn't see the logs indicated any ECC errors, but I know this computer has ECC errors all the time. My server didn't crash last night, I stopped all the dockers. Going move the Docker to the SSD and start, see if it fails tonight.
 

wackymole

Explorer
Joined
Aug 21, 2017
Messages
59
I actually appear to be coming from my nvdia P2000 GPU with plex.
I did replace one stick of DDR4 which did reduce the ECC errors, but it still crashing and hanging.

I have this log at crash. ~ 12:55
Oct 17 20:12:38 primenas kernel: Error: Driver 'pcspkr' is already registered, aborting...
Oct 17 20:12:38 primenas kernel:
Oct 17 20:12:55 primenas systemd-modules-load[5090]: Failed to find module 'nvidia-drm'
Oct 17 20:13:26 primenas systemd-tmpfiles[7288]: "/var/log" already exists and is not a directory.
Oct 17 20:13:26 primenas systemd[1]: Failed to start nslcd.service - LSB: LDAP connection daemon.
Oct 17 20:13:26 primenas systemd[1]: Failed to start console-setup.service - Set console font and keymap.

And the display had a lot of error messages about nvdia.
 

wackymole

Explorer
Joined
Aug 21, 2017
Messages
59
Final Update. It was the CPU. One stick of ECC ram was also going bad, but it wasn't the cause of the crashes. Changed the 3700x to a 5700x, same mobo and its been update for 2+ days now. Plex, VMs ect all running with gpu.
 
Top