mehran
Cadet
- Joined
- Apr 24, 2023
- Messages
- 6
I have TrueNAS Scale (TrueNAS-SCALE-23.10.2) setup at home. I use it for data storage as well as providing some home lab services. For example, I have a MLFlow service running (some docker image) which logs my machine learning experiments' metrics. Simply put, it's a web application which I call its APIs to save some data into some DB. Just to be clear, the ML experiment is not running on TN Scale, only the results are saved there.
If this web server is unavailable, my machine which is doing the actual ML experiment will fail (with API inaccessible error). And this happens from time to time and I don't know why. There's no good reason for it. I mean if it was a power outage (I don't have UPS), first my other machine would be affected too (it is not) also the TN would not be back on after it (TN is still running) - this has happened before but not in this case.
One other observation is that, I have a "uptime-kuma" app running on the same TN and each time it restarts, it sends me a notification. And each time MLFlow API fails, I also receive a notification from uptime-kuma (uptime-kuma is not monitoring MLFlow, it's monitoring some other apps). It seems the whole kubernetes or TN is reset.
I checked the logs and this is all I see:
These are close to the time I suspect the problem occured but they are not really matching the exact time.
Any idea how I can figure out the problem?
If this web server is unavailable, my machine which is doing the actual ML experiment will fail (with API inaccessible error). And this happens from time to time and I don't know why. There's no good reason for it. I mean if it was a power outage (I don't have UPS), first my other machine would be affected too (it is not) also the TN would not be back on after it (TN is still running) - this has happened before but not in this case.
One other observation is that, I have a "uptime-kuma" app running on the same TN and each time it restarts, it sends me a notification. And each time MLFlow API fails, I also receive a notification from uptime-kuma (uptime-kuma is not monitoring MLFlow, it's monitoring some other apps). It seems the whole kubernetes or TN is reset.
I checked the logs and this is all I see:
Code:
$ tail /var/logs/messages -n 500 | grep "Mar 11" Mar 11 04:02:41 truenas kernel: kube-bridge: port 51(veth43d2ef41) entered blocking state Mar 11 04:02:41 truenas kernel: kube-bridge: port 51(veth43d2ef41) entered forwarding state Mar 11 04:02:51 truenas kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth8aca59cb: link becomes ready Mar 11 04:02:51 truenas kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready Mar 11 04:02:51 truenas kernel: kube-bridge: port 55(veth8aca59cb) entered blocking state Mar 11 04:02:51 truenas kernel: kube-bridge: port 55(veth8aca59cb) entered disabled state Mar 11 04:02:51 truenas kernel: device veth8aca59cb entered promiscuous mode Mar 11 04:02:51 truenas kernel: kube-bridge: port 55(veth8aca59cb) entered blocking state Mar 11 04:02:51 truenas kernel: kube-bridge: port 55(veth8aca59cb) entered forwarding state Mar 11 04:03:19 truenas kernel: kube-bridge: port 56(vethf71e63c4) entered disabled state Mar 11 04:03:19 truenas kernel: device vethf71e63c4 left promiscuous mode Mar 11 04:03:19 truenas kernel: kube-bridge: port 56(vethf71e63c4) entered disabled state Mar 11 20:59:11 truenas systemd-journald[473]: Data hash table of /var/log/journal/a7d8b70ff4f9462d8d4f33d50337384c/system.journal has a fill level at 75.0 (8533 of 11377 items, 6553600 file size, 768 bytes per hash table item), suggesting rotation. Mar 11 20:59:11 truenas systemd-journald[473]: /var/log/journal/a7d8b70ff4f9462d8d4f33d50337384c/system.journal: Journal header limits reached or header out-of-date, rotating. Mar 11 20:59:11 truenas systemd-journald[473]: Failed to set ACL on /var/log/journal/a7d8b70ff4f9462d8d4f33d50337384c/user-3000.journal, ignoring: Operation not supported
These are close to the time I suspect the problem occured but they are not really matching the exact time.
Any idea how I can figure out the problem?