murarustefaan
Cadet
- Joined
- Nov 20, 2023
- Messages
- 4
Hello everyone,
I'm new here so pardon me if I missed providing any information, just let me know and I will share it with you.
Furthermore, I tried searching the forums but did not find any similar issue.
So to start with a hopefully-brief history, I've been running TrueNAS Scale for almost 2 years now on an old PC, incrementally improvind where needed. I've also been running ~3 VMs, all ubuntu-based with little to no load whatsoever (Plex, Ghost & a monitoring one having Prometheus and Cloudflare Tunnels). Therefore, it's also been configured with a br0 bridge NIC for VM access.
Not to the current day, since the Cobia upgrade I've been having some strange issues which I can't exactly pinpoint to the upgrade itself, yet do not have a solid explanation either. It just happens that after a few hours of perfectly fine running, Scale loses all connectivity; I can't access it from the UI, can't ping (No route to host), can't mount NFS Shares. Furthermore, SCALE misses all metrics for the timeframe from the "crash" and until the restart.
The solution is a hard poweroff of the whole system.
In trying to fix it I stopped all VM's & removed the bridge nic altogether as I wanted to migrate them either way, hoping the issue was around the bridge nic behaving erratically. However, this did not fix the issue.
I also tried searching system logs, journal, dmesg, but did not manage to find anything related.
Journalctl logs of 2 hours before athe last crash, at Nov 20 01:12
Also, I believe you can ignore the /dev/sdg disk errors as that's been like that for like 2 months now, give or take. I plant to swap that either the single disk or replace the whole vdev with a larger capacity one, but haven't decided yet which approach to take.
I'd appreciate any ideas you have that I can try.
Thanks!
I'm new here so pardon me if I missed providing any information, just let me know and I will share it with you.
Furthermore, I tried searching the forums but did not find any similar issue.
So to start with a hopefully-brief history, I've been running TrueNAS Scale for almost 2 years now on an old PC, incrementally improvind where needed. I've also been running ~3 VMs, all ubuntu-based with little to no load whatsoever (Plex, Ghost & a monitoring one having Prometheus and Cloudflare Tunnels). Therefore, it's also been configured with a br0 bridge NIC for VM access.
Not to the current day, since the Cobia upgrade I've been having some strange issues which I can't exactly pinpoint to the upgrade itself, yet do not have a solid explanation either. It just happens that after a few hours of perfectly fine running, Scale loses all connectivity; I can't access it from the UI, can't ping (No route to host), can't mount NFS Shares. Furthermore, SCALE misses all metrics for the timeframe from the "crash" and until the restart.
The solution is a hard poweroff of the whole system.
In trying to fix it I stopped all VM's & removed the bridge nic altogether as I wanted to migrate them either way, hoping the issue was around the bridge nic behaving erratically. However, this did not fix the issue.
I also tried searching system logs, journal, dmesg, but did not manage to find anything related.
Journalctl logs of 2 hours before athe last crash, at Nov 20 01:12
Code:
Nov 19 23:00:01 warehouse CRON[16412]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0) Nov 19 23:00:01 warehouse CRON[16413]: (root) CMD (PATH="/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/root/bin" midclt call cloudsync.sync 1 > /dev/null 2> /dev/null) Nov 19 23:00:01 warehouse CRON[16412]: pam_unix(cron:session): session closed for user root Nov 19 23:00:20 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Nov 19 23:00:20 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully. Nov 19 23:00:20 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool. Nov 19 23:10:20 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Nov 19 23:10:20 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully. Nov 19 23:10:20 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool. Nov 19 23:15:31 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Currently unreadable (pending) sectors Nov 19 23:15:31 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Offline uncorrectable sectors Nov 19 23:15:32 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Currently unreadable (pending) sectors Nov 19 23:15:32 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Offline uncorrectable sectors Nov 19 23:15:32 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 63 Nov 19 23:15:32 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 36 to 37 Nov 19 23:15:33 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 63 Nov 19 23:15:33 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 36 to 37 Nov 19 23:15:33 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 74 to 75 Nov 19 23:15:33 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 63 Nov 19 23:15:33 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 36 to 37 Nov 19 23:15:33 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 74 to 75 Nov 19 23:15:33 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 74 to 75 Nov 19 23:15:33 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 63 Nov 19 23:15:33 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 36 to 37 Nov 19 23:15:33 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 74 to 75 Nov 19 23:15:49 warehouse nscd[16801]: 16801 monitoring file `/etc/hosts` (1) Nov 19 23:15:49 warehouse nscd[16801]: 16801 monitoring directory `/etc` (2) Nov 19 23:15:49 warehouse nscd[16801]: 16801 monitoring file `/etc/resolv.conf` (3) Nov 19 23:15:49 warehouse nscd[16801]: 16801 monitoring directory `/etc` (2) Nov 19 23:15:49 warehouse nscd[16801]: 16801 monitoring file `/etc/nsswitch.conf` (4) Nov 19 23:15:49 warehouse nscd[16801]: 16801 monitoring directory `/etc` (2) Nov 19 23:17:01 warehouse CRON[16860]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0) Nov 19 23:17:01 warehouse CRON[16861]: (root) CMD (cd / && run-parts --report /etc/cron.hourly) Nov 19 23:17:01 warehouse CRON[16860]: pam_unix(cron:session): session closed for user root Nov 19 23:20:20 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Nov 19 23:20:20 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully. Nov 19 23:20:20 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool. Nov 19 23:30:20 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Nov 19 23:30:20 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully. Nov 19 23:30:20 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool. Nov 19 23:40:20 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Nov 19 23:40:20 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully. Nov 19 23:40:20 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool. Nov 19 23:45:31 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Currently unreadable (pending) sectors Nov 19 23:45:31 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Offline uncorrectable sectors Nov 19 23:45:32 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Currently unreadable (pending) sectors Nov 19 23:45:32 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Offline uncorrectable sectors Nov 19 23:45:32 warehouse smartd[4155]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 64 Nov 19 23:45:32 warehouse smartd[4155]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 37 to 36 Nov 19 23:45:33 warehouse smartd[4155]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 64 Nov 19 23:45:33 warehouse smartd[4155]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 37 to 36 Nov 19 23:45:33 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 64 Nov 19 23:45:33 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 37 to 36 Nov 19 23:45:33 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 64 Nov 19 23:45:33 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 37 to 36 Nov 19 23:45:33 warehouse smartd[4155]: Device: /dev/sdf [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 65 Nov 19 23:45:33 warehouse smartd[4155]: Device: /dev/sdf [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 36 to 35 Nov 19 23:45:33 warehouse smartd[4155]: Device: /dev/sdf [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 65 Nov 19 23:45:33 warehouse smartd[4155]: Device: /dev/sdf [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 36 to 35 Nov 19 23:45:34 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 65 Nov 19 23:45:34 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 37 to 35 Nov 19 23:45:34 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 65 Nov 19 23:45:34 warehouse smartd[4155]: Device: /dev/sdh [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 37 to 35 Nov 19 23:50:20 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Nov 19 23:50:20 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully. Nov 19 23:50:20 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool. Nov 20 00:00:01 warehouse CRON[17882]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0) Nov 20 00:00:01 warehouse CRON[17883]: (root) CMD (PATH="/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/root/bin" midclt call cloudsync.sync 1 > /dev/null 2> /dev/null) Nov 20 00:00:01 warehouse CRON[17882]: pam_unix(cron:session): session closed for user root Nov 20 00:00:20 warehouse systemd[1]: Starting dpkg-db-backup.service - Daily dpkg database backup service... Nov 20 00:00:20 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Nov 20 00:00:20 warehouse systemd[1]: Starting logrotate.service - Rotate log files... Nov 20 00:00:20 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully. Nov 20 00:00:20 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool. Nov 20 00:00:20 warehouse systemd[1]: logrotate.service: Deactivated successfully. Nov 20 00:00:20 warehouse systemd[1]: Finished logrotate.service - Rotate log files. Nov 20 00:00:20 warehouse systemd[1]: dpkg-db-backup.service: Deactivated successfully. Nov 20 00:00:20 warehouse systemd[1]: Finished dpkg-db-backup.service - Daily dpkg database backup service. Nov 20 00:07:20 warehouse systemd[1]: Starting sysstat-summary.service - Generate a daily summary of process accounting... Nov 20 00:07:21 warehouse systemd[1]: sysstat-summary.service: Deactivated successfully. Nov 20 00:07:21 warehouse systemd[1]: Finished sysstat-summary.service - Generate a daily summary of process accounting. Nov 20 00:10:20 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Nov 20 00:10:20 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully. Nov 20 00:10:20 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool. Nov 20 00:15:31 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Currently unreadable (pending) sectors Nov 20 00:15:31 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Offline uncorrectable sectors Nov 20 00:15:32 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Currently unreadable (pending) sectors Nov 20 00:15:32 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Offline uncorrectable sectors Nov 20 00:15:33 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 65 Nov 20 00:15:33 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 36 to 35 Nov 20 00:15:33 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 65 Nov 20 00:15:33 warehouse smartd[4155]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 36 to 35 Nov 20 00:15:51 warehouse nscd[18427]: 18427 monitoring file `/etc/hosts` (1) Nov 20 00:15:51 warehouse nscd[18427]: 18427 monitoring directory `/etc` (2) Nov 20 00:15:51 warehouse nscd[18427]: 18427 monitoring file `/etc/resolv.conf` (3) Nov 20 00:15:51 warehouse nscd[18427]: 18427 monitoring directory `/etc` (2) Nov 20 00:15:51 warehouse nscd[18427]: 18427 monitoring file `/etc/nsswitch.conf` (4) Nov 20 00:15:51 warehouse nscd[18427]: 18427 monitoring directory `/etc` (2) Nov 20 00:17:01 warehouse CRON[18450]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0) Nov 20 00:17:01 warehouse CRON[18451]: (root) CMD (cd / && run-parts --report /etc/cron.hourly) Nov 20 00:17:01 warehouse CRON[18450]: pam_unix(cron:session): session closed for user root Nov 20 00:20:00 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Nov 20 00:20:00 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully. Nov 20 00:20:00 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool. Nov 20 00:30:20 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Nov 20 00:30:20 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully. Nov 20 00:30:20 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool. Nov 20 00:40:00 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Nov 20 00:40:00 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully. Nov 20 00:40:00 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool. Nov 20 00:45:31 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Currently unreadable (pending) sectors Nov 20 00:45:31 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Offline uncorrectable sectors Nov 20 00:45:32 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Currently unreadable (pending) sectors Nov 20 00:45:32 warehouse smartd[4155]: Device: /dev/sdg [SAT], 16 Offline uncorrectable sectors Nov 20 00:47:42 warehouse systemd[1]: Starting fstrim.service - Discard unused blocks on filesystems from /etc/fstab... Nov 20 00:47:42 warehouse systemd[1]: fstrim.service: Deactivated successfully. Nov 20 00:47:42 warehouse systemd[1]: Finished fstrim.service - Discard unused blocks on filesystems from /etc/fstab. Nov 20 00:50:00 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Nov 20 00:50:00 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully. Nov 20 00:50:00 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool. Nov 20 01:00:01 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Nov 20 01:00:01 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully. Nov 20 01:00:01 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool. Nov 20 01:00:01 warehouse CRON[19574]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0) Nov 20 01:00:01 warehouse CRON[19575]: (root) CMD (PATH="/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/root/bin" midclt call cloudsync.sync 1 > /dev/null 2> /dev/null) Nov 20 01:00:02 warehouse CRON[19574]: pam_unix(cron:session): session closed for user root Nov 20 01:10:01 warehouse systemd[1]: Starting sysstat-collect.service - system activity accounting tool... Nov 20 01:10:01 warehouse systemd[1]: sysstat-collect.service: Deactivated successfully. Nov 20 01:10:01 warehouse systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Also, I believe you can ignore the /dev/sdg disk errors as that's been like that for like 2 months now, give or take. I plant to swap that either the single disk or replace the whole vdev with a larger capacity one, but haven't decided yet which approach to take.
I'd appreciate any ideas you have that I can try.
Thanks!