whodis
Cadet
- Joined
- Mar 3, 2022
- Messages
- 4
- Asrock Rack X470du
- Ryzen 7 3800X
- 64gb Ram
- (8) 14TB Seagate Exos X16 - RaidZ2
- (1) 480GB Intel Optane 900P PCIE- SLOG
- (2) 500GB Samsung 970 Evo M.2 - Raid 0 Boot
- Onboard hard disk controllers through SATA backplanes.
- Onboard networking - (2) 1Gbps lagg
System has been running fine in this configuration for over 2 years. I've been upgrading this server for roughly 10 years, through OVM to FreeNAS to TrueNAS. This is a personal rig for media and backups. It's in a server rack in my house and has a UPS. Never had a single issue with it.
One of the pool disks started to throw unreadable sector errors so I pulled it and had it replaced under warranty. The first 2 replacement drives they sent me were DOA. The third worked. I didn't really trust it after having 2 of them be doa so I threw it in the test bench and ran it through a series of tests. It passed chkdsk, seatools, wmic, and a single pass of badblocks on livecd. I put it in the server, added it to the pool, and it began resilvering. I let it go overnight and into the next day. When I went to check on it, it was only at 2.x% and I had many "unscheduled system reboot" alerts, ranging from 1 minute to ~3 hours apart. Nothing in the logs/dmesg. If I offline the replacement disk, the rest will resilver without incident. If I then online the disk, the resilvering process begins again, and so do the random unscheduled reboots.
I thought it was maybe a heat issue at first, but this is under load:
Code:
root@media[~]# smartctl -A /dev/ada5 | grep -i temperature 190 Airflow_Temperature_Cel 0x0022 057 047 040 Old_age Always - 43 (Min/Max 27/47) 194 Temperature_Celsius 0x0022 043 053 000 Old_age Always - 43 (0 26 0 0 0)
Even so, I pulled it from the rack and set it up in a cooler room (about 67°F. Normally in a CCed closet... about 73°F in there). With the case open, a couple additional 120mm fans pulling, and a desk fan for the push, I tried running it again. Same problems. Power supply? Replaced the Seasonic Prime 650W with a new, cheap EVGA G3 850w. Same problem. Did the SATA cable go bad? I took the drive out of the chassis and connected it directly to the MB with a new SATA cable. Same problem. This is when I discovered it works when I offline the disk. I just offlined it yet again and the resilver of the pool and a long S.M.A.R.T. on it has been going for almost 5 hours now.
Code:
root@media[~]# zpool status pool: freenas-boot state: ONLINE status: Some supported features are not enabled on the pool. The pool can still be used, but some features are unavailable. action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features. See zpool-features(5) for details. scan: scrub repaired 0B in 00:00:18 with 0 errors on Sun Feb 27 03:45:18 2022 config: NAME STATE READ WRITE CKSUM freenas-boot ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 nvd1p2 ONLINE 0 0 0 nvd2p2 ONLINE 0 0 0 errors: No known data errors pool: tank state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Thu Mar 3 09:01:56 2022 24.3T scanned at 1.19G/s, 22.7T issued at 1.11G/s, 27.4T total 5.68G resilvered, 82.66% done, 01:12:59 to go config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 gptid/d1fa7f1a-fde0-11e9-b9b1-d05099d48bc5 ONLINE 0 0 0 gptid/d3344d35-fde0-11e9-b9b1-d05099d48bc5 ONLINE 0 0 0 gptid/d47ca4f1-fde0-11e9-b9b1-d05099d48bc5 ONLINE 0 0 0 gptid/d5c3c843-fde0-11e9-b9b1-d05099d48bc5 ONLINE 0 0 0 gptid/d7195b2a-fde0-11e9-b9b1-d05099d48bc5 ONLINE 0 0 0 gptid/d87d908f-fde0-11e9-b9b1-d05099d48bc5 ONLINE 0 0 0 gptid/fca9c832-99a6-11ec-8333-d05099d48bc5 OFFLINE 0 0 0 (resilvering) gptid/db769a3b-fde0-11e9-b9b1-d05099d48bc5 ONLINE 0 0 0 logs gptid/dcd3df64-fde0-11e9-b9b1-d05099d48bc5 ONLINE 0 0 0 errors: No known data errors
I'm out of ideas. Any suggestions why the system works when the disk is offline, but has constant crashes with it online? Why does the resilvering process restart after I online the disk? What am I overlooking? What am I doing wrong? Thanks in advance.