Can't replace drive. Crashing during resilver.

whodis

Cadet
Joined
Mar 3, 2022
Messages
4
  • Asrock Rack X470du
  • Ryzen 7 3800X
  • 64gb Ram
  • (8) 14TB Seagate Exos X16 - RaidZ2
  • (1) 480GB Intel Optane 900P PCIE- SLOG
  • (2) 500GB Samsung 970 Evo M.2 - Raid 0 Boot
  • Onboard hard disk controllers through SATA backplanes.
  • Onboard networking - (2) 1Gbps lagg

System has been running fine in this configuration for over 2 years. I've been upgrading this server for roughly 10 years, through OVM to FreeNAS to TrueNAS. This is a personal rig for media and backups. It's in a server rack in my house and has a UPS. Never had a single issue with it.

One of the pool disks started to throw unreadable sector errors so I pulled it and had it replaced under warranty. The first 2 replacement drives they sent me were DOA. The third worked. I didn't really trust it after having 2 of them be doa so I threw it in the test bench and ran it through a series of tests. It passed chkdsk, seatools, wmic, and a single pass of badblocks on livecd. I put it in the server, added it to the pool, and it began resilvering. I let it go overnight and into the next day. When I went to check on it, it was only at 2.x% and I had many "unscheduled system reboot" alerts, ranging from 1 minute to ~3 hours apart. Nothing in the logs/dmesg. If I offline the replacement disk, the rest will resilver without incident. If I then online the disk, the resilvering process begins again, and so do the random unscheduled reboots.

I thought it was maybe a heat issue at first, but this is under load:
Code:
root@media[~]# smartctl -A /dev/ada5 | grep -i temperature
190 Airflow_Temperature_Cel 0x0022   057   047   040    Old_age   Always       -       43 (Min/Max 27/47)
194 Temperature_Celsius     0x0022   043   053   000    Old_age   Always       -       43 (0 26 0 0 0)


Even so, I pulled it from the rack and set it up in a cooler room (about 67°F. Normally in a CCed closet... about 73°F in there). With the case open, a couple additional 120mm fans pulling, and a desk fan for the push, I tried running it again. Same problems. Power supply? Replaced the Seasonic Prime 650W with a new, cheap EVGA G3 850w. Same problem. Did the SATA cable go bad? I took the drive out of the chassis and connected it directly to the MB with a new SATA cable. Same problem. This is when I discovered it works when I offline the disk. I just offlined it yet again and the resilver of the pool and a long S.M.A.R.T. on it has been going for almost 5 hours now.


Code:
root@media[~]# zpool status
  pool: freenas-boot
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 00:00:18 with 0 errors on Sun Feb 27 03:45:18 2022
config:


        NAME          STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            nvd1p2    ONLINE       0     0     0
            nvd2p2    ONLINE       0     0     0


errors: No known data errors


  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Mar  3 09:01:56 2022
        24.3T scanned at 1.19G/s, 22.7T issued at 1.11G/s, 27.4T total
        5.68G resilvered, 82.66% done, 01:12:59 to go
config:


        NAME                                            STATE     READ WRITE CKSUM
        tank                                            DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            gptid/d1fa7f1a-fde0-11e9-b9b1-d05099d48bc5  ONLINE       0     0     0
            gptid/d3344d35-fde0-11e9-b9b1-d05099d48bc5  ONLINE       0     0     0
            gptid/d47ca4f1-fde0-11e9-b9b1-d05099d48bc5  ONLINE       0     0     0
            gptid/d5c3c843-fde0-11e9-b9b1-d05099d48bc5  ONLINE       0     0     0
            gptid/d7195b2a-fde0-11e9-b9b1-d05099d48bc5  ONLINE       0     0     0
            gptid/d87d908f-fde0-11e9-b9b1-d05099d48bc5  ONLINE       0     0     0
            gptid/fca9c832-99a6-11ec-8333-d05099d48bc5  OFFLINE      0     0     0  (resilvering)
            gptid/db769a3b-fde0-11e9-b9b1-d05099d48bc5  ONLINE       0     0     0
        logs
          gptid/dcd3df64-fde0-11e9-b9b1-d05099d48bc5    ONLINE       0     0     0


errors: No known data errors


I'm out of ideas. Any suggestions why the system works when the disk is offline, but has constant crashes with it online? Why does the resilvering process restart after I online the disk? What am I overlooking? What am I doing wrong? Thanks in advance.
 

whodis

Cadet
Joined
Mar 3, 2022
Messages
4
Update: Extended test finished with no errors. Resilver still completes successfully with drive offline. Tried to online the disk again this morning. It started resilvering and has already rebooted 3 times since, restarting the resilver each time. I don't know where to go from here.
 

whodis

Cadet
Joined
Mar 3, 2022
Messages
4
It seems I don't have the ability to edit posts, but wanted to share some more small bits of info:

Not that it matters for this issue, but I noticed I listed the boot drives as Raid 0. They are actually Raid 1.

TrueNAS is version 12.0-U8
 

whodis

Cadet
Joined
Mar 3, 2022
Messages
4
Ugh... having a conversation with myself here. I second guessed my second guess and actually looked at the pool... boot is indeed 0. :/ I can't edit or delete my posts so sorry for "spamming."
 
Top