Can not offline disk because system crashes and reboots

msc · Sep 27, 2023

We have a Backup-System consisting of a Supermicro server and two Supermicro JBODs. Here some system info:

FreeNAS version: FreeNAS-11.3-U3.2

Server:
Manufacturer: Supermicro
Product Name: SYS-5019P-WTR
Bios Vendor: American Megatrends Inc.
Bios Version: 3.2
Bios Release: 10/17/2019

JBOD1:
Vendor: Supermicro

JBOD2:
Vendor: Supermicro

My system keeps crashing because (my suspicion) a faulty drive. Here is some of the log regarding the error:
Sep 27 14:30:40 backupserverMTP syslog-ng[2659]: syslog-ng starting up; version='3.23.1'
Sep 27 14:30:40 backupserverMTP kernel: pid 1461 (syslog-ng), jid 0, uid 0: exited on signal 6 (core dumped)
Sep 27 14:32:46 backupserverMTP smartd[1705]: Device: /dev/da104, SMART Failure: DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH
Sep 27 14:34:39 backupserverMTP collectd[1194]: Traceback (most recent call last):
File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 46, in init
self.powermode = c.call('smart.config')['powermode']
File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 45, in init
self.smartctl_args = c.call('disk.smartctl_args_for_devices', self.disks)
File "/usr/local/lib/python3.7/site-packages/middlewared/client/client.py", line 386, in call

So the faulty disk is da104.

All the disks in the two JBODs are part of a RAIDZ2 pool. Here the high-level structure:

The detailed structure is the following:

Every RAIDZ2 above contains 30 HDDs (Seagate Exos X16 14TB).

The faulty disk (da104) does not show an error in the FreeNAS gui:

Obviously, I want to replace disk da104. Now, if I want to put the disk offline via the FreeNAS gui I am not successful, because it takes a lot of time and the system keeps crashing. At least this is what I understand.

What additional info do you need and what is your recomendation on how to replace the disk in a propper way if I can not offline it? Or how might I be able to offline it?

Arwen · Sep 27, 2023

I don't have any suggestions, perhaps someone else will.

If I am reading your configuration right, you have 30 disks in each RAID-Z2. This is close to 3 times the suggested maximum. Usually 10 to 12 disks per RAID-Zx vDev is considered the maximum. Whence you go above that, things can get slow for certain operations. Especially as more data is put in to the vDev / pool.

Some people manage to make very wide RAID-Zx vDevs work because they understand their work load, (only large files, written in a single event, not appended). So, you may know and understand all this already.

msc · Sep 28, 2023

Attached a large portion of the most recent /var/log/messages (see file).

I guess some relevant part might be here:
Sep 28 12:54:15 backupserverMTP ntpd[1538]: ntpd 4.2.8p12-a (1): Starting
Sep 28 12:54:15 backupserverMTP ntpd[1538]: Command line: /usr/sbin/ntpd -g -c /etc/ntp.conf -p /var/run/ntpd.pid -f /var/db/ntpd.drift
Sep 28 12:54:15 backupserverMTP ntpd[1538]: ----------------------------------------------------
Sep 28 12:54:15 backupserverMTP ntpd[1538]: ntp-4 is maintained by Network Time Foundation,
Sep 28 12:54:15 backupserverMTP ntpd[1538]: Inc. (NTF), a non-profit 501(c)(3) public-benefit
Sep 28 12:54:15 backupserverMTP ntpd[1538]: corporation. Support and training for ntp-4 are
Sep 28 12:54:15 backupserverMTP ntpd[1538]: available at https://www.nwtime.org/support
Sep 28 12:54:15 backupserverMTP ntpd[1538]: ----------------------------------------------------
Sep 28 12:55:05 backupserverMTP GEOM_MIRROR: Device mirror/swap0 launched (2/2).
Sep 28 12:55:05 backupserverMTP GEOM_MIRROR: Device mirror/swap1 launched (2/2).
Sep 28 12:55:05 backupserverMTP GEOM_MIRROR: Device mirror/swap2 launched (2/2).
Sep 28 12:55:05 backupserverMTP GEOM_MIRROR: Device mirror/swap3 launched (2/2).
Sep 28 12:55:05 backupserverMTP GEOM_MIRROR: Device mirror/swap4 launched (2/2).
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Device mirror/swap0.eli created.
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Encryption: AES-XTS 128
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Crypto: hardware
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Device mirror/swap1.eli created.
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Encryption: AES-XTS 128
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Crypto: hardware
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Device mirror/swap2.eli created.
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Encryption: AES-XTS 128
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Crypto: hardware
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Device mirror/swap3.eli created.
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Encryption: AES-XTS 128
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Crypto: hardware
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Device mirror/swap4.eli created.
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Encryption: AES-XTS 128
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Crypto: hardware
Sep 28 12:55:57 backupserverMTP collectd[2091]: Traceback (most recent call last):
File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 46, in init
self.powermode = c.call('smart.config')['powermode']
File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 45, in init
self.smartctl_args = c.call('disk.smartctl_args_for_devices', self.disks)
File "/usr/local/lib/python3.7/site-packages/middlewared/client/client.py", line 386, in call
raise CallTimeout("Call timeout")
middlewared.client.client.CallTimeout: Call timeout
Sep 28 12:56:57 backupserverMTP collectd[2091]: Traceback (most recent call last):
File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 46, in init
self.powermode = c.call('smart.config')['powermode']
File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 45, in init
self.smartctl_args = c.call('disk.smartctl_args_for_devices', self.disks)
File "/usr/local/lib/python3.7/site-packages/middlewared/client/client.py", line 386, in call
raise CallTimeout("Call timeout")
middlewared.client.client.CallTimeout: Call timeout
Sep 28 12:58:20 backupserverMTP smartd[1576]: Device: /dev/da104, SMART Failure: DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH
Sep 28 13:04:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:05:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:06:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:07:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:08:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:09:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:10:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:11:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:12:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:13:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:14:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:15:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:16:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:17:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:18:59 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:19:59 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:20:59 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:21:59 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:22:59 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:23:59 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:24:59 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:25:59 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:26:59 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:27:59 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:28:59 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:29:40 backupserverMTP smartd[1576]: Device: /dev/da104, SMART Failure: DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH

msc · Sep 28, 2023

Thanks @Arwen for your reply! The design of the layout was done long before so I do not know what the reasoning was.
However, this storage is for backup only.

jlpellet · Sep 28, 2023

While I've never used such a system, my 1st thought is 1) make sure you know which physical disk is da14, 2) power down the server, 3) remove the physical disk (da14), 4) install its blank replacement, 5) power up the server, & 6) from the GUI, replace the now-offline da14 with the spare from #4. Good luck.
John

msc · Sep 28, 2023

Is it save to do this before I have offlined the disk in FreeNAS?

jlpellet · Sep 28, 2023

Removing a disk from a powered-down server has worked every time I've done it. For safety, I would make a fresh config backup to a network location, but the process I outlined should not need it as it does not affect the system config. Of course, YMMV.
John

msc · Sep 28, 2023

OK, 1) power off, 2) replace disk physically, 3) power on, 4) replace disk in FreeNAS seems to have worked.

At least it is happily resilvering now and the random reboots seem to have stopped.

What I do not get how a faulty disk can crash the system? Isn't a RAID supposed to handle exactly those problems?

Arwen · Sep 28, 2023

Except you have a non-recommended configuration of 30 disk wide RAID-Z2. And not just one, but 5 x 30 disk wide RAID-Z2s. This can cause slow behavior to the point something else fails, software wise, and potentially a watchdog timer expires and forces a reboot. Or crash.

That said, yes, RAID covers the DATA on the disks, not the bad behavior of other software on the server.

jlpellet · Sep 29, 2023

msc said:
OK, 1) power off, 2) replace disk physically, 3) power on, 4) replace disk in FreeNAS seems to have worked.

At least it is happily resilvering now and the random reboots seem to have stopped.

What I do not get how a faulty disk can crash the system? Isn't a RAID supposed to handle exactly those problems?

As I understand it, RAID/zfs/btrs/etc protect, in varying degrees, from disk read/write/bitrot errors on the storage media. Since it is a file system, it cannot prevent hardware corruption/failure, which is what I think your symptoms indicate (the physical drive was generating a hardware level error between the disk, controller, & power supply). However, zfs does make an almost-always successful effort to prevent actual data loss from an unexpected hardware crash. So, FreeNAS/TrueNAS protected you data from the hardware failure. I hope the resilver completes successully. While recognizing the hardware config precedes you, if you ever have to rebuild the server it may be wise to read about the issues with wide arrays before just replicating what exists. Good luck.
John

msc · Sep 29, 2023

Thanks @jlpellet ! Advice well taken!

Resilvering was successful :)

cobrakiller58 · Sep 29, 2023

msc said:
MTP-NAS-HDD-Backup

just out of curiosity is the system dataset also stored on this pool?

Important Announcement for the TrueNAS Community.

Can not offline disk because system crashes and reboots

msc

Cadet

Arwen

MVP

msc

Cadet

Attachments

msc

Cadet

jlpellet

Patron

msc

Cadet

jlpellet

Patron

msc

Cadet

Arwen

MVP

jlpellet

Patron

msc

Cadet

cobrakiller58

Guru

Similar threads