Can not offline disk because system crashes and reboots

msc

Cadet
Joined
Sep 27, 2023
Messages
6
We have a Backup-System consisting of a Supermicro server and two Supermicro JBODs. Here some system info:

FreeNAS version: FreeNAS-11.3-U3.2

Server:
Manufacturer: Supermicro
Product Name: SYS-5019P-WTR
Bios Vendor: American Megatrends Inc.
Bios Version: 3.2
Bios Release: 10/17/2019

JBOD1:
Vendor: Supermicro

JBOD2:
Vendor: Supermicro

My system keeps crashing because (my suspicion) a faulty drive. Here is some of the log regarding the error:
Sep 27 14:30:40 backupserverMTP syslog-ng[2659]: syslog-ng starting up; version='3.23.1'
Sep 27 14:30:40 backupserverMTP kernel: pid 1461 (syslog-ng), jid 0, uid 0: exited on signal 6 (core dumped)
Sep 27 14:32:46 backupserverMTP smartd[1705]: Device: /dev/da104, SMART Failure: DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH
Sep 27 14:34:39 backupserverMTP collectd[1194]: Traceback (most recent call last):
File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 46, in init
self.powermode = c.call('smart.config')['powermode']
File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 45, in init
self.smartctl_args = c.call('disk.smartctl_args_for_devices', self.disks)
File "/usr/local/lib/python3.7/site-packages/middlewared/client/client.py", line 386, in call


So the faulty disk is da104.

All the disks in the two JBODs are part of a RAIDZ2 pool. Here the high-level structure:
Unbenannt.PNG


The detailed structure is the following:
Unbenannt2.PNG


Every RAIDZ2 above contains 30 HDDs (Seagate Exos X16 14TB).

The faulty disk (da104) does not show an error in the FreeNAS gui:
Unbenannt3.PNG


Obviously, I want to replace disk da104. Now, if I want to put the disk offline via the FreeNAS gui I am not successful, because it takes a lot of time and the system keeps crashing. At least this is what I understand.

What additional info do you need and what is your recomendation on how to replace the disk in a propper way if I can not offline it? Or how might I be able to offline it?
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
I don't have any suggestions, perhaps someone else will.


If I am reading your configuration right, you have 30 disks in each RAID-Z2. This is close to 3 times the suggested maximum. Usually 10 to 12 disks per RAID-Zx vDev is considered the maximum. Whence you go above that, things can get slow for certain operations. Especially as more data is put in to the vDev / pool.

Some people manage to make very wide RAID-Zx vDevs work because they understand their work load, (only large files, written in a single event, not appended). So, you may know and understand all this already.
 

msc

Cadet
Joined
Sep 27, 2023
Messages
6
Attached a large portion of the most recent /var/log/messages (see file).

I guess some relevant part might be here:
Sep 28 12:54:15 backupserverMTP ntpd[1538]: ntpd 4.2.8p12-a (1): Starting
Sep 28 12:54:15 backupserverMTP ntpd[1538]: Command line: /usr/sbin/ntpd -g -c /etc/ntp.conf -p /var/run/ntpd.pid -f /var/db/ntpd.drift
Sep 28 12:54:15 backupserverMTP ntpd[1538]: ----------------------------------------------------
Sep 28 12:54:15 backupserverMTP ntpd[1538]: ntp-4 is maintained by Network Time Foundation,
Sep 28 12:54:15 backupserverMTP ntpd[1538]: Inc. (NTF), a non-profit 501(c)(3) public-benefit
Sep 28 12:54:15 backupserverMTP ntpd[1538]: corporation. Support and training for ntp-4 are
Sep 28 12:54:15 backupserverMTP ntpd[1538]: available at https://www.nwtime.org/support
Sep 28 12:54:15 backupserverMTP ntpd[1538]: ----------------------------------------------------
Sep 28 12:55:05 backupserverMTP GEOM_MIRROR: Device mirror/swap0 launched (2/2).
Sep 28 12:55:05 backupserverMTP GEOM_MIRROR: Device mirror/swap1 launched (2/2).
Sep 28 12:55:05 backupserverMTP GEOM_MIRROR: Device mirror/swap2 launched (2/2).
Sep 28 12:55:05 backupserverMTP GEOM_MIRROR: Device mirror/swap3 launched (2/2).
Sep 28 12:55:05 backupserverMTP GEOM_MIRROR: Device mirror/swap4 launched (2/2).
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Device mirror/swap0.eli created.
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Encryption: AES-XTS 128
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Crypto: hardware
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Device mirror/swap1.eli created.
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Encryption: AES-XTS 128
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Crypto: hardware
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Device mirror/swap2.eli created.
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Encryption: AES-XTS 128
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Crypto: hardware
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Device mirror/swap3.eli created.
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Encryption: AES-XTS 128
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Crypto: hardware
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Device mirror/swap4.eli created.
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Encryption: AES-XTS 128
Sep 28 12:55:06 backupserverMTP GEOM_ELI: Crypto: hardware
Sep 28 12:55:57 backupserverMTP collectd[2091]: Traceback (most recent call last):
File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 46, in init
self.powermode = c.call('smart.config')['powermode']
File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 45, in init
self.smartctl_args = c.call('disk.smartctl_args_for_devices', self.disks)
File "/usr/local/lib/python3.7/site-packages/middlewared/client/client.py", line 386, in call
raise CallTimeout("Call timeout")
middlewared.client.client.CallTimeout: Call timeout
Sep 28 12:56:57 backupserverMTP collectd[2091]: Traceback (most recent call last):
File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 46, in init
self.powermode = c.call('smart.config')['powermode']
File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 45, in init
self.smartctl_args = c.call('disk.smartctl_args_for_devices', self.disks)
File "/usr/local/lib/python3.7/site-packages/middlewared/client/client.py", line 386, in call
raise CallTimeout("Call timeout")
middlewared.client.client.CallTimeout: Call timeout
Sep 28 12:58:20 backupserverMTP smartd[1576]: Device: /dev/da104, SMART Failure: DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH
Sep 28 13:04:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:05:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:06:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:07:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:08:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:09:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:10:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:11:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:12:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:13:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:14:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:15:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:16:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:17:58 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:18:59 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:19:59 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:20:59 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:21:59 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:22:59 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:23:59 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:24:59 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:25:59 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:26:59 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:27:59 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404
Sep 28 13:28:59 backupserverMTP ZFS: slow vdev I/O, zpool=MTP-NAS-HDD-Backup path=/dev/gptid/ddba1b16-7dd4-11ea-9d9d-ac1f6bbd0404

Sep 28 13:29:40 backupserverMTP smartd[1576]: Device: /dev/da104, SMART Failure: DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH
 

Attachments

  • log.txt
    194.2 KB · Views: 59

msc

Cadet
Joined
Sep 27, 2023
Messages
6
Thanks @Arwen for your reply! The design of the layout was done long before so I do not know what the reasoning was.
However, this storage is for backup only.
 

jlpellet

Patron
Joined
Mar 21, 2012
Messages
287
While I've never used such a system, my 1st thought is 1) make sure you know which physical disk is da14, 2) power down the server, 3) remove the physical disk (da14), 4) install its blank replacement, 5) power up the server, & 6) from the GUI, replace the now-offline da14 with the spare from #4. Good luck.
John
 

msc

Cadet
Joined
Sep 27, 2023
Messages
6
Is it save to do this before I have offlined the disk in FreeNAS?
 

jlpellet

Patron
Joined
Mar 21, 2012
Messages
287
Removing a disk from a powered-down server has worked every time I've done it. For safety, I would make a fresh config backup to a network location, but the process I outlined should not need it as it does not affect the system config. Of course, YMMV.
John
 

msc

Cadet
Joined
Sep 27, 2023
Messages
6
OK, 1) power off, 2) replace disk physically, 3) power on, 4) replace disk in FreeNAS seems to have worked.

At least it is happily resilvering now and the random reboots seem to have stopped.

What I do not get how a faulty disk can crash the system? Isn't a RAID supposed to handle exactly those problems?
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Except you have a non-recommended configuration of 30 disk wide RAID-Z2. And not just one, but 5 x 30 disk wide RAID-Z2s. This can cause slow behavior to the point something else fails, software wise, and potentially a watchdog timer expires and forces a reboot. Or crash.

That said, yes, RAID covers the DATA on the disks, not the bad behavior of other software on the server.
 

jlpellet

Patron
Joined
Mar 21, 2012
Messages
287
OK, 1) power off, 2) replace disk physically, 3) power on, 4) replace disk in FreeNAS seems to have worked.

At least it is happily resilvering now and the random reboots seem to have stopped.

What I do not get how a faulty disk can crash the system? Isn't a RAID supposed to handle exactly those problems?
As I understand it, RAID/zfs/btrs/etc protect, in varying degrees, from disk read/write/bitrot errors on the storage media. Since it is a file system, it cannot prevent hardware corruption/failure, which is what I think your symptoms indicate (the physical drive was generating a hardware level error between the disk, controller, & power supply). However, zfs does make an almost-always successful effort to prevent actual data loss from an unexpected hardware crash. So, FreeNAS/TrueNAS protected you data from the hardware failure. I hope the resilver completes successully. While recognizing the hardware config precedes you, if you ever have to rebuild the server it may be wise to read about the issues with wide arrays before just replicating what exists. Good luck.
John
 
Joined
Jan 18, 2017
Messages
525
Top