TrueNAS does not recognize new inserted disk after failure.

jabster · Mar 21, 2024

Hello everyone,

Hopefully someone has some additional insight for me - I have spent a couple of hours going through the forums,
but none of the similar issues I found matches my problem exactly, or the proposed solution just isn't it...
Thank you for reading -- hopefully I listed all the information you need, below.

Issue summary: TrueNAS does not recognize new inserted disk after disk failure.

System:
ProLiant ML350 Gen10
2x TrueNAS OS drives, SSD, Mirrored
10x Data drives, ATA, 14TB Seagate Ironwolf Pro
Raid: HPE Smart Array P816i-a SR Gen10
Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
32GB RAM, 1 DIMM
HPE Ethernet 10Gb 2-port 562SFP+ Adapter

HP ILO shows all 10 data drives listed as being in good working order
HP ILO shows the 10 data drives as Unconfigured - thi sis done by Truenas

TrueNAS:
TrueNAS-13.0-RELEASE
Data store: 1 vdev; RAIDZ2, 10 disks

Issue detail:
Disk da5 died after power outage.
Now, after replacing the disk and wiring:
- /Storage/Pools status shows degraded, da5 is Faulted and is listed as /dev/gptid/###
- /Storage/Disks shows da11 with serial number, disk size 0K, pool N/A
- (New) Disk specs: Disk Type: UNKNOWN Model: Generic- SD/MMC CRW

This setup has been working properly for months
After replacing the faulted drive with a new drive, the UI does not update/respond.
The same serial number and /dev/gptid hook are showing after refresh, off/onlining, replacement.
HP ILO has the replacement drive listed as being in good working order (the faulted was showing as Smart Failure Imminent)

TShooting steps:
- tried multiple replacement disks
- replaced cable
- took disk offline -- no change in console UI
- replaced disk -- no change; new disk not recognized
- wiped new disk - no change
- put old disk back, onlined disk and scrubbed pool -- no change
- took disk offline, replaced with (other) new disk and Onlined -- no change
- I tried the 'replace option' as well; it shows me the options:
- da5 twice and d11 once.

'Alerts' notes:
CRITICAL -- Pool DATA state is DEGRADED: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy: Disk ATA ST14000NT001-3LW ZR9088QL is FAULTED

My box 'o tricks is empty - what am I missing?
The other 9 disks contain production data; changing/endangering the pool is not an quick option (Without offloading all data first, which I hope to avoid if at all possible...)

Samuel Tai · Mar 21, 2024

jabster said:
Raid: HPE Smart Array P816i-a SR Gen10

This is likely your problem. ZFS will often encounter issues like these with RAID controllers. You should replace this with a plain HBA.

What's all the noise about HBAs, and why can't I use a RAID controller?

1) An HBA is a Host Bus Adapter. This is a controller that allows SAS and SATA devices to be attached to, and communicate directly with, a server. RAID controllers typically aggregate several disks into a Virtual Disk abstraction of some sort...

www.truenas.com

HoneyBadger · Mar 21, 2024

As mentioned by @Samuel Tai this is likely related to your HPE SmartArray RAID controller; it is likely masking the plug/unplug SAS events that would prompt TrueNAS to be rescanning the disks for new UUID information. It appears to operate in "Mixed Mode" so a combination of both RAID for defined virtual disks and "passthrough" for unassigned.

Do you see the SCSI device detaching/attaching in your dmesg log when you hot swap the drive?

jabster · Mar 28, 2024

Thanks @Samuel Tai and @HoneyBadger, for you time in formulating a response; much appreciated.

I will reach out to the HP support team, as this issue appears to be related to communication between the OS and the controller. Hopefully there will be at least a workaround we may be able to use. I'll keep you posted on that.

DMESG does show activity with da5 and looks like it is obtaining all the information it needs (snippet below), but the tool does not show time stamps, so I'm not quite sure what I'm looking at. It's been a while since I have been at the server physically.

The HP controller has no configuration relating to arrays; /mnt/data is a set of 10 unconfigured drives - all config is done by Truenas. All drives are properly seen and identified by HP

Code:

[INFO]:[ pqisrc_display_device_info ] [ 324 ]removed scsi BTL 0:68:0:  ATAST14000NT001-3LW Physical     SSDSmartPathCap- En- Exp+ qd=32
[INFO]:[ pqisrc_remove_device ] [ 1302 ]vendor: ATA     ST14000NT001-3LW model:ST14000NT001-3LW bus:0 target:68 lun:0 is_physical_device:0x1 expose_device:0x1volume_offline 0x0 volume_status 0x0
[INFO]:[ pqisrc_wait_for_device_commands_to_complete ] [ 515 ]Device Outstanding IO count = 0
da5 at smartpqi0 bus 0 scbus0 target 68 lun 0
da5: <ATA ST14000NT001-3LW EN01>  s/n ZR9089K0 detached
(da5:smartpqi0:0:68:0): Periph destroyed
[INFO]:[ pqisrc_add_device ] [ 1282 ]vendor: ATA     ST14000NT001-3LW model: ST14000NT001-3LW bus:0 target:68 lun:0 is_physical_device:0x1 expose_device:0x1 volume_offline 0x0 volume_status 0x0
[INFO]:[ pqisrc_display_device_info ] [ 324 ]added scsi BTL 0:68:0:  ATA      ST14000NT001-3LW Physical     SSDSmartPathCap- En- Exp+ qd=32
da5 at smartpqi0 bus 0 scbus0 target 68 lun 0
da5: <ATA ST14000NT001-3LW EN01> Fixed Direct Access SPC-4 SCSI device
da5: Serial Number ZR9089K0
da5: 1200.000MB/s transfers
da5: Command Queueing enabled
da5: 13351936MB (27344764928 512 byte sectors)
ses0: da5,pass5 in 'ArrayElement0004', SAS Slot: 1 phys at slot 5
ses0:  phy 0: SATA device
ses0:  phy 0: parent 51402ec0192d5de4 addr 31402ec0192d5de4
[INFO]:[ pqisrc_display_device_info ] [ 324 ]removed scsi BTL 0:68:0:  ATAST14000NT001-3LW Physical     SSDSmartPathCap- En- Exp+ qd=32
[INFO]:[ pqisrc_remove_device ] [ 1302 ]vendor: ATA     ST14000NT001-3LW model:ST14000NT001-3LW bus:0 target:68 lun:0 is_physical_device:0x1 expose_device:0x1volume_offline 0x0 volume_status 0x0
[INFO]:[ pqisrc_wait_for_device_commands_to_complete ] [ 515 ]Device Outstanding IO count = 0
da5 at smartpqi0 bus 0 scbus0 target 68 lun 0
da5: <ATA ST14000NT001-3LW EN01>  s/n ZR9089K0 detached
(da5:smartpqi0:0:68:0): Periph destroyed
arp: 10.100.3.134 moved from 80:fa:5b:72:a7:d2 to 0c:dd:24:2d:74:b3 on ixl0
[INFO]:[ pqisrc_add_device ] [ 1282 ]vendor: ATA     ST14000NT001-3LW model: ST14000NT001-3LW bus:0 target:68 lun:0 is_physical_device:0x1 expose_device:0x1 volume_offline 0x0 volume_status 0x0
[INFO]:[ pqisrc_display_device_info ] [ 324 ]added scsi BTL 0:68:0:  ATA      ST14000NT001-3LW Physical     SSDSmartPathCap- En- Exp+ qd=32
da5 at smartpqi0 bus 0 scbus0 target 68 lun 0
da5: <ATA ST14000NT001-3LW EN01> Fixed Direct Access SPC-4 SCSI device
da5: Serial Number ZR908AHZ
da5: 1200.000MB/s transfers
da5: Command Queueing enabled
da5: 13351936MB (27344764928 512 byte sectors)
ses0: da5,pass5 in 'ArrayElement0004', SAS Slot: 1 phys at slot 5
ses0:  phy 0: SATA device
ses0:  phy 0: parent 51402ec0192d5de4 addr 31402ec0192d5de4

jabster · Apr 3, 2024

Ok, we hit a brick wall, according to HP Support;

The HPE Smart Array P816i-a SR Gen10 controller:
- supports mixed mode, and cannot be configured to HBA mode only
- is not able to turn off/on one bay or declare it HBA only to help TrueNAS recognize that change/the replaced drive.

This leads me to my next question @Samuel Tai and @HoneyBadger:
>> Which Full-HBA controllers are recommended for use with TrueNAS?
(the literature I found that is discussing this topic is a couple of years old..)

Second question:
>> Does Truneas store RAID/zPool information on the disks? In other words -- if I switch controller, will the data pool still be intact or all data be lost and the raid will have to be recreated?

Thanks once more,
- J

Important Announcement for the TrueNAS Community.

TrueNAS does not recognize new inserted disk after failure.

jabster

Cadet

Samuel Tai

Never underestimate your own stupidity

What's all the noise about HBAs, and why can't I use a RAID controller?

HoneyBadger

actually does care

jabster

Cadet

jabster

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

TrueNAS does not recognize new inserted disk after failure.

jabster

Cadet

Samuel Tai

Never underestimate your own stupidity

What's all the noise about HBAs, and why can't I use a RAID controller?

HoneyBadger

actually does care

jabster

Cadet

jabster

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "TrueNAS does not recognize new inserted disk after failure."

Similar threads