Missing disks after zfs pool deletion

Joined
Feb 8, 2023
Messages
9
System is TrueNAS-13.0-U3.1, on an HP Proliant DL380 with a P420i controller.

I'm doing some test with zfs.

Controller is in RAID mode, so "single disks" are indeed single disk RAID0 arrays created on controller.
I know this is not good for zfs and I plan to change mode or controller.

But the strange behavior I see doesn't seem related to this.

I create a pool with 4 disks, let's say with RAIDZ1. It's created and it works.
I destroy the pool, to create it again with another level etc.

Deletion works, but when I go to create a pool I don't find the drives anymore. Reboot doesn't help.

With 'camcontrol devlist" they appear:

<HP RAID 0 OK> at scbus0 target 0 lun 0 (pass0,da0)
<HP RAID 1(1+0) OK> at scbus0 target 1 lun 0 (pass1,da1)
<HP RAID 0 OK> at scbus0 target 2 lun 0 (pass2,da2)
<HP RAID 0 OK> at scbus0 target 3 lun 0 (pass3,da3)
<HP RAID 0 OK> at scbus0 target 4 lun 0 (pass4,da4)
<HP iLO Internal SD-CARD 2.10> at scbus5 target 0 lun 0 (da5,pass5)

da1 is a RAID10 array with good data that is, for now, untouchable
da5 is the flash with operating system

da0, da2, da3, da4 are the drives that were composing the pool.

If I try to add a pool, I only see da4. Another time I didn't see anything.

The only way to see them again is delete and recreate the arrays within HP utility at boot.

Why does Truenas "forget" them? Why camcontrol sees them and pool creation interface not?
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
I don't mess with RAID controllers so I can't give you a good answer, but I wouldn't be surprised if what you're seeing is exactly the reason why everyone here recommends against using RAID controllers on ZFS.
 
Joined
Feb 8, 2023
Messages
9
I took off those four drives and put in another identical machine. I enabled HBA mode on controller and I could see the drives as /dev/sda-b-c-d with their real model name, not /dev/ciss and "RAIDx device".

But that machine runs Proxmox, and drives are managed by hpsa driver.

I read on these Linux kernel docs that
The hpsa driver is a SCSI driver, while the cciss driver is a “block” driver. Actually cciss is both a block driver (for logical drives) AND a SCSI driver (for tape drives). This “split-brained” design of the cciss driver is a source of excess complexity and eliminating that complexity is one of the reasons for hpsa to exist.

And https://cciss.sourceforge.net/ tells that P420i is supported by both (ciss and hpsa).

But here we are in Freebsd, so the question is: which driver is going to be used inside Truenas?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Controller is in RAID mode, so "single disks" are indeed single disk RAID0 arrays created on controller.
I know this is not good for zfs and I plan to change mode or controller.

You actually need to replace the controller, not just change some imaginary "mode".


Lobotomizing a RAID controller just makes it a stupid RAID controller. It really needs to be an HBA.

But here we are in Freebsd, so the question is: which driver is going to be used inside Truenas?

CISS sucks, absolutely should not be used, it is a decades-old misadventure in trying to create a "standard" RAID controller that doesn't work.

FreeBSD isn't Linux and doesn't have Linux's "hpsa" driver; I don't recall what driver adopts their stupid RAID cards, but I do recall that we've figured out a number of times that it isn't suitable for use with TrueNAS, and neither is the Linux "hpsa" driver.

Please replace your RAID controller with an actual LSI HBA.
 
Joined
Feb 8, 2023
Messages
9
You actually need to replace the controller, not just change some imaginary "mode".


Lobotomizing a RAID controller just makes it a stupid RAID controller. It really needs to be an HBA.

Under Proxmox HBA mode seems to work: I can see the drives with their real name, smartctl on them etc.

CISS sucks, absolutely should not be used, it is a decades-old misadventure in trying to create a "standard" RAID controller that doesn't work.

FreeBSD isn't Linux and doesn't have Linux's "hpsa" driver; I don't recall what driver adopts their stupid RAID cards, but I do recall that we've figured out a number of times that it isn't suitable for use with TrueNAS, and neither is the Linux "hpsa" driver.

Please replace your RAID controller with an actual LSI HBA.

If there isn't "hpsa" (scsi devices instead of block ones of ciss) I agree: I should replace controller.
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
Under Proxmox HBA mode seems to work: I can see the drives with their real name, smartctl on them etc.
Until it doesn't. Plenty of setups also work on CORE, until it doesn't and they come in here with a title "HELP!!! Pool doesn't mount anymore! Been working fine for a year until now". I'm actually not quite sure why, but a year seems to be the average time until someone makes a panic thread here, usually with a post count of 1.

Working for now, doesn't always mean it will stay working indefinitely. You're free to do as you wish, of course; but it's "do at your own risk, your mileage will vary" type of thing.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Until it doesn't. Plenty of setups also work on CORE, until it doesn't and they come in here with a title "HELP!!! Pool doesn't mount anymore! Been working fine for a year until now". I'm actually not quite sure why, but a year seems to be the average time until someone makes a panic thread here, usually with a post count of 1.

Working for now, doesn't always mean it will stay working indefinitely. You're free to do as you wish, of course; but it's "do at your own risk, your mileage will vary" type of thing.

You know, I appreciate you pointing that out, because that's creepily like what seemed to happen to people a decade ago on ESXi with RDM. It would just catastrophically fail and then no one really understood what had gone wrong.
 
Joined
Jun 15, 2022
Messages
674
Under Proxmox HBA mode seems to work: I can see the drives with their real name, smartctl on them etc.

If there isn't "hpsa" (scsi devices instead of block ones of ciss) I agree: I should replace controller.
I started with a very similar system as yourself, and @jgreco went out of his way to get me off the P410i controller--with hardware cache, even, a very nice setup if I do say so myself. Like you I was testing out independent RAID-0 arrays, and like you I was seeing things looked off (though a different "off" than your "off). Other members had similar reasoning, but @jgreco really led the charge on that one.

Now, the thing is he, like every experienced member here that I've run into, is trying to vastly uncomplicate your life with very little effort or expense on your part, to the point it does not seem that it could be "that simple." But it is. The community really does want you to succeed. And that's why I'm part of the community now, because my fairly stress-free job accidentally presented me with an unintended opportunity to make things even more stress-free, and the members here support me in making that a reality. With all this new-found free time I often have little else to do than participate here, because there are only so many hours of the day I can sleep peacefully in my office undisturbed while others think I'm busily toiling away ensuring the systems they depend on keep functioning smoothly, which has earned their respect and admiration, not to mention their desire to leave me undisturbed so that I may continue focusing on doing whatever it is I'm doing, which in reality is drooling on my breast pocket where I've conveniently placed a paper napkin which is readily disposed of whenever I awaken--it really is about having reliable systems in place, you know.
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
You know, I appreciate you pointing that out, because that's creepily like what seemed to happen to people a decade ago on ESXi with RDM. It would just catastrophically fail and then no one really understood what had gone wrong.
Yeah, I definitely remember those days. I think those are a lot less frequent now probably cause Proxmox has been rising in popularity now, especially among home users.
 
Joined
Feb 8, 2023
Messages
9
Until it doesn't. Plenty of setups also work on CORE, until it doesn't and they come in here with a title "HELP!!! Pool doesn't mount anymore! Been working fine for a year until now". I'm actually not quite sure why, but a year seems to be the average time until someone makes a panic thread here, usually with a post count of 1.

What happened in those cases?

Working for now, doesn't always mean it will stay working indefinitely. You're free to do as you wish, of course; but it's "do at your own risk, your mileage will vary" type of thing.

Which could it be a good controller?
Server is a Proliantd DL380 G8, drives are SSD (4 SAS 6G, 4 SAS 12G).
 
Joined
Feb 8, 2023
Messages
9
I started with a very similar system as yourself, and @jgreco went out of his way to get me off the P410i controller--with hardware cache, even, a very nice setup if I do say so myself. Like you I was testing out independent RAID-0 arrays, and like you I was seeing things looked off (though a different "off" than your "off). Other members had similar reasoning, but @jgreco really led the charge on that one.

Mine were two separate tests: on Truenas I configured indipendent RAID-0 arrays, on Proxmox I used the controller in HBA mode.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Which could it be a good controller?
Server is a Proliantd DL380 G8, drives are SSD (4 SAS 6G, 4 SAS 12G).

I am not certain what the recommended HP controller for this is, @HoneyBadger any help kind sir? I believe it would be the HP H220 for 6Gbps, but don't know a 12Gbps part. It should probably be something with an LSI SAS3008 controller chip on it.
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
What happened in those cases?
I'll let the threads speak for themselves.
SATA multiplier card
USB sticks and SD card
USB HDD + power outage
HW RAID card + power outage - This may be most applicable to you
ESXi + virtual disk + power outage
Power outage - Not sure about the underlying cause beyond that as this guy refuses to divulge any further information about his specs. It's as if he doesn't want help... :rolleyes:

Looking at that list above, it amazes me the kind of crackpot shenanigans people are willing to put their precious irreplaceable data (which they have no backups) through.
I could find more instances, but you get the gist.
In all of the cases, they run "fine" for a good amount of time.... until it doesn't (usually a power outage). In all of the cases mentioned, they likely experience full data loss.

Which could it be a good controller?
Server is a Proliantd DL380 G8, drives are SSD (4 SAS 6G, 4 SAS 12G).
I'm using HPE HP220 (LSI 9205-8i) to great effect, but this is an older 6G SAS2308 card though, which may not be fast enough for your SSD's.
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I am not certain what the recommended HP controller for this is, @HoneyBadger any help kind sir? I believe it would be the HP H220 for 6Gbps, but don't know a 12Gbps part. It should probably be something with an LSI SAS3008 controller chip on it.
Unfortunately for CORE users, HP switched to sourcing their controllers from MicroSemi for the SAS3/12Gbps generation - the H240 uses a MicroSemi chipset and the ciss driver. SCALE should be able to get better support using the hpsa driver; however, I'm still uncertain if it's addressed some standing bugs regarding refusing to make replaced/hotswapped disks visible without a power-cycle.

If it is used, it needs to be pushed into HBA mode using the HP Smart Storage Administrator CLI - another user at ServeTheHome has documented the process, but I can't verify it myself as I don't have an H240 available.

https://forums.servethehome.com/ind...4-usd-from-us-seller.28719/page-2#post-268021

My recommendation would be to use an LSI SAS3008 based controller as suggested.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
What happened in those cases?

In the early 2010's, we had a large run of people with various hypervisor "accidents". Nehalem and Westmere both theoretically had PCIe passthru capability; VT-d was introduced IIRC in 2007, and Nehalem was maybe two years later. But this was still a real new feature, including for the BIOS, for the CPU, and for ESXi. People were not having success with PCIe passthru for the controller, so many Nehalem and Westmere users tried RDM instead.

RDM is a technology intended for SAN storage systems, and allows you to allocate SAN space to a VM. Internally, ESXi creates a fake .vmdk mapping file and instead stores the disk data directly on the LUN. If you've ever looked at a VMware Fusion or Workstation disk with the descriptor and extent files, it is somewhat similar, just substituting LUN for extent files.

ESXi unofficially supports the use of local storage devices for RDM if you do a little hackery, which led some desperate users to set up multiple disks as RDM and expose them to their FreeNAS VM's, because they couldn't get PCIe passthru to work, or in some cases didn't bother to try. There was some insipid blog post somewhere that described how to do it, and so a bunch of users who really didn't understand the fire tried playing with fire anyways.

And it worked. Or seemed to. Until something went wrong. And here's where I have no good answers for you. Usually caused by a disk failure or replacement, it seems like the use of RDM would cause some sort of catastrophic pool failure. We had a few users try to pursue this with VMware; they were told that local use of RDM was unsupported and lacked needed features, and VMware would not analyze what had happened. More usually, the free ESXi users who didn't have support contracts would just suddenly find themselves without their RDM disks, and generally none of them had the virtualization experience to dive in and figure out where the vmdk's pointed and discover what was missing or broken. This basically turns into a debacle of nearly universal failure to recover a RDM setup.

This was the genesis of the frustrated-sounding meta-article


I recommend against RDM. Not because I want you to buy a newer machine, not because I want you to burn watts, and not because of any of the various other crackpot theories people have accused me of. I recommend against RDM because I have no idea how to fix it when it breaks, and have seen enough breakage that it scares me.
 
Joined
Feb 8, 2023
Messages
9
Unfortunately for CORE users, HP switched to sourcing their controllers from MicroSemi for the SAS3/12Gbps generation - the H240 uses a MicroSemi chipset and the ciss driver. SCALE should be able to get better support using the hpsa driver; however, I'm still uncertain if it's addressed some standing bugs regarding refusing to make replaced/hotswapped disks visible without a power-cycle.

If it is used, it needs to be pushed into HBA mode using the HP Smart Storage Administrator CLI - another user at ServeTheHome has documented the process, but I can't verify it myself as I don't have an H240 available.
For that P420i I didn't need CLI, but only the utility included in Firmware DVD for GEN 8.1 worked.
The version you can launch during POST with F5 didn't have HBA option.

https://forums.servethehome.com/ind...4-usd-from-us-seller.28719/page-2#post-268021

My recommendation would be to use an LSI SAS3008 based controller as suggested.

Like this?
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
For that P420i I didn't need CLI, but only the utility included in Firmware DVD for GEN 8.1 worked.
The version you can launch during POST with F5 didn't have HBA option.
...
To be clear, some RAID controllers don't have an HBA mode. Even if they say they do.

The real test, is if SMART or Linux "hdparm" can directly access the disk. If they can, you almost certainly have HBA mode.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
For that P420i I didn't need CLI, but only the utility included in Firmware DVD for GEN 8.1 worked.
The version you can launch during POST with F5 didn't have HBA option.

The P420i isn't compatible.

 
Joined
Feb 8, 2023
Messages
9
To be clear, some RAID controllers don't have an HBA mode. Even if they say they do.

The real test, is if SMART or Linux "hdparm" can directly access the disk. If they can, you almost certainly have HBA mode.
smartctl say this:

smartctl --all -d scsi /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.85-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: SAMSUNG
Product: P1633N19 EMC1920
Revision: EQP9
Compliance: SPC-4
User Capacity: 1,920,924,123,136 bytes [1.92 TB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x5002538a0664cca0
Serial number: P0NAAH604918
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Wed Feb 22 00:59:42 2023 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 80%
Current Drive Temperature: 28 C
Drive Trip Temperature: 60 C

Accumulated power on time, hours:minutes 41426:31
Manufactured in week 23 of year 2016
Accumulated start-stop cycles: 117
Specified load-unload count over device lifetime: 0
Accumulated load-unload cycles: 0
Elements in grown defect list: 2749

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 384 0 384 248057 7118840.350 247673
write: 0 0 0 0 4496 3605987.844 4496
verify: 0 0 0 0 0 109579.038 0

Non-medium error count: 65

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed - 39507 - [- - -]
# 2 Background long Aborted (by user command) 8 39507 - [- - -]
# 3 Reserved(3) Completed - 39329 - [- - -]
# 4 Reserved(3) Completed - 39327 - [- - -]
# 5 Reserved(3) Completed - 39325 - [- - -]
# 6 Reserved(3) Completed - 39323 - [- - -]
# 7 Reserved(3) Completed - 39321 - [- - -]
# 8 Reserved(3) Completed - 39318 - [- - -]
# 9 Reserved(3) Completed - 39316 - [- - -]
#10 Reserved(3) Completed - 39314 - [- - -]
#11 Reserved(3) Completed - 39312 - [- - -]
#12 Reserved(3) Completed - 39310 - [- - -]
#13 Reserved(3) Completed - 39308 - [- - -]
#14 Reserved(3) Completed - 39306 - [- - -]
#15 Reserved(3) Completed - 39303 - [- - -]
#16 Reserved(3) Completed - 39301 - [- - -]
#17 Reserved(3) Completed - 39299 - [- - -]
#18 Reserved(3) Completed - 39297 - [- - -]
#19 Reserved(3) Completed - 39295 - [- - -]
#20 Reserved(3) Completed - 39293 - [- - -]

Long (extended) Self-test duration: 90 seconds [1.5 minutes]
 
Top