Disk Initialisation Issues and Missing Disks

xxGBHxx

Cadet
Joined
Apr 3, 2017
Messages
1
(I didn't know which forum to post this into but this one seemed to make the most sense)

Hi,

I'm in the process of building a new NAS. It's my third TrueNAS system and this one was meant to be the one I did "right" but I'm having huge problems with it. As my other two systems "just worked" I'm not very experienced in troubleshooting so I don't really know what to look at.

The system is as follows

Supermicro X11DPH-T
256GB ECC RAM
12 * 18GB Seagate EXOS
1 * Optane 360GB P4800x
1 * LSI 9300-16i SAS3008 HBA
1 * Adaptec AEC-82885T Adaptec SAS Expander Card
1 * Mellanox MCX455A-ECAT 100GbE ConnectX-4 NIC (currently unused)
1 * 1500W Corsair HX PSU
24 Hot Swap Bay SAS3 Backplane

I have run the full set of disk tests from the Github script linked on this forum and all disks are working 100% with no errors
The motherboard is on the latest firmware
The LSI card is on the latest firmware
The Adaptec card is, I suspect, not on the latest firmware
I currently use the 10GBit MB networking and the Mellanox is unused

The problem is fairly simple.

I can boot the system up. It sees all the disks, I can create a 6*2 mirror VDEV pool with the Optane as a Log VDEV. It creates the pool perfectly fine, states everything is healthy and working. When I initially create the pool, I can create shares and write to the pool successfully without issues.

If I hard reboot/power down the server and then restart, I will randomly lose anything from 1 to all 13 disks from the pool and it will either degrade or go completely offline. However even when this happens, I can still go into Storage > Manage Disks in the GUI and all the disks are available but a random number of them now show "N/A" for the pool while others show <pool name> (Exported)

The disks have hardware encryption enabled with a single global password set.

I've tried
  • Various variations of VDEV's and disks - No change to what happens
  • I thought there was not enough power as I was splitting 1 PCI cable into 3 Molex. I now use no more than 2 Molex connectors off each PSU connect. Though this didn't change anything
  • I've re-configured the Expander and the HBA. I need 6 SAS cables for the backplane (6*4 drives) The Expander has 7 connectors, the HBA has 4. I had originally used 2 HBA > Expander then 5 Expander > Backplane and 1 HBA to Backplane. I have since tried removing the Expander completely and only connecting up 3 of the backplane connections to the 12 disks directly to the HBA. No change.
Looking through /var/log/messages I see a number of massages like this relating to the drives
Sep 3 17:17:16 truenasfast kernel: sd 8:0:1:0: [sdh] tag#1308 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s Sep 3 17:17:16 truenasfast kernel: sd 8:0:1:0: [sdh] tag#1308 Sense Key : Data Protect [current] [descriptor] Sep 3 17:17:16 truenasfast kernel: sd 8:0:1:0: [sdh] tag#1308 Add. Sense: Access denied - no access rights

I'm assuming that's because the drives haven't been unlocked by this stage however later on in the boot I also see

Sep 3 17:17:36 truenasfast kernel: ldm_validate_partition_table(): Disk read failed. Sep 3 17:17:36 truenasfast kernel: sdh: unable to read partition table

In System Setting > Advanced I have set the Self-Encrypting Drive parameters to "MASTER" for the user and double checked the password is correct. I've done all the checks on the command line and they seem to suggest it's all working as expected and all the drives are configured correctly. If they were not I'm guessing I wouldn't be able to create a pool or write to a share.

I had also thought it might be a timing issue with the Adaptec Expander and a staggered start however removing it from the build completely had no effect.

Finally the most odd part. If I soft reboot the machine instead of a hard reboot/power down, the drives seem to come back online and the pool becomes healthy again as if nothing has happened. In testing it seems to recognise all the disks and bring the pool back online every single time if I only soft reboot. But if I then hard reboot, the problem re-occurs.

Obviously I don't want to trust my data to the system when it literally "breaks" the pool if I shutdown or hard reboot so I'd like to get it sorted.

Things I have not tried

  • A different version of TrueNAS - I am running the latest Bluefin. I haven't tried either Core or Cobia to see if this rectifies the issue
  • Any other hardware - If it's something to do with the hardware then I can change it however I don't think it's a hardware fault (though maybe a hardware configuration or firmware issue)
  • All possible SAS cable combinations between the HBA and the Expander
Any thoughts, ideas or things to try would be very much appreciated.

Thanks

G
 
Top