Another "door bell handshake failed" thread with a SAS2008 and a MD1200

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
I am using TrueNAS as a VM in ESXi and moved my drives from a R720xd that has worked well for many years, to a MD1200 connected to a R720, that has worked well in testing with spare SAS drives I had. Initially I was using a H200e that I crossflashed, but soon as I started using the pool the VM would power off (crash). Trying to reboot I would get the "door bell handshake failed", unless I removed the H200e from the VM. I had another H200e that I also had crossflashed to LSI firmware. So I tried the 2nd card, but it gave no different results.

After reading all the threads I could find about this fault, and trying lower RAM amounts, but still it would crash. I settled on TrueNAS13 just not playing well with my SAS2008 for some reason and I ordered a LSI 9207-8e (LSAS2308), then moved my drives back to the R720xd to wait for it to arrive.

Okay, now it came in and I installed it today, then moved the drives back over to the MD1200. Within a few minutes of getting everything back online and utilizing it, the VM crashed like it had with the H200e card. I rebooted the TrueNAS VM and everything went back to normal, without getting the door bell handshake error that I would get from the H200e. At this point I come to realize the SAS2008 card may have been giving that error on reboot, but the initial problem is not related to the card, just that the SAS2008 gets stuck that way unless the host if rebooted while the SAS2308 does not. Probably a mistake/misunderstanding I am making on how I have things.

At this point I unplug one of the two cables between the LSI card and MD1200. TrueNAS has now been running for about 15minutes with no issue while I type this up.

Now, what am I doing wrong... The MD1200 has a switch on it to be set for single or split, I have that set to single since I am using a single host. I am using 10 SATA drives, so I understand the two MD1200 controllers can not split traffic off the drives to the LSI card. I had the understanding that the two cable connections and two controllers would just be handled as redundant paths, sure my LSI card is still a single failure point though... Seeing how having both cables hooked up may contribute to the TrueNAS crash, tells me I am mistaken there...
 
Last edited:

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
I spoke too soon, it crashed just now with only the one cable connected to the MD1200. I am going to try only the other cable now to see if it is related to just one cable or controller.
 

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
Further testing has shown me that only controller #0 in the MD1200 will allow TrueNAS to see the disks. Controller #1 connected by itself has TrueNAS not showing any of the MD1200 disks. I physically swapped the controllers and cables to verify this and only the MD1200 controller slot #0 will allow TrueNAS to display the disks. Note: this is all with the MD1200 mode switch in the unified mode.

I am not sure if this is normal and research is not exactly telling me... Can someone with experience on MD1200 running TrueNAS please chime in on there configuration and cabling that works for others, especially if they run SATA on a MD1200?
 

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
Further testing shows that the MD1200 can not offer "redundancy" of the SATA drives... Unified mode gives me a 2nd path when I put it a SAS drive, but no matter what I tried, I was unable to get controller slot #1 to register any SATA drives when in unified mode. Even with controller #0 totally removed.

Now because I am not sure what else to try, I am going to try split mode on the MD1200 so that the first 6 drive bays go through controller #0 to connection #0 of my SAS2308 card, and the other 6 drive bays go through controller #1 to connection #1 of the SAS card. Trying to see if I get any different result in the crashing.
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I am not sure if this is normal and research is not exactly telling me...

It's normal. Your SATA disks normally only offer a single connection to whatever is talking to them. Feel free to see the SAS Primer in the Resources section. A SATA to SAS interposer may be used to make two different connections from the hard drive to your primary and secondary controller.
 

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
I was not familiar with those. That is a great solution! I find them to be very economical in the recycled enterprise equipment world too. Thank you very much!

Might you also have any thoughts to why the VM is crashing when using a SATA pool located in the MD1200? this pool works perfectly fine in my R720xd. Or further direction I should be going with my troubleshooting?
 

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
further testing in split mode gave me results of the first 6 drive bays no longer showing up in TrueNAS, but the last 6 would just fine. Being that I have 4 free drive bays in the R720 and the last 6 bays in the MD1200 looking to maybe work okay, I am testing my 10 disk pool across those 10 drive bays.

If it runs okay with no crashes then I guess it is related to something in the first 6 drive bays. I looked closely at the pins and backplane with a flash light and I do not notice anything too odd. If testing shows the system to be stable now then I will probably order a replacement backplane or closely inspect the solder joints for an issue. I do not know the history of this MD1200 as it was just a used online purchase.
 

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
@jgreco to use an interposer on a SATA drive in a MD1200, I was not seeing how to secure them to the SATA drive or caddy. I also see them being used on the older 2900 series servers, and not finding photos of them with the newer server caddies. I mean if they were not designed for the newer ones, but still do the job very well, then I can use hot glue to attach them to the HDD's. Just wanted to make sure I was not missing something obvious, which would not be the first or last time for me.
 

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
okay, even only using the last 6 drive bays of the MD1200, it just crashed again. The TrueNAS runs stable though if I just stick to using my other pools made up of NVMe and drives in the R720.

At this point I am not 100% convinced it is the MD1200 backplane, but no other ways to narrow down further or verify better are coming to mind.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
@jgreco to use an interposer on a SATA drive in a MD1200, I was not seeing how to secure them to the SATA drive or caddy. I also see them being used on the older 2900 series servers, and not finding photos of them with the newer server caddies. I mean if they were not designed for the newer ones, but still do the job very well, then I can use hot glue to attach them to the HDD's. Just wanted to make sure I was not missing something obvious, which would not be the first or last time for me.
You wouldn't really want to. Fun fact: Dell got rid of interposer support on Gen 14, making the drive bays a bit shallower and freeing up chassis depth for actually useful things.
The interposer would just attach to the disk and sit in between. The drive trays have two sets of holes to support interposers (labeled "SAS" for no interposer and "SATA" because someone actually thought they could fool everyone using SATA into paying for interposers).
 

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
You wouldn't really want to. Fun fact: Dell got rid of interposer support on Gen 14, making the drive bays a bit shallower and freeing up chassis depth for actually useful things.
The interposer would just attach to the disk and sit in between. The drive trays have two sets of holes to support interposers (labeled "SAS" for no interposer and "SATA" because someone actually thought they could fool everyone using SATA into paying for interposers).
"I think" the idea was that the interposer will allow the MD1200 to make SATA drives redundant between both the controller modules since without them only the controller module #0 will see the SATA drives when in unified mode
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Yes, but if you really wanted that, you'd use SAS drives. Interposers are somewhere between a silly hack and a silly scam in terms of actual benefits. Using SAS disks is cheaper and more reliable, given the small price difference between SAS and SATA versions of the same disk model.
 

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
Okay, good point.
Further testing on the MD1200 and my TrueNAS VM crashing, I have now setup another TrueNAS VM with just the SAS2308 card and thrown my old 10x6TB SAS drives in it to try stress testing it further without continuing to keep my main pools down. I know it is not apples to apples, but for further testing I am trying to make use of what I have to work with.
 
Last edited:

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
Happy New Year!

Okay, so as of this morning the troubleshooting is moving differently than I expected. I made a copy of my current TrueNAS VM and then moved over the SAS2308 card along with the MD1200 and SATA drives from the main VM to its own test VM. It has been running flawlessly this way since last night. So at this point I am feeling like it is a mix of all it together that is not playing nice, not exactly a hardware issue...

Any ideas what could be causing it? Below is how I have it setup.

This config is the one that crashes - Main TrueNAS VM config
Qty3 NVMe in PCIe passed through
Qty1 H710 1GB crossflashed and passed through for the 8 local R720 drives
128GB RAM
10 vCPU cores
Qty4 vNIC's for round robbin iSCSI to ESXi datastore
Qty1 vNIC for 10gbit SMB share inside VM's
Qty1 vNIC linked to 1gbit for SMB share to physical network
Qty1 LSI SAS9207-8e passed through and connected to MD1200 with Qty10 SATA drives

Working - Main TrueNAS VM config
Qty3 NVMe in PCIe passed through
Qty1 H710 crossflashed and passed through for the 8 local R720 drives
90GB RAM
10 vCPU cores
Qty4 vNIC's for round robbin iSCSI to ESXi datastore
Qty1 vNIC for 10gbit SMB share inside VM's
Qty1 vNIC linked to 1gbit for SMB share to physical network

Working - Test VM
Qty1 LSI SAS9207-8e passed through and connected to MD1200 with Qty10 SATA drives
32GB RAM
6 vCPU cores
Qty1 vNIC for 10gbit SMB share inside VM's
Qty1 vNIC linked to 1gbit for SMB share to physical network
 
Last edited:

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
I had an unscheduled reboot a few hours ago, and this was on the VM setup just for the SAS2308 and MD1200. Looking at the reporting, it shows the VM NIC started getting higher than typical utilization with ~500Mb Rx and ~500Mb Tx at the time of the crash. The HDD's were only at maybe 15% utilization during this time too which is not high compared to usual. Nothing else in reporting looked abnormal to me for when this happened.

I really am not sure what else to look at or try right now. Please tell me even the simple things that I may be missing at trying to understand unscheduled reboots or crashes.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
At this point, both of the disk shelf controllers have shown this issue, right?
 

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
At this point, both of the disk shelf controllers have shown this issue, right?
Yes, both the control modules internal to the MD1200, both the cables between the HBA and control modules, and many different HBA's (Qty2 SAS2008 and Qty1 SAS2308)connected to it.
 

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
I must also say setting it up on its own TrueNAS instance has stabilized it much better with uptime being a couple days before a crash. When it was part of my main TrueNAS instance that has the iSCSI datastore for all my ESXi VM's then it would crash quickly, as in typically only run for 10min or less and no more than 30min.

Not to mention, I much prefer to let just the SMB share stop with that VM, and not my entire ESXi datastore to drop out like it was.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Grasping at straws because these things are generally known to work well: Is the firmware up to date? Does the issue persist with a different model of HDD?
 

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
The firmware for the LSI 9207-8e SAS2308 HBA is v20
1672777126657.png


I have ordered a RS232 service cable for the MD1200 Control Module and it was delivered today. Will take me a little time to get connected to it as I am not yet familiar with doing so, not sure how much help it will be, but certainly wanted to have it as an option,

These hard drives are working flawlessly when installed in my R720xd. I am just trying to transition from the R720xd to the MD1200 with a regular R720. It is starting to feel like more hassle than it is worth, but at this point I am in too deep and committed.

Oddly enough, my R720xd that is perfectly stable with them is running v19 FW on an LSI 9207-8i that I installed in 2016.
1672777585299.png



Something else worth noting is I have been using a very similar setup in production at another location with the below hardware.
Same ESXI 6.7u3, but TrueNAS is on v12.0 while I am using v13 on the crashing system.
R720xd LSI SAS9207-8i for internal 12 bays (completed before H710 cross flash had been developed)
MD1200 connected via LSI SAS9207-8e
Same exact model SATA WD HDD's used in the R720xd, MD1200, and the setup I am having stability issues with.
Here is the firmware list on this similar, but stable system running for a few years.
1672778141754.png
 
Last edited by a moderator:
Top