Another "door bell handshake failed" thread with a SAS2008 and a MD1200

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Wow, P14 firmware for SAS2 is ancient. I'm surprised it's been remotely stable, the lead-up to P16 was janky as hell.

Something else worth noting is I have been using a very similar setup in production at another location with the below hardware.
Same ESXI 6.7u3, but TrueNAS is on v12.0 while I am using v13 on the crashing system.
R720xd LSI SAS9207-8i for internal 12 bays (completed before H710 cross flash had been developed)
MD1200 connected via LSI SAS9207-8e
Same exact model SATA WD HDD's used in the R720xd, MD1200, and the setup I am having stability issues with.
Here is the firmware list on this similar, but stable system running for a few years.
That's what I'd always heard (and I have an MD1200 at work I'd like to repurpose as the sneakernet/engineering needs disk chassis, so I'd hate to learn of weird incompatibilities).
I have ordered a RS232 service cable for the MD1200 Control Module and it was delivered today. Will take me a little time to get connected to it as I am not yet familiar with doing so, not sure how much help it will be, but certainly wanted to have it as an option,
Let us know. Could very well be bad firmware.

Although... Have you tried different PSUs on the MD1200? It could be bad power... Hell, just the other day I thought I had a pair of bad LTO-3 tape drives and it turned out to almost certainly be a problem with the tape library's PSUs (almost certainly because I'm not throwing money into that thing, I just need my predecessors' data back!).
 

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
Well dangit, I accidently deleted my post from earlier because I had an attachment at the bottom of the post that was a mistake and just hanging out there. "Delete" was right under it and well yeah I deleted the post, not that attachment.

Okay, back on track... This MD1200 came in with just one controller in it and I wanted two, so I ordered the 2nd one as a used listing on ebay. I guess that lowers the possibility on the firmware, but sure it is not eliminated.

As for the PSU's I was running off just one PSU originally because I like to put one on mains and the other on UPS, but ran out of outlets when hooking it up for stress testing back in November. You make a good point with the PSU and it had crossed my mind and started doing some testing there, but did not complete it and then it left my mind. I will double check with each PSU as to how stable it is.
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Well dangit, I accidently deleted my post from earlier because I had an attachment at the bottom of the post that was a mistake and just hanging out there. "Delete" was right under it and well yeah I deleted the post, not that attachment.
Nothing a little moderator intervention can't solve...
As for the PSU's I was running off just one PSU originally because I like to put one on mains and the other on UPS, but ran out of outlets when hooking it up for stress testing back in November. You make a good point with the PSU and it had crossed my mind and started doing some testing there, but did not complete it and then it left my mind. I will double check with each PSU as to how stable it is.
Let us know how it goes.
 
Last edited:

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
Will do, and thank you for undeleting that post!

I am reading through this currently which appears to be the go to place for in-depth understanding of working with these MD1200's. But going to try and test out the two PSU's independently first.
 

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
I did notice this in the logs from this morning, but it has not crashed today. I also double checked on the PSU's and I had swapped to the other PSU yesterday when working on it. I will give it some time and see how it does.
Jan 3 09:21:29 truenas (ses1:mps0:0:146:0): RECEIVE DIAGNOSTIC RESULTS. CDB: 1c 01 02 80 00 00 length 32768 SMID 1212 Command timeout on target 146(0x0018) 60000 set, 60.128511376 elapsed
Jan 3 09:21:29 truenas mps0: Sending abort to target 146 for SMID 1212
Jan 3 09:21:29 truenas (ses1:mps0:0:146:0): RECEIVE DIAGNOSTIC RESULTS. CDB: 1c 01 02 80 00 00 length 32768 SMID 1212 Aborting command 0xfffffe0104b75ca0
I am running a SAS SSD for my SLOG on this pool, so it should be the only one on the ses1 since the SATA's are on the ses0. If I am thinking about things correctly.
 

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
I see now, that error above was not for my SAS SSD. I believe it is point to my controller #1 with that message, right?
1672797060403.png
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Right, the driver is not liking what the expander had to say - or not to say.
 

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
Right, the driver is not liking what the expander had to say - or not to say.
I continue to get those same diagnostic results messages this morning. 1am, 330am, and 7am got them. I also find that I had been getting those same exact messages back in 2018 when setting up the now long time stable R720xd/MD1200 I mentioned. It was a thread you and I were talking about it on too. This was when I was testing it on a spare R510xd, but it moved to the R720xd shortly after. You would think my memory of that would be better, but honestly I do not recall it much at all anymore.


I do recall doing dell firmware updates on the R720xd that is connected to the MD1200 and probably did that after our conversion, but that was just firmware for the R720xd components like iDRAC, BIOS, PSU's, and NIC. I am also fairly certain I did not replace any hardware on the R510xd, R720xd or the MD1200 to resolve the issue, but I have not been getting those diagnostic results messages since then...
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
It could just be a fairly benign (non-)issue that causes a bit of logspam, then.
 

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
I agree and will just keep an eye on that. I do find it odd that this system gets it on SES1 and the other system got it on SES0. But only the one controller gets it each time.
 

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
Things were going good, so I moved the SAS2308 and MD1200 back into the main VM and started trying to stress it a bit harder with more utilization last night and it ended up crashing. Although it is not stable, it does appear to be staying up longer after swapping to the other power supply and running it off the other controller, cable, and VM. That change not fully resolving it does still puzzle me though.
 
Last edited:

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
The standalone VM with the SAS2308 and MD1200 is staying strong with normal use which gets pretty heavy at time. It will only last a few hours or less if I put it all back into one single VM. For now I will just continue to run two TrueNAS VM's, so long as it is stable that way. I prefer to just have the one VM to share resources better, but I will take stability over that for sure.
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I hate this sort of general suckiness. It's hard to get answers and it just makes things way harder than they need to be.

Any luck with firmware upgrades?
 

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
Yeah, I feel the same. Both controllers are on 1.06 and far as I have seen that is the latest firmware. Having the same issue on all 3 HBA's and being on v20, I feel like current firmware should be good, right?
1673306066657.png
1673306105389.png
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Maybe "good" is wishful thinking, but "the best we can hope for" sounds attainable.

Have you tried different cables? Got any SAS3 HBAs and SFF-8644 to SFF-8088 cables to try out?
 

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
Sadly, I do not have any of those. I almost purchased a SAS3 to be using it more long term when I bought the SAS2308, but told myself the SAS2308 was already overkill for the 10 spinner disks I will probably stay with for many years, compared to the two SAS2008's I already had on the shelf.

So long as it stays stable in its own VM, I may just leave it be for a while so I can focus on some other issues/projects I am getting behind on. And to let my sanity recharge ;)
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
adly, I do not have any of those. I almost purchased a SAS3 to be using it more long term when I bought the SAS2308, but told myself the SAS2308 was already overkill for the 10 spinner disks I will probably stay with for many years, compared to the two SAS2008's I already had on the shelf.
And it is! It's really just me throwing stuff at the wall at this point, hoping that the different firmware works better.
 

bstev

Explorer
Joined
Dec 16, 2016
Messages
53
And it is! It's really just me throwing stuff at the wall at this point, hoping that the different firmware works better.
I appreciate the ideas and help! At this point my brain keeps going to the fact that it has only ever had one unscheduled reboot when the HBA and MD1200 are in there own VM. I hate leaving something unfinished, but maybe coming back to it at a later time will allow me to get past whatever I am overlooking.
 

mardel

Cadet
Joined
Aug 11, 2019
Messages
8
Looks like I have similar issue :) - "door bell handshake failed".
My setup was running smoothly for quite some time now on TrueNAS Core 12 + 2 HBAs 2008 + 8x8TB spinners. I decided to upgrade to v13 and upgrade to 12-U8.1 I believe and then to the latest 13 a week ago +-.
This is when "door bell handshake failed" started showing up and my TrueNAS VM is just abruptly shuts down.
I checked forum and seemed like HBA could be "dead" (though everyone says it is very hard to kill them). I found the culprit HBA and removed it physically. System worked fine for a day without much data movement. I was running badblocks on a new drive and was playing with mounting/unmounting SMB shares from the MAC when it happened again :). I disconnected new drive and tried to copy large file from the existing working pool (3-way mirror) and it crashed VM again.
BTW, when my first HBA "failed" I ordered new one and it should be soon delivered, but now wondering whether it is TrueNAS Core 13 is the problem. I wish I didn't upgrade my pool after updating to v13 :) so it will be hard to test it again on v12.

Anything I can provide to help diagnose it?
 

mardel

Cadet
Joined
Aug 11, 2019
Messages
8
One more thing - HBAs were running super hot without fan for 24/7 in my homelab, but after one HBA "failed" I installed those 40x20 fans. Could be that HBAs were slowly dying - just a thought. But still it is very strange it coincided with TrueNAS Core 12--> 13 upgrade.

Any thoughts/ideas?
I am reading this forum since 2019 and these are my first messages - so pls forgive me for any wrongdoing.
 
Top