Upgrading to 13 results in controller reinitialization

D0peX

Cadet
Joined
May 23, 2023
Messages
4
Hi Forum

I have been struggling with this one for so long now, without any answers.
Cannot find too much information about it too.

When running Truenas 13.0-U4, the `mps` re-inits after lots of command timeouts.
I can reproduce this issue 100% with starting a bunch of random writes with fio. The issue starts after ram has been filled by zfs cache.

When rolling back to 12.0-U8, no issues at all. This completely singles out the issue to be software.
I've tried some tunables in loader like
Code:
hw.mps.disable_msix=1
hw.mps.disable_msi=1

Without any success.

I also cannot find any mention of this controller starting to be buggy in freebsd 13, which i'd find strange if that were the case. I mean... It's freebsd, used by many for storage. With this in mind, i'm suspecting it to be a config/setting that Truenas ships with 13. But i do not know what/where to look for.

Small snippet of what happens:
Code:
mps0: Reinitializing controller
mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
    (da0:mps0:0:0:0): WRITE(10). CDB: 2a 00 46 74 2e b8 00 08 00 00 length 1048576 SMID 597 Command timeout on target 0(0x0009) 60000 set, 60.780533875 elapsed


I would appreciate any input, and could do some testing too if needed.
Thanks!



Code:
root@sirius[~]# mpsutil show all
Adapter:
mps0 Adapter:
       Board Name: SAS9211-8i
   Board Assembly:
        Chip Name: LSISAS2008
    Chip Revision: ALL
    BIOS Revision: 7.39.02.00
Firmware Revision: 20.00.07.00
  Integrated RAID: no
         SATA NCQ: ENABLED
 PCIe Width/Speed: x8 (5.0 GB/sec)
        IOC Speed: Full
      Temperature: Unknown/Unsupported

PhyNum  CtlrHandle  DevHandle  Disabled  Speed   Min    Max    Device
0       0001        0009       N         6.0     1.5    6.0    SAS Initiator
1       0002        000a       N         6.0     1.5    6.0    SAS Initiator
2                              N                 1.5    6.0    SAS Initiator
3                              N                 1.5    6.0    SAS Initiator
4       0003        000b       N         6.0     1.5    6.0    SAS Initiator
5       0004        000c       N         6.0     1.5    6.0    SAS Initiator
6                              N                 1.5    6.0    SAS Initiator
7                              N                 1.5    6.0    SAS Initiator

Devices:
B____T    SAS Address      Handle  Parent    Device        Speed Enc  Slot  Wdt
00   00   4433221100000000 0009    0001      SATA Target   6.0   0001 03    1
00   01   4433221101000000 000a    0002      SATA Target   6.0   0001 02    1
00   02   4433221104000000 000b    0003      SATA Target   6.0   0001 07    1
00   03   4433221105000000 000c    0004      SATA Target   6.0   0001 06    1

Enclosures:
Slots      Logical ID     SEPHandle  EncHandle    Type
  08    56c92bf00003bcd7               0001     Direct Attached SGPIO

Expanders:
NumPhys   SAS Address     DevHandle   Parent  EncHandle  SAS Level

Code:
mps0: Reinitializing controller
mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
    (da1:mps0:0:1:0): WRITE(10). CDB: 2a 00 bd b3 10 58 00 08 00 00 length 1048576 SMID 440 Command timeout on target 1(0x000a) 60000 set, 60.590619855 elapsed
mps0: Sending abort to target 1 for SMID 440
    (da1:mps0:0:1:0): WRITE(10). CDB: 2a 00 bd b3 10 58 00 08 00 00 length 1048576 SMID 440 Aborting command 0xfffffe00c51edf40
    (da0:mps0:0:0:0): WRITE(10). CDB: 2a 00 db af a1 d8 00 08 00 00 length 1048576 SMID 432 Command timeout on target 0(0x0009) 60000 set, 60.561992076 elapsed
mps0: Sending abort to target 0 for SMID 432
    (da0:mps0:0:0:0): WRITE(10). CDB: 2a 00 db af a1 d8 00 08 00 00 length 1048576 SMID 432 Aborting command 0xfffffe00c51ed480
    (da0:mps0:0:0:0): WRITE(10). CDB: 2a 00 db af a9 d8 00 08 00 00 length 1048576 SMID 392 Command timeout on target 0(0x0009) 60000 set, 60.530714629 elapsed
    (da0:mps0:0:0:0): WRITE(10). CDB: 2a 00 db af b1 d8 00 08 00 00 length 1048576 SMID 306 Command timeout on target 0(0x0009) 60000 set, 60.501117892 elapsed
    (da2:mps0:0:2:0): WRITE(10). CDB: 2a 00 f0 41 46 20 00 01 00 00 length 131072 SMID 211 Command timeout on target 2(0x000b) 60000 set, 60.475944824 elapsed
mps0: Sending abort to target 2 for SMID 211
    (da2:mps0:0:2:0): WRITE(10). CDB: 2a 00 f0 41 46 20 00 01 00 00 length 131072 SMID 211 Aborting command 0xfffffe00c51dab88
    (da0:mps0:0:0:0): WRITE(10). CDB: 2a 00 db af b9 d8 00 08 00 00 length 1048576 SMID 375 Command timeout on target 0(0x0009) 60000 set, 60.483962856 elapsed
    (da1:mps0:0:1:0): WRITE(10). CDB: 2a 00 bd b3 20 58 00 08 00 00 length 1048576 SMID 160 Command timeout on target 1(0x000a) 60000 set, 60.441826032 elapsed
    (da1:mps0:0:1:0): WRITE(10). CDB: 2a 00 bd b3 18 58 00 08 00 00 length 1048576 SMID 158 Command timeout on target 1(0x000a) 60000 set, 60.451334556 elapsed
    (da1:mps0:0:1:0): WRITE(10). CDB: 2a 00 bd b3 30 58 00 08 00 00 length 1048576 SMID 121 Command timeout on target 1(0x000a) 60000 set, 60.423550204 elapsed
    (da1:mps0:0:1:0): WRITE(10). CDB: 2a 00 bd b3 28 58 00 08 00 00 length 1048576 SMID 269 Command timeout on target 1(0x000a) 60000 set, 60.432836445 elapsed
    (da1:mps0:0:1:0): WRITE(10). CDB: 2a 00 be 41 46 e8 00 00 78 00 length 61440 SMID 417 Command timeout on target 1(0x000a) 60000 set, 60.404959286 elapsed
    (da1:mps0:0:1:0): WRITE(10). CDB: 2a 00 bd b3 38 58 00 05 00 00 length 655360 SMID 404 Command timeout on target 1(0x000a) 60000 set, 60.414351577 elapsed
    (da0:mps0:0:0:0): WRITE(10). CDB: 2a 00 db af c1 d8 00 08 00 00 length 1048576 SMID 433 Command timeout on target 0(0x0009) 60000 set, 60.241136485 elapsed
    (da0:mps0:0:0:0): WRITE(10). CDB: 2a 00 f0 41 46 20 00 01 00 00 length 131072 SMID 364 Command timeout on target 0(0x0009) 60000 set, 60.222849270 elapsed
    (da0:mps0:0:0:0): WRITE(10). CDB: 2a 00 db af c9 d8 00 05 00 00 length 655360 SMID 322 Command timeout on target 0(0x0009) 60000 set, 60.232077823 elapsed
    (xpt0:mps0:0:2:0): SMID 9 task mgmt 0xfffffe00c51c9c18 timed out
mps0: Reinitializing controller
mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
    (da0:mps0:0:0:0): WRITE(10). CDB: 2a 00 db b8 18 d8 00 08 00 00 length 1048576 SMID 216 Command timeout on target 0(0x0009) 60000 set, 60.203261674 elapsed
mps0: Sending abort to target 0 for SMID 216
    (da0:mps0:0:0:0): WRITE(10). CDB: 2a 00 db b8 18 d8 00 08 00 00 length 1048576 SMID 216 Aborting command 0xfffffe00c51db240
    (da2:mps0:0:2:0): WRITE(10). CDB: 2a 00 db b8 28 d8 00 08 00 00 length 1048576 SMID 398 Command timeout on target 2(0x000b) 60000 set, 60.204955892 elapsed
mps0: Sending abort to target 2 for SMID 398
    (da2:mps0:0:2:0): WRITE(10). CDB: 2a 00 db b8 28 d8 00 08 00 00 length 1048576 SMID 398 Aborting command 0xfffffe00c51ea6d0
    (da0:mps0:0:0:0): WRITE(10). CDB: 2a 00 db b8 28 d8 00 08 00 00 length 1048576 SMID 535 Command timeout on target 0(0x0009) 60000 set, 60.186371608 elapsed

Code:
Extended brand string: "      Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz"
CLFLUSH instruction cache line size: 8
Initial APIC ID: 1
Hyper threading siblings: 16


root@sirius[~]# zpool status sirius-b
  pool: sirius-b
 state: ONLINE
  scan: scrub repaired 768K in 1 days 04:32:21 with 0 errors on Mon May 22 04:32:21 2023
config:

    NAME                                            STATE     READ WRITE CKSUM
    sirius-b                                        ONLINE       0     0     0
      mirror-0                                      ONLINE       0     0     0
        gptid/78f45af7-083b-11eb-8d8d-00259077a4ba  ONLINE       0     0     0
        gptid/7d4362ef-083b-11eb-8d8d-00259077a4ba  ONLINE       0     0     0
      mirror-1                                      ONLINE       0     0     0
        gptid/877b5643-083b-11eb-8d8d-00259077a4ba  ONLINE       0     0     0
        gptid/89fe6a49-083b-11eb-8d8d-00259077a4ba  ONLINE       0     0     0

errors: No known data errors
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
And it doesn't happen if you roll it back?

Typically these are caused by:

1) Cabling issues
2) Overheating issues
3) Failing controller
 

D0peX

Cadet
Joined
May 23, 2023
Messages
4
And it doesn't happen if you roll it back?

Typically these are caused by:

1) Cabling issues
2) Overheating issues
3) Failing controller
Yup! That is the strange part.
I've been running 12 for a while now. No issues at all.

When running 13, 'mps' takes a dump and re-inits and does all kind of weird stuff.
Hence the argument i'm fairly certain it has nothing to do with the actual hardware. Which also makes it really difficult to troubleshoot...
I have no deep knowledge about what Truenas adds on top of FreeBSD. I could try to import the pool in fbsd 13, but i'm hesitant.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
There are at least tens of thousands of hosts running TrueNAS controllers and the 2008/2308 based controllers. If these were a general software problem, there would be a firestorm here in the forums.
 

D0peX

Cadet
Joined
May 23, 2023
Messages
4
There are at least tens of thousands of hosts running TrueNAS controllers and the 2008/2308 based controllers. If these were a general software problem, there would be a firestorm here in the forums.
That is why i find the issue really strange. It's a well known controller... On fbsd... Used in many production setups.
I have no clue if it's difficult to import the zfs pool on vanilla freebsd. But if that is the case, i could try that.
 

nihim

Cadet
Joined
Jul 6, 2023
Messages
1
I've been having the same issue with 13U1 and 13U5. Sometimes it takes a few hours to appear and crash TrueNAS. I'm just holding back on the stable 12U8.1 for now. There is also at least another thread which seems related: thread

Below is my card info:

root@freenas[~]# sas2flash -list
LSI Corporation SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2008-2013 LSI Corporation. All rights reserved

Adapter Selected is a LSI SAS: SAS2308_2(D1)

Controller Number : 0
Controller : SAS2308_2(D1)
PCI Address : 00:0b:00:00
SAS Address : 500605b-0-069d-b170
NVDATA Version (Default) : 14.01.00.06
NVDATA Version (Persistent) : 14.01.00.06
Firmware Product ID : 0x2214 (IT)
Firmware Version : 20.00.07.00
NVDATA Vendor : LSI
NVDATA Product ID : SAS9207-8i
BIOS Version : 07.39.02.00
UEFI BSD Version : N/A
FCODE Version : N/A
Board Name : SAS9217-8i
Board Assembly : H3-25566-00C
Board Tracer Number : SV31909386

Finished Processing Commands Successfully.
Exiting SAS2Flash.
root@freenas[~]#
 

mardel

Cadet
Joined
Aug 11, 2019
Messages
8
Looks like I have similar issue :) - "door bell handshake failed".

My setup was running smoothly for quite some time now on TrueNAS Core 12 + 2 HBAs 2008 + 8x8TB spinners. I decided to upgrade to v13 and upgrade to 12-U8.1 I believe and then to the latest 13 a week ago +-.
This is when "door bell handshake failed" started showing up and my TrueNAS VM is just abruptly shuts down.
I checked forum and seemed like HBA could be "dead" (though everyone says it is very hard to kill them). I found the culprit HBA and removed it physically. System worked fine for a day without much data movement. I was running badblocks on a new drive and was playing with mounting/unmounting SMB shares from the MAC when it happened again :). I disconnected new drive and tried to copy large file from the existing working pool (3-way mirror) and it crashed VM again.
BTW, when my first HBA "failed" I ordered new one and it should be soon delivered, but now wondering whether it is TrueNAS Core 13 is the problem. I wish I didn't upgrade my pool after updating to v13 :) so it will be hard to test it again on v12.

Anything I can provide to help diagnose it?

@D0peX , @nihim , I am curious whether you solved it somehow and made v13 work.
 
Top