Freenas 11.3 SAS3008 probable driver issue.

adeeronafoggylake · Feb 17, 2020

Having similar issues with my mirrored SSDs on HBA cards. Not appearing to have issues with non-mirrored SSD. I'm virtualization freenas with 2 hba cards passed through. The mirrored SSDs sit on separate HBA cards for redundancy. Problems started occurring after update to 11.3. Unfortunately I updated my pools already so there's no going back.

Freenas 11.3
2 x SAS 9300-8i - SAS3008-IT - 16.00.01
2 x - Crucial MX500 2.5" 500GB

Logs:
-----------------------------------------------------------
Feb 17 01:44:29 HOSTNAME (pass14:mpr1:0:5:0): LOG SENSE. CDB: 4d 00 2f 00 00 00 00 00 40 00 length 64 SMID 150 Aborting command 0xfffffe000153c7a0
Feb 17 01:44:29 HOSTNAME mpr1: Sending reset from mprsas_send_abort for target ID 5
Feb 17 01:44:29 HOSTNAME (pass15:mpr1:0:6:0): LOG SENSE. CDB: 4d 00 0d 00 00 00 00 00 40 00 length 64 SMID 942 Aborting command 0xfffffe0001583a20
Feb 17 01:44:29 HOSTNAME mpr1: Sending reset from mprsas_send_abort for target ID 6
Feb 17 01:44:32 HOSTNAME mpr1: mprsas_action_scsiio: Freezing devq for target ID 6
Feb 17 01:44:32 HOSTNAME (da15:mpr1:0:6:0): WRITE(10). CDB: 2a 00 04 e4 b3 98 00 01 00 00
Feb 17 01:44:32 HOSTNAME (da15:mpr1:0:6:0): CAM status: CAM subsystem is busy
Feb 17 01:44:32 HOSTNAME (da15:mpr1:0:6:0): Retrying command
Feb 17 01:44:32 HOSTNAME mpr1: mprsas_action_scsiio: Freezing devq for target ID 6
Feb 17 01:44:32 HOSTNAME (da15:mpr1:0:6:0): WRITE(10). CDB: 2a 00 04 e4 b3 98 00 01 00 00
Feb 17 01:44:32 HOSTNAME (da15:mpr1:0:6:0): CAM status: CAM subsystem is busy
Feb 17 01:44:32 HOSTNAME (da15:mpr1:0:6:0): Retrying command
Feb 17 01:44:33 HOSTNAME mpr1: mprsas_action_scsiio: Freezing devq for target ID 6
Feb 17 01:44:33 HOSTNAME (da15:mpr1:0:6:0): WRITE(10). CDB: 2a 00 04 e4 b3 98 00 01 00 00
Feb 17 01:44:33 HOSTNAME (da15:mpr1:0:6:0): CAM status: CAM subsystem is busy
Feb 17 01:44:33 HOSTNAME (da15:mpr1:0:6:0): Retrying command
Feb 17 01:44:33 HOSTNAME mpr1: mprsas_action_scsiio: Freezing devq for target ID 6
Feb 17 01:44:33 HOSTNAME (da15:mpr1:0:6:0): WRITE(10). CDB: 2a 00 04 e4 b3 98 00 01 00 00
Feb 17 01:44:33 HOSTNAME (da15:mpr1:0:6:0): CAM status: CAM subsystem is busy
Feb 17 01:44:33 HOSTNAME (da15:mpr1:0:6:0): Retrying command
Feb 17 01:44:34 HOSTNAME mpr1: mprsas_action_scsiio: Freezing devq for target ID 6
Feb 17 01:44:34 HOSTNAME (da15:mpr1:0:6:0): WRITE(10). CDB: 2a 00 04 e4 b3 98 00 01 00 00
Feb 17 01:44:34 HOSTNAME (da15:mpr1:0:6:0): CAM status: CAM subsystem is busy
Feb 17 01:44:34 HOSTNAME (da15:mpr1:0:6:0): Error 5, Retries exhausted
Feb 17 01:44:34 HOSTNAME mpr1: mprsas_action_scsiio: Freezing devq for target ID 6
Feb 17 01:44:34 HOSTNAME (da15:mpr1:0:6:0): WRITE(10). CDB: 2a 00 19 13 af 20 00 00 d0 00
Feb 17 01:44:34 HOSTNAME (da15:mpr1:0:6:0): CAM status: CAM subsystem is busy
Feb 17 01:44:34 HOSTNAME (da15:mpr1:0:6:0): Retrying command
Feb 17 01:44:53 HOSTNAME collectd[6369]: nut plugin: nut_read: upscli_list_start (1500PFCLCD) failed: Driver not connected
Feb 17 01:45:23 HOSTNAME collectd[6369]: nut plugin: nut_read: upscli_list_start (1500PFCLCD) failed: Driver not connected
Feb 17 01:45:24 HOSTNAME collectd[6369]: Traceback (most recent call last):
File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 66, in read
temperatures = c.call('disk.temperatures', self.disks, self.powermode, self.smartctl_args)
File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 66, in read
temperatures = c.call('disk.temperatures', self.disks, self.powermode, self.smartctl_args)
File "/usr/local/lib/python3.7/site-packages/middlewared/client/client.py", line 500, in call
raise CallTimeout("Call timeout")
middlewared.client.client.CallTimeout: Call timeout
Feb 17 01:45:53 HOSTNAME collectd[6369]: nut plugin: nut_read: upscli_list_start (1500PFCLCD) failed: Driver not connected
Check to stop refresh
-----------------------------------------------------------

Ctiger256 · Feb 17, 2020

FWIW, my drives are all mirrored (though on one HBA).

Ctiger256 · Feb 17, 2020

OK, I've pulled my machine completely out of production for now. If anyone can think of any tests I can run to help troubleshoot, identify, the problem here, I'm happy to do so. If there's no response, I'll downgrade to 11.2 redo the pool and see if that solves the issue. (If it does, though, the machine will go back into production at that point and I won't be able to test 11.3 on it...)

Machupo · Feb 17, 2020

Nothing useful to add, just wanted to add my name to the list, lol.

I am also getting the random reboots on about a 1.5 day cycle. Nothing untoward in /var/log, just the boot sequence. Behavior was not present with 11.2U6 but started as soon as I installed 11.3.

I am also on a 9305-24i and have a full flash pool (3 vdevs of 8x2TB SSDs in Z2).

Ctiger256 · Feb 18, 2020

Have people been able to observe what activity triggers this? Some possible patters that seem to emerging in the thread are:

1. SSDs on the HBA
2. mirrored vdevs
3. accessing as an iSCSI target for VMs

Anything else? Or anyone able to reproduce this in some other configuration?

I've been trying to reproduce it with straight read/writes from an SMB share and haven't had any luck; so for me its only happened with VMs on an iSCSI target. That could be because of random vs. sequential access rather than the specific protocol, of course.

willuz · Feb 18, 2020

Ctiger256 said:
Have people been able to observe what activity triggers this? Some possible patters that seem to emerging in the thread are:

1. SSDs on the HBA
2. mirrored vdevs

I have verified that no activity is required. I can create an SSD pool of three Z2 vdevs and not access them at all and it will crash within an hour. As soon as I detach the pool the problem goes away. So far the SSD cache and log disks have not been an issue.

MikeyG · Feb 18, 2020

Similar to @willuz. No activity pattern required or that seems to trigger a restart - many restarts happen in the middle of the night where the pool would be doing nothing but writing log data. The SSD pool on my HBA does not have mirror vdevs or any iSCSI targets on it. Purely RAIDZ2 for storage. I have not yet tried detaching the pool to see how that changes things though as I use it fairly regularly.

adeeronafoggylake · Feb 18, 2020

This may be preemptive but disabling SMART (under pools -> disks) has been a fix so far with no reboots in the last 1.5 days. SSDs have different SMART data than spinning disks and I was getting some other weird logs so I figured I'd give it a try. To trigger the issue previously I was doing some IO intensive read/writes.

Feb 17 01:57:21 NASWOLF smartd[7756]: Device: /dev/da15 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:21 NASWOLF smartd[7756]: Device: /dev/da3 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:25 NASWOLF smartd[7910]: Device: /dev/da15 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:25 NASWOLF smartd[7910]: Device: /dev/da3 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:28 NASWOLF smartd[8067]: Device: /dev/da15 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:28 NASWOLF smartd[8067]: Device: /dev/da3 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:31 NASWOLF smartd[8222]: Device: /dev/da15 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:31 NASWOLF smartd[8222]: Device: /dev/da3 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:34 NASWOLF smartd[8379]: Device: /dev/da15 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:34 NASWOLF smartd[8379]: Device: /dev/da3 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:37 NASWOLF smartd[8533]: Device: /dev/da15 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:38 NASWOLF smartd[8533]: Device: /dev/da3 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:41 NASWOLF smartd[8687]: Device: /dev/da15 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:41 NASWOLF smartd[8687]: Device: /dev/da3 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:44 NASWOLF smartd[8844]: Device: /dev/da15 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:44 NASWOLF smartd[8844]: Device: /dev/da3 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:48 NASWOLF smartd[9001]: Device: /dev/da15 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:48 NASWOLF smartd[9001]: Device: /dev/da3 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:51 NASWOLF smartd[9155]: Device: /dev/da15 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:51 NASWOLF smartd[9155]: Device: /dev/da3 [SAT], WARNING: This firmware returns bogus raw values in attribute 197

adeeronafoggylake · Feb 18, 2020

adeeronafoggylake said:
This may be preemptive but disabling SMART (under pools -> disks) has been a fix so far with no reboots in the last 1.5 days. SSDs have different SMART data than spinning disks and I was getting some other weird logs so I figured I'd give it a try. To trigger the issue previously I was doing some IO intensive read/writes.

Feb 17 01:57:21 NASWOLF smartd[7756]: Device: /dev/da15 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:21 NASWOLF smartd[7756]: Device: /dev/da3 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:25 NASWOLF smartd[7910]: Device: /dev/da15 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:25 NASWOLF smartd[7910]: Device: /dev/da3 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:28 NASWOLF smartd[8067]: Device: /dev/da15 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:28 NASWOLF smartd[8067]: Device: /dev/da3 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:31 NASWOLF smartd[8222]: Device: /dev/da15 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:31 NASWOLF smartd[8222]: Device: /dev/da3 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:34 NASWOLF smartd[8379]: Device: /dev/da15 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:34 NASWOLF smartd[8379]: Device: /dev/da3 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:37 NASWOLF smartd[8533]: Device: /dev/da15 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:38 NASWOLF smartd[8533]: Device: /dev/da3 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:41 NASWOLF smartd[8687]: Device: /dev/da15 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:41 NASWOLF smartd[8687]: Device: /dev/da3 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:44 NASWOLF smartd[8844]: Device: /dev/da15 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:44 NASWOLF smartd[8844]: Device: /dev/da3 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:48 NASWOLF smartd[9001]: Device: /dev/da15 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:48 NASWOLF smartd[9001]: Device: /dev/da3 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:51 NASWOLF smartd[9155]: Device: /dev/da15 [SAT], WARNING: This firmware returns bogus raw values in attribute 197
Feb 17 01:57:51 NASWOLF smartd[9155]: Device: /dev/da3 [SAT], WARNING: This firmware returns bogus raw values in attribute 197

I should be clear... I disabled SMART for devices that were reporting to have issues. This so happened to be 2 crucial SSDs that were mirrored. My other SSD that's just a single drive never came up in any of the logs and that still has SMART enabled.

MikeyG · Feb 19, 2020

@adeeronafoggylake I think you may have found something. I disabled SMART only on my Crucial MX500s and haven't had the HBA or the system reset since. Been about 24 hours now and I was rebooting at least 3-4 times a day before. Not a great solution, but I would think it gives the devs something to work with. If it remains stable I will update the ticket I submitted to them.

willuz · Feb 20, 2020

I can also confirm that disabling SMART on SSD's works. This is a dangerous workaround long term but it does narrow down the issue for the devs to fix.

MikeyG · Feb 22, 2020

@Kris Moore I hate to tag you in a thread, but this seems like a pretty critical stability issue that multiple people are experiencing on rather standard hardware. I have filed ticket https://jira.ixsystems.com/browse/NAS-105053 but after 11 days have yet to see any indication that it's being looked into. It would be nice to see some activity on there.

palerstam · Feb 24, 2020

Hi, I'm also experiencing the same behavior
reboot/shutdown after 1-3 days uptime
My HW:
ASUS P11C-C/4L motherboard (latest bios)
Broadcom SAS 9305-24i HBA (latest fw)
Samsung SSD drives and Seagate spin-disks

Also got this kernel log message:
> Failed to fully fault in a core file segment at VA 0x819453000 with
> size 0x16000 to be written at offset 0x5f5e000 for process smbd pid
> 99148 (smbd), uid 0: exited on signal 6 (core dumped)

The following logs from IPMI:

Watchdog2 sensor of type watchdog_2 logged a hard reset
CPU_CATERR sensor of type processor logged a IERR

wolfman · Feb 24, 2020

Same issue here! Since the update to 11.3 FreeNAS reboots approx. every 1.5 days!

System:
- Dell R630 with LSI SAS3008 (Perc HBA330 Mini Mono 12G)
- 5x SSD RaidZ2 pool; no spinning disks

The following Errors appear every time in the iDRAC Log:

Code:

Mon Feb 24 2020 12:12:49    A fatal error was detected on a component at bus 0 device 1 function 0.   
Mon Feb 24 2020 12:12:49    A fatal error was detected on a component at bus 2 device 0 function 0.

I also disabled SMART on the SSD's now!

incaloid · Feb 25, 2020

Hi all,
Maybe the following observation helps

I'm running Freenas on two servers

S1) AsrockRack x470D4U2-2T motherboard, Ryzen1500X processor, 32GB ECC RAM with LSI HBA 9211-8i where Freenas is installed on bare metal
S2) AsrockRack x470D4U motherboard, Ryzen 1500X processor, 64 GB ECC RAM with LSI HBA 9211-8i where Freenas is installed on Proxmox and all Datadisks behind the HBA are passed-through.

Both servers have been stable for ever under Freenas 11.2. Last week-end I moved both to 11.3. Since then S1 is freezing about once a day. But S2 which is installed as VM on Proxmox is still stable. SMART is enabled on both systems.

MikeyG · Feb 26, 2020

As an experiment, I tried disabling SMART for each drive within the FreeNAS GUI, then re-enabling within the OS using smartctl -s on /dev/da0 (etc for each drive number). This allows me to run scripts that make use of smart data, and to run manual smart tests, however tests scheduled from the GUI don't run as scheduled. The drives show as SMART disabled in the GUI even though they are enabled in FreeBSD. It also appears to be stable this way for the last 24 hours.

Could someone else try this and test? If turning it on through FreebSD is actually stable, to me this would point to some interaction between the GUI and how it's calling for SMART data, and not necessarily a driver issue.

Reece Johnson · Feb 27, 2020

Chiming in here to combine information. I'm experiencing similar issues running 11.3 with the 3008 built into my Supermicro X10SRH-CLN4F. Was stable on 11.2, this is new behavior with 11.3. Disabling SMART to see if that helps.

jenksdrummer · Mar 27, 2020

RickM said:
I have a similar issue with version 11.3 with a SAS3008 PCI-Express Fusion-MPT SAS-3. I'm running firmware version 15.3IT. It crashes every 2 to 3 days. I did not have any issues running 11.2. Below is the SuperMicro Server Info:

Board Product Name: X11SPH-nCTF
Board Part Num: X11SPH-nCTF
Product PartNum: SSG-5029P-E1CTR12L
Machine model: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz

I have the exact system; same CPU even...lol.

Running 11.3U1, mine seems to be holding up w/o issue. I'm using the on board SAS3008 controller; which I believe comes in IR-mode; not IT-mode. I've opted to not flash that firmware over. I did do a fresh reload of the OS rather than upgrade; but it's been up for a few weeks and I've been beating the crap out of it with VMs lately.

MikeyG · Mar 27, 2020

@jenksdrummer Do you have any SSDs attached to the 3008 controller?

RickM · Mar 28, 2020

Thanks for the reply. I upgraded to 11.3U1 also but, the server kept on crashing almost every day. About a week ago, I tried disabling S.M.A.R.T. on all the drives and my system has not crashed since. All my disks are SSDs, I listed them below. I know this is a good feature because, it predicts if a drive is going to fail before it does. But, at this point I'm just going to rely on RAIDZ2 + hot spare.

Boot - RIAD 1: INTEL SSDSC2KG240G8 223.7 GiB
Storage - ZFS RAIDZ2: ATA Micron_5200_MTFD 1.75TiB

Important Announcement for the TrueNAS Community.

Freenas 11.3 SAS3008 probable driver issue.

Cadet

Dabbler

Dabbler

Cadet

Dabbler

Cadet

Patron

Cadet

Cadet

Patron

Cadet

Patron

Cadet

Dabbler

Cadet

Patron

Dabbler

Patron

Patron

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Freenas 11.3 SAS3008 probable driver issue."

Similar threads