System crash on SMB file copy

SweetAndLow · Jun 1, 2017

Code:

Adapter Selected is a LSI SAS: SAS2008(B2)																				
																																	
		Controller Number			  : 0																						
		Controller					 : SAS2008(B2)																				
		PCI Address					: 00:0a:00:00																				
		SAS Address					: 5d4ae52-0-b191-xxxx																		
		NVDATA Version (Default)	   : 14.01.00.08																				
		NVDATA Version (Persistent)	: 14.01.00.08																				
		Firmware Product ID			: 0x2213 (IT)																				
		Firmware Version			   : 20.00.07.00																				
		NVDATA Vendor				  : LSI																						
		NVDATA Product ID			  : SAS9211-8i																				
		BIOS Version				   : N/A																						
		UEFI BSD Version			   : N/A																						
		FCODE Version				  : N/A																						
		Board Name					 : SAS9211-8i																				
		Board Assembly				 : N/A																						
		Board Tracer Number			: N/A																						
																																	
		Finished Processing Commands Successfully.																				
		Exiting SAS2Flash.

This is the output. I don't know if that looks normal or not for the SAS address string. I took out the last 4. I looked at more guides and they have the 500 before the 16 digit address. It could be some kinda identifier.

Yes this looks normal. You could add a bios and get that info but not required with using IT mode.

Spearfoot · Jun 1, 2017

hardlivinlow said:

The SAS Address is just a 16-digit Hex number and yours fits the bill. I had gotten the impression that the 2008-series boards have a '500' prefix -- all of mine do -- but I don't know that it really matters. You can always open up the system and look at the HBA; they usually have stickers with the SAS Address on 'em.

This output looks good. I don't believe the HBA firmware is the problem, but of course that doesn't rule out an actual hardware failure.

Spearfoot · Jun 1, 2017

Suggesting a way forward... Here's what I would do:

Assign the HBA its proper SAS Address (Obtained from sticker on the card. Hopefully...)
Run MemTest for at least 24 hours.
Run a CPU stress test (PassMark/Mersenne prime or similar) for at least 2 hours (be sure to run your fans at full speed during this test.)
Run my disk burnin script (link above) on all of the disks in the system, using a tmux session for each disk.

I understand that you've had these disks in service and have found them to be reliable, but I'm thinking more along the lines of exercising the disk controller subsystem.

hardlivinlow · Jun 1, 2017

Spearfoot said:
Suggesting a way forward... Here's what I would do:

Assign the HBA its proper SAS Address (Obtained from sticker on the card. Hopefully...)

Run MemTest for at least 24 hours.

Run a CPU stress test (PassMark/Mersenne prime or similar) for at least 2 hours (be sure to run your fans at full speed during this test.)

Run my disk burnin script (link above) on all of the disks in the system, using a tmux session for each disk.

I understand that you've had these disks in service and have found them to be reliable, but I'm thinking more along the lines of exercising the disk controller subsystem.

The HBA has its proper SAS address as far as I know. It was obtained when I flashed the card with the "megacli.exe -AdpAllInfo -aAll -page 20" command. No sticker was present.

I'm running the burn in scripts now on all 5 drives. No issues yet. Ill report back after all of that.

droeders · Jun 1, 2017

Late to the thread, but the following makes me think this could be network related:

 

May 31 20:19:11 freenas savecore: reboot after panic: sbdrop

May 31 20:19:11 freenas savecore: writing compressed core to /data/crash/textdump.tar.0.gz

Can you try some extended iperf testing with the same NIC and see if the panic still occurs?

Also, is the crash file referenced above (/data/crash/textdump.tar.0.gz) present on the machine still?

Ericloewe · Jun 1, 2017

droeders said:
Late to the thread, but the following makes me think this could be network related:

May 31 20:19:11 freenas savecore: reboot after panic: sbdrop May 31 20:19:11 freenas savecore: writing compressed core to /data/crash/textdump.tar.0.gz

Can you try some extended iperf testing with the same NIC and see if the panic still occurs?

Also, is the crash file referenced above (/data/crash/textdump.tar.0.gz) present on the machine still?

Why would it happen with a dd?

droeders · Jun 1, 2017

Ericloewe said:
Why would it happen with a dd?

I certainly agree that if it happens with just dd, it's unlikely to be network related.

That said, I'm still not sure that your dd test has even been run by the OP. All I see is the dd command to clear the MBR.

Spearfoot · Jun 1, 2017

droeders said:
I certainly agree that if it happens with just dd, it's unlikely to be network related.

That said, I'm still not sure that your dd test has even been run by the OP. All I see is the dd command to clear the MBR.

Good point!

@hardlivinlow : Before you proceed with major hardware testing, it would be a easy to fire up iperf and let it run for a good, long while to see if you get a panic. If you do, that narrows it down to a networking problem.

hardlivinlow · Jun 1, 2017

i'm pulling the crash dumps now. not sure how to run a dd test. and yes i will run a iperf test. I would think if it was a major hardware issue I would have already got a crash running the burnin script.

hardlivinlow · Jun 1, 2017

droeders said:
Late to the thread, but the following makes me think this could be network related:

May 31 20:19:11 freenas savecore: reboot after panic: sbdrop May 31 20:19:11 freenas savecore: writing compressed core to /data/crash/textdump.tar.0.gz

Can you try some extended iperf testing with the same NIC and see if the panic still occurs?

Also, is the crash file referenced above (/data/crash/textdump.tar.0.gz) present on the machine still?

I have 4 files. here is the newest one running on the intel nics. May have one from when i was using the broadcom ones also.

http://www.filedropper.com/textdumptar3
http://www.filedropper.com/textdumptar0

hardlivinlow · Jun 1, 2017

Ran a iperf test for 30m. client to server. No issues.

Code:

Client connecting to 172.16.10.15, TCP port 5001
TCP window size:  128 KByte
------------------------------------------------------------
[  3] local 172.16.10.105 port 58638 connected with 172.16.10.15 port 5001
[ ID] Interval	   Transfer	 Bandwidth
[  3]  0.0-1800.0 sec   197 GBytes   939 Mbits/sec

Spearfoot · Jun 1, 2017

hardlivinlow said:

Did you run it both ways? i.e., as both server and client?

droeders · Jun 1, 2017

Spearfoot said:
Did you run it both ways? i.e., as both server and client?

Good idea to run in both directions. The panic dumps point to the network stack, but if it happens on both Intel and Broadcom I would think it's something else.

I would also try the test that @Ericloewe suggested originally and let it run for quite a while (adjust the of= path for your environment):

dd if=/dev/random of=/path/to/pool/filename

SweetAndLow · Jun 1, 2017

droeders said:
Good idea to run in both directions. The panic dumps point to the network stack, but if it happens on both Intel and Broadcom I would think it's something else.

I would also try the test that @Ericloewe suggested originally and let it run for quite a while (adjust the of= path for your environment):

dd if=/dev/random of=/path/to/pool/filename

I don't think I would use random, it's very slow. /dev/zero is much faster.

Sent from my Nexus 5X using Tapatalk

droeders · Jun 1, 2017

SweetAndLow said:
I don't think I would use random, it's very slow. /dev/zero is much faster.

Good point - definitely faster and will stress the disk subsystem more.

I only get about 35MB/s from /dev/random on a (mostly idle) BSD machine, and about triple that from /dev/zero.

Ericloewe · Jun 2, 2017

SweetAndLow said:
I don't think I would use random, it's very slow. /dev/zero is much faster.

But zeros compress really well and tend to mask some issues. ;)

Once the file has grown, it can be used for faster tests.

hardlivinlow · Jun 2, 2017

I will test both dd as zero and random to get accurate tests.

reverse iperf 30m server to client

Code:

------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  128 KByte
------------------------------------------------------------
[  4] local 172.16.10.105 port 5001 connected with 172.16.10.15 port 45286
[ ID] Interval	   Transfer	 Bandwidth
[  4]  0.0-1800.2 sec   184 GBytes   880 Mbits/sec

Memtest86 for 7 hours. No errors

Did more testing on file sizes on copy. small files as 600MB or below will copy to the share quick. Anything over a gig in size will panic crash the server almost immediately once copy is started.

Panic crash picture snap last night while testing copying.

hardlivinlow · Jun 2, 2017

dd random has been running 2 hours to the pool. Current size of the file is 240GB. No crashes yet.

SweetAndLow · Jun 2, 2017

What nic do you have? This sounds like a nic problem.

Sent from my Nexus 5X using Tapatalk

hardlivinlow · Jun 2, 2017

Well I used the onboard Broadcom extream II and had crashes so I disabled them and installed a intel PCIe 1000 4 port gig nic. So same issue with 2 different brands of nics.

Important Announcement for the TrueNAS Community.

System crash on SMB file copy

Sweet'NASty

He of the long foot

He of the long foot

Dabbler

Contributor

Server Wrangler

Contributor

He of the long foot

Dabbler

Dabbler

Dabbler

He of the long foot

Contributor

Sweet'NASty

Contributor

Server Wrangler

Dabbler

Dabbler

Sweet'NASty

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "System crash on SMB file copy"

Similar threads