System crash on SMB file copy

Status
Not open for further replies.

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Code:
Adapter Selected is a LSI SAS: SAS2008(B2)																				
																																	
		Controller Number			  : 0																						
		Controller					 : SAS2008(B2)																				
		PCI Address					: 00:0a:00:00																				
		SAS Address					: 5d4ae52-0-b191-xxxx																		
		NVDATA Version (Default)	   : 14.01.00.08																				
		NVDATA Version (Persistent)	: 14.01.00.08																				
		Firmware Product ID			: 0x2213 (IT)																				
		Firmware Version			   : 20.00.07.00																				
		NVDATA Vendor				  : LSI																						
		NVDATA Product ID			  : SAS9211-8i																				
		BIOS Version				   : N/A																						
		UEFI BSD Version			   : N/A																						
		FCODE Version				  : N/A																						
		Board Name					 : SAS9211-8i																				
		Board Assembly				 : N/A																						
		Board Tracer Number			: N/A																						
																																	
		Finished Processing Commands Successfully.																				
		Exiting SAS2Flash.																										


This is the output. I don't know if that looks normal or not for the SAS address string. I took out the last 4. I looked at more guides and they have the 500 before the 16 digit address. It could be some kinda identifier.
Yes this looks normal. You could add a bios and get that info but not required with using IT mode.
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Code:
Adapter Selected is a LSI SAS: SAS2008(B2)																				
																																	
		Controller Number			  : 0																						
		Controller					 : SAS2008(B2)																				
		PCI Address					: 00:0a:00:00																				
		SAS Address					: 5d4ae52-0-b191-xxxx																		
		NVDATA Version (Default)	   : 14.01.00.08																				
		NVDATA Version (Persistent)	: 14.01.00.08																				
		Firmware Product ID			: 0x2213 (IT)																				
		Firmware Version			   : 20.00.07.00																				
		NVDATA Vendor				  : LSI																						
		NVDATA Product ID			  : SAS9211-8i																				
		BIOS Version				   : N/A																						
		UEFI BSD Version			   : N/A																						
		FCODE Version				  : N/A																						
		Board Name					 : SAS9211-8i																				
		Board Assembly				 : N/A																						
		Board Tracer Number			: N/A																						
																																	
		Finished Processing Commands Successfully.																				
		Exiting SAS2Flash.																										


This is the output. I don't know if that looks normal or not for the SAS address string. I took out the last 4. I looked at more guides and they have the 500 before the 16 digit address. It could be some kinda identifier.
The SAS Address is just a 16-digit Hex number and yours fits the bill. I had gotten the impression that the 2008-series boards have a '500' prefix -- all of mine do -- but I don't know that it really matters. You can always open up the system and look at the HBA; they usually have stickers with the SAS Address on 'em.

This output looks good. I don't believe the HBA firmware is the problem, but of course that doesn't rule out an actual hardware failure.
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Suggesting a way forward... Here's what I would do:
  1. Assign the HBA its proper SAS Address (Obtained from sticker on the card. Hopefully...)
  2. Run MemTest for at least 24 hours.
  3. Run a CPU stress test (PassMark/Mersenne prime or similar) for at least 2 hours (be sure to run your fans at full speed during this test.)
  4. Run my disk burnin script (link above) on all of the disks in the system, using a tmux session for each disk.
I understand that you've had these disks in service and have found them to be reliable, but I'm thinking more along the lines of exercising the disk controller subsystem.
 

hardlivinlow

Dabbler
Joined
May 31, 2017
Messages
30
Suggesting a way forward... Here's what I would do:
  1. Assign the HBA its proper SAS Address (Obtained from sticker on the card. Hopefully...)
  2. Run MemTest for at least 24 hours.
  3. Run a CPU stress test (PassMark/Mersenne prime or similar) for at least 2 hours (be sure to run your fans at full speed during this test.)
  4. Run my disk burnin script (link above) on all of the disks in the system, using a tmux session for each disk.
I understand that you've had these disks in service and have found them to be reliable, but I'm thinking more along the lines of exercising the disk controller subsystem.

The HBA has its proper SAS address as far as I know. It was obtained when I flashed the card with the "megacli.exe -AdpAllInfo -aAll -page 20" command. No sticker was present.

I'm running the burn in scripts now on all 5 drives. No issues yet. Ill report back after all of that.
 

droeders

Contributor
Joined
Mar 21, 2016
Messages
179
Late to the thread, but the following makes me think this could be network related:


May 31 20:19:11 freenas savecore: reboot after panic: sbdrop
May 31 20:19:11 freenas savecore: writing compressed core to /data/crash/textdump.tar.0.gz



Can you try some extended iperf testing with the same NIC and see if the panic still occurs?

Also, is the crash file referenced above (/data/crash/textdump.tar.0.gz) present on the machine still?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Late to the thread, but the following makes me think this could be network related:


May 31 20:19:11 freenas savecore: reboot after panic: sbdrop
May 31 20:19:11 freenas savecore: writing compressed core to /data/crash/textdump.tar.0.gz



Can you try some extended iperf testing with the same NIC and see if the panic still occurs?

Also, is the crash file referenced above (/data/crash/textdump.tar.0.gz) present on the machine still?
Why would it happen with a dd?
 

droeders

Contributor
Joined
Mar 21, 2016
Messages
179
Why would it happen with a dd?

I certainly agree that if it happens with just dd, it's unlikely to be network related.

That said, I'm still not sure that your dd test has even been run by the OP. All I see is the dd command to clear the MBR.
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
I certainly agree that if it happens with just dd, it's unlikely to be network related.

That said, I'm still not sure that your dd test has even been run by the OP. All I see is the dd command to clear the MBR.
Good point!

@hardlivinlow : Before you proceed with major hardware testing, it would be a easy to fire up iperf and let it run for a good, long while to see if you get a panic. If you do, that narrows it down to a networking problem.
 

hardlivinlow

Dabbler
Joined
May 31, 2017
Messages
30
i'm pulling the crash dumps now. not sure how to run a dd test. and yes i will run a iperf test. I would think if it was a major hardware issue I would have already got a crash running the burnin script.
 

hardlivinlow

Dabbler
Joined
May 31, 2017
Messages
30
Late to the thread, but the following makes me think this could be network related:


May 31 20:19:11 freenas savecore: reboot after panic: sbdrop
May 31 20:19:11 freenas savecore: writing compressed core to /data/crash/textdump.tar.0.gz



Can you try some extended iperf testing with the same NIC and see if the panic still occurs?

Also, is the crash file referenced above (/data/crash/textdump.tar.0.gz) present on the machine still?

I have 4 files. here is the newest one running on the intel nics. May have one from when i was using the broadcom ones also.

http://www.filedropper.com/textdumptar3
http://www.filedropper.com/textdumptar0
 

hardlivinlow

Dabbler
Joined
May 31, 2017
Messages
30
Ran a iperf test for 30m. client to server. No issues.

Code:
Client connecting to 172.16.10.15, TCP port 5001
TCP window size:  128 KByte
------------------------------------------------------------
[  3] local 172.16.10.105 port 58638 connected with 172.16.10.15 port 5001
[ ID] Interval	   Transfer	 Bandwidth
[  3]  0.0-1800.0 sec   197 GBytes   939 Mbits/sec

 
Last edited:

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Ran a iperf test for 30m. No issues.

Code:
Client connecting to 172.16.10.15, TCP port 5001
TCP window size:  128 KByte
------------------------------------------------------------
[  3] local 172.16.10.105 port 58638 connected with 172.16.10.15 port 5001
[ ID] Interval	   Transfer	 Bandwidth
[  3]  0.0-1800.0 sec   197 GBytes   939 Mbits/sec

Did you run it both ways? i.e., as both server and client?
 

droeders

Contributor
Joined
Mar 21, 2016
Messages
179
Did you run it both ways? i.e., as both server and client?

Good idea to run in both directions. The panic dumps point to the network stack, but if it happens on both Intel and Broadcom I would think it's something else.

I would also try the test that @Ericloewe suggested originally and let it run for quite a while (adjust the of= path for your environment):

dd if=/dev/random of=/path/to/pool/filename
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Good idea to run in both directions. The panic dumps point to the network stack, but if it happens on both Intel and Broadcom I would think it's something else.

I would also try the test that @Ericloewe suggested originally and let it run for quite a while (adjust the of= path for your environment):

dd if=/dev/random of=/path/to/pool/filename
I don't think I would use random, it's very slow. /dev/zero is much faster.

Sent from my Nexus 5X using Tapatalk
 

droeders

Contributor
Joined
Mar 21, 2016
Messages
179
I don't think I would use random, it's very slow. /dev/zero is much faster.

Good point - definitely faster and will stress the disk subsystem more.

I only get about 35MB/s from /dev/random on a (mostly idle) BSD machine, and about triple that from /dev/zero.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I don't think I would use random, it's very slow. /dev/zero is much faster.
But zeros compress really well and tend to mask some issues. ;)

Once the file has grown, it can be used for faster tests.
 

hardlivinlow

Dabbler
Joined
May 31, 2017
Messages
30
I will test both dd as zero and random to get accurate tests.

reverse iperf 30m server to client
Code:
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  128 KByte
------------------------------------------------------------
[  4] local 172.16.10.105 port 5001 connected with 172.16.10.15 port 45286
[ ID] Interval	   Transfer	 Bandwidth
[  4]  0.0-1800.2 sec   184 GBytes   880 Mbits/sec



Memtest86 for 7 hours. No errors

IMG_0232.jpg



Did more testing on file sizes on copy. small files as 600MB or below will copy to the share quick. Anything over a gig in size will panic crash the server almost immediately once copy is started.

Panic crash picture snap last night while testing copying.

IMG_0221.jpg
 

hardlivinlow

Dabbler
Joined
May 31, 2017
Messages
30
dd random has been running 2 hours to the pool. Current size of the file is 240GB. No crashes yet.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
What nic do you have? This sounds like a nic problem.

Sent from my Nexus 5X using Tapatalk
 

hardlivinlow

Dabbler
Joined
May 31, 2017
Messages
30
Well I used the onboard Broadcom extream II and had crashes so I disabled them and installed a intel PCIe 1000 4 port gig nic. So same issue with 2 different brands of nics.
 
Status
Not open for further replies.
Top