Major drop-off in read/write speed after 3.5 yrs

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Have you Enabled SMART for each disk under Storage, Disks...?
 

VictorR

Contributor
Joined
Dec 9, 2015
Messages
143
Yes, they are

disks.png
 

VictorR

Contributor
Joined
Dec 9, 2015
Messages
143
Ok, test results are in.
what is the code to use in this forum, to get the columns to display properly?

One (obvious) that I noticed from the earlier test, was that that an entire 10 drive vdev(da20-da29) was marked slow in the previous test.

vdev.png

T3.png

T4.png

All look like this drive (except da28 & da 29, which are 1 yr old WD Gold)

da26.png

da28

da28.png

da17, which was also marked as slow, has a UDMA_CRC_Error_Count of 439. A quick web search indicates this is usually a bad cable or loose power connection. Could this be the suspect drive?

da17.png
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Is there stuff going on in the pool as you do these tests? If you filled up the first vdev, never removed much, and then added a new vdev later as da20-29, the "slow vdev" behaviour could make sense if there was a lot of traffic to the second vdev. Write traffic or traffic with a lot of temporal locality.

The UDMA_CRC_EC is worth pursuing but doesn't appear to be a huge problem.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Rewinding to the beginning just to check assumptions.

[root@Q30] ~# dd if=/dev/zero of=/mnt/Q30/test.bin bs=2048k count=1024k
1048576+0 records in
1048576+0 records out
2199023255552 bytes transferred in 1098.190928 secs (2,002,405,228 bytes/sec) [commas added for ease of reading]

[root@q30 /mnt/Q30]# dd if=test.bin of=/dev/null bs=2048k count=1024k
1048576+0 records in
1048576+0 records out
2199023255552 bytes transferred in 1392.109668 secs (1,579,633,635 bytes/sec)

Jul 28, 2019:

root@freenas:~ # dd if=/dev/zero of=/mnt/Q30/Test/test.bin bs=2048k count=1024k
1048576+0 records in
1048576+0 records out
2199023255552 bytes transferred in 1593.191745 secs (1,380,262,773 bytes/sec)

root@freenas: dd if=/mnt/Q30/Test/test.bin of=/dev/null bs=2048k count=1024k
1048576+0 records in
1048576+0 records out
2199023255552 bytes transferred in 3142.254662 secs (699,823,373 bytes/sec)

I'm trying to reconcile this with the array test numbers just posted.

The slowness on da16-29 is a problem. If this is due to there being activity on the pool while you're doing this, that's really the only "acceptable" explanation. The serial test suggests that the drives can support full speed but the parallel tests show that the system is struggling to deliver that across all drives simultaneously. That would seem to be contention at some point, or a failing controller, or a failing SAS expander. Are your HBA's running cool? Airflow across them? This is *required* for LSI HBA's. They are designed to be used in rackmount systems that have a lot of airflow. It isn't clear to me that the Q30 is that, being designed for "whisper quiet". The PCIe backplate should be vented. This kind may be marginal:

16-118-142-03.jpg



Because I suspect in your server you may not have sufficient airflow. Storinator is famous for doing stuff like this:

mobo_with_lsi_cards.jpg

Because you have to do that to make it "quiet" but have enough airflow not to kill the drives. Unfortunately it kills the airflow in the server compartment, and if you place the HBA's in this particular manner, the one closer to the CPU cooks. Check to see if that corresponds to your da16 and beyond HBA. LSI 92xx HBA's do not have temp sensors so this sucks to check, I'm sorry. And I realize that's not a picture of a Q30. Picture just for discussion purposes.

A conventional rackmount server forces all the air out through the back so that the air traverses the entire system, cooling RAM, CPU, chipset, PSU, and expansion cards. A well built one will include an air shroud to make sure air is flowing over critical bits. The Storinator design just leaks air all over the place. If you have two massive HBA cards slotted next to each other, the tendency of air to prefer an unobstructed flow means that a very low volume of air will go between those two HBA cards, and because the tight mesh venting on the PCIe bracket image above, the air isn't going to want to exit through those little pin holes in the bracket, and is instead going to sail around and go out any of the massive holes Storinator made in this chassis.

By way of comparison, Supermicro has been putting brackets like these

AOC-S3108L-H8iR.gif

on their recent HBA's. This does a better job of encouraging airflow.

So the question I'm really interested in is what the inside of your chassis looks like and which HBA is exhibiting the slow drive behaviour. Also are all your fans working properly and are any filters that may exist clean. Also is there any dust buildup inside your server. Because I agree with you that the da16-29 issue is odd.

The numbers I quoted above from the first post in the thread do suggest something amiss. My only other real theory than hardware issues is that you are somehow wrong about the state of the pool. Those numbers generally match up with what I might expect to see out of a fragmented pool that was experiencing lots of seek activity in order to perform the operations. But you seem to have a grasp on this so I have to think that's an unreasonable explanation. If you deleted and recreated the pool, and you're still seeing issues, and solnet-array-test is clearly showing an abnormality, I think we are looking at hardware issues.
 

VictorR

Contributor
Joined
Dec 9, 2015
Messages
143
Awesome points, jgreco!! There is no data on the NAS, and no traffic to it. It is set up strictly for testing, right now. I'm the only one accessing it.

I was actually looking at some of the temperature readings of the hard drives, in the FreeNAS GUI, a couple of nights ago. And, noticed that a certain block of them were 4-8 degrees C hotter. da0-da15 are 30-33C, while da16-da29 are all 36-37C. While running those dd tests, the warmer block was getting to 40-41C. So, that indicates that there was 10-15F difference in certain areas of the chassis. So, there are some airflow issues in the layout

And, this Q30 is far more cramped than the picture you shared. Add in dual processors, 4 rows of RAM, and all the open slots filled with 10Gbe NIC cards. I'll take a pic when I get to the office, tonight.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Awesome points, jgreco!! There is no data on the NAS, and no traffic to it. It is set up strictly for testing, right now. I'm the only one accessing it.

I was actually looking at some of the temperature readings of the hard drives, in the FreeNAS GUI, a couple of nights ago. And, noticed that a certain block of them were 4-8 degrees C hotter. da0-da15 are 30-33C, while da16-da29 are all 36-37C. While running those dd tests, the warmer block was getting to 40-41C. So, that indicates that there was 10-15F difference in certain areas of the chassis. So, there are some airflow issues in the layout

And, this Q30 is far more cramped than the picture you shared. Add in dual processors, 4 rows of RAM, and all the open slots filled with 10Gbe NIC cards. I'll take a pic when I get to the office, tonight.

Okay. Good clues. Check all the fans and make sure none are failing. If one fan is in front of a block of "failing" drives, it could be the fan instead of the HBA. All things considered it may be best just to order replacement fans anyways, fans are cheap compared to drives, and then you can have a spare in case one fails someday.
 

VictorR

Contributor
Joined
Dec 9, 2015
Messages
143
Some pics of the Q30 internals. There’s two u-shaped hoods that clamp down over each row of drives to minimize vibration, and hold them firmly in their sockets.

8B5D153C-57FB-4BB4-BFDB-1EE5DE2CCF08.jpeg

DF917BAA-B4DC-4778-BF22-86D59E3D99A6.jpeg

70C10ADB-AEB4-4644-83DF-993A1B02DCA4.jpeg
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I have a Storinator XL60 at work. Someone ordered it before I started working there. The airflow / cooling is terrible in it, especially by comparison to the Supermicro 60 bay system I have right next to it. Drive temperatures in the Supermicro chassis are all under 30C where the Storinator drives are over 45C by the time it get to the back row of drives, with each row of drives getting progressively warmer. Getting warmer like that isn't unusual for the format of the system, I have a 48 bay system from Chenbro at home and it does the same thing. However, the "45Drives" company makes systems are very poorly designed from both airflow and performance standpoint. Something I would not have known without being able to make direct comparisons between the Storinator and the Supermicro.
This is the model I have from Supermicro: https://www.supermicro.com/en/products/system/4U/6048/SSG-6048R-E1CR60l.cfm?parts=SHOW
It holds just as many drives as the XL60, but it is about 6 or 8 inches shorter overall AND it cools better AND it performs better. If you want a good system, don't listen to the hype train from the folks at "45Drives". They really don't know what they are doing.
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Yeah. I'd place a wager on bad airflow. With 4xSFF8087 hanging off the end of each of those and all that cabling, I bet very little air is flowing over the HBA's. Take an 80MM fan and see if you can use some electrical tape or zip ties or something to position it over the top of the cards to push air down towards the motherboard and get some air moving in those slots. Not a fix, just a test, don't work too hard at it. I'd not be shocked if things got better.
 

VictorR

Contributor
Joined
Dec 9, 2015
Messages
143
However, the "45Drives" company makes systems are very poorly designed from both airflow and performance standpoint. Something I would not have known without being able to make direct comparisons between the Storinator and the Supermicro.

In 45 Drives defense, this version of the Q30 was designed for "desktop quiet". So, the fan speed/flow is probably a lot lower than it should be. I should look into kicking that up, as the system resides in a Noren 24TD AcoustilLock cabinet, that cuts sound by 40dBA

noren.png

At the time I was looking for a low-priced 100TB+ unit for 4K video editing(in late 2015), most systems from companies specializing in film production, were $60k-130k. The Q30 exceeded all those system's performance for $19k. On top of that, the engineers at 45 Drives really went out of their way to help squeeze every bit of performance out it, for our use. But, like you point out, I could probably get the same, today, for $10k from Supermicro. Whereas, it would cost $15k from 45 Drives. Of course, I know quite a bit more about what we need, today, than 2015.

Drive temperatures in the Supermicro chassis are all under 30C where the Storinator drives are over 45C by the time it get to the back row of drives, with each row of drives getting progressively warmer.

The front row is steady at 30-31C, at idle. The back row rides at 36-37C. I do think the "hoods" used to lock the drive rows down, causes a significant amount of heat to be held in. Removing the chassis top cover over the drives dropped the overall drive temps by about 3C. I think the highest temp I've seen, under load, was 41-42C. I'd like that to be <40C, but that's still well under the temp for concern.

I don't have a way to check temp at the HBA, though. I really wish they were built with sensors.
 
Last edited:

VictorR

Contributor
Joined
Dec 9, 2015
Messages
143
Ok, while tracking down the location of another bad memory chip(extremely hard), I noticed that rear row of HD's had reached a high of 50C today(now, at 45C). There was nothing running on the box - no scrub, etc.

I was diddling around in IPMI and noticed that one of the 6 fans was not sending any info, via sensors to IPMI.

Q30_fans.png

I need to get back up there sometime this week and see if it has died, or spinning at a different rate. Im betting it is the middle of three. Because while taking the pictures posted above, I noticed that three drives in the line were noticeably warmer than the others. It looked, at the time, like all three fans were spinning. but, maybe that one is intermittent
 
Top