Checking new HDD's in RAID

Ericloewe · Oct 28, 2014

Mlovelace said:
You can get a SAS expander and connect all 24 bays to a single serveRAID M1015 card. That card will support up to 32 drives per the IBM tech sheet.

http://www.newegg.com/Product/Product.aspx?Item=N82E16816117207

The controller should support at least 128 drives on IT mode, maybe even 256.

cyberjock · Oct 28, 2014

You are correct @Ericloewe but the problem is the M1015 docs say 32 disks. No idea if there is a limitation with the IBM or not. But you are correct that the true LSI card supports 128 or 256 (I forget what my box says and I can't check it right now).

Mlovelace · Oct 28, 2014

Ericloewe said:
The controller should support at least 128 drives on IT mode, maybe even 256.

I was just going off the white sheet IBM put out. If it uses the same LSI chip as the 9211-8i then it should support 256 in IT mode.

cyberjock · Oct 28, 2014

Mlovelace said:
I was just going off the white sheet IBM put out. If it uses the same LSI chip as the 9211-8i then it should support 256 in IT mode.
View attachment 5338

Yes, it "should". Are you 100% sure that IBM doesn't have some limitation in the silicon that makes 256 disk support not work? That is a very real possibility and as far as I know nobody has tried to hook up enough drives to validate the 32 drive limit.

Mlovelace · Oct 28, 2014

cyberjock said:
Yes, it "should". Are you 100% sure that IBM doesn't have some limitation in the silicon that makes 256 disk support not work? That is a very real possibility and as far as I know nobody has tried to hook up enough drives to validate the 32 drive limit.

I don't know but being an IBM card they probably do have a limit on the drive count. I don't run the IBM card myself so I couldn't test the stated 32 drive count limit per their white sheet. It would be interesting to see, when flashed with the LSI firmware, if the drive count is written into the IBM firmware or if it's hard coded on the card.

cyberjock · Oct 28, 2014

Yep. That's why I'd rather say it has the 32 drive limit that IBM calls for. If I'm wrong there's no downside. But if I were to tell people it supporst 128 drives and then doesn't that could mean seriously lost money and uptime as a result. So the conservative answer is the "proper" answer until proven otherwise. ;)

Mlovelace · Oct 28, 2014

cyberjock said:
Yep. That's why I'd rather say it has the 32 drive limit that IBM calls for. If I'm wrong there's no downside. But if I were to tell people it supporst 128 drives and then doesn't that could mean seriously lost money and uptime as a result. So the conservative answer is the "proper" answer until proven otherwise. ;)

I agee 110%, which is why I included the IBM white sheet in my first reply about the SAS expander. I don't what to mislead anyone, I thought we were speaking in theory about the cards potential. Please feel free to delete the other posts if you think someone might interpret them incorrectly and acquire false expectations.

pjc · Oct 30, 2014

pjc said:
Now I'm looking forward to jgreco's seek-heavy stress test even more...

jgreco posted his script here, but that thread doesn't allow discussion.

Here are the results from the first pass (the second pass is currently running), apologies for the novel:

Code:

Selected disks: da2 da4
<HGST HUS724040ALS640 A1C4>        at scbus0 target 2 lun 0 (da2,pass2)
<HGST HUS724040ALS640 A1C4>        at scbus0 target 8 lun 0 (da4,pass4)
Is this correct? (y/N): y
Performing initial serial array read (baseline speeds)
Tue Oct 28 16:45:19 EDT 2014
Tue Oct 28 16:49:49 EDT 2014
Completed: initial serial array read (baseline speeds)

Array's average speed is 174.075 MB/sec per disk

Disk    Disk Size  MB/sec %ofAvg
------- ---------- ------ ------
da2      3815447MB    175    101
da4      3815447MB    173     99

Performing initial parallel array read
Tue Oct 28 16:49:49 EDT 2014
The disk da2 appears to be 3815447 MB.
Disk is reading at about 175 MB/sec
This suggests that this pass may take around 363 minutes

                  Serial Parall % of
Disk    Disk Size  MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da2      3815447MB    175    175    100
da4      3815447MB    173    173    100

Awaiting completion: initial parallel array read
Wed Oct 29 00:29:26 EDT 2014
Completed: initial parallel array read

Disk's average time is 27364 seconds per disk

Disk    Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
da2         4000787030016   27152     99
da4         4000787030016   27576    101

Performing initial parallel seek-stress array read
Wed Oct 29 00:29:26 EDT 2014
The disk da2 appears to be 3815447 MB.
Disk is reading at about 136 MB/sec
This suggests that this pass may take around 466 minutes

                  Serial Parall % of
Disk    Disk Size  MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da2      3815447MB    175    136     77
da4      3815447MB    173    132     77

Awaiting completion: initial parallel seek-stress array read
Thu Oct 30 17:21:25 EDT 2014
Completed: initial parallel seek-stress array read

Disk's average time is 105024 seconds per disk

Disk    Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
da2         4000787030016  109143    104
da4         4000787030016  100905     96

Performing pass 2 parallel array read
Thu Oct 30 17:21:25 EDT 2014
The disk da2 appears to be 3815447 MB.
Disk is reading at about 175 MB/sec
This suggests that this pass may take around 363 minutes

                  Serial Parall % of
Disk    Disk Size  MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da2      3815447MB    175    175    100
da4      3815447MB    173    173    100

Awaiting completion: pass 2 parallel array read

I find it interesting that da2 is slightly faster at streaming (175MB/s vs. 173, ~1%), but quite a bit slower at seeking (109143 sec vs. 100905, ~8%!). Does that seem vaguely sane?

I'll be interested to see if the seek stress test disparity was a fluke once the second pass completes.

The only particular problems I've found with the script so far are:

1) As posted, it doesn't loop. I've tweaked that in my copy so that it'll just keep running indefinitely.

2) The initial sampling and time estimate is WAY off. It's not so bad for the streaming (456 minutes instead of 363, or 25%), but for the seek stress test something doesn't seem right. (1800+ minutes instead of 466 minutes, nearly 400% off!)

When I first ran the script, I accidentally had left the "sysctl kern.geom.debugflags=0x10" on from running badblocks, and the estimate was quite a bit closer (1121 minutes):

Code:

Disk    Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
da2         4000787030016   27154     99
da4         4000787030016   27578    101

Performing initial parallel seek-stress array read
Tue Oct 28 15:38:51 EDT 2014
The disk da2 appears to be 3815447 MB.       
Disk is reading at about 57 MB/sec       
This suggests that this pass may take around 1121 minutes

                  Serial Parall % of
Disk    Disk Size  MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da2      3815447MB    175    138     79
da4      3815447MB    173    133     77

One interesting thing I have noticed is that throughput seems to slowly drop towards the end of the disk. Even with badblocks, I got nearly 175MB/s at the start, but then 100MB/s at the end, consistently for each read and write pass, across both drives. That seems like it would account for the variance in streaming estimate vs. throughput (the average of 175 and 100 is 137.5, which is 25% slower than the initial estimate of 175).

But I'm not sure why the stress test estimate is so wildly off. The disk usage charts show about 150MB/s, presumably aggregated across the 5 processes per drive, and there's one odd spike up to 400MB/s that returns back to sanity over the course of about 6 hours. Maybe ARC is interfering? The two drives are a mirrored vdev, but they're not in use (no scrubbing or anything else).

On the plus side, no kernel warnings about connectivity, and no errors in the SAS PHY layer.

Ericloewe · Oct 31, 2014

pjc said:
One interesting thing I have noticed is that throughput seems to slowly drop towards the end of the disk. Even with badblocks, I got nearly 175MB/s at the start, but then 100MB/s at the end, consistently for each read and write pass, across both drives. That seems like it would account for the variance in streaming estimate vs. throughput (the average of 175 and 100 is 137.5, which is 25% slower than the initial estimate of 175).

That's typical of hard drives. Historically, it was because sectors were all the arc of the same angle, regardless of distance to the center of the disk - that way, the outer sectors are physically larger and require more time to read. I'd venture a guess this is no longer quite the case, so other things are involved.

pjc · Oct 31, 2014

Ericloewe said:
...the outer sectors are physically larger and require more time to read.

With a constant RPM, though, you'd travel the same angle in the same amount of time regardless of which track you were on. I thought the old CHS drives just had limited capacity due to lower bit density on the outer tracks (since they were capped by the platter bit density on the innermost tracks).

With LBA drives, in contrast, you can have different numbers of sectors on different tracks, so that you can keep a fixed bit density on the platter, but have more sectors of data on outer tracks. That would be consistent with higher throughput on the outer edge vs. inner.

So if there were a way to tell iostat to sample the speed at both low LBA and high LBA (or maybe just do a middle LBA, but the middle track might be hard to calculate based on variable sectors per track) and average, you'd get a better estimate for streaming throughput over the entire drive.

That still doesn't explain the seek stress test disparity, though.

pjc · Nov 1, 2014

I just noticed another strange result:

pjc said:

This claims the pass finished in at most 109143 seconds, which is about 30h15m, but if you look at the two timestamps (Wed and Thu), it actually took almost 41 hours!

It looks like the script is having all 5 dd processes per drive write to the same log file, which seems a little odd. But in theory the last one running will provide the time estimate. That suggests that the 5th run of dd only took 30 hours, but the whole run for all 5 took 41.

But that's also a bit surprising given that the 5 runs of dd are started a minute apart from each other. Also, once the other 4 are done, I'd expect throughput to approach the streaming throughput, which can scan the entire drive in less than 8 hours. So that still doesn't account for a 10-hour discrepancy between dd and wall clock.

pjc · Nov 1, 2014

The plot thickens:

Code:

Performing pass 2 parallel seek-stress array read
Fri Oct 31 01:01:44 EDT 2014
The disk da2 appears to be 3815447 MB.
Disk is reading at about 133 MB/sec
This suggests that this pass may take around 477 minutes

                  Serial Parall % of
Disk    Disk Size  MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da2      3815447MB    175    132     75
da4      3815447MB    173    133     77

Awaiting completion: pass 2 parallel seek-stress array read
Sat Nov  1 15:57:24 EDT 2014
Completed: pass 2 parallel seek-stress array read

Disk's average time is 101333 seconds per disk

Disk    Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
da2         4000787030016  113990    112 --SLOW--
da4         4000787030016   88676     88 ++FAST++

Not sure what to make of this one. At least the drives are getting a workout.

pjc · Nov 4, 2014

Here's my patch to make it loop forever:

Code:

--- solnet-array-test-v2.sh.orig    2014-11-04 13:34:12.000000000 -0500
+++ solnet-array-test-v2.sh    2014-11-04 13:35:00.000000000 -0500
@@ -160,5 +160,11 @@
done

-pass=initial
+# loop forever
+while [ 1 ]; do
+if [ ${passnumber} -eq 1 ]; then
+    pass=initial
+else
+    pass="pass ${passnumber}"
+fi

echo ""
@@ -273,2 +279,6 @@
    echo "${disk}" "${avgtime}" `grep -h " bytes transferred in" /tmp/sat.${disk}.out /tmp/sat.${disk}.err` "${avgtime}" | awk '{ if ($2 != 0) {percent=100 * $7 / $2} else {percent = 0}; printf "%-7s %17s %7.0f %6.0f", $1, $3, $7, percent; if (percent < 92) printf " ++FAST++"; if (percent > 107) printf " --SLOW--"; printf "\n"}'
done
+
+passnumber=`expr $passnumber + 1`
+done  # while
+

Copy/paste the above into a file (e.g. "solnet-array-test-loop.patch"). Apply with:

Code:

patch < solnet-array-test-loop.patch

Fraoch · Nov 4, 2014

Thanks pjc!

Fraoch · Nov 4, 2014

Oops, I get:

Code:

[root@Minas-Tirith] ~# patch < solnet-array-test-loop.patch
Hmm...  Looks like a unified diff to me...
(Patch is indented 4 spaces.)
The text leading up to this was:
--------------------------
|  --- solnet-array-test-v2.sh.orig  2014-11-04 13:34:12.000000000
|-0500
|  +++ solnet-array-test-v2.sh  2014-11-04 13:35:00.000000000 -0500
--------------------------
Patching file solnet-array-test-v2.sh using Plan A...
patch: **** malformed patch at line 5: done

I'm not too familiar with patching so I don't know if this was successful or not...

pjc · Nov 4, 2014

It looks like your lines wrapped...that "-0500" should be part of the first line.

(I just copied it, did "cat > test.txt", pasted, and hit ctrl-C, and it came out fine. But different terminals do different things with wrapping.)

pjc · Nov 10, 2014

Here are the next few seek stress tests:

Code:

Performing pass 3 parallel seek-stress array read
Sat Nov  1 23:37:01 EDT 2014
The disk da2 appears to be 3815447 MB.       
Disk is reading at about 133 MB/sec       
This suggests that this pass may take around 480 minutes

                  Serial Parall % of
Disk    Disk Size  MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da2      3815447MB    175    132     75
da4      3815447MB    173    133     77

Awaiting completion: pass 3 parallel seek-stress array read
Mon Nov  3 14:39:55 EST 2014
Completed: pass 3 parallel seek-stress array read

Disk's average time is 97941 seconds per disk

Disk    Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
da2         4000787030016  109567    112 --SLOW--
da4         4000787030016   86314     88 ++FAST++
...
Performing pass 4 parallel seek-stress array read
Mon Nov  3 22:19:28 EST 2014
The disk da2 appears to be 3815447 MB.       
Disk is reading at about 133 MB/sec       
This suggests that this pass may take around 478 minutes

                  Serial Parall % of
Disk    Disk Size  MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da2      3815447MB    175    132     76
da4      3815447MB    173    133     77

Awaiting completion: pass 4 parallel seek-stress array read
Thu Nov  6 15:11:48 EST 2014
Completed: pass 4 parallel seek-stress array read

Disk's average time is 126745 seconds per disk

Disk    Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
da2         4000787030016  129153    102
da4         4000787030016  124337     98
...
Performing pass 5 parallel seek-stress array read
Thu Nov  6 22:51:27 EST 2014
The disk da2 appears to be 3815447 MB.       
Disk is reading at about 133 MB/sec       
This suggests that this pass may take around 478 minutes

                  Serial Parall % of
Disk    Disk Size  MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da2      3815447MB    175    132     76
da4      3815447MB    173    132     76

Awaiting completion: pass 5 parallel seek-stress array read
Sat Nov  8 13:37:29 EST 2014
Completed: pass 5 parallel seek-stress array read

Disk's average time is 88768 seconds per disk

Disk    Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
da2         4000787030016   90015    101
da4         4000787030016   87521     99

This just seems weird. Passes 2 and 3 of the seek stress test show a wide variation between da2 and da2, and looking at the chart, we do see a slight difference in speed, but not the 6+-hour difference reported by the script.

Stranger still, pass 4 looks really weird in the chart (with crazy low throughput on da2 at the end, 24MB/s), but the script thinks that run was fine.

And what's with the spikes up to 400MB/s in passes 1-3 and 5 (and on da4 in pass 4, and on da2 in pass 6)?

These two drives are a mirrored vdev with no activity, and no scrubs.

There's nothing in the logs that indicates any problems, and there aren't any SAS PHY errors or reallocated sectors.

pjc · Nov 12, 2014

Well, now we have pass 6, and the chart looks like pass 4 except that da2 and da4 have traded places:

Code:

Performing pass 6 parallel seek-stress array read
Sat Nov  8 21:18:19 EST 2014
The disk da2 appears to be 3815447 MB.       
Disk is reading at about 133 MB/sec       
This suggests that this pass may take around 477 minutes

                  Serial Parall % of
Disk    Disk Size  MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da2      3815447MB    175    132     76
da4      3815447MB    173    133     77

Awaiting completion: pass 6 parallel seek-stress array read
Tue Nov 11 16:21:52 EST 2014
Completed: pass 6 parallel seek-stress array read

Disk's average time is 154824 seconds per disk

Disk    Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
da2         4000787030016  140787     91 ++FAST++
da4         4000787030016  168861    109 --SLOW--

The only differences in the drives that I see are in higher ECC usage on da2: 94251 read errors corrected vs. 47533.

Code:

Manufactured in week 10 of year 2014
...
Error counter log:
          Errors Corrected by           Total   Correction     Gigabytes    Total
              ECC          rereads/    errors   algorithm      processed    uncorrected
          fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:      94251       24         0     94275      11622     213946.009           0
write:         0        0         0         0        630      16392.261           0
verify:      126        0         0       126         85         16.143           0

Non-medium error count:        0
...
   Accumulated power on time, hours:minutes 1241:01 [74461 minutes]

Code:

Manufactured in week 34 of year 2014
...
Error counter log:
          Errors Corrected by           Total   Correction     Gigabytes    Total
              ECC          rereads/    errors   algorithm      processed    uncorrected
          fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:      47533        0         0     47533       4377     213117.450           0
write:         0        0         0         0       1013      16003.161           0
verify:       40        0         0        40         27         15.575           0

Non-medium error count:        0
...
   Accumulated power on time, hours:minutes 780:20 [46820 minutes]

For reference, the verify stats haven't changed while running the solnet test script, but the read stats started out at 8804 for da2 and 7926 for da4, so they've diverged pretty significantly during the seek tests.

Curiously, the two drives claim to be identical firmware/revisions, but clearly they're not identical.

@jgreco, what do you make of this? Does any of this seem like cause for concern to you? I can't tell how much of the timing difference is due to script wonkiness (see posts above) and how much is due to any issue with the drives. Again, there aren't any kernel/log messages or SAS PHY errors.

WillM · Jan 13, 2016

Brian M said:
The SMART Long test scans the entire disk surface - this will take a long time - estimate at least 1 minute per GB. Do this before you go to sleep, and it will probably be done by morning.
to start it: smartctl -t long /dev/ada0 (repeating for each disk)
and then the next morning the same log command: smartctl -l xselftest /dev/ada0

A second recommended way to check on drive performance and do some stress testing is to do a "dd" read of the each of the drives.
From the Shell you can run the following command - note this is to test 6 drives, adjust it if you have fewer or more drives.
# for i in 0 1 2 3 4 5; do (press return it will then start a new line where you continue)
> dd if=/dev/ada${i} of=/dev/null bs=1048576 & (press return again)
> done (after pressing return it will actually run the above commands)
this will take a long time (hundreds of minutes) and as long as you keep the Shell open, when it does complete it will report how long it took for each of the drives, if one is much longer than the others, it could have a performance issue (even if it passes SMART) (when I did this test on 6 drives, I left it overnight and saw the next morning they were all within 10% of each other)

I just did a smartctl -t long on ada{1,2,3,4} and got the following durations for the scans: 12 hours, 10 hours, 10 hours and 9 hours. I wondered if you would suspect a problem with the 12 hour drive?

Ericloewe · Jan 13, 2016

WillM said:
I just did a smartctl -t long on ada{1,2,3,4} and got the following durations for the scans: 12 hours, 10 hours, 10 hours and 9 hours. I wondered if you would suspect a problem with the 12 hour drive?

No, the estimates are almost always off by at least 10% anyway.

Important Announcement for the TrueNAS Community.

Checking new HDD's in RAID

Server Wrangler

Inactive Account

Guru

Inactive Account

Guru

Inactive Account

Guru

Contributor

Server Wrangler

Contributor

Contributor

Contributor

Contributor

Patron

Patron

Contributor

Contributor

Contributor

Dabbler

Server Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Checking new HDD's in RAID"

Similar threads