HDD stress test

Glorious1 · Dec 6, 2014

I've been trying for about a week to download the stress test script that jgreco posted in the Building, Burn-in and Testing sticky. The incoming directory just seems to be empty.
http://ftp//ftp.sol.net/incoming/solnet-array-test-v2.sh

But I have a further question that I'm sure will seem trivial to the experts. How would one best go about running a script remotely on the server? Do you have to upload it somehow, or can you run it from the remote computer through SSH or the KVM console or something? And can I assume the script would run under FreeBSD, no other operating system needed?

depasseg · Dec 6, 2014

Works from me. the URL is ftp://ftp.sol.net/incoming/solnet-array-test-v2.sh (no http).

look into TMUX. this allows a cli command to run even after you disconnect. when you reconnect, just type tmux attach and you are connected to the cli test you were running.

Glorious1 · Dec 6, 2014

Thanks, my browser added that http when I copied the URL. I usually don't have trouble with FTP. But browsers I've tried are flummoxed. Even in an FTP client, all I find is an empty second incoming folder inside the first one. Strange.
Edit: I finally managed to get it through a terminal

My question about running it was basically, where is the script physically saved when you run it (can it be on the remote computer), and how do you invoke it if it isn't on the server? And, does it run on FreeBSD?

depasseg · Dec 6, 2014

Oh, sorry. Yes, run it on the computer you want to test. an easy way to get the file onto it, is to use wget.

Code:

Welcome to FreeNAS
[root@freenas] ~#
[root@freenas] ~#
[root@freenas] ~# wget ftp://ftp.sol.net/incoming/solnet-array-test-v2.sh
converted 'ftp://ftp.sol.net/incoming/solnet-array-test-v2.sh' (US-ASCII) -> 'ftp://ftp.sol.net/incoming/solnet-array-test-v2.sh' (UTF-8)
--2014-12-06 16:59:10--  ftp://ftp.sol.net/incoming/solnet-array-test-v2.sh
           => 'solnet-array-test-v2.sh'
Resolving ftp.sol.net (ftp.sol.net)... 206.55.64.92
Connecting to ftp.sol.net (ftp.sol.net)|206.55.64.92|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /incoming ... done.
==> SIZE solnet-array-test-v2.sh ... 9578
==> PASV ... done.    ==> RETR solnet-array-test-v2.sh ... done.
Length: 9578 (9.4K) (unauthoritative)

solnet-array-test-v2.sh                              100%[=====================================================================================================================>]   9.35K  7.73KB/s   in 1.2s  

2014-12-06 16:59:15 (7.73 KB/s) - 'solnet-array-test-v2.sh' saved [9578]

[root@freenas] ~# ls -la
total 97
drwxr-xr-x   3 root  wheel    17 Dec  6 16:59 ./
drwxr-xr-x  19 root  wheel    28 Dec  6 16:21 ../
-rw-r--r--   1 root  wheel  1128 Dec  6 16:21 .bashrc
-rw-r--r--   1 root  wheel  1229 Dec  6 16:21 .cshrc
-rw-------   1 root  wheel    63 Dec  6 15:54 .history
-rw-r--r--   1 root  wheel    80 Dec  6 16:21 .k5login
-rw-r--r--   1 root  wheel   229 Dec  6 16:21 .login
-rw-r--r--   1 root  wheel   560 Dec  6 16:21 .profile
-rw-r--r--   1 root  wheel  1128 Dec  6 16:21 .shrc
drwxr-xr-x   2 root  wheel     2 Dec  6 07:49 .ssh/
-rwxr-xr-x   1 root  wheel  1677 Dec  6 16:21 change_password*
-rwxr-xr-x   1 root  wheel  1630 Dec  6 16:21 save_cfg*
-rwxr-xr-x   1 root  wheel  1591 Dec  6 16:21 save_sshkeys*
-rw-r--r--   1 root  wheel  9578 Dec  6 16:59 solnet-array-test-v2.sh
-rwxr-xr-x   1 root  wheel  1607 Dec  6 16:21 update*
-rwxr-xr-x   1 root  wheel  2949 Dec  6 16:21 updatep1*
-rwxr-xr-x   1 root  wheel  3225 Dec  6 16:21 updatep2*
[root@freenas] ~# chmod +x solnet-array-test-v2.sh
[root@freenas] ~# ./solnet-array-test-v2.sh
sol.net disk array test v2

1) Use all disks (from camcontrol)
2) Use selected disks (from camcontrol|grep)
3) Specify disks
4) Show camcontrol list

Option: 2

Enter grep match pattern (e.g. ST150176): WDC

Selected disks: da2 da4 da6 da8 da9
<WD WDC WD4001FYYG-0 VR07>         at scbus0 target 12 lun 0 (pass2,da2)
<WD WDC WD4001FYYG-0 VR07>         at scbus0 target 14 lun 0 (pass4,da4)
<WD WDC WD4001FYYG-0 VR07>         at scbus0 target 16 lun 0 (pass6,da6)
<WD WDC WD4001FYYG-0 VR07>         at scbus0 target 18 lun 0 (pass8,da8)
<WD WDC WD4001FYYG-0 VR07>         at scbus0 target 19 lun 0 (pass9,da9)
Is this correct? (y/N): n

1) Use all disks (from camcontrol)
2) Use selected disks (from camcontrol|grep)
3) Specify disks
4) Show camcontrol list

Option: 2

Enter grep match pattern (e.g. ST150176): WD

Selected disks: da0 da1 da2 da3 da4 da5 da6 da7 da8 da9 da10 da11
<WD WD4001FYYG-01SL3 VR07>         at scbus0 target 10 lun 0 (pass0,da0)
<WD WD4001FYYG-01SL3 VR07>         at scbus0 target 11 lun 0 (pass1,da1)
<WD WDC WD4001FYYG-0 VR07>         at scbus0 target 12 lun 0 (pass2,da2)
<WD WD4001FYYG-01SL3 VR07>         at scbus0 target 13 lun 0 (pass3,da3)
<WD WDC WD4001FYYG-0 VR07>         at scbus0 target 14 lun 0 (pass4,da4)
<WD WD4001FYYG-01SL3 VR07>         at scbus0 target 15 lun 0 (pass5,da5)
<WD WDC WD4001FYYG-0 VR07>         at scbus0 target 16 lun 0 (pass6,da6)
<WD WD4001FYYG-01SL3 VR07>         at scbus0 target 17 lun 0 (pass7,da7)
<WD WDC WD4001FYYG-0 VR07>         at scbus0 target 18 lun 0 (pass8,da8)
<WD WDC WD4001FYYG-0 VR07>         at scbus0 target 19 lun 0 (pass9,da9)
<WD WD4001FYYG-01SL3 VR07>         at scbus0 target 20 lun 0 (pass10,da10)
<WD WD4001FYYG-01SL3 VR07>         at scbus0 target 21 lun 0 (pass11,da11)
Is this correct? (y/N): y
Performing initial serial array read (baseline speeds)

Glorious1 · Dec 7, 2014

Wow, that's some real unix kungfu! Thanks for the explicit example with output depasseg, that's very helpful. Since I had already downloaded it on my laptop, I figured out how to scp it into my home directory on the server instead, after I set up a Dataset and gave myself permissions. It's running now.

Glorious1 · Dec 7, 2014

Doh! I've been using SSH, not knowing that it would eventually time out, stopping the script when it did. When I went back in, I discovered that the script wiped out the pool and dataset where I had stored the script on the disks. Who'dathunkit? :( So I stuck in a USB flash drive and created a pool and dataset on it. Also using tmux through SSH so the session will survive me getting logged out. Learning all the time thanks to you guys.

Glorious1 · Dec 8, 2014

The stress test is continuing. I received an email from the system (very cool) that a smart error was detected in one of my new WD Red NAS drives (not so cool). The email reported one "Current_Pending_Sector" error; by the time I looked at it there were two. I also noticed in my slightly older WD Greens, there were one each errors in "UDMA_CRC_Error_Count".

Is this cause for concern or RMA? I'm pretty sure the Greens are out of warranty.

A related question: when I installed the drives I carefully placed them in order of serial number and wired them in order from the SATA ports 0, 1, 2, etc. I figured that would make it easier to identify physical drives. But the systems seems to have assigned identifiers randomly. So my first drive might be ada3 and the fifth one ada0. To identify drives is there no easier way than tying it to serial number and looking for that one?

Fraoch · Dec 8, 2014

Glorious1 said:
The stress test is continuing. I received an email from the system (very cool) that a smart error was detected in one of my new WD Red NAS drives (not so cool). The email reported one "Current_Pending_Sector" error; by the time I looked at it there were two. I also noticed in my slightly older WD Greens, there were one each errors in "UDMA_CRC_Error_Count".

Contact WD support - I'm pretty sure even one pending sector is justification for an RMA.

It's probably moot at this point, but you might want to also try the SMART short, long and conveyance tests.

Glorious1 · Dec 8, 2014

Thanks Fraoch. I did all the smart tests before beginning the stress test. All passed with no errors.

Fraoch · Dec 8, 2014

Try them now. They may show even more pending sectors.

It's an academic exercise at this point because that drive will probably be RMAed anyway.

Glorious1 · Dec 8, 2014

I guess I can run those SMART tests while the stress test is ongoing? I would hate to interrupt that, as it would start at the beginning next time, and it's worse than watching paint dry. Can I even pull that drive out while the stress test is running? Is there something like an eject command?

Fraoch · Dec 8, 2014

You can run these tests while the drives are doing something else, but it will slow the tests down - particularly the extended test.

Don't pull the drive before setting it to "OFFLINE" in the FreeNAS web GUI! That will probably interrupt all tests except for the SMART tests. Also unless you've set the drive as "hotswap" in the BIOS, don't pull it while under power.

Glorious1 · Dec 8, 2014

VERY valuable information. Thank you!

Glorious1 · Dec 9, 2014

I read a bit about these "current pending" sector errors and decided to write to the whole disk and see what would happen. I wrote up a script to make 5 parallel writes of about 3 TB each just to give it some more stress.

It didn't finish, but the Current Pending Sector went from 2 to 0, and the Raw Read Error Rate went from 0 to 2. Reallocated sectors are still 0. Not sure whether that means it fixed itself or not, but during the writing in the middle of the night I got an cryptic email. Anyone know what it means (ada1 is the troubled disk)?

I already set up the RMA, but I wanted to check further since I've heard that WD often replaces new RMA'd drives with used ones.

Code:

Tabernacle.local kernel log messages:
CPU: Intel(R) Atom(TM) CPU  C2758  @ 2.40GHz (2400.06-MHz K8-class CPU)
SMP: AP CPU #5 Launched!
SMP: AP CPU #2 Launched!
SMP: AP CPU #4 Launched!
SMP: AP CPU #1 Launched!
Timecounter "TSC-low" frequency 1200030168 Hz quality 1000
vboxdrv: fAsync=0 offMin=0x780 offMax=0x1440
ahcich1: Timeout on slot 22 port 0
ahcich1: is 00000000 cs 00400000 ss 00000000 rs 00400000 tfd d0 serr 00000000 cmd 0020d617
ahcich1: Error while READ LOG EXT
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 00 48 b8 40 2e 00 00 01 00 00
(ada1:ahcich1:0:0:0): CAM status: ATA Status Error
(ada1:ahcich1:0:0:0): ATA status: 00 ()
(ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada1:ahcich1:0:0:0): Retrying command
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 00 59 3c 40 2f 00 00 01 00 00
(ada1:ahcich1:0:0:0): CAM status: ATA Status Error
(ada1:ahcich1:0:0:0): ATA status: 00 ()
(ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada1:ahcich1:0:0:0): Retrying command
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 00 cc b1 40 2f 00 00 01 00 00
(ada1:ahcich1:0:0:0): CAM status: ATA Status Error
(ada1:ahcich1:0:0:0): ATA status: 00 ()
(ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada1:ahcich1:0:0:0): Retrying command
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 00 30 f0 40 2e 00 00 01 00 00
(ada1:ahcich1:0:0:0): ATA status: 00 ()
(ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 00 b8 c6 40 30 00 00 01 00 00
(ada1:ahcich1:0:0:0): ATA status: 00 ()
(ada1:ahcich1:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00

-- End of security output --

Glorious1 · Dec 9, 2014

Well, just tried to repeat the long test and it failed. Strangely, the printout still says "SMART overall-health self-assessment test result: PASSED".
I'll give up and ship it out for RMA.

Code:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       80%        76         818256240
# 2  Short offline       Completed without error       00%        75         -
# 3  Conveyance offline  Completed without error       00%        75         -
# 4  Extended offline    Completed without error       00%        24         -
# 5  Short offline       Completed without error       00%        17         -
# 6  Conveyance offline  Completed without error       00%        17         -

Fraoch · Dec 9, 2014

ATA Status Error = bad cable? Bad controller? Try swapping cables, see if the problem drive changes.

Glorious1 · Dec 9, 2014

Well, that's a good thought that never occurred to me, but I've got it all boxed up now, so we'll see what WD says. I'll keep the cable swapping in mind if it happens again.

Fraoch · Dec 9, 2014

I believe they read the SMART log. If any bad sectors or pending sectors are logged you get a replacement. So it depends if the SMART error appears in the log.

You can check the log with:

Code:

smartctl -a /dev/adaX

or

Code:

smartctl -x /dev/adaX

where X = the drive number.

Glorious1 · Dec 9, 2014

Well, as mentioned, I HAD pending sector errors, but when I wrote to the whole drive, they changed back to 0 and were replaced by Raw_Read_Error_Rate of 2. Hopefully when they see it failed the long test that will help. Thanks.

cyberjock · Dec 10, 2014

Few things...

Yes, it says "SMART overall-health self-assessment test result: PASSED" and that is normal despite failing a SMART test. That assessment is solely based on if you've failed any monitored parameters by going below the threshold value. Hint: You haven't.

You get an RMA if you have a failed SMART test. You'd think that bad sectors and such would qualify, but that's not entirely correct. A failed SMART test qualifies you. But if you have bad sectors then a SMART test will fail (whether performed by you or by the manufacturer).

As for the RMA sending you a "recertified" drive, there's a possibility it will be some used drive. It could be a drive that had a manufacturing defect at a higher capacity and was recertified for a smaller quantity of data. I don't think anyone has solid evidence (such as an employee of the company) that they actually use used drives, but it's a pretty simple foregone conclusion that they get many drives back that work fine and they simply run them through a gambit of tests, throw a "recertified" sticker on it, and put it in the pool of drives to go out for an RMA.

If you are adamant that you want a new disk then you basically have that 30 day window when you buy a hard drive to get another new drive from the place of purchase. After that you'll almost certainly get a "recertified" drive.

Important Announcement for the TrueNAS Community.

HDD stress test

Guru

FreeNAS Replicant

Guru

FreeNAS Replicant

Guru

Guru

Guru

Patron

Guru

Patron

Guru

Patron

Guru

Guru

Guru

Patron

Guru

Patron

Guru

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "HDD stress test"

Similar threads