SOLVED Beginning Burn-in, help with SMART results?

J-NAS · Oct 27, 2014

OK, please go easy on me as this is my first post! I've been reading up on FreeNAS for about a month now--probably longer. I've read all the setup guides, watched the PP presentation (a few times actually) and went with hardware strictly off the "recommended" list. I've read the burn-in threads with great interest and have a game-plan I'm implementing.

I completed the build (actually used an anti-static strap as suggested) and ran 12 hours of memtest with ECC off, and 12 with it on. That got me...3 passes each? No errors.

Moved on to the drives. I'm running the SMART commands from the Shell from within FreeNAS. The conveyance test doesn't appear to work on HGST drives, but that seems to be the expected result after re-searching these forums.

All drives passed the short SMART test. I've begun testing with the long SMART test and I'm collecting my results in an excel sheet for future reference. I'm only four drives in, but I'd like feedback on two things that have happened:

a) I've been testing the drives with

Code:

smartctl -t long /dev/adaX

I waited the 10 hours requested to print results with

Code:

smartctl -a /dev/adaX

The first drive I tested returned a value very different from the next two drives for SEEK TIME PERFORMANCE. ADA0 returned "0", while the next two drives were "32" and "33". My google-fu only reveals that if this number drops, I should be concerned.

Can someone elaborate as to why my drive would read 0? From what I've read, 33 is also very low...

b) I awoke this morning to find that my shell window in FreeNAS was glitched out, showing two screens stacked on top of one another, with a GUI "login" prompt. I was unable to login, as the screen wasn't actually accepting input. Not sure if this was some overlay issue or why I'd been "logged out". I had to close my FreeNAS browser window and login again to re-launch the shell.

I ran

Code:

smartctl -a /dev/ada3

and although it looks the test completed in the summary, START/STOP COUNT and POWER CYCLE COUNT are "2", while the preceding drives all read "3". All the drives were connected and powered on in unison for each session. They should be identical.

I don't understand this discrepancy and whether I should pay it any mind.

If the drives clear a long SMART test I will move on to a run of dd, and then badblocks and then a final pass of SMART to see if anything has changed.

I am an adult, capable of following directions, and more than willing to put in the legwork required to get this up and running. If there is a resource which directly addresses what I'm seeing then I've missed it, and will gladly go back to do my homework if you'd be so kind as to point me in the correct direction.

I could never have gotten this far without following the wonderful guides provided on the background of FreeNAS, the underpinnings of ZFS, and how to properly create safe Pools. Quite interesting how an entire pool can be brought down by a single drive failure with the wrong collection of vdevs! Anyhow, I believe I have followed best practices to this point and hope to continue.

Would love to hear your thoughts--should I rerun a SMART long on ADA0 and see if I get a different result? I was initially using

Code:

smartctl -l xselftest /dev/adaX

but that only provided me with the briefest of summaries. My own research is what lead me to use

Code:

smartctl -a /dev/adaX

I hope that's correct...

Thanks!

edit(s): figured out how to format code, and added hardware to .sig

Oko · Oct 27, 2014

I don't want to hijack your thread but one of the things which bothers me the most about SMART is that doesn't report the most important info when it comes to HDD failure at least for classical HDD with movable plates. That information is vibrations preferably reported as a time series. Another thing is that SMART is just a reporting tool. As somebody who works in the data mining lab one of more interesting problems for me would be to plug SMART data into an data mining analytic tool (my Lab has developed such tools but for different purposes:)) and train the analytic tool using training sets of failed HDD to reliably predict HDD failure. I am talking about machine learning here. Anybody has 10 million in venture capital around here please send me PP :).

jgreco · Oct 27, 2014

I wouldn't worry about the START/STOP and POWER CYCLE counts unless they're diverging rapidly. Unless you're _positive_ that they were all the same to begin with, at which point, you may have some problem somewhere, but even then, I'd disregard it until the numbers had diverged by more than one.

Ericloewe · Oct 27, 2014

Many drives arrive with one or two power cycles. I always assumed those were subjected to some basic testing at the factory and never gave it a second thought.

jgreco · Oct 27, 2014

Yeah, I just got a bunch of Seacrates which have only been powered on and a bunch of them are showing 7.

Code:

# sh -c '( for i in 0 1 2 3 ; do smartctl -a /dev/ada${i} ; done ) | egrep "Start_St|Power_O|Power_Cy"'
  4 Start_Stop_Count  0x0032  100  100  020  Old_age  Always  -  7
  9 Power_On_Hours  0x0032  100  100  000  Old_age  Always  -  72
 12 Power_Cycle_Count  0x0032  100  100  020  Old_age  Always  -  7
  4 Start_Stop_Count  0x0032  100  100  020  Old_age  Always  -  7
  9 Power_On_Hours  0x0032  100  100  000  Old_age  Always  -  33
 12 Power_Cycle_Count  0x0032  100  100  020  Old_age  Always  -  7
  4 Start_Stop_Count  0x0032  100  100  020  Old_age  Always  -  7
  9 Power_On_Hours  0x0032  100  100  000  Old_age  Always  -  7
 12 Power_Cycle_Count  0x0032  100  100  020  Old_age  Always  -  7
  4 Start_Stop_Count  0x0032  100  100  020  Old_age  Always  -  1
  9 Power_On_Hours  0x0032  100  100  000  Old_age  Always  -  620
 12 Power_Cycle_Count  0x0032  100  100  020  Old_age  Always  -  1
# camcontrol devlist
<NECVMWar VMware IDE CDR10 1.00>  at scbus1 target 0 lun 0 (pass0,cd0)
<VMware Virtual disk 1.0>  at scbus2 target 0 lun 0 (pass1,da0)
<ST6000DX000-1H217Z CC46>  at scbus5 target 0 lun 0 (ada0,pass2)
<ST6000DX000-1H217Z CC46>  at scbus6 target 0 lun 0 (ada1,pass3)
<ST6000DX000-1H217Z CC46>  at scbus7 target 0 lun 0 (ada2,pass4)
<ST6000DX000-1H217Z CC47>  at scbus8 target 0 lun 0 (pass5,ada3)

Note also that the ones with 7 are rev CC46, while the one CC47 (purchased elsewhere) is only 1.

J-NAS · Oct 27, 2014

jgreco said:
I wouldn't worry about the START/STOP and POWER CYCLE counts unless they're diverging rapidly. Unless you're _positive_ that they were all the same to begin with, at which point, you may have some problem somewhere, but even then, I'd disregard it until the numbers had diverged by more than one.

Ericloewe said:
Many drives arrive with one or two power cycles. I always assumed those were subjected to some basic testing at the factory and never gave it a second thought.

Duly noted. Any comment on the SEEK TIME PERFORMANCE?

Ericloewe · Oct 27, 2014

J-NAS said:
Duly noted. Any comment on the SEEK TIME PERFORMANCE?

Right, forgot to answer that this morning (that's what I get for buying lunch while reading the forums on my phone):

The numbers will probably assume reasonable values after the drive determines the data has stabilized enough to to write it down. My WD Reds needed a few days (and spin ups) to correctly start showing their spin up time instead of 0. They all show values close to each other now.

J-NAS · Oct 27, 2014

Ericloewe said:
Right, forgot to answer that this morning (that's what I get for buying lunch while reading the forums on my phone):

The numbers will probably assume reasonable values after the drive determines the data has stabilized enough to to write it down. My WD Reds needed a few days (and spin ups) to correctly start showing their spin up time instead of 0. They all show values close to each other now.

Awesome--thanks! I'm currently running about 10 hours between tests--can't wait until the stage of testing where I can begin chaining dd commands to all drives simultaneously!

Ericloewe · Oct 27, 2014

J-NAS said:
Awesome--thanks! I'm currently running about 10 hours between tests--can't wait until the stage of testing where I can begin chaining dd commands to all drives simultaneously!

Seek time performance may take a while, since you'll mostly be doing sequential IO for now, so don't worry if the drives are working fine.

Generally, only stuff like bad sectors and other stuff labeled error count is cause for concern. WD drives provide somewhat readable error rates, but since they're not very well defined they're only more readable than Seagate's encoding of error rates - not actually more informative.

cyberjock · Oct 27, 2014

jgreco said:
I wouldn't worry about the START/STOP and POWER CYCLE counts unless they're diverging rapidly. Unless you're _positive_ that they were all the same to begin with, at which point, you may have some problem somewhere, but even then, I'd disregard it until the numbers had diverged by more than one.

If you are one of those weirdos that like to spin down their disks when the server is idle those will diverge naturally. ;)

jgreco · Oct 27, 2014

I'm going to sneak onto your filer and install a little cron script...

Ericloewe · Oct 28, 2014

cyberjock said:
If you are one of those weirdos that like to spin down their disks when the server is idle those will diverge naturally. ;)

It's a very German superficial-environmentalist thing. Call it an irrational fear of the quantifiable danger, as opposed to the greater, more difficult to quantify danger.
Happens in the German section way more often that it does here.

anodos · Oct 28, 2014

Ericloewe said:
It's a very German superficial-environmentalist thing. Call it an irrational fear of the quantifiable danger, as opposed to the greater, more difficult to quantify danger.

Oh, like their fear of nuclear power. :)

cyberjock · Oct 28, 2014

I love how they shutdown their own power plants, then bought power from the French (who also use nuclear power almost exclusively). To make matters even more hilarious, if the French had an accident like Fukushima or Chernobyl, Germany would be almost entirely within the area that would be contaminated. Haha.

Ericloewe · Oct 28, 2014

cyberjock said:
I love how they shutdown their own power plants, then bought power from the French (who also use nuclear power almost exclusively). To make matters even more hilarious, if the French had an accident like Fukushima or Chernobyl, Germany would be almost entirely within the area that would be contaminated. Haha.

anodos said:
Oh, like their fear of nuclear power. :)

Yup, that's the craziest manifestation of this German mentality.

Before someone accuses me of being an insensitive clod, I'm German, so it's not just a lazy stereotype.

J-NAS · Oct 28, 2014

OK, so SMART long checked out and i moved on to

Code:

for i in 0 1 2 3 4 5; do
dd if=/dev/ada${i} of=/dev/null bs=1048576 &
done

and i've returned 12 hours later to find the following

[root@freenas ~]# for i in 0 1 2 3 4 5; do
> dd if=/dev/ada${i} of=/dev/null bs=1048576 &
> done
[1] 32043
[2] 32044
[3] 32045
[4] 32046
[5] 32047
[6] 32048
[root@freenas ~]# 3815447+1 records in
3815447+1 records out
4000787030016 bytes transferred in 29408.436272 secs (136042154 bytes/sec)
3815447+1 records in
3815447+1 records out
4000787030016 bytes transferred in 29542.926235 secs (135422842 bytes/sec)
3815447+1 records in
3815447+1 records out
4000787030016 bytes transferred in 29735.657377 secs (134545101 bytes/sec)
3815447+1 records in
3815447+1 records out
4000787030016 bytes transferred in 29835.296386 secs (134095770 bytes/sec)
3815447+1 records in
3815447+1 records out
4000787030016 bytes transferred in 29889.979148 secs (133850446 bytes/sec)
3815447+1 records in
3815447+1 records out
4000787030016 bytes transferred in 30033.595088 secs (133210394 bytes/sec)

These results look to be formatted differently from the examples I've seen posted. Will it be obvious when this read test has completed? (it looks done to me, but I don't have control in the shell yet...)

edit: nope--it was done. Just didn't draw the prompt. Moved on to writing out to the drives and used the following

Code:

for i in 0 1 2 3 4 5; do
dd if=/dev/zero of=/dev/ada${i} bs=1048576 &
done

In contrast to running the read command, my shell does nothing. I get no indication it is working. Is this expected behaviour? (would hate to wait 12 hours if it's not actually doing anything...)

J-NAS · Oct 28, 2014

Assuming my dd write command is formatted properly and actually executing--my next step will be to run badblocks so I'll ask this in preparation.

I only see one reference to running

Code:

sysctl kern.geom.debugflags=0x10

before executing badblocks. Is this a necessary / suggested practice?

My intended commandline for badblocks after creating a USB dataset named storage:

Code:

for i in 0 1 2 3 4 5; do
badblocks -svw -b 4096 -t 0xFF -t 0x00 -t 0xFF -o /mnt/storage/data/badblocks_ada${i}.txt /dev/ada${i} &
done

Though I'm thinking -t 0xFF -t 0x00 -t 0xFF is limiting it to only two patterns, which if omitted would be more thorough but take longer, correct?

J-NAS · Oct 29, 2014

OK, so this goes back to my initial post. I was logged out of the shell again this morning, and this is what my screen looks like. Any idea what is happening here?

Fraoch · Oct 29, 2014

Interesting because I've seen that screen when I was trying the exact same tests you are. I had to do the badblocks test using 4 simultaneous SSH sessions - SSH does not accept the for ... do statement (it's probably related to the shell environment it's using) so I just used 4 simultaneous sessions.

The initial output of the dd tests is the process IDs. You'll be able to see this is running using at least two methods:

"top", either through an SSH session or through "display running processes" in the web GUI. You should see 6 dd processes (in your case).
Reporting - Disks through the web GUI. You'll see a high read/write rate that slowly drops as the dd process works through the disk. Suddenly it will drop to 0. Done!

If you're also near the server, you will be able to hear and see dd working - the drive access light(s) will be constantly on and you'll hear your drives working.

Also if you've created a pool, you definitely should use:

Code:

sysctl kern.geom.debugflags=0x10

which allows raw access to the drives. I'm not sure what would happen otherwise - a good outcome will simply be an error, but a bad outcome would be the destruction of the pool. However with your server in testing, any data on your pool is there for testing purposes and should be considered disposable.

Just make sure to reboot after you're finished testing using that sysctl.

Ericloewe · Oct 29, 2014

Why not use tmux instead of trying to hack your way around?

Important Announcement for the TrueNAS Community.

SOLVED Beginning Burn-in, help with SMART results?

Dabbler

Contributor

Resident Grinch

Server Wrangler

Resident Grinch

Dabbler

Server Wrangler

Dabbler

Server Wrangler

Inactive Account

Resident Grinch

Server Wrangler

Sambassador

Inactive Account

Server Wrangler

Dabbler

Dabbler

Dabbler

Patron

Server Wrangler

Similar threads