SMART test overhead & scheduling?

rvassar · Jun 10, 2019

I was reading one of the other threads here this morning about SMART tests, and it occurred to me I've done some shuffling and recabling since I set up my periodic SMART tests, and the descriptions no longer in any way address my actual pool geometry anymore... So I set about to fix things, and I have a question:

I stagger my SMART tests. I have a 2x2 mirror pool, and 3 drive RAIDz pool. I run the mirror pool short test daily in two jobs, several hours apart, so that there's only one drive under test in each vdev at any given time. The RAID pool I run the daily short tests all in one job. For the long tests, I run weekly, and again run the mirror pool in two jobs, separated by 12 hours. For the RAIDz pool, the tests take 400+ minutes per drive, so I set up three tasks and run them weekly on three consecutive days.

My questions:

What is the overhead of the SMART test execution? Is there any benefit or disadvantage in splitting up the tests like this? I set it up like this, and there are certain times of day when I want no test activity, but I don't really have a justification for it.

Chris Moore · Jun 10, 2019

rvassar said:
What is the overhead of the SMART test execution?

It is an IO overhead internal to the drive, no CPU overhead on the computer. Drives will operate more slowly when they are conducting a test because of the internal IO for the test and using the drive for system IO will slow the progress of the test, which extends the duration of the test.
The amount of impact the self-test IO has on the drive depends very much on the drive. Some drives handle it very poorly where other drives handle it pretty well. Usually, older drives suffer more than newer drives. I think it is because of the integrated controller on newer drives being better and more capable with more memory.

rvassar said:
Is there any benefit or disadvantage in splitting up the tests like this?

Short tests usually complete in a matter of a few minutes, usually less than 15 even for drives that take a long time on a short test. Because that is so (relatively) quick, I schedule all my drives to do a short self-test at the same time. I run that daily at 5AM, which is a time I am not usually accessing the system for anything else. I do that on the systems I manage for work also. Long tests are a different matter. The time to complete the test can vary a great deal depending on the age of the drive, drive manufacturer and capacity of the drive. I have some 10TB drives at work that take over 12 hours to complete the test. I schedule those to run at 6AM on Saturdays, when nobody is there and it will not impact anyone. For my home system, I have been running the long test (smaller drives) at 7AM on Tuesdays because I am normally at work and it won't bother me and the test is normally done in 6 hours, so I can have it done by the time I get home.

rvassar said:
I set it up like this, and there are certain times of day when I want no test activity, but I don't really have a justification for it.

I set my schedule so the slowdown has minimal impact on users, me on the home system. I also try to schedule the scrubs to coincide with low use times but I have a system at work where the scrub takes four days, so not much chance for it.

rvassar · Jun 10, 2019

Chris Moore said:
It is an IO overhead internal to the drive, no CPU overhead on the computer. Drives will operate more slowly when they are conducting a test because of the internal IO for the test and using the drive for system IO will slow the progress of the test, which extends the duration of the test.
The amount of impact the self-test IO has on the drive depends very much on the drive.

I was initially a bit troubled after reading this.... 10+ hours later I'm fully troubled. I have a Seagate ST32000542AS with 57.6k hours stuck at 90% remaining, since this morning. It's not throwing errors, but it's not making any progress. It hosts backups, and a security camera pool. I'm thinking I turn off the security cameras tomorrow and see if any progress is made.

Chris Moore · Jun 10, 2019

No guarantee, but the drives that I have seen take a really long time to complete the long test, longer than usual, are going to fail in the near future.

Snow · Jun 10, 2019

I've also seen them start to heat up, go from 27C/28C to 30C/33C seems to be a sign they are about to die as well.
I have mine all spaced out around every 10 day's apart. Long is ones every month Short test every other week. I do not do any of the other tests.

rvassar · Jun 11, 2019

Chris Moore said:
No guarantee, but the drives that I have seen take a really long time to complete the long test, longer than usual, are going to fail in the near future.

Well... And with 57k hours, it's sort of expected. I've been kind of keeping an eye on this drive. It's the oldest one in my NAS. I took advantage of the Best Buy Easystore sale, and replaced the 3Tb drives in my VM / mirror pool. So I have a couple new-ish drives fully burned in with less than 8k hours at the ready.

Chris Moore · Jun 11, 2019

Snow said:
I've also seen them start to heat up, go from 27C/28C to 30C/33C seems to be a sign they are about to die as well.

Probably the bearings.

rvassar · Jun 11, 2019

Chris Moore said:
Probably the bearings.

Can also be slop in the head assembly. As it wears out, the controller ends up in a negative feedback loop, repeatedly issuing position corrections to the voice coil, which translates to increased current draw, and hence more heat.

Snow · Jun 11, 2019

One of my oldest drive has over 60K, it stays at around 31-32C I Have a replacement for it. Just been to lazy to swap it out. Also I have not had any smart errors yet from it witch is a surprise, knock on wood!

rvassar · Jun 11, 2019

24+ hours have passed, and the next pool member kicked off it's weekly long self-test this morning. It's been stuck at 90% remaining just like the first one. This drive is a Hitachi at half the age. I've taken steps to idle the pool (halted ZM) to see if I can get the drives to make progress.

Snow · Jun 11, 2019

Sounds like you should replace it, My smart tests have never goten stuck that I know of. Knock on wood -- ---- -Knock -- - - -- - Knock -.
How long will it stay like that ?

rvassar · Jun 11, 2019

Snow said:
Sounds like you should replace it, My smart tests have never goten stuck that I know of. Knock on wood -- ---- -Knock -- - - -- - Knock -.
How long will it stay like that ?

Perhaps, but the pool is fully functional, responsive, and presenting no errors. The drive that's been stuck for the last 30 hours responded immediately to "smartctl -X", and aborted the test. I reissued the test manually, and it immediately started testing again.

I do know I'm not going to like having to idle the pool to perform testing. Waiting to see if the Hitachi drive resumes on it's own.

tfran1990 · Jun 11, 2019

60k hours of spin.... wow

Snow · Jun 11, 2019

It was a Old drive out of a Data Center, AT&T Communication Server. I got it with 40K on it and did a burn in and it Passed without no avail.
thats only 7.25 of years of time try to guess what brand it is.

rvassar said:
Perhaps, but the pool is fully functional, responsive, and presenting no errors. The drive that's been stuck for the last 30 hours responded immediately to "smartctl -X", and aborted the test. I reissued the test manually, and it immediately started testing again.

I mean its better to be safe then sorry if I had any drive other then a SSD that had smart problems I would replace it or add a spare to that pool in case it drop's. Also you have to worry that it may be having a Firmware Problem and if that is so thats also the same chip that does ECC for the Controller as well.

rvassar · Jun 11, 2019

Ok, the Hitachi disk sped up, and is now at 50% remaining. The original problem Seagate, I restarted, and is also making progress, now at 80% remaining.

So the constant trickle of I/O from my security cameras appears to be blocking SMART testing, or at least slowing it down to a snail's pace.

Snow · Jun 11, 2019

rvassar said:
Perhaps, but the pool is fully functional, responsive, and presenting no errors. The drive that's been stuck for the last 30 hours responded immediately to "smartctl -X", and aborted the test. I reissued the test manually, and it immediately started testing again.

What do you have the FPS on the streams set to?

rvassar · Jun 11, 2019

Snow said:
What do you have the FPS on the streams set to?

I'm only running around 6fps each. But one of the cams turns ~35kb/sec and the other 400kb/sec. I'm still figuring that out...

Snow · Jun 11, 2019

6 fps is pretty low it is on the conservative side. I am guessing you have more then 8 Cams ?

rvassar · Jun 11, 2019

Snow said:
6 fps is pretty low it is on the conservative side. I am guessing you have more then 8 Cams ?

That's actually one of those sort of things you don't document in a public forum, sorry. ;)

57k hour drive passed long self test after 6 hours. The Hitachi passed as well. I'm going re-enable the cameras, and see how long the third drive takes. I'm guessing a week or more...

tfran1990 · Jun 11, 2019

rvassar said:
I'm only running around 6fps each. But one of the cams turns ~35kb/sec and the other 400kb/sec. I'm still figuring that out...

depending on what cameras you have, is it possible to record a SUB stream?
(i have 4 4MP ip cameras, i view the live stream @ 2k 30fps but i have them record the data on motion at 1080p or 720p SUBSTREAM)

Important Announcement for the TrueNAS Community.

SMART test overhead & scheduling?

Guru

Hall of Famer

Guru

Hall of Famer

Patron

Guru

Hall of Famer

Guru

Patron

Guru

Patron

Guru

Patron

Patron

Guru

Patron

Guru

Patron

Guru

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "SMART test overhead & scheduling?"

Similar threads