Exos x16 - High head cycle count due to toggle between EPC idle_a and idle_b power states

scrubbertux · Feb 1, 2021

I'm currently trying to build a new Storage Server based on TrueNas and having some trouble because the "load_cycle_count" of the drives gets increased every 3 minutes, which would lead to a load_cycle_count of 480 per day or 175200 per year. I don't think this is intended behaviour.

But first things first,
I'm using the following hardware & setup:

CPU: i3 9100
MB: Supermicro X11SCH-LN4
Drives: 4x Seagate Exos x16 16TB (ST16000NM001G), 1xTranscend 128GB m2 SSD

TrueNAS is directly installed on the SSD, no virtualization or anything special or modified. Just a plain install.
The 4 16TB hdds are directly connected to the motherboard.

What I did so far:
- Changed the hdds format to 4kNative via SeaChest Utilities
- Executed Spearfoot's BurnIn script. Took 8 days and completed without any complications and 0 errors for all 4 drives.
- Note: The "load_cycle_count" of the Drives directly after the BurnIn which completed today morning was around 60 now it is arround 300 already :(
- Note2: I didn't create any pool or changed any settings so far, just a plain install and the BurnIn test.

What seems to happen:
The drives unload their heads when there is no access within 2 Minutes. These is the default setting also explained in the very detailed Exos-Datasheet, which can be found here: https://www.seagate.com/www-content/product-content/enterprise-hdd-fam/exos-x-16/en-us/docs/100845789j.pdf

There seem to be 4 states which have the following results and timings:

Power Condition Name	Description	Manufacturer Default Timer Values
Idle_a	Reduced electronics	100ms
Idle_b	Heads unloaded. Disks spinning at full RPM	2 min
Idle_c	Heads unloaded. Disks spinning at reduced RPM	4 min
Standby_z	Heads unloaded. Motor stopped (disks not spinning)	15 min

Reading from the SeaChest Utilities I know that only idle_a and idle_b states are enabled by default with the factory settings. idle_c and Standby_z are disabled.

The problem now is, that the drives unload their heads every 2 minutes after the last operation going to state_b (you can hear that clearly) and then something "wakes up" the drive every 3 minutes and it goes back to state_a which gives a loud *klick* and increases the load_cycle_count counter...
Only disabling S.M.A.R.T completely for every drive allows the drives to stay in state_b. So my guess is, S.M.A.R.T readings are the cause of the described problem.

What I tried already without any change:

I tried to set all different APM Settings for the drive, but that doesn't help cause the drives don't support APM, only EPC
I also tried different HDD Standby settings, also using the "Force HDD Standby" Checkbox, but no effect
I'm aware of this SpinDown-script, but I don't want to do spindowns, so that doesn't help me
I'm aware of this solved Bug, addressing an issue closely related. But "Force HDD Standby" isn't usable without SpinDown, and I don't want to use spindown...

What are my options now?
I absolutely do not want to SpinDown the drives, because that would decrease lifetime, but I think this current behavior will also decrease the lifetime very fast...
What I don't understand ist the fact, that I'm doing nothing special nor did I change any default settings. Neither in TrueNAS nor on the drives. Just using enterprise discs with enterprise hardware and setup without any changes or modifications.

At the moment I only see two options here:

Disable EPC/state_b completly for the drives (but is it usual you need to do this?)
Disable S.M.A.R.T, but then I can't do regular disc health checks (this should also not be the standard, right?)

Do you have any suggestions? Is this a Bug (TrueNas or HDD) or am I doing something wrong here? I searched a lot but couldn't find anything on this problem :/

I would prefer to have the drives go to idle_b state when they are not used, but I also would be OK with having them stay in idle_a.
However I don't know if staying in idle_a all the time is good for the drive as the manufacturers default has idle_b enabled...

If you need any further infos, just let me know! I'm happy for any suggestions or explanations.

Samuel Tai · Feb 1, 2021

Having the drives operate steady state, with the heads flying, is best for drive longevity. Continuously unloading and reloading the heads does save a negligible amount of power at an increased risk of a head crash. To disable idle_b, you'll need to use

camcontrol epc /dev/<device ID> -c state -d -p Idle_b -s.

However, you may want to see the actual state of EPC to see if it matches the documentation before disabling anything:

camcontrol epc /dev/<device ID> -c status

scrubbertux · Feb 1, 2021

Hi Samuel, first of all thanks for taking the time!

Samuel Tai said:
Having the drives operate steady state, with the heads flying, is best for drive longevity. Continuously unloading and reloading the heads does save a negligible amount of power at an increased risks of a head crash. To disable idle_b, you'll need to use

camcontrol epc /dev/<device ID> -c state -d -p Idle_b -s.

Ok, wasn't sure about if heads flying is good or not. Thanks for clarifying this! Then I will choose the way to disable idle_b.
But shouldn't the HDD be usable without any modifications to it? I mean, with the current behavior it is a little bit on the Highway to Hell in this setup, isn't it?

Samuel Tai said:
However, you may want to see the actual state of EPC to see if it matches the documentation before disabling anything:
camcontrol epc /dev/<device ID> -c status

Yeah, did this a lot in the last hours, it absolutely behaves like it is stated in the documentation. Just toggles between idle_a and idle_b all the time..

Code:

camcontrol epc /dev/ada0 -c status
APM: NOT Supported, NOT Enabled
EPC: Supported, Enabled
Low Power Standby NOT Supported
Set EPC Power Source NOT Supported
Current power state: Idle_b(0x82)

Samuel Tai · Feb 1, 2021

scrubbertux said:
But shouldn't the HDD be usable without any modifications to it? I mean, with the current behavior it is a little bit on the Highway to Hell in this setup, isn't it?

You should remember the Exos drives are intended for data center applications, where it would be very rare for the drive to be idle >2 minutes so that Idle_b would kick in. In your home NAS environment, there's just not enough disk activity to make Idle_b rare enough to match the intended installation. These disks are usable, at the cost of increased load cycles, as you've observed.

If you want to keep Idle_b, but adjust the timers, you could try: camcontrol epc /dev/<device ID> -c timer -e -p Idle_b -T 300 -s. This bumps up the timer from 2 minutes to 5 minutes (300 seconds), or whatever is longer than your SMART update interval.

scrubbertux · Feb 2, 2021

Samuel Tai said:
You should remember the Exos drives are intended for data center applications, where it would be very rare for the drive to be idle >2 minutes so that Idle_b would kick in. In your home NAS environment, there's just not enough disk activity to make Idle_b rare enough to match the intended installation. These disks are usable, at the cost of increased load cycles, as you've observed.

Well, if used with another OS the drives go and stay into idle_b perfectly well, as I have observed on arch which was for the SeaChest Utilities.
As no process wakes up the drives, they stay in idle_b. So the question is, which process in TrueNAS continuously wakes up the drives and why?
Is it only the continuous temperature smartctl reading?

If I do read the temperatures via ssh with:
smartctl -l scttempsts /dev/ada0 -n idle
the discs will not wake from idle_b.

Samuel Tai said:
If you want to keep Idle_b, but adjust the timers, you could try: camcontrol epc /dev/<device ID> -c timer -e -p Idle_b -T 300 -s. This bumps up the timer from 2 minutes to 5 minutes (300 seconds),

That would have no reasonable effect. It would only reduce the load_cycle_count increase a little bit. The main problem is, that something (SMART) continuously wakes up the drive for no reason (at least no real disc access).

Samuel Tai said:
or whatever is longer than your SMART update interval.

What is MY smart interval?! I never defined any SMART settings or intervals. It must be TrueNAS defaults. Where can I change these?

Nathan1980 · Oct 8, 2021

Have same Seagate X16 Drives and the same issue.
Only unchecking SMART in each disk keep them in idle_b.

Disabling the smart service does not make any difference.
So i push this thread for the last question out of the last post:

There is something happening all 5 Minutes, what weaks up the drives, and it is not smart temperature monitoring.
Any idea what it could be?

And about my reason for keeping idle_b.
In my 24 bay NAS, the power difference between idle_a and idle_b is 40 Watt.
Thats up to 105 € of electricity per Year.

Would be great if there is a way to keep SMART activated, but avoid this 5 Minute weakup.

Alecmascot · Oct 8, 2021

There have been many threads about this "waking up", mainly from those who want to sleep their drives.
I think a forum search is needed.

Nathan1980 · Oct 8, 2021

Sorry Alecmascot - normally you may be right with this comment. But I searched for days according this toppic. People asking over and over again, but there is no Answer in any of this threads.

Most threads are "dont spin it down". Agree - i dont.
But i want to use idle_b parking for saving 100€ a year.
And because I need the server only 1 time a Month, the 12 times parking a year are not relevant for livetime.

Please point me where to search if you know a thread with a technical answer, why it gets woken up all 5 minutes.

Everything else from those threads is done. Also unplugging network cable... Its something internal, related to SMART but not the temperature readout.

Alecmascot · Oct 8, 2021

Nathan1980 said:
And because I need the server only 1 time a Month

why are you not powering the server down

Nathan1980 · Oct 8, 2021

Alecmascot said:
why are you not powering the server down

Thats unfortunately how all of those threads continue.
Instead of tackling the issue, they drift all to a discussion about the usecase.
For this thread please accept the answer: "is how it is".
I tell you all the details in a PN if you like or in another thread, but dont want this hundredth thread to go off toppic.
In short: i dont know when I need it, but then it must be on within 1 second or so. And Its not in my physical location.

So i want to get back to this all decisive question: What is TrueNas doing on a regular 5 Minutes base to weak up the drives? And how to disable this?

What we know:
- its somehow related to SMART
- but it is not the SMART daemon itself
- and it is not the temperature monitoring (or the temperature monitoring has a bug)

Samuel Tai · Oct 8, 2021

You're probably seeing collectd polling to gather the drive IO and temperature data for the GUI reporting feature. So far as I know, it's not possible to disable this, although you could experiment with modifying /etc/local/collectd.conf:

Code:

<Plugin "disk">
        Disk "/^gptid/"
        Disk "/^md/"
        Disk "/^pass/"
        Disk "ada0" <- add disks you want to exempt from collectd monitoring
        IgnoreSelected true
</Plugin>

I'm not sure this will stick, though, as I believe the middleware will revert the collectd.conf back to the original version every so often.

Nathan1980 · Oct 8, 2021

Good hint, thank you.
I think disabling "SMART" in a drive will add the disc to this location.
Drive IO monitoring is working with disabled smart and without weaking the drives up, so I think its also not drive IO.

Is there a list of features what collectd does each run and on each disk? Any chance to enable or disable single features of this?

Samuel Tai · Oct 9, 2021

Look under Reporting->Disks to see what collectd polls per disk. Unfortunately, this can't be customized, as this is integral to the collectd Disks plugin.

Nathan1980 · Oct 10, 2021

As collectd is open source, i will disable feature by feature to see, what causes the weakup.

diskdiddler · Jun 29, 2022

I have just ordered 8 of these drives and now I'm terrified.

What is the best course of action to ensure they don't die?
My NAS is regularly busy but it's not busy every 2 minutes for certain.

Will switching from Idle_b and Idle_a all the time, genuinely cause issues?

chruk · Jul 5, 2022

This affected me, and its kind of annoying I noticed so late.

Two ironwolf drives under a year old, with nearly 80k load cycle count. Roughly a 12:1 ratio to my power on hours.

I have changed it from the default of 2 minutes to 2 hours, considering disabling it entirely which I might still do, but I expect anything over 15 minutes would be reasonable.

Another interesting note is I noticed smart can be queried on my WD Red drives, it doesn't affect whatever idle state they may be in including spun down, whilst my Ironwolf drives will be pushed into full load state (and heads unparked) for any smart query other than a query to check the idle state.

Collectd probably should be customizable.

Setting the power state in the smart service to standby will fix this problem at the cost of no monitoring whilst the drive is in a idle state, its also global, you can do the same thing per drive instead of global by adding '-n standby' as custom smart flags on the drive setting.

diskdiddler · Jul 5, 2022

chruk said:
This affected me, and its kind of annoying I noticed so late.

Two ironwolf drives under a year old, with nearly 80k load cycle count. Roughly a 12:1 ratio to my power on hours.

I have changed it from the default of 2 minutes to 2 hours, considering disabling it entirely which I might still do, but I expect anything over 15 minutes would be reasonable.

Another interesting note is I noticed smart can be queried on my WD Red drives, it doesn't affect whatever idle state they may be in including spun down, whilst my Ironwolf drives will be pushed into full load state (and heads unparked) for any smart query other than a query to check the idle state.

Collectd probably should be customizable.

Setting the power state in the smart service to standby will fix this problem at the cost of no monitoring whilst the drive is in a idle state, its also global, you can do the same thing per drive instead of global by adding '-n standby' as custom smart flags on the drive setting.

Did you successfully correct the issue with the Seagate tools?

Mine should arrive in the coming week. I intend to test them extremely thoroughly before putting them in to action.

chruk · Jul 5, 2022

diskdiddler said:
Did you successfully correct the issue with the Seagate tools?

Mine should arrive in the coming week. I intend to test them extremely thoroughly before putting them in to action.

The tools report the setting has changed, however I dont know if I need to power cycle, I only did the change in the past hour, so I made a note of the existing power on hours and LLC counter and plan to check how much it goes up as the week goes on.

diskdiddler · Jul 5, 2022

chruk said:
The tools report the setting has changed, however I dont know if I need to power cycle, I only did the change in the past hour, so I made a note of the existing power on hours and LLC counter and plan to check how much it goes up as the week goes on.

Thanks, well let me know how you go with this.

chruk · Jul 5, 2022

diskdiddler said:
Thanks, well let me know how you go with this.

Sure, I will report back in a week but for now I checked with camcontrol a few times and its not left Idle_a state.

Important Announcement for the TrueNAS Community.

Exos x16 - High head cycle count due to toggle between EPC idle_a and idle_b power states

Cadet

Never underestimate your own stupidity

Cadet

Never underestimate your own stupidity

Cadet

Dabbler

Guru

Dabbler

Guru

Dabbler

Never underestimate your own stupidity

Dabbler

Never underestimate your own stupidity

Dabbler

Wizard

Dabbler

Wizard

Dabbler

Wizard

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Exos x16 - High head cycle count due to toggle between EPC idle_a and idle_b power states"

Similar threads