SOLVED Unexpected HDD behaviour

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
@ChrisRJ The command you specified above, does that survive a power cycle or does it need to be applied each time?
On my system it did survive. But I forgot to reapply it to a replaced HDD, which is the HDD with the lowest number of power-on hours (around 10k) also has the by far highest count of load cycles (around 100k).
It sounds like the "-s" would allow it to survive. I understand why you would desire a script if you were replacing hard drives often and I understand that sending the command multiple times has no ill effect.
That is also my understanding.
 

Nogtail

Cadet
Joined
Dec 29, 2022
Messages
9
I ended up using a similar command to what @ChrisRJ posted on my drives and the problem appears fixed. Almost 24 hours later the load cycle count hasn't increased at all. Is there any reason you use the -s option if you run the command on boot?

When I finally got around to sharing a dataset and copying a significant amount of data I had a couple of the drives fault in response to persistent write errors. The logs are full of CAM status: Uncorrectable parity/CRC error entries for almost all drives. I'm in the process of running a SMART long test on all drives - you're right, it estimates it'll take around 12 hours. SMART reports a high UDMA_CRC_Error_Count for a lot of the drives so I'm hoping it's just bad SATA cables. I've ordered some replacements but at this time of year they're probably going to be sitting in the post for a while.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
The UDMA_CRC_Error_Count is most likely the data cables. I would shut down the system and reseat or replace the drive cables in question. power back on and see how it goes. Remember, these errors will be permanently recorded on the drive, you can't return them to a zero value. If we could, I'd do that to one drive that has 15 errors all because of a data cable. It drives me nuts.

I ended up using a similar command
I'm curious what the command was. Did you just increase the timer amount from 2 minutes to 6 minutes, did you disable the Idle_b timer as suggested? Why do I ask? Well yo help others who run across the same problem.
 

Nogtail

Cadet
Joined
Dec 29, 2022
Messages
9
The UDMA_CRC_Error_Count is most likely the data cables. I would shut down the system and reseat or replace the drive cables in question. power back on and see how it goes. Remember, these errors will be permanently recorded on the drive, you can't return them to a zero value. If we could, I'd do that to one drive that has 15 errors all because of a data cable. It drives me nuts.
I think I'm just going to wait for some new cables. If it was just a single drive with issues I would try to reseat the cable, but the fact 4 of my 8 drives show errors makes me think I might have got some dodgy cables. I'm just hoping it isn't a bad controller. Two of the drives are only showing issues for writes, one just has problems with reads, and the other one has problems with both. As SATA uses a separate pair for transmit and receive it makes me think it's probably a connection issue. My OCD is going to absolutely hate having a non-zero CRC_Error_Count though.

I'm curious what the command was. Did you just increase the timer amount from 2 minutes to 6 minutes, did you disable the Idle_b timer as suggested? Why do I ask? Well yo help others who run across the same problem.
Unfortunately I don't remember exactly what I used and it's no longer in my command history. I believe it was functionally identical to what @ChrisRJ posted though. I opted to disable the state instead of adjusting the timer as I figured it would be easier to revert if I wanted to use the drives in a different setup, although it looks like camcontrol can reset a drive to defaults so there is probably no difference for a TrueNAS setup.
 

Nogtail

Cadet
Joined
Dec 29, 2022
Messages
9
I figured I'd follow up with what I ended up doing in case it helps anyone else:

The CRC errors were caused by the SATA cables. After checking the logs I had 5 of my 8 disks spitting out huge numbers of errors with any activity. I replaced all the SATA cables and haven't had a single error since. I don't think I'll be trusting StarTech SATA cables again.

There seems to be a bug where swap created on the boot drive is overwritten when creating a pool. I fixed this by reinstalling and setting "System>Advanced>Swap Size" to 0 before creating my first pool.

Disabling the idle_b state has fixed the load_cycle_count increasing. Ideally I'd like to let it park the heads when not used for an extended period but that doesn't seem possible with TrueNAS and Seagate drives.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Disabling the idle_b state has fixed the load_cycle_count increasing. Ideally I'd like to let it park the heads when not used for an extended period but that doesn't seem possible with TrueNAS and Seagate drives.
If you use the tools you have you could increase the Idle_B timer to 6000 to keep the heads loaded for ten minutes and if no activity after ten minutes, they park. This is a good compromise it you really wanted the heads to park but still minimize the frequency of the heads loading all the time. But many of us prefer to leave the heads loaded. I'm just providing an option you previously had.

I don't think I'll be trusting StarTech SATA cables again.
There are a lot of poorly made cables out there. It's always best to buy some good quality locking SATA data cables. But while I do not understand it, SATA data cables do go bad. It's happened to me and I had a difficult time believing it.

Glad you have all the problems corrected. I hope you are problem free for a while now. And thanks for the follow-up.
 
Top