Drives being removed and then cleared.

zerothaught

Dabbler
Joined
Jan 1, 2024
Messages
13
Hello Everyone,

I am quite new to TrueNAS, so excuse me if this is a dumb question, but I couldn't seem to find any existing threads that covered what I am experiencing. I am currently running TrueNAS Scale on a Storinator Q30.

OS Version:TrueNAS-SCALE-22.12.3.2
Product:Q30
Model:Intel(R) Xeon(R) Bronze 3204 CPU @ 1.90GHz
Memory:125 GiB
3 x RAIDZ2 | 10 wide | 16.37 TiB
Drive Model: ST18000NM000J-2TV103 x30

I woke up today to the following alerts from my TrueNAS system:

• Pool Storinator state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:

o Disk ST18000NM000J-2TV103 ZR5D35C0 is REMOVED
o Disk ST18000NM000J-2TV103 ZR5D4FD9 is REMOVED

About 1 minute after the alert I recieved another emails saying that the alert was cleared.

Then another minute later I got another alert for one of the same disks:

• Pool Storinator state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
o Disk ST18000NM000J-2TV103 ZR5D35C0 is REMOVED

Then finally I got this email:

• Pool Storinator state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

Upon looking at the dashboard, it shows that I have 0 disks with errors and that my pool is online.

I did a bit more digging and ran a dmesg command and saw the following errors:

Disk failure on sd01, disabling devices
md/raid1:md122: Operation continuuing on 1 devices
blk_update_request: I/O error, dev sdo, sector 19604053776 op 0x11:(WRITE) flags 0x700 phys_seg 15 prio class 0

blk_update_request: I/O error, dev sde, sector 1920752848 op 0x0: (READ) flags 0x0 phys_seg 1 prio class 0
I see multiple of these for different sectors all on SDE

Buffer I/O error on dev sde2, logical block 4097, async page read

I ran sudo smartctl -a /dev/sdo1 and I don't see any Reallocated Sector Count or any CRC Error Counts. I am also not seeing any errors in the SMART testing for that drive.

I ran a scrub on the pool and after it completes I get an alert saying that resilvering is in process and that 18MB were copied over.

To my untrained eye, it seems as if a disk is failing or there is a hardware error, but I would assume I would see SMART errors or something on the disk. Are there any other commands I should run, or should I look at contacting the manufacturer?
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
One thing that can happen is that a disk with power save options enabled, like spin down, can cause ZFS to think the drive has failed. But, the access that ZFS made, and timed out, causes the drive to spin up. Then ZFS thinks the drive is good and re-silvers the difference between when it stopped responding and when it became available again.

I don't know all the trouble shooting steps for that kind of issue, but their are 3 that can be problematic for ZFS;
  • Automatic spin down when drive thinks it is idle
  • Park heads when idle
  • TLER, (Time Limited Error Recovery), Seagate calls it something else. Basically NASes want to do the recovery themselves, so limit TLER to 7 seconds. On the other hand, desktop non-redundant configurations want the disk to go to extremes to recover a bad block. That can take over 1 minute, which ZFS can interpret as disk failure.
The last does not seem to apply to your case.

Look at the spin down & up counts, as well as the head park counts. Then should be roughly similar to power on counts.

Sometimes a disk comes from the manufacturer with non-default configurations, different from other drives in the same model line. Or, the manufacturer changed the defaults and you got some disks with one configuration and others with something different.
 

zerothaught

Dabbler
Joined
Jan 1, 2024
Messages
13
Thanks for your reply Arwen. I've looked at the results (although I'm not entirely sure where to see head park counts) and it looks like the Start_Stop_Count for SDE and SDO, which the power-off_Retract_Count is 8. So it doesn't look like the drives are going to sleep. I've checked the disks and both have HDD Standby set to Always On and Advanced Power Management is disabled.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Looks good so far.

The head parking is labeled as below;
225 Load_Cycle_Count 0x0032 096 096 000 Old_age Always - 48512
Ignore that mine is high. That happened in the first few months... whence I noticed, I disabled head parking.

You might check your data cables for the 2 drives. Heat and vibration could wiggle them loose for a few seconds.
 

zerothaught

Dabbler
Joined
Jan 1, 2024
Messages
13
My Load_Cycle_Count is 391. You are saying it should be close to my start_stop_count?

Thanks
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
My Load_Cycle_Count is 391. You are saying it should be close to my start_stop_count?

Thanks
Yes, if I am understanding all the drive details correct.

Now check the other drives that are NOT giving you problems. See if they are noticeably different.
 

zerothaught

Dabbler
Joined
Jan 1, 2024
Messages
13
It looks like they all have ~394 for the Load_Cycle_Count.

Drive that has never alerted:
1704222266582.png


Here is the drive that threw an error (even though SMART is showing none)
1704222318382.png


Thank you for your help so far.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Hmm, then I don't know.

As a test, you could try disabling power save & head parking, and see if the problem repeats. It's a long shot and understandable if you don't want to try.

First, see what the values are now. This is from my miniature PC;
root@media_server:~# hdparm -B -M /dev/sda /dev/sda: APM_level = 254 acoustic = 254 (128=quiet ... 254=fast)

Next, you can change them, subbing the proper disk as desired;
hdparm -B 254 -M 254 /dev/sde

See the manual page for hdparm on the parameter details.

Note that this does not make the change permanent. A drive reset or power cycle will almost certainly restore the old value. There is a way to make it permanent, I just don't remember it. See the hdparm manual page...
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Yes, that linked article seems like it will solve the problem, though only for TrueNAS Core. I am sure their is a way to do it with SCALE, (aka under Linux), but I don't know it.

You could make a another install media pair, (install ISO & temporary boot-device), for TrueNAS Core. Then run the command listed in the article on all the drives. Shutdown, and re-install your SCALE boot-device and power up. That is something I would do simply because it is straight forward. A bit more time that perhaps figuring out the Linux command.,.
 

zerothaught

Dabbler
Joined
Jan 1, 2024
Messages
13
I'm thinking it might just be something to not worry about. I'm not seeing any errors when I run zpool status or looking at any of the smart information. It seems to be happening when the scrub kicks off, and then it resilvers a few MB. I would assuming I'd see errors listed on the disk somewhere, but like you said, it might just be a head parked and it takes a bit for it to spin up and truenas doesn't like that.
 
Top