pool degraded - HD204UI - has been removed by the administrator - HDD compatibility list?

flashdrive

Patron
Joined
Apr 2, 2021
Messages
264
Hello,

all too often the zfs data pool will have this message:


CRITICAL
Pool 123 state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
• Disk SAMSUNG HD204UI S2... is REMOVED

The workaround to get it working again is to shut down the host, replug the power connector of the HDD's in question and restart the host.

It seems to happen when the TN system is under some load, like 250 MByte / sec file transfers during backups with


1. Are my HDDs "HD204UI" not up to the task any more in terms of vibration? I have recycled them from an old system with 4x HD204UI. The new system has 6 hdd's altogether in a tower. I have never had this kind of error when I was running those 4 in the old Solaris system.

2. My hdds only being "desktop grade": is there a HDD compatibility list for TrueNAS? WD Red Plus or the like?

3. When will TN Core automatically "remove" those drives during operation?

4. I have enabled and disable SMART in both EFI and TN Core Services and HDD to no effect.

I do not think that this is a power supply issue.

Also I have switched the hdd's in question in terms of SATA Cable and Power Connector.

I am at a loss here...
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The HD204UI is a decade-old HDD design, which suggests that your drives may be 8 to 10 years old, and may have far more than 50K hours on them. When older drives have disk read errors or start to run into problems, some of these present as locking up or otherwise "vanishing" from the system, which could be firmware bugs, and which might be fixable with a drive firmware update, but more often it doesn't matter since the drive may be on the edge of catastrophic failure anyways.

Are my HDDs "HD204UI" not up to the task any more in terms of vibration?

I suspect age. Check with smartctl to find out how many hours are on them.

When will TN Core automatically "remove" those drives during operation?

NEVER EVER. That is a no-no for ZFS storage systems. If a drive falls off, it should be because the drive faulted into such a mode. One of the reasons we're very picky about HBA's instead of RAID controllers is because RAID controllers sometimes DO kick out drives that are marginal, which is really bad for ZFS. HBA's will continue to let the drive be seen by the OS. If the drive has locked up or freaked out, that's not possible, of course.

My hdds only being "desktop grade": is there a HDD compatibility list for TrueNAS? WD Red Plus or the like?

Avoid SMR drives. A list is available in the resources section. It may also be worth avoiding drives faster than 5900RPM as they tend to generate more heat and wear, without adding appreciable performance benefits for ZFS in many cases. Drive compatibility issues have largely vanished in the last ten years as SATA 6Gbps has matured and the number of HDD manufacturers have dwindled to just a few. They likely use similar firmware stacks in all their drives, so the old tropes like "avoid the Seagate Cheetah STblabla" or "WD Blacks suck" are no longer significant factors.
 

flashdrive

Patron
Joined
Apr 2, 2021
Messages
264
Hello @jgreco

thank you for the fast reply.

The drives have not been running for the last 10 years :smile:

logged hours:
3352


I will have to have a look into the SMART data.


The hdds are directly connected to the mainboard. No HBA.


FreeBSD 12.2-RELEASE-p11 75566f060d4(HEAD) TRUENAS

TrueNAS (c) 2009-2021, iXsystems, Inc.
All rights reserved.
TrueNAS code is released under the modified BSD license with some
files copyrighted by (c) iXsystems, Inc.

For more information, documentation, help or support, go here:
http://truenas.com
Welcome to TrueNAS

Warning: settings changed through the CLI are not written to
the configuration database and will be reset on reboot.

s[~]# smartctl -a /dev/ada0
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p11 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint F4 EG (AF)
Device Model: SAMSUNG HD204UI
Serial Number: S2H7...
LU WWN Device Id: 5 0024e9 203fefb25
Firmware Version: 1AQ10001
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 6
SATA Version is: SATA 2.6, 3.0 Gb/s
Local Time is: Sun Jan 9 19:31:01 2022 CET

==> WARNING: Using smartmontools or hdparm with this
drive may result in data loss due to a firmware bug.
****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******
Buggy and fixed firmware report same version number!
See the following web pages for details:

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (21240) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 354) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 12
2 Throughput_Performance 0x0026 252 252 000 Old_age Always - 0
3 Spin_Up_Time 0x0023 067 044 025 Pre-fail Always - 10215
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 829
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0
8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 3655
10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 1
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 614
181 Program_Fail_Cnt_Total 0x0022 095 095 000 Old_age Always - 121186104
191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always - 3016
192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0
194 Temperature_Celsius 0x0002 064 063 000 Old_age Always - 30 (Min/Max 13/47)
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 1103
223 Load_Retry_Count 0x0032 100 100 000 Old_age Always - 1
225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 851

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 3352 -

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Completed [00% left] (0-65535)
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

flashdrive

Patron
Joined
Apr 2, 2021
Messages
264


 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
==> WARNING: Using smartmontools or hdparm with this
drive may result in data loss due to a firmware bug.
****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******
This may be worth investigating... It could explain what is happening to you.
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 1
181 Program_Fail_Cnt_Total 0x0022 095 095 000 Old_age Always - 121186104
191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always - 3016
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 1103
223 Load_Retry_Count 0x0032 100 100 000 Old_age Always - 1
None of these are good signs, but some of them may be just weird ways of reporting data.
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 3352 -
Definitely run a long test. It'll be interesting to see what the results are.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
One thing that can affect ZFS pools are disk drives without time limited error recovery, (aka TLER). Seagate calls it something else, but same concept.

TLER tells the disks not to go to extremes during bad block recovery. Generally NAS drives use 7 seconds. But, desktop and laptop drives generally default to something over 60 seconds to attempt recovery of a bad block. During this recovery, the disk can be totally unresponsive to any host requests. This can cause the OS or ZFS to give up on the drive.

With proper redundancy, (Mirror, RAID-Zx or "copies=2/3"), finding a bad block is nothing to worry about. ZFS will get the redundant copy, supply it to the user request and re-write the failing / failed block with good data again. SATA drives automatically do sector sparing when writing to a bad block.

With your disks being an older model, AND desktop type, this could be your problem.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
What PSU do you have in that box. Older HDD's can be a bit power hungry and you may not have enough power.
 

flashdrive

Patron
Joined
Apr 2, 2021
Messages
264
@Ericloewe
==> WARNING: Using smartmontools or hdparm with this
drive may result in data loss due to a firmware bug.
****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******


I was aware of this when I bought the drives and made sure that the new firmware was installed.

The un-fun part of this was that Samsung decided not to change the version number of the new firmware...
 

flashdrive

Patron
Joined
Apr 2, 2021
Messages
264
Are there any log files available in TrueNAS Core to have a look into this? I want to know what triggers the event that will make the drives disappear.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
time limited error recovery, (aka TLER).


Are there any log files available in TrueNAS Core to have a look into this?

Well, yes, the kernel should indicate what led to the device being dropped in the system log. This isn't necessarily going to tell you what's wrong, because the view from the system is likely "something went wrong"/"device went away", along the lines of what happens when you do a hotswap of drives. There can be hints of problems leading up that could be interesting. These things can fail in so many ways, it's hard to guess what you might find.
 

flashdrive

Patron
Joined
Apr 2, 2021
Messages
264
@NugentS
I have finally checked for the PSU:

be quiet System Power 7 450W

During boot / startup I do not have issues with lost drives.
 

flashdrive

Patron
Joined
Apr 2, 2021
Messages
264
@Ericloewe

result from 2 tested disks so far:

Test SMART "Long"

Device: /dev/ada0, 1 Currently unreadable (pending) sectors.

I will run the test for the other drives as well.
 
Top