pool degraded - HD204UI - has been removed by the administrator - HDD compatibility list?

flashdrive · Jan 9, 2022

Hello,

all too often the zfs data pool will have this message:

CRITICAL
Pool 123 state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
• Disk SAMSUNG HD204UI S2... is REMOVED

The workaround to get it working again is to shut down the host, replug the power connector of the HDD's in question and restart the host.

It seems to happen when the TN system is under some load, like 250 MByte / sec file transfers during backups with

Free Windows Backup Solution for PCs and Endpoints

Simple and free Windows backup for cloud and physical servers, desktops, endpoints and more. Protect Windows workloads with Veeam Agent for Windows!

www.veeam.com

1. Are my HDDs "HD204UI" not up to the task any more in terms of vibration? I have recycled them from an old system with 4x HD204UI. The new system has 6 hdd's altogether in a tower. I have never had this kind of error when I was running those 4 in the old Solaris system.

2. My hdds only being "desktop grade": is there a HDD compatibility list for TrueNAS? WD Red Plus or the like?

3. When will TN Core automatically "remove" those drives during operation?

4. I have enabled and disable SMART in both EFI and TN Core Services and HDD to no effect.

I do not think that this is a power supply issue.

Also I have switched the hdd's in question in terms of SATA Cable and Power Connector.

I am at a loss here...

jgreco · Jan 9, 2022

The HD204UI is a decade-old HDD design, which suggests that your drives may be 8 to 10 years old, and may have far more than 50K hours on them. When older drives have disk read errors or start to run into problems, some of these present as locking up or otherwise "vanishing" from the system, which could be firmware bugs, and which might be fixable with a drive firmware update, but more often it doesn't matter since the drive may be on the edge of catastrophic failure anyways.

flashdrive said:
Are my HDDs "HD204UI" not up to the task any more in terms of vibration?

I suspect age. Check with smartctl to find out how many hours are on them.

flashdrive said:
When will TN Core automatically "remove" those drives during operation?

NEVER EVER. That is a no-no for ZFS storage systems. If a drive falls off, it should be because the drive faulted into such a mode. One of the reasons we're very picky about HBA's instead of RAID controllers is because RAID controllers sometimes DO kick out drives that are marginal, which is really bad for ZFS. HBA's will continue to let the drive be seen by the OS. If the drive has locked up or freaked out, that's not possible, of course.

flashdrive said:
My hdds only being "desktop grade": is there a HDD compatibility list for TrueNAS? WD Red Plus or the like?

Avoid SMR drives. A list is available in the resources section. It may also be worth avoiding drives faster than 5900RPM as they tend to generate more heat and wear, without adding appreciable performance benefits for ZFS in many cases. Drive compatibility issues have largely vanished in the last ten years as SATA 6Gbps has matured and the number of HDD manufacturers have dwindled to just a few. They likely use similar firmware stacks in all their drives, so the old tropes like "avoid the Seagate Cheetah STblabla" or "WD Blacks suck" are no longer significant factors.

flashdrive · Jan 9, 2022

Hello @jgreco

thank you for the fast reply.

The drives have not been running for the last 10 years

logged hours:
3352

I will have to have a look into the SMART data.

The hdds are directly connected to the mainboard. No HBA.

FreeBSD 12.2-RELEASE-p11 75566f060d4(HEAD) TRUENAS

TrueNAS (c) 2009-2021, iXsystems, Inc.
All rights reserved.
TrueNAS code is released under the modified BSD license with some
files copyrighted by (c) iXsystems, Inc.

For more information, documentation, help or support, go here:
http://truenas.com
Welcome to TrueNAS

Warning: settings changed through the CLI are not written to
the configuration database and will be reset on reboot.

s[~]# smartctl -a /dev/ada0
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p11 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint F4 EG (AF)
Device Model: SAMSUNG HD204UI
Serial Number: S2H7...
LU WWN Device Id: 5 0024e9 203fefb25
Firmware Version: 1AQ10001
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 6
SATA Version is: SATA 2.6, 3.0 Gb/s
Local Time is: Sun Jan 9 19:31:01 2022 CET

==> WARNING: Using smartmontools or hdparm with this
drive may result in data loss due to a firmware bug.
****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******
Buggy and fixed firmware report same version number!
See the following web pages for details:

Samsung HD204UI_JP.exe Firmware Patch/Update | Seagate ASEAN

knowledge.seagate.com

SamsungF4EGBadBlocks – smartmontools

www.smartmontools.org

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (21240) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 354) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 12
2 Throughput_Performance 0x0026 252 252 000 Old_age Always - 0
3 Spin_Up_Time 0x0023 067 044 025 Pre-fail Always - 10215
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 829
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0
8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 3655
10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 1
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 614
181 Program_Fail_Cnt_Total 0x0022 095 095 000 Old_age Always - 121186104
191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always - 3016
192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0
194 Temperature_Celsius 0x0002 064 063 000 Old_age Always - 30 (Min/Max 13/47)
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 1103
223 Load_Retry_Count 0x0032 100 100 000 Old_age Always - 1
225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 851

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 3352 -

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Completed [00% left] (0-65535)
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

flashdrive · Jan 9, 2022

Hard Drive Troubleshooting Guide (All Versions of FreeNAS)

UPDATE: 22 September 2018 - Added Drive Data Refreshing UPDATE: 2 April 2017 - Added support for FreeNAS Corral (FreeNAS 10 and beyond) UPDATE: 1 November 2020 - Added ID 1 and 7 description for Seagate drives at bottom of Appendix B This guide...

www.truenas.com

Disk Price/Performance Analysis Buying Information

Attached is an EXCEL spreadsheet that contains information on NAS/Desktop disk drives providing a lot of information to help you choose the best disk drive for your RAID configuration. Various tables are in the worksheet Tables. This allows...

www.truenas.com

Proper Power Supply Sizing Guidance

I've seen about 1,000 threads like this one where people decide that they can power a dozen hard drives off a 360 watt supply. DO NOT DO THIS. I've seen another 1,000 threads where people decide to buy the cheapest power supply that they can find. DO NOT DO THIS. Your NAS lives or dies by...

www.truenas.com

Ericloewe · Jan 9, 2022

flashdrive said:
==> WARNING: Using smartmontools or hdparm with this
drive may result in data loss due to a firmware bug.
****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******

This may be worth investigating... It could explain what is happening to you.

flashdrive said:
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 1

flashdrive said:
181 Program_Fail_Cnt_Total 0x0022 095 095 000 Old_age Always - 121186104

flashdrive said:
191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always - 3016

flashdrive said:
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 1103

flashdrive said:
223 Load_Retry_Count 0x0032 100 100 000 Old_age Always - 1

None of these are good signs, but some of them may be just weird ways of reporting data.

flashdrive said:
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 3352 -

Definitely run a long test. It'll be interesting to see what the results are.

Arwen · Jan 9, 2022

One thing that can affect ZFS pools are disk drives without time limited error recovery, (aka TLER). Seagate calls it something else, but same concept.

TLER tells the disks not to go to extremes during bad block recovery. Generally NAS drives use 7 seconds. But, desktop and laptop drives generally default to something over 60 seconds to attempt recovery of a bad block. During this recovery, the disk can be totally unresponsive to any host requests. This can cause the OS or ZFS to give up on the drive.

With proper redundancy, (Mirror, RAID-Zx or "copies=2/3"), finding a bad block is nothing to worry about. ZFS will get the redundant copy, supply it to the user request and re-write the failing / failed block with good data again. SATA drives automatically do sector sparing when writing to a bad block.

With your disks being an older model, AND desktop type, this could be your problem.

NugentS · Jan 9, 2022

What PSU do you have in that box. Older HDD's can be a bit power hungry and you may not have enough power.

flashdrive · Jan 9, 2022

@Ericloewe
==> WARNING: Using smartmontools or hdparm with this
drive may result in data loss due to a firmware bug.
****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******

I was aware of this when I bought the drives and made sure that the new firmware was installed.

The un-fun part of this was that Samsung decided not to change the version number of the new firmware...

flashdrive · Jan 9, 2022

Are there any log files available in TrueNAS Core to have a look into this? I want to know what triggers the event that will make the drives disappear.

jgreco · Jan 10, 2022

Arwen said:
time limited error recovery, (aka TLER).

Checking for TLER, ERC, etc. support on a drive

One of the problems with consumer-grade hard drives is that most of them will hang in the event that they run into an error, and will internally retry the operation, possibly for a minute or more. For a desktop PC, where redundancy does not exist, this is the correct course of action, because...

www.truenas.com

flashdrive said:
Are there any log files available in TrueNAS Core to have a look into this?

Well, yes, the kernel should indicate what led to the device being dropped in the system log. This isn't necessarily going to tell you what's wrong, because the view from the system is likely "something went wrong"/"device went away", along the lines of what happens when you do a hotswap of drives. There can be hints of problems leading up that could be interesting. These things can fail in so many ways, it's hard to guess what you might find.

flashdrive · Jan 19, 2022

@NugentS
I have finally checked for the PSU:

be quiet System Power 7 450W

During boot / startup I do not have issues with lost drives.

flashdrive · Jan 19, 2022

@Ericloewe

result from 2 tested disks so far:

Test SMART "Long"

Device: /dev/ada0, 1 Currently unreadable (pending) sectors.

I will run the test for the other drives as well.

Important Announcement for the TrueNAS Community.

pool degraded - HD204UI - has been removed by the administrator - HDD compatibility list?

flashdrive

Patron

Free Windows Backup Solution for PCs and Endpoints

jgreco

Resident Grinch

flashdrive

Patron

Samsung HD204UI_JP.exe Firmware Patch/Update | Seagate ASEAN

SamsungF4EGBadBlocks – smartmontools

flashdrive

Patron

Hard Drive Troubleshooting Guide (All Versions of FreeNAS)

Disk Price/Performance Analysis Buying Information

Proper Power Supply Sizing Guidance

Ericloewe

Server Wrangler

Arwen

MVP

NugentS

MVP

flashdrive

Patron

flashdrive

Patron

jgreco

Resident Grinch

Checking for TLER, ERC, etc. support on a drive

flashdrive

Patron

flashdrive

Patron

Similar threads

Important Announcement for the TrueNAS Community.

pool degraded - HD204UI - has been removed by the administrator - HDD compatibility list?

Patron

Resident Grinch

Patron

Patron

Server Wrangler

MVP

MVP

Patron

Patron

Resident Grinch

Patron

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "pool degraded - HD204UI - has been removed by the administrator - HDD compatibility list?"

Similar threads