One HDD passes all SMART tests, but Truenas hates it. Is this a config issue on my part?

Salvorite

Cadet
Joined
Jan 28, 2024
Messages
7
I have four WD Red 4TB HDDs that I'm trying to create a pool with. With one of the drives, Truenas is having all sorts of errors. I've run a Short, Conveyance, and Extended SMART test on the drive and it passes them all. Is this user error on my part, or is it dead in a way that SMART tests don't catch?

Note that my NAS has 10 other unrelated drives in it that are part of different pools and work OK.

Any help is appreciated, thank you!

Attached the system error log for more info. Please let me know if other logs would be helpful.
 

Attachments

  • truenas-log.txt
    37.7 KB · Views: 23

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Are you sure the disk isn't logging interface CRC errors? That many errors screams "cable issue".
 

Salvorite

Cadet
Joined
Jan 28, 2024
Messages
7
I thought it might be a cable issue as well, I tried a couple of other cables and the issue remained.
 

Salvorite

Cadet
Joined
Jan 28, 2024
Messages
7
I double checked the SMART data for all four of these drives and they all show 0 for Interface CRC Errors.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Post the output of smartctl -x /dev/sdk to see what is going on, hopefully.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Are those SMR or CMR disks?

Older WD Red 4TB HDD were CMR, (preferred by ZFS), but newer ones are SMR, which are unusable with ZFS.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I'm thinking 'nvme1' based on the log file, but I won't know until the requested info is posted.
Der... it wouldn't be sdk then.

Also, please follow the forum rules (red text at the top of each page) and report your hardware. It may be more than just a hard drive issue.
 

Salvorite

Cadet
Joined
Jan 28, 2024
Messages
7
The smartctl output is attached, thank you!
 

Attachments

  • smartctl-output.txt
    11 KB · Views: 30

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The NVMe errors are yet another problem, though not directly related.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I thought it might be a cable issue as well, I tried a couple of other cables and the issue remained.
UDMA CRC Errors remain recorded even if you fix the problem. A count of 1 is okay. If it constantly increases then that is a problem.

I see nothing wrong with your hard drive based on the SMART data.

Still waiting on a list of your hardware.
 

Salvorite

Cadet
Joined
Jan 28, 2024
Messages
7
Apologies, here are the hardware specifications:

Motherboard make and model
MSI Z490-A PRO

CPU make and model
Intel Core i7-10700T

RAM quantity
128 GB

Hard drives, quantity, model numbers, and RAID configuration, including boot drives
1x Samsung 870 EVO 1TB SATA (Boot)
6x WD Red 6TB SATA (RAID Z2)
2x Samsung 850 EVO 512MB SATA (Write Cache)
2x Samsung 970 EVO Plus 2TB NVMe (Mirrored)
4x WD Red 4TB SATA (Not Working)

Hard disk controllers
LSI 9300-16i

Network cards
Realtek RTL8125B-CG
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Not sure what this means;
ATA Security is: ENABLED, PW level HIGH, **LOCKED** [SEC4] Wt Cache Reorder: Unknown (SCT not supported if ATA Security is LOCKED)

But, it appears SATA security is enabled on the drive. Perhaps all 4. I would guess that if the drive is not un-locked with the proper password, reads and writes would fail.
 

Salvorite

Cadet
Joined
Jan 28, 2024
Messages
7
Not sure what this means;
ATA Security is: ENABLED, PW level HIGH, **LOCKED** [SEC4] Wt Cache Reorder: Unknown (SCT not supported if ATA Security is LOCKED)

But, it appears SATA security is enabled on the drive. Perhaps all 4. I would guess that if the drive is not un-locked with the proper password, reads and writes would fail.

Agreed, I was curious about this. I pulled these drives from a QNAP and wiped them, but it's possible there's some protection on them I'm not familiar with.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
While you are not using the correct parts for a ZFS server (No ECC Support), that should not be causing the issue you are having.

Your LSI cards, what firmware are you running?

Your hard drives, what is the specific connectivity (where are each of your drives physically connected to)? Is the problematic drive connected to an LSI card, the motherboard? I'm thinking the LSI card is an issue.

But, it appears SATA security is enabled on the drive. Perhaps all 4. I would guess that if the drive is not un-locked with the proper password, reads and writes would fail.

I saw that and didn't think much of it since SMART data was retrievable. Now I'm not so sure.

Let's see if you can unlock the drive...
hdparm --security-disable xxxx /dev/sdk

You might even need to look into other tools to unlock it.

Agreed, I was curious about this. I pulled these drives from a QNAP and wiped them, but it's possible there's some protection on them I'm not familiar with.
Yup, that could cause the lock. What device did you wipe them in? The QNAP? I ask because there may be a password specific to the device.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
One other command may help hdparm --security-unlock xxxx /dev/sdk but I'm hoping the first command works.

Then check the status by looking at the SMART output again, see if it says it's locked.
 

Salvorite

Cadet
Joined
Jan 28, 2024
Messages
7
You folks are amazing! It was the lock. Thank you so much!

I ran
Code:
hdparm --security-disable xxxx /dev/sdk
and got an I/O error, but on a whim I tried
Code:
hdparm --security-disable password /dev/sdk
and it unlocked it!

I was able to create the new pool with the 4 HDDs.

Now, for the rest of the issues you've identified.

All the SATA drives are connected to the LSI controller minus the boot drive which is connected to SATA port #1 on the motherboard.
The NVMe drives are in the motherboard's two available M.2 slots.

For the LSI firmware, here are the versions:
Main 16.00.12.00
BIOS 8.37.00.00
Signed BIOS 18.00.00.00

As for the non-ECC memory, I must unfortunately take this risk for the time being. This being a Frankenstein NAS/home server with parts I had on hand. Am I correct to presume that without ECC, there's a possibility for corruption in motion that could be written as truth? I am "mitigating" this by taking monthly off-site backups, but I do realize I am not eliminating my risk.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
You must have searched the internet for that password. I saw it but not sure it was the right one. Glad you figured it out but @Arwen figured the key issue.
 
Top