252 Failed SMART Tests Shown But SMART Shows No Failures

brando56894

Wizard
Joined
Feb 15, 2014
Messages
1,537
I just upgraded from the Bluefin RC1 to the 12/5 nightly and while looking through the UI I noticed that under my largest pool, which has the most disks it says there are 252 failed SMART tests.

I just looked through the SMART log of all the disks in that pool using for i in {a..z}; do smartctl --attributes --log=selftest /dev/sd$i; done and not a single one shows a failed test. I don't remember if I saw this on RC1.
 

justjasch

Dabbler
Joined
May 8, 2022
Messages
20
hi i can confirm this, i have same "Error" on on Pool.
Manual checking there are no smart errors
 
Joined
Oct 22, 2019
Messages
3,641
What about checking the error log?

for i in {a..z}; do smartctl --log=error /dev/sd$i; done
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
A google search shows that the problem has been reported here. Please see if this addresses your issue, if not then submit a bug report. But I'd still report the issue as the more times a problem is reported, the priority will hopefully go higher.

Jira site this link: https://ixsystems.atlassian.net/browse/NAS-119323
 
Joined
Oct 22, 2019
Messages
3,641
So my non-expert hunch is that the middleware for this version of SCALE is interpreting any logged SMART selftest as "failed".

"21 logged tests? 21 failures!"

"252 logged tests? 252 failures!"



I noticed that under my largest pool, which has the most disks it says there are 252 failed SMART tests.

Let me wager a guess: this pool has 12 drives? :wink:

252 / 21 = 12

The reason for "21" per drive is because the selftest log only saves the most recent 21 tests.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994

brando56894

Wizard
Joined
Feb 15, 2014
Messages
1,537
So my non-expert hunch is that the middleware for this version of SCALE is interpreting any logged SMART selftest as "failed".

"21 logged tests? 21 failures!"

"252 logged tests? 252 failures!"

Let me wager a guess: this pool has 12 drives? :wink:

252 / 21 = 12

The reason for "21" per drive is because the selftest log only saves the most recent 21 tests.
You really are a wizard haha yep, 2 vdevs of 6 drives each in raidz2.

It looks like only one drive was throwing errors, they all occurred at power on lifetime 7 days, 12 hours. They're all the same error.
Code:
Error 147 occurred at disk power-on lifetime: 180 hours (7 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 b8 00 68 28 81 40 00      00:29:32.601  READ FPDMA QUEUED
  60 48 10 b8 29 81 40 00      00:29:32.600  READ FPDMA QUEUED
  60 80 08 28 29 81 40 00      00:29:32.600  READ FPDMA QUEUED
  60 50 00 10 28 81 40 00      00:29:32.600  READ FPDMA QUEUED
  60 08 00 00 28 81 40 00      00:29:32.600  READ FPDMA QUEUED


I also added my notes to that jira and linked to this thread.
 
Joined
Oct 22, 2019
Messages
3,641
Last edited:

brando56894

Wizard
Joined
Feb 15, 2014
Messages
1,537
All the drives are connected to an LSI 9201-16i HBA flashed to IT mode. I had other drives connected to the HBA, the U.2, and SATA ports on my motherboard and TrueNAS kept saying that there were tons of read errors and would kick drives out of the pool, not the above pool, one with 3x 6 TB drives, and a mirror of 5 TB seagate drives, this pool it never had issues with, even though some drives from the other pool were also connected to the HBA. I swapped (HBA to onboard SATA) ports and cables and it kept happening. I even bought a different HBA, an ATTO H120F, since the LSI keeps throwing an alert (host bus degraded, can't figure out why, different slots and bios settings don't seem to matter, the ATTO doesn't throw that error), but that seemed to cause a lot more errors for some reason! It's not a RAID card, I've confirmed that, it's just an HBA.

I ran the new MemTest version which supports EFI and ECC, it ran for like 8 hours and showed no issues there. I decided to throw the ATTO HBA and the two Seagate drives in my Windows desktop to see if it threw any errors and it's been working perfectly, so I'm very confused why TrueNAS is showing a bunch of read/write errors when nothing else is....
 
Top