252 Failed SMART Tests Shown But SMART Shows No Failures

brando56894 · Dec 10, 2022

I just upgraded from the Bluefin RC1 to the 12/5 nightly and while looking through the UI I noticed that under my largest pool, which has the most disks it says there are 252 failed SMART tests.

I just looked through the SMART log of all the disks in that pool using for i in {a..z}; do smartctl --attributes --log=selftest /dev/sd$i; done and not a single one shows a failed test. I don't remember if I saw this on RC1.

justjasch · Dec 11, 2022

hi i can confirm this, i have same "Error" on on Pool.
Manual checking there are no smart errors

winnielinnie · Dec 11, 2022

What about checking the error log?

for i in {a..z}; do smartctl --log=error /dev/sd$i; done

joeschmuck · Dec 11, 2022

A google search shows that the problem has been reported here. Please see if this addresses your issue, if not then submit a bug report. But I'd still report the issue as the more times a problem is reported, the priority will hopefully go higher.

Jira site this link: https://ixsystems.atlassian.net/browse/NAS-119323

winnielinnie · Dec 11, 2022

joeschmuck said:
Jira site this link: https://ixsystems.atlassian.net/browse/NAS-119323

So my non-expert hunch is that the middleware for this version of SCALE is interpreting any logged SMART selftest as "failed".

"21 logged tests? 21 failures!"

"252 logged tests? 252 failures!"

brando56894 said:
I noticed that under my largest pool, which has the most disks it says there are 252 failed SMART tests.

Let me wager a guess: this pool has 12 drives?

252 / 21 = 12

The reason for "21" per drive is because the selftest log only saves the most recent 21 tests.

joeschmuck · Dec 11, 2022

winnielinnie said:
The reason for "21" per drive is because the selftest log only saves the most recent 21 tests.

That is interesting.

brando56894 · Dec 11, 2022

winnielinnie said:
So my non-expert hunch is that the middleware for this version of SCALE is interpreting any logged SMART selftest as "failed".

"21 logged tests? 21 failures!"

"252 logged tests? 252 failures!"

Let me wager a guess: this pool has 12 drives?

252 / 21 = 12

The reason for "21" per drive is because the selftest log only saves the most recent 21 tests.

You really are a wizard haha yep, 2 vdevs of 6 drives each in raidz2.

It looks like only one drive was throwing errors, they all occurred at power on lifetime 7 days, 12 hours. They're all the same error.

Code:

Error 147 occurred at disk power-on lifetime: 180 hours (7 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 b8 00 68 28 81 40 00      00:29:32.601  READ FPDMA QUEUED
  60 48 10 b8 29 81 40 00      00:29:32.600  READ FPDMA QUEUED
  60 80 08 28 29 81 40 00      00:29:32.600  READ FPDMA QUEUED
  60 50 00 10 28 81 40 00      00:29:32.600  READ FPDMA QUEUED
  60 08 00 00 28 81 40 00      00:29:32.600  READ FPDMA QUEUED

I also added my notes to that jira and linked to this thread.

winnielinnie · Dec 11, 2022

brando56894 said:
They're all the same error.

It's unrelated to the TrueNAS bug, but interesting coincidence that you can at least catch it early.

I'd check the port and/or cable and/or the connection itself (behind the drive and on the motherboard/controller.)

brando56894 said:
I also added my notes to that jira and linked to this thread.

@justjasch @joeschmuck @brando56894

Make sure to "vote" for the issue too.

brando56894 · Dec 12, 2022

All the drives are connected to an LSI 9201-16i HBA flashed to IT mode. I had other drives connected to the HBA, the U.2, and SATA ports on my motherboard and TrueNAS kept saying that there were tons of read errors and would kick drives out of the pool, not the above pool, one with 3x 6 TB drives, and a mirror of 5 TB seagate drives, this pool it never had issues with, even though some drives from the other pool were also connected to the HBA. I swapped (HBA to onboard SATA) ports and cables and it kept happening. I even bought a different HBA, an ATTO H120F, since the LSI keeps throwing an alert (host bus degraded, can't figure out why, different slots and bios settings don't seem to matter, the ATTO doesn't throw that error), but that seemed to cause a lot more errors for some reason! It's not a RAID card, I've confirmed that, it's just an HBA.

I ran the new MemTest version which supports EFI and ECC, it ran for like 8 hours and showed no issues there. I decided to throw the ATTO HBA and the two Seagate drives in my Windows desktop to see if it threw any errors and it's been working perfectly, so I'm very confused why TrueNAS is showing a bunch of read/write errors when nothing else is....

Important Announcement for the TrueNAS Community.

252 Failed SMART Tests Shown But SMART Shows No Failures

brando56894

Wizard

justjasch

Dabbler

winnielinnie

MVP

joeschmuck

Old Man

winnielinnie

MVP

joeschmuck

Old Man

brando56894

Wizard

winnielinnie

MVP

brando56894

Wizard

Similar threads

Important Announcement for the TrueNAS Community.

252 Failed SMART Tests Shown But SMART Shows No Failures

Wizard

Dabbler

MVP

Old Man

MVP

Old Man

Wizard

MVP

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "252 Failed SMART Tests Shown But SMART Shows No Failures"

Similar threads