Boot pool degraded - no errors

chuck32 · Feb 17, 2024

I started configuring the multi_report.sh script and upon first run I was greeted with a critical error message to my surprise.

My boot pool is apparently degraded.

Hardware:

Code:

TrueNAS-SCALE-23.10.1
Supermicro X10SLL-F, i3 4130, 16 Gb ECC RAM, Seasonic Prime PX-750
Data pool: 2*8Tb mirror
Data pool 2: 1*8Tb
boot pool: 1*128 Gb SSD
UPS: Eaton Eco 650

Code:

pool: boot-pool
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:00:31 with 0 errors on Sat Feb 17 03:45:32 2024
config:

    NAME        STATE     READ WRITE CKSUM
    boot-pool   DEGRADED     0     0     0
      sdc3      DEGRADED     0     0     0  too many errors

errors: No known data errors
sdc3 -> sdc3

Code:

########## SMART status report for sdc drive (INTENSO SSD : AA000000000000001840) ##########

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   100   100   050    Old_age   Always       -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   050    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   050    Old_age   Always       -       1632
 12 Power_Cycle_Count       0x0032   100   100   050    Old_age   Always       -       14
160 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
161 Unknown_Attribute       0x0033   100   100   050    Pre-fail  Always       -       100
163 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       8
164 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       10351
165 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       86
166 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       1
167 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       42
168 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       5050
169 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       100
175 Program_Fail_Count_Chip 0x0032   100   100   050    Old_age   Always       -       0
176 Erase_Fail_Count_Chip   0x0032   100   100   050    Old_age   Always       -       0
177 Wear_Leveling_Count     0x0032   100   100   050    Old_age   Always       -       0
178 Used_Rsvd_Blk_Cnt_Chip  0x0032   100   100   050    Old_age   Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   050    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   050    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   050    Old_age   Always       -       3
194 Temperature_Celsius     0x0022   100   100   050    Old_age   Always       -       40
195 Hardware_ECC_Recovered  0x0032   100   100   050    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   050    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   050    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0032   100   100   050    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   050    Old_age   Always       -       0
232 Available_Reservd_Space 0x0032   100   100   050    Old_age   Always       -       100
241 Total_LBAs_Written      0x0030   100   100   050    Old_age   Offline      -       27485
242 Total_LBAs_Read         0x0030   100   100   050    Old_age   Offline      -       2999
245 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       116886

No Errors Logged

Most recent Short & Extended Tests - Listed by test number
# 1 Short offline Completed without error 00% 1487 -
# 2 Extended offline Completed without error 00% 1381 -


SCT Error Recovery Control:  SCT Error Recovery Control command not supported

During running the script I also received these errors:

Code:

Collecting data, Please wait...
Partition 1 does not start on physical sector boundary.
Partition 2 does not start on physical sector boundary.
Partition 1 does not start on physical sector boundary.
Partition 2 does not start on physical sector boundary.
Partition 1 does not start on physical sector boundary.
Partition 2 does not start on physical sector boundary.

fdisk -l shows the errors in regard to zd0 which would be my zvol for my pfsense VM.

Code:

Disk /dev/zd0: 12 GiB, 12884918272 bytes, 25165856 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 16384 bytes
I/O size (minimum/optimal): 16384 bytes / 16384 bytes
Disklabel type: gpt
Disk identifier: 9D15D700-C5E8-11EE-BAA7-00A0982879CB

Device       Start      End  Sectors  Size Type
/dev/zd0p1      40   532519   532480  260M EFI System
/dev/zd0p2  532520   533543     1024  512K FreeBSD boot
/dev/zd0p3  534528  2631679  2097152    1G FreeBSD swap
/dev/zd0p4 2631680 25163775 22532096 10.7G FreeBSD ZFS

Partition 1 does not start on physical sector boundary.
Partition 2 does not start on physical sector boundary.

This would lead me to thinking I can ignore the warning and also assume it is not related to the degraded boot pool.

My course of action would be to to zpool clear the boot pool and wait what happens. The main reason for my question here though is: Why is the status degraded if no error is logged? Where can I check?

edit: almost forgot another confusing thing: I did not receive an email alert or an alert in the GUI.

artlessknave · Feb 17, 2024

chuck32 said:
Why is the status degraded if no error is logged?

what do you mean by this? the pool is degraded because there are too many errors...logged by zfs. this generally means that the data returned iddnt match its checksums 100's of times.
you say you have 2 ssds in your boot pool, but that zpool status output only shows one. with no redundancy to the pool and the only disk returning garbage, just about anything could happen, as things are being corrupted.

zfs reports actual errors identified by checksums failing to match returned data. smartctl attempts to predict drive failure. these are not directly attached, and a drive can be failing but perfectly pass smartctl.

you need the 2nd drive for your boot pool, and while doing so should inspect your cables and connections. bad cables, loose connections, and bad controllers can all cause errors on reading data.

chuck32 · Feb 17, 2024

Thanks for your reply!

artlessknave said:
what do you mean by this? the pool is degraded because there are too many errors...logged by zfs.

Doesn't ZFS then show the damaged files inzpool status? I assumed the read / write / checksum columns wouldn't all print 0

Also I expected an alarm in the GUI / via the email alert.

If there are indeed errors I'll reinstall truenas and see if the errors come back, in that case I'd RMA the drive since it's not that old. I need to confess that I never bothered testing my SSDs for non-mission critical applications, I just do that with my HDDs for data storage.

artlessknave said:
you say you have 2 ssds in your boot pool, but that zpool status output only shows one.

chuck32 said:
TrueNAS-SCALE-23.10.1
Supermicro X10SLL-F, i3 4130, 16 Gb ECC RAM, Seasonic Prime PX-750
Data pool: 2*8Tb mirror
Data pool 2: 1*8Tb
boot pool: 1*128 Gb SSD
UPS: Eaton Eco 650

Maybe you clicked on the main system from my signature? This system only has one SSD as the boot pool. I thought about adding a second one for redundancy but ultimately decided with current config backups that's not needed.

chuck32 · Feb 17, 2024

I reinstalled scale and swapped the sata cable just in case. So far so good. I maybe purchase another SSD in order to hopefully just need to replace the disk rather than reinstalling next time.

I'm still curious about the 0 error count and why I didn't receive an alert.

Last time I had a degraded pool, I need to revise that on remembering,

chuck32 said:
Doesn't ZFS then show the damaged files inzpool status? I assumed the read / write / checksum columns wouldn't all print 0

it did indeed did not show specific files that were corrupt but at least the CRC error count wasn't 0. In that case it was a rather simple problem, I was running a Y-splitter for the power connection and previously powered that via molex (somewhat recommended if needed) and when working on the server I used a sata cable to power the splitter. After tossing the y splitter everything was fine.

artlessknave · Feb 17, 2024

chuck32 said:
I'm still curious about the 0 error count and why I didn't receive an alert.

after a certain point, zfs stops bothering to count the number of errors, and just fails the disk out of the pool. since you only HAVE 1 disk in the pool, it marked it as degradded but didnt fail the pool

chuck32 said:
Maybe you clicked on the main system from my signature?

yup thats what I was looking at, i thought they were the same. thats my mistake.

chuck32 said:
Doesn't ZFS then show the damaged files inzpool status?

only if it detects that files are damaged. it depends on exactly what the errors are and what is affected. you can also have errors in the checksums, but not the data itself. (there are checksums of checksums of data)
zfs basically considers any disk that reuturns enough errors to get marked as "too many errors" as a dead disk. manual intervention is required to determine if its the disk, controller, or connection.

artlessknave · Feb 17, 2024

chuck32 said:
Also I expected an alarm in the GUI / via the email alert.

not sure why you didnt get alerts about it. do you have alerts disabled or anything?
it is possible that because the boot disk, which is pretty important, was in a degraded state, that functionality didnt work correctly due to something being damaged.

chuck32 · Feb 19, 2024

artlessknave said:
not sure why you didnt get alerts about it. do you have alerts disabled or anything?

I think it's a bug.

artlessknave said:
after a certain point, zfs stops bothering to count the number of errors, and just fails the disk out of the pool.

That just the details, but I'd rather display "too many errors" or the like instead of 0. I may start of feature request (if that's the right place) to investigate of anyone thinks it would be worth changing the output. Should take out the confusion of a degraded but errorfree pool.

artlessknave said:
manual intervention is required to determine if its the disk, controller, or connection.

Thanks again, time will tell :) So far it looks okay.

artlessknave · Feb 19, 2024

chuck32 said:
but I'd rather display "too many errors" or the like instead of 0

it does exactly this. it's in your zpool status output in your first post. im confused.

chuck32 said:
sdc3 DEGRADED 0 0 0 too many errors

chuck32 · Feb 19, 2024

artlessknave said:
im confused.

Maybe I am ;)

Apparantly it took me until now to finally realize what you were trying to tell me all along..

Instead of displaying a number for the error count, zfs didn't bother and added "too many errors". I stubbornly read this as: 0 errors, and then zfs came to the conclusion that these are too many errors. Thanks for your patience, now I got it!

artlessknave · Feb 19, 2024

chuck32 said:
I stubbornly read this as: 0 errors,

ooooookay. that makes at least some of this thread make more sense.
sometimes it will show errors there, sometimes just zeros; once zfs decides its over the threshhold for "too many errors" those numbers become meaningless.

chuck32 · Feb 22, 2024

This morning I was greeted with 2 write errors on the boot pool. But only due to the multi_report script, I haven't updated yet so there is still the bug that I will not receive an alarm from truenas.
The automatic long smart aborted after 80 % (host reset). A manual test ran and passed with no errors.

Since I switched the sata cable and doubt it's just that specific SATA port on the Mainboard I will RMA the drive.

I already have the same drive here as a replacement (originally planned to be used as a mirror). I'll order another 128GB SSD from another manufacturer (just in case these cheap SSDs are in general not good) and setup a mirrored boot pool.
I will use the shutdown to update on 23.10.2.

I will report back if the errors went away or if further investigation is needed.

artlessknave · Feb 23, 2024

You might want to run a memtest just to eliminate things. Ecc is good but not perfect.
I believe there are cases it can't detect, they are just really really rare.

chuck32 · Feb 23, 2024

artlessknave said:
You might want to run a memtest just to eliminate things. Ecc is good but not perfect.

Good point, I didn't mention it but I checked via ipmi if the event log showed any memory correction events before deciding it's most likely the SSD.

I'll run memtest for a day after the weekend and see if that will pass.

artlessknave · Feb 23, 2024

Always Run it till it completes a full pass, however long that takes, though 16gb on the board you list shouldn't take long

chuck32 · Feb 26, 2024

I ran 22 passes with no errors so I am fairly confident it was the ssd itself, unless the errors resurface and it's on the SSD that is connected to the previously used sata port.

I'll use veracrypt and encrypt the disk with several wipe passes and RMA it. Since my homelab is not open to the internet anyway, except for a VPN connection I doubt there's sensitive information on the boot pool anyway, but better safe than sorry ;)

Thank you for your support! Hopefully this will finalize my work on this machine for now.

joeschmuck · Feb 26, 2024

Try zpool status -xv to display all the good data. But it might be too late now since you reinstalled TrueNAS.

Your SSD is highly unlikely to be the cause of a ZFS error, there is no data to support that here, nor would I expect the RMA to be honored without a real error in the SMART data.

The error could have occured during a power off or reboot event, but you were using a buggy SCALE release then that could have been the cause (that is where my money is). A ZFS error is generally a data corruption event, but usually not hardware driven. But I don't know it all and learn every day new things, I'm just saying that your SSD is likely not the issue, your RAM was tested and all good there, next CPU Burn-in.

There is only so much we can test, the chipset could be the issue on the motherboard.

My advice, just monitor it.

chuck32 · Feb 26, 2024

joeschmuck said:
Try zpool status -xv to display all the good data. But it might be too late now since you reinstalled TrueNAS.

I haven't wiped the drive yet so I could still boot from it if I put in the server again. This beeing just the boot pool I don't think I'd gain anything from that now.
But the command may be handy in the future.

joeschmuck said:
Your SSD is highly unlikely to be the cause of a ZFS error, there is no data to support that here, nor would I expect the RMA to be honored without a real error in the SMART data.

The SSD was 13.99 Eur, no real loss here. If it's not the cause / faulty I'm sure it will find its purpose in the future.
I was / am set on the SSD since I only experienced errors on the SSD and none of my other drives (luckily).

joeschmuck said:
The error could have occured during a power off or reboot event, but you were using a buggy SCALE release then that could have been the cause (that is where my money is). A ZFS error is generally a data corruption event, but usually not hardware driven.

Can you elaborate on the buggy scale release? I'm running the same version on my main machine with basically zero issues. Or was there a bug reported similiar to my issue that I missed?
In between reinstalling and the new errors occuring there was no reboot / power off event that I know off.

Code:

5    2024/02/17 21:35:56    OEM    AC Power On    AC Power On - Assertion
6    2024/02/25 14:45:49    OEM    AC Power On    AC Power On - Assertion

I reinstalled on the 17th and I catched new errors on the 23rd. The server is running off an UPS, so even if an unplanned restart does not result in an AC Power On - Assertion event, the shutdown should at least have been graceful. And I didn't receive a UPS alert either.

joeschmuck said:
next CPU Burn-in.

Nothing lasts forever, I pulled the CPU from wifes computer in working condition. She didn't report any problems with it before.
I guess a linux live USB and then running prime95 for a few hours will do the trick? Iirc that's what I did on my main machine.

edit: I have to admit I just ran memtest when I purchased the memory and mainboard for this machine, I skimped on a CPU test.

joeschmuck said:
My advice, just monitor it.

Of course! I just hope it's the SSD and not something else in the system, because the machine also holds a single HDD (I advised him to get me two drives for redundancy, he knows the risks) as a backup location for a friend. Replicating via VPN takes a long time. But even that would just be annoying and not the end of the world.

joeschmuck · Feb 26, 2024

chuck32 said:
Can you elaborate on the buggy scale release? I'm running the same version on my main machine with basically zero issues. Or was there a bug reported similiar to my issue that I missed?

There were a lot of trouble tickets (Jira), it's getting better. This is one issue with an open hardware architecture, a problem you would be hard pressed to find on an Apple platform. I'm not pro apple, just an observation that software folks need to accommodate so many versions of hardware. The issue is SCALE is not nearly as mature as CORE.

chuck32 said:
I guess a linux live USB and then running prime95 for a few hours will do the trick?

Yup

Hopefully the issue does not come back but my money is on the software or some hardware burp. All you can do is watch to see what happens and try to be attentive to changes you may make. Keep your fingers crossed.

chuck32 · Feb 27, 2024

joeschmuck said:
Hopefully the issue does not come back but my money is on the software or some hardware burp. All you can do is watch to see what happens and try to be attentive to changes you may make. Keep your fingers crossed.

9.5 hours of small FFT testing -> no crash. Max. temperature was 69 °C.

This leaves software, mainboard or the SSD. I'll upgrade to 23.10.2 when I fire the server up again and see how it goes from there.

I guess it's safe to assume that my actual data on the data pools is not at risk by a certain SCALE release? At least it won't be a silent corruption

I'll try not to worry too much, my main system never threw any errors (except that one time Iused a Ysplitter).

edit: 23.10.2 installed fresh from iso with restored config - let's see.

joeschmuck · Feb 27, 2024

chuck32 said:
data pools is not at risk by a certain SCALE release?

It shouldn't be however there is always a possibility, but I have not seen that in my many years with FreeNAS/TrueNAS which is a very good thing.

chuck32 said:
edit: 23.10.2 installed fresh from iso with restored config - let's see.

That is the way I would do it as well, and have actually. I try to grab the current version for the script development.

Important Announcement for the TrueNAS Community.

Boot pool degraded - no errors

Guru

Wizard

Guru

Guru

Wizard

Wizard

Guru

Wizard

Guru

Wizard

Guru

Wizard

Guru

Wizard

Guru

Old Man

Guru

Old Man

Guru

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Boot pool degraded - no errors"

Similar threads