Register for the iXsystems Community to get an ad-free experience and exclusive discounts in our eBay Store.

Solaris: warning pool has encountered an uncorrectable io error suspended

eleson

Junior Member
Joined
Jul 9, 2020
Messages
17
I need a nudge/hard kick in the right direction.

First:
zpool clear -hangs.
cd /var/log -sometimes hangs
cat debug.log or cat console.log -hangs
Single user mode seems to work.
I have not been able to read any logs.

I was scared of physical sata failures, so all SATA cables are replaced.

UI doesn't function/start . When entering the ip address i can hear a harddisk spin up,
and that have to be one of the disk in the mirror, the boot-disk is an SSD.

After several evenings of googling I am out of ideas on where to move forward.

Any pointers to my options on where to move from here?

Code:
  pool: data
state: ONLINE
status: One or more devices are faulted in response to IO failures.

scan: scrub repaired 0B with 0 errors  on Sun Sep 20 xxxx.
NAME              STATE      READ Write CKSUM
data              ONLINE        0     0     0
    mirror-0      ONLINE        0     0     0
       gptid/xxx  ONLINE        0     0     4
       gptid/yyy  ONLINE        0     0     4

errors: List of errors unavailable: pool I/O is currently suspended.
 

sretalla

Dedicated Sage
Joined
Jan 1, 2016
Messages
3,013
I guess dmesg is showing you a bunch of CAM errors

If you're sure the connections and power are good, then it may point to a SATA controller problem. Do you have other SATA ports or a different controller available to test?
 

eleson

Junior Member
Joined
Jul 9, 2020
Messages
17
Thanks for response.
That thought have occurred.
I have an AsRock A320M Pro4 motherboard with 4 SATA connectors (and 2 unused m.2 connectors. )
My knowledge of zfs is not enough to understand the consequences of moving disk from to another SATA connector.
I will make a jump and see what happens.
Edit: unplugged the two disks, and now I can read the logs without hanging. I also see that the logs are a lot smaller.
But now they don't say anything about the problems.
UI works now. Thanks for moving it forward.
 
Last edited:

HoneyBadger

Mushroom! Mushroom!
Joined
Feb 6, 2014
Messages
2,758
My knowledge of zfs is not enough to understand the consequences of moving disk from to another SATA connector.
Assuming you created this pool through the FreeNAS UI, the consequences would be "nothing" as they'll be identified by their gptid rather than physical port. Go ahead.
 

eleson

Junior Member
Joined
Jul 9, 2020
Messages
17
Assuming you created this pool through the FreeNAS UI, the consequences would be "nothing" as they'll be identified by their gptid rather than physical port. Go ahead.
Not a good feeling right now. Moved one of the mirrored disks (still same motherboard) to an unused SATA port and got the same IO error.
Removed the other disk in the mirror, still same problem with only one connected disk of the mirrored set.
Disconnected both 'faulty' disks and moved boot disk to one of the 'faulty' SATA connectors, and still booting ok.

I've ordered a new PCIe SATA card, but it feels like I need a bit of luck.
 

Ericloewe

Not-very-passive-but-aggressive
Moderator
Joined
Feb 15, 2014
Messages
16,767

eleson

Junior Member
Joined
Jul 9, 2020
Messages
17
Status update: I have moved one of the disks to a windows PC, and I can see it , read status from it, and run short smart test successfully.
( I will do the samer on the other also)

On the NAS, the two mirrored HDD's and the SSD boot disk are all connected to the same power cable with three connectors on it.
SSD boots fine when the others are disconnected, but with one or two of them also connected,
it can boot, but it probably also hangs if I do 'ls -la /var/log'.
What are the odds of me seing a PSU problem? (Given that there is a 100% that I have an issue right now...)
How would I see if the 12V rail can't deliver enough amps, for instance?
Edit: Now tried the above with swapped mirrored disks. and same result. Disk seems ok in smart tests, Nas won't run properly with disk connected.
 
Last edited:

eleson

Junior Member
Joined
Jul 9, 2020
Messages
17
Some update.
New SATA // PCIe card installed. Attached one disk to it and got:
"Solaris: warning pool has encountered an uncorrectable io error suspended" so problem remains.
And the terminal becomes unresponsive if I do ls -la in /var/log.

Now I really could use some help on how to approach this error.
 

HoneyBadger

Mushroom! Mushroom!
Joined
Feb 6, 2014
Messages
2,758
Can you poll the SMART data of the disk through your FreeNAS machine with the new SATA card (also, which model card?)

This is pointing towards a failed PSU or other hardware fault.
 

eleson

Junior Member
Joined
Jul 9, 2020
Messages
17
Update and more frustration.
My SSD seems to use only +5V, so I bought a new mechanical harddisk (, and that works fine in the Truenas server.
In a available SATA ports, on motherboard and expansion card.
To move forward with this, I built a computer of scrap parts had, here I installed my first FreeBSD server. And the I could reach both faulting disks.
Put'em back in the Truenas server and problem remains.
Used ny scrap power supply in the Truenas box, and viola!, now the disks work again!

Whilst writing this, I updated the installation, and suddenly the disks doesn't work again.
I was so hoping to mark it as completed.
Frustrating, but slowly moving forward.
 

eleson

Junior Member
Joined
Jul 9, 2020
Messages
17
Thursday 2020-11-19 22:35 CET+1
... continued.
1st, tried to start new server on old boot disk, but it failed. I didn't examine the root cause of it due to the hardware differences.

Decide to install Truenas on 2nd hardware. Could import the disks fine on the fresh install.
( Faulty env i newer and AMD/Ryzen, older is Intel i5 )
Imported a saved config into the 2nd hardware and ...

warning pool has encountered an uncorrectable io error suspended

Could this be somehow be software induced?

I reinstalled from scratch and now I can reach the disks again. Scrub is ongoing, after that , a backup before more fact finding.

2020-11-20 16:30 CET+1
Scrub done ,multiple restarts, no IO errors.
I have upgraded the pool, whatever that means.
Next step is to move all disk from this set up into the old chassi/motherboard/PSU and see if the issue reappears. (Hoping it will boot on the disk.)
 
Last edited:

eleson

Junior Member
Joined
Jul 9, 2020
Messages
17
Running this to its end.
All disks moved back to old hardware (mobo and PSU) and it boots and starts without error.
Differences are;:
Boot disk is changed to a 2.5" I had laying around instead of the SSD.
Installed with latest Truenas V12 version , older was some 6-8 weeks old.
Data disk that indicated problem have been upgraded, whatever that means.

I am more and more leaning on this being a software problem.

But I still have stuff to solve:
I want to challenge the SSD and see if that was the root cause.
I have no config data on the new setup so users , shares, ACL's and jail's etc are gone. Last time I tried to load a saved config, I immediately got an IO error. (Hence my strong suspicion of this being a Truenas/zfs related issue.)
 

sretalla

Dedicated Sage
Joined
Jan 1, 2016
Messages
3,013
I want to challenge the SSD and see if that was the root cause.
I have no config data on the new setup so users , shares, ACL's and jail's etc are gone. Last time I tried to load a saved config, I immediately got an IO error.
It seems to me that you may have either corruption or a bad setting in your config DB.

If you use another install of an OS capable of reading ZFS, you may be able to mount the SSD and if your system dataset was on that disk, you could get back to the config before the current one to see if that's something that still works and has almost all of your settings in it.
 

eleson

Junior Member
Joined
Jul 9, 2020
Messages
17
I am now convinced this is an issue with Truenas or zfs.

I've been running on fresh install on old hardware now for 4-5 with multiple reboot and no issue.
And saved a new config.
So I restored a old config as per above, and immediately on first reboot I get the IO Error again.
On every reboot.

Then restored the newly saved config, and the problem disappeared.

So this an issue with the Truenas software, probably a corrupt config.

Should I do a bug report?


Edit: Restoring the new config did remove the IO error. BUT, the systems becomes unresponsive after a few hours.
So it did not completely get me into a good state.
I probably don't dare to go that route again, it is tiresome and frustrating.
 
Last edited:

eleson

Junior Member
Joined
Jul 9, 2020
Messages
17
It seems to me that you may have either corruption or a bad setting in your config DB.

If you use another install of an OS capable of reading ZFS, you may be able to mount the SSD and if your system dataset was on that disk, you could get back to the config before the current one to see if that's something that still works and has almost all of your settings in it.
Thanks for answer.
Seems like a corrupt config DB, yes.
I have quite a few configs saved on disk, but it tiresome to test which one is ok.
I will do some tries, mostly to get the emby mediaserver up and running again.
 
Top