Solaris: warning pool has encountered an uncorrectable io error suspended

eleson · Nov 1, 2020

I need a nudge/hard kick in the right direction.

First:
zpool clear -hangs.
cd /var/log -sometimes hangs
cat debug.log or cat console.log -hangs
Single user mode seems to work.
I have not been able to read any logs.

I was scared of physical sata failures, so all SATA cables are replaced.

UI doesn't function/start . When entering the ip address i can hear a harddisk spin up,
and that have to be one of the disk in the mirror, the boot-disk is an SSD.

After several evenings of googling I am out of ideas on where to move forward.

Any pointers to my options on where to move from here?

Code:

  pool: data
state: ONLINE
status: One or more devices are faulted in response to IO failures.

scan: scrub repaired 0B with 0 errors  on Sun Sep 20 xxxx.
NAME              STATE      READ Write CKSUM
data              ONLINE        0     0     0
    mirror-0      ONLINE        0     0     0
       gptid/xxx  ONLINE        0     0     4
       gptid/yyy  ONLINE        0     0     4

errors: List of errors unavailable: pool I/O is currently suspended.

sretalla · Nov 1, 2020

I guess dmesg is showing you a bunch of CAM errors

If you're sure the connections and power are good, then it may point to a SATA controller problem. Do you have other SATA ports or a different controller available to test?

eleson · Nov 1, 2020

Thanks for response.
That thought have occurred.
I have an AsRock A320M Pro4 motherboard with 4 SATA connectors (and 2 unused m.2 connectors. )
My knowledge of zfs is not enough to understand the consequences of moving disk from to another SATA connector.
I will make a jump and see what happens.
Edit: unplugged the two disks, and now I can read the logs without hanging. I also see that the logs are a lot smaller.
But now they don't say anything about the problems.
UI works now. Thanks for moving it forward.

HoneyBadger · Nov 1, 2020

eleson said:
My knowledge of zfs is not enough to understand the consequences of moving disk from to another SATA connector.

Assuming you created this pool through the FreeNAS UI, the consequences would be "nothing" as they'll be identified by their gptid rather than physical port. Go ahead.

eleson · Nov 1, 2020

HoneyBadger said:
Assuming you created this pool through the FreeNAS UI, the consequences would be "nothing" as they'll be identified by their gptid rather than physical port. Go ahead.

Not a good feeling right now. Moved one of the mirrored disks (still same motherboard) to an unused SATA port and got the same IO error.
Removed the other disk in the mirror, still same problem with only one connected disk of the mirrored set.
Disconnected both 'faulty' disks and moved boot disk to one of the 'faulty' SATA connectors, and still booting ok.

I've ordered a new PCIe SATA card, but it feels like I need a bit of luck.

Ericloewe · Nov 1, 2020

HoneyBadger said:
Assuming you created this pool through the FreeNAS UI

Or even if you didn't.

eleson · Nov 3, 2020

Status update: I have moved one of the disks to a windows PC, and I can see it , read status from it, and run short smart test successfully.
( I will do the samer on the other also)

On the NAS, the two mirrored HDD's and the SSD boot disk are all connected to the same power cable with three connectors on it.
SSD boots fine when the others are disconnected, but with one or two of them also connected,
it can boot, but it probably also hangs if I do 'ls -la /var/log'.
What are the odds of me seing a PSU problem? (Given that there is a 100% that I have an issue right now...)
How would I see if the 12V rail can't deliver enough amps, for instance?
Edit: Now tried the above with swapped mirrored disks. and same result. Disk seems ok in smart tests, Nas won't run properly with disk connected.

eleson · Nov 9, 2020

Some update.
New SATA // PCIe card installed. Attached one disk to it and got:
"Solaris: warning pool has encountered an uncorrectable io error suspended" so problem remains.
And the terminal becomes unresponsive if I do ls -la in /var/log.

Now I really could use some help on how to approach this error.

HoneyBadger · Nov 9, 2020

Can you poll the SMART data of the disk through your FreeNAS machine with the new SATA card (also, which model card?)

This is pointing towards a failed PSU or other hardware fault.

eleson · Nov 17, 2020

Update and more frustration.
My SSD seems to use only +5V, so I bought a new mechanical harddisk (, and that works fine in the Truenas server.
In a available SATA ports, on motherboard and expansion card.
To move forward with this, I built a computer of scrap parts had, here I installed my first FreeBSD server. And the I could reach both faulting disks.
Put'em back in the Truenas server and problem remains.
Used ny scrap power supply in the Truenas box, and viola!, now the disks work again!

Whilst writing this, I updated the installation, and suddenly the disks doesn't work again.
I was so hoping to mark it as completed.
Frustrating, but slowly moving forward.

eleson · Nov 19, 2020

Thursday 2020-11-19 22:35 CET+1
... continued.
1st, tried to start new server on old boot disk, but it failed. I didn't examine the root cause of it due to the hardware differences.

Decide to install Truenas on 2nd hardware. Could import the disks fine on the fresh install.
( Faulty env i newer and AMD/Ryzen, older is Intel i5 )
Imported a saved config into the 2nd hardware and ...

warning pool has encountered an uncorrectable io error suspended

Could this be somehow be software induced?

I reinstalled from scratch and now I can reach the disks again. Scrub is ongoing, after that , a backup before more fact finding.

2020-11-20 16:30 CET+1
Scrub done ,multiple restarts, no IO errors.
I have upgraded the pool, whatever that means.
Next step is to move all disk from this set up into the old chassi/motherboard/PSU and see if the issue reappears. (Hoping it will boot on the disk.)

eleson · Nov 21, 2020

Running this to its end.
All disks moved back to old hardware (mobo and PSU) and it boots and starts without error.
Differences are;:
Boot disk is changed to a 2.5" I had laying around instead of the SSD.
Installed with latest Truenas V12 version , older was some 6-8 weeks old.
Data disk that indicated problem have been upgraded, whatever that means.

I am more and more leaning on this being a software problem.

But I still have stuff to solve:
I want to challenge the SSD and see if that was the root cause.
I have no config data on the new setup so users , shares, ACL's and jail's etc are gone. Last time I tried to load a saved config, I immediately got an IO error. (Hence my strong suspicion of this being a Truenas/zfs related issue.)

sretalla · Nov 21, 2020

eleson said:
I want to challenge the SSD and see if that was the root cause.
I have no config data on the new setup so users , shares, ACL's and jail's etc are gone. Last time I tried to load a saved config, I immediately got an IO error.

It seems to me that you may have either corruption or a bad setting in your config DB.

If you use another install of an OS capable of reading ZFS, you may be able to mount the SSD and if your system dataset was on that disk, you could get back to the config before the current one to see if that's something that still works and has almost all of your settings in it.

eleson · Nov 21, 2020

I am now convinced this is an issue with Truenas or zfs.

I've been running on fresh install on old hardware now for 4-5 with multiple reboot and no issue.
And saved a new config.
So I restored a old config as per above, and immediately on first reboot I get the IO Error again.
On every reboot.

Then restored the newly saved config, and the problem disappeared.

So this an issue with the Truenas software, probably a corrupt config.

Should I do a bug report?

Edit: Restoring the new config did remove the IO error. BUT, the systems becomes unresponsive after a few hours.
So it did not completely get me into a good state.
I probably don't dare to go that route again, it is tiresome and frustrating.

eleson · Nov 21, 2020

sretalla said:
It seems to me that you may have either corruption or a bad setting in your config DB.

If you use another install of an OS capable of reading ZFS, you may be able to mount the SSD and if your system dataset was on that disk, you could get back to the config before the current one to see if that's something that still works and has almost all of your settings in it.

Thanks for answer.
Seems like a corrupt config DB, yes.
I have quite a few configs saved on disk, but it tiresome to test which one is ok.
I will do some tries, mostly to get the emby mediaserver up and running again.

eleson · Nov 25, 2020

Marking this one as solved.
I also see the disktemp.py problem that have been reported as a bug.
Getting access to SMB shares to work became the last straw.
Solution is OMV. Some work remains but I have at least been able to retrieve the data I needed.

House Of Cards · Feb 6, 2022

Ericloewe said:
Or even if you didn't.

Just reading through this and I have wondered this myself... What you are saying is that if I pulled all my hardware out of one case, and plugged it all into new hardware and case with more SATA ports, ZFS wouldn't care where I plugged the drives from the existing pool into the new machine? It would look at the GPTID and just reassemble the pool?

Always wondered about this....?

jgreco · Feb 6, 2022

Steven Wormuth said:
ZFS wouldn't care where I plugged the drives from the existing pool into the new machine? It would look at the GPTID and just reassemble the pool?

That is correct.

This is also another reason why it is so very important for people to use HBA's or SATA ports rather than RAID controllers, which sometimes hide the contents of a disk inside a faux-partition table.

jgreco · Feb 7, 2022

Steven Wormuth said:
Just reading through this and I have wondered this myself... What you are saying is that if I pulled all my hardware out of one case, and plugged it all into new hardware and case with more SATA ports, ZFS wouldn't care where I plugged the drives from the existing pool into the new machine? It would look at the GPTID and just reassemble the pool?

Always wondered about this....?

And I really wanted to circle back around to this and talk about it for a second, but I had lost the message.

Back in the good old days, we had UNIX devices like "/dev/sd0" (SunOS) and then these could be mapped by kernel config lines to a SCSI target ID.

If you didn't, then the OS would enumerate the devices as they were discovered. In many simple cases, that was just dandy. If you only have two disks on one SCSI bus, one is numbered 0, the other 1, then you have /dev/da0 and /dev/da1 and life is fine.

But if you had multiple SCSI busses and lots of devices... it meant that you had to keep careful track of where your devices were, lest the kernel automatically allocate them in some random order. So back in the days of FreeBSD 3, a large disk kernel config might look like (this is a real config incidentally):

Code:

# SCSI peripherals
# Only one of each of these is needed, they are dynamically allocated.
controller      scbus0 at SCBUS0
controller      scbus1 at SCBUS1
controller      scbus2 at SCBUS2
controller      scbus3 at SCBUS3
controller      scbus4 at SCBUS4

disk            da0     at      scbus0  target 0 unit 0
disk            da1     at      scbus0  target 1 unit 0
disk            da2     at      scbus0  target 2 unit 0
disk            da3     at      scbus0  target 3 unit 0
disk            da4     at      scbus0  target 4 unit 0
disk            da5     at      scbus0  target 5 unit 0
disk            da6     at      scbus0  target 6 unit 0
disk            da8     at      scbus0  target 8 unit 0
disk            da9     at      scbus0  target 9 unit 0

disk            da10    at      scbus1  target 0 unit 0
disk            da11    at      scbus1  target 1 unit 0
disk            da12    at      scbus1  target 2 unit 0
disk            da13    at      scbus1  target 3 unit 0
disk            da14    at      scbus1  target 4 unit 0
disk            da15    at      scbus1  target 5 unit 0
disk            da16    at      scbus1  target 6 unit 0
disk            da18    at      scbus1  target 8 unit 0
disk            da19    at      scbus1  target 9 unit 0

disk            da20    at      scbus2  target 0 unit 0
disk            da21    at      scbus2  target 1 unit 0
disk            da22    at      scbus2  target 2 unit 0
disk            da23    at      scbus2  target 3 unit 0

disk            da24    at      scbus2  target 4 unit 0
disk            da25    at      scbus2  target 5 unit 0
disk            da26    at      scbus2  target 6 unit 0
disk            da28    at      scbus2  target 8 unit 0
disk            da29    at      scbus2  target 9 unit 0

disk            da30    at      scbus3  target 0 unit 0
disk            da31    at      scbus3  target 1 unit 0
disk            da32    at      scbus3  target 2 unit 0
disk            da33    at      scbus3  target 3 unit 0
disk            da34    at      scbus3  target 4 unit 0
disk            da35    at      scbus3  target 5 unit 0
disk            da36    at      scbus3  target 6 unit 0
disk            da38    at      scbus3  target 8 unit 0
disk            da39    at      scbus3  target 9 unit 0

disk            da40    at      scbus4  target 0 unit 0
disk            da41    at      scbus4  target 1 unit 0
disk            da42    at      scbus4  target 2 unit 0
disk            da43    at      scbus4  target 3 unit 0
disk            da44    at      scbus4  target 4 unit 0
disk            da45    at      scbus4  target 5 unit 0
disk            da46    at      scbus4  target 6 unit 0
disk            da48    at      scbus4  target 8 unit 0
disk            da49    at      scbus4  target 9 unit 0

which is actually humanly-interpretable to mean that "da41" is the first device on the fourth SCSI bus. This kind of thing was a necessary sanity-hack. Then you could use ccd (disk concatenation) to create your meta-devices

Code:

ccd0    110592  none    /dev/da10s1f /dev/da20s1f /dev/da30s1f /dev/da40s1f
ccd1    110592  none    /dev/da11s1f /dev/da21s1f /dev/da31s1f /dev/da41s1f
ccd2    110592  none    /dev/da12s1f /dev/da22s1f /dev/da32s1f /dev/da42s1f
ccd3    110592  none    /dev/da13s1f /dev/da23s1f /dev/da33s1f /dev/da43s1f
ccd4    110592  none    /dev/da14s1f /dev/da24s1f /dev/da34s1f /dev/da44s1f
ccd5    110592  none    /dev/da15s1f /dev/da25s1f /dev/da35s1f /dev/da45s1f
ccd6    110592  none    /dev/da16s1f /dev/da26s1f /dev/da36s1f /dev/da46s1f
ccd8    110592  none    /dev/da18s1f /dev/da28s1f /dev/da38s1f /dev/da48s1f
ccd9    110592  none    /dev/da19s1f /dev/da29s1f /dev/da39s1f /dev/da49s1f

ccd10   128     none    /dev/da10s1e /dev/da11s1e /dev/da12s1e /dev/da13s1e /dev/da14s1e /dev/da15s1e /dev/da16s1e /dev/da18s1e /dev/da19s1e /dev/da20s1e /dev/da21s1e /dev/da22s1e /dev/da23s1e /dev/da24s1e /dev/da25s1e /dev/da26s1e /dev/da28s1e /dev/da29s1e /dev/da30s1e /dev/da31s1e /dev/da32s1e /dev/da33s1e /dev/da34s1e /dev/da35s1e /dev/da36s1e /dev/da38s1e /dev/da39s1e /dev/da40s1e /dev/da41s1e /dev/da42s1e /dev/da43s1e /dev/da44s1e /dev/da45s1e /dev/da46s1e /dev/da48s1e /dev/da49s1e

and this worked swimmingly well. However, if you had the misfortune to connect your SCSI busses in the wrong order, all hell would still let loose...

So the thing that's really broken here is that "da14" for example refers to a device by I/O location, "SCSI bus 1, disk 4". Locking these down does indeed prevent lots of headaches, but it was really only a partial solution to a significant problem.

From the UNIX system's point of view, it doesn't really care what the device is named, or what kind of controller it is on. This is already abstracted away, and we could swap out models of SCSI controllers easily (which is substituted into these configs where you see CAPITAL LETTER SCBUS# above). It doesn't know or care that da14 is a Seagate ST150176LW on an Adaptec 3940. It already had all the capability to abstract that out to a generic device name.

The problem is that this was really only half the challenge. What you really want is to refer to a target device by an abstract name that doesn't change as it moves from one SCSI bus to another.

That's where GPT and the gpart tool come in.

Your FreeBSD disk devices still enumerate as /dev/da# but you can use GPT to label partitions. You can either name them explicitly, which is what I do for most modern FreeBSD installs, or the system will assign a UUID when creating it. This means your /dev/da4p1 also shows up as /dev/gptid/1234-some-big-hex-num-5678

But more importantly, when you REMOVE it from the system and put it on a different port, it will no longer show up as /dev/da4p1 but it WILL show up in /dev/gptid/1234-some-big-hex-num-5678

That means that if you access your devices by GPT labels, you are no longer dependent on physical device names. Hooray.

House Of Cards · Feb 7, 2022

jgreco said:
And I really wanted to circle back around to this and talk about it for a second, but I had lost the message.

Back in the good old days, we had UNIX devices like "/dev/sd0" (SunOS) and then these could be mapped by kernel config lines to a SCSI target ID.

If you didn't, then the OS would enumerate the devices as they were discovered. In many simple cases, that was just dandy. If you only have two disks on one SCSI bus, one is numbered 0, the other 1, then you have /dev/da0 and /dev/da1 and life is fine.

But if you had multiple SCSI busses and lots of devices... it meant that you had to keep careful track of where your devices were, lest the kernel automatically allocate them in some random order. So back in the days of FreeBSD 3, a large disk kernel config might look like (this is a real config incidentally):

Code:
# SCSI peripherals # Only one of each of these is needed, they are dynamically allocated. controller scbus0 at SCBUS0 controller scbus1 at SCBUS1 controller scbus2 at SCBUS2 controller scbus3 at SCBUS3 controller scbus4 at SCBUS4 disk da0 at scbus0 target 0 unit 0 disk da1 at scbus0 target 1 unit 0 disk da2 at scbus0 target 2 unit 0 disk da3 at scbus0 target 3 unit 0 disk da4 at scbus0 target 4 unit 0 disk da5 at scbus0 target 5 unit 0 disk da6 at scbus0 target 6 unit 0 disk da8 at scbus0 target 8 unit 0 disk da9 at scbus0 target 9 unit 0 disk da10 at scbus1 target 0 unit 0 disk da11 at scbus1 target 1 unit 0 disk da12 at scbus1 target 2 unit 0 disk da13 at scbus1 target 3 unit 0 disk da14 at scbus1 target 4 unit 0 disk da15 at scbus1 target 5 unit 0 disk da16 at scbus1 target 6 unit 0 disk da18 at scbus1 target 8 unit 0 disk da19 at scbus1 target 9 unit 0 disk da20 at scbus2 target 0 unit 0 disk da21 at scbus2 target 1 unit 0 disk da22 at scbus2 target 2 unit 0 disk da23 at scbus2 target 3 unit 0 disk da24 at scbus2 target 4 unit 0 disk da25 at scbus2 target 5 unit 0 disk da26 at scbus2 target 6 unit 0 disk da28 at scbus2 target 8 unit 0 disk da29 at scbus2 target 9 unit 0 disk da30 at scbus3 target 0 unit 0 disk da31 at scbus3 target 1 unit 0 disk da32 at scbus3 target 2 unit 0 disk da33 at scbus3 target 3 unit 0 disk da34 at scbus3 target 4 unit 0 disk da35 at scbus3 target 5 unit 0 disk da36 at scbus3 target 6 unit 0 disk da38 at scbus3 target 8 unit 0 disk da39 at scbus3 target 9 unit 0 disk da40 at scbus4 target 0 unit 0 disk da41 at scbus4 target 1 unit 0 disk da42 at scbus4 target 2 unit 0 disk da43 at scbus4 target 3 unit 0 disk da44 at scbus4 target 4 unit 0 disk da45 at scbus4 target 5 unit 0 disk da46 at scbus4 target 6 unit 0 disk da48 at scbus4 target 8 unit 0 disk da49 at scbus4 target 9 unit 0

which is actually humanly-interpretable to mean that "da41" is the first device on the fourth SCSI bus. This kind of thing was a necessary sanity-hack. Then you could use ccd (disk concatenation) to create your meta-devices

Code:
ccd0 110592 none /dev/da10s1f /dev/da20s1f /dev/da30s1f /dev/da40s1f ccd1 110592 none /dev/da11s1f /dev/da21s1f /dev/da31s1f /dev/da41s1f ccd2 110592 none /dev/da12s1f /dev/da22s1f /dev/da32s1f /dev/da42s1f ccd3 110592 none /dev/da13s1f /dev/da23s1f /dev/da33s1f /dev/da43s1f ccd4 110592 none /dev/da14s1f /dev/da24s1f /dev/da34s1f /dev/da44s1f ccd5 110592 none /dev/da15s1f /dev/da25s1f /dev/da35s1f /dev/da45s1f ccd6 110592 none /dev/da16s1f /dev/da26s1f /dev/da36s1f /dev/da46s1f ccd8 110592 none /dev/da18s1f /dev/da28s1f /dev/da38s1f /dev/da48s1f ccd9 110592 none /dev/da19s1f /dev/da29s1f /dev/da39s1f /dev/da49s1f ccd10 128 none /dev/da10s1e /dev/da11s1e /dev/da12s1e /dev/da13s1e /dev/da14s1e /dev/da15s1e /dev/da16s1e /dev/da18s1e /dev/da19s1e /dev/da20s1e /dev/da21s1e /dev/da22s1e /dev/da23s1e /dev/da24s1e /dev/da25s1e /dev/da26s1e /dev/da28s1e /dev/da29s1e /dev/da30s1e /dev/da31s1e /dev/da32s1e /dev/da33s1e /dev/da34s1e /dev/da35s1e /dev/da36s1e /dev/da38s1e /dev/da39s1e /dev/da40s1e /dev/da41s1e /dev/da42s1e /dev/da43s1e /dev/da44s1e /dev/da45s1e /dev/da46s1e /dev/da48s1e /dev/da49s1e

and this worked swimmingly well. However, if you had the misfortune to connect your SCSI busses in the wrong order, all hell would still let loose...

So the thing that's really broken here is that "da14" for example refers to a device by I/O location, "SCSI bus 1, disk 4". Locking these down does indeed prevent lots of headaches, but it was really only a partial solution to a significant problem.

From the UNIX system's point of view, it doesn't really care what the device is named, or what kind of controller it is on. This is already abstracted away, and we could swap out models of SCSI controllers easily (which is substituted into these configs where you see CAPITAL LETTER SCBUS# above). It doesn't know or care that da14 is a Seagate ST150176LW on an Adaptec 3940. It already had all the capability to abstract that out to a generic device name.

The problem is that this was really only half the challenge. What you really want is to refer to a target device by an abstract name that doesn't change as it moves from one SCSI bus to another.

That's where GPT and the gpart tool come in.

Your FreeBSD disk devices still enumerate as /dev/da# but you can use GPT to label partitions. You can either name them explicitly, which is what I do for most modern FreeBSD installs, or the system will assign a UUID when creating it. This means your /dev/da4p1 also shows up as /dev/gptid/1234-some-big-hex-num-5678

But more importantly, when you REMOVE it from the system and put it on a different port, it will no longer show up as /dev/da4p1 but it WILL show up in /dev/gptid/1234-some-big-hex-num-5678

That means that if you access your devices by GPT labels, you are no longer dependent on physical device names. Hooray.

Outstanding!

Thanks for the clarification! I know from experience that this question isn’t easy to find and answer to, so maybe we could make a sticky or something about physical hardware upgrades?

The first issue that comes up for a home user newbie a year or two after throwing together a basic NAS is, that once he realizes how convenient it is, how do I expand this thing?

Do I buy SATA expansion cards or a whole new system to fill up that new giant case with drives? Will I break my pool if I don’t plug one of the drives into the right SATA port? Etc…

It’s good to know that you can take that perfectly maintained NAS with all its jails and just plug the drives into a new giant case with more drives. Then install TrueNAS and it will relocate your pool.

That information needs to be more prominent somewhere.

Thanks for the clarification!
Steven

Important Announcement for the TrueNAS Community.

Solaris: warning pool has encountered an uncorrectable io error suspended

Dabbler

Powered by Neutrality

Dabbler

actually does care

Dabbler

Server Wrangler

Dabbler

Dabbler

actually does care

Dabbler

Dabbler

Dabbler

Powered by Neutrality

Dabbler

Dabbler

Dabbler

Patron

Resident Grinch

Resident Grinch

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Solaris: warning pool has encountered an uncorrectable io error suspended"

Similar threads