ghost_za
Dabbler
- Joined
- Oct 13, 2021
- Messages
- 42
Hi,
I am satill diagnosing the problem. But, wanted to get feedback and shared my frustrations on the forums.
I bought a WD 3TiB for extra storage on my personal TrueNAS system.
Recently upgraded to 13
Created a single ZFS pool for the data as I don't have the cash to buy 3 drives and put them all ina pool. There is lots of drives and 1 of them is a 3 drive zpool.
Initially I didn;t have any problems creating the zpool or share or copying the data to the drive.
Once in a while (like every 3 months or so) I will do a complete drive health check to ensure no unexpected drive failures. I found a few pending sectors on 1 of the drives and repaired them but the new WD failed the EXTENDED smart test. It stuck on 90% every time I ran it. I contacted the supplier and informed them of the issue. Both SHORT and CONVEYANCE test completed fine. The SMART did not report any failures, just stuck at 90% evcery time. I assumed the drive was faulty but as SMART was not reporting any immanent failure I waited on a M.2 - 6SATA expansion card to expand the Truenas system (to have more sata ports) so I can backup the data and send it back to the supplier. The m.2 card has a ASM1166 chip. So the SMART issue presented before the m.2 card. I received the card Friday(6-10-23) and continued with the installation. Everything worked right off the bat with some performance issues - I expected the expansion card would have reduced throughput to no bif surprise. Left the new WD on the on-board SATA ports and added 2 drives as tests. Shortly there after I received some minor I/O failures on this WD. I assumed I was unrelated and it was just the drives' natural failre creeping up. Saturday it started giving me and "Uncorrectable I/O failure" pool is suspended warning and the poolo was unaccessable.
TrueNAS restart fixed the issue (for a while - like 5 hours or so) - which leads me to my first FRUSTRATION with TrueNAS.
I could still access the other drives but the CPU load went ape-shit. Running at 75-80 degrees with 100% load. I doidn't investigate exactly what service was causing the issue. just restart. Now irrelevant of what caused the issue a server OS like TrueNAS can't be causing a runaway CPU load like that. If it was an unattended server and the load went unnoticed for weeks it could have severly damaged the CPU (or even other components). Then OK it could be argued that no preventative measure could have helped and that shit-happens right...OK fine.
But when I tried to reboot the system remotely it backfired. While I left the system as-is the remaining pools were still accessable and functioning - exactly what I'd expect of a high availalbility NAS system like truenas. But when I rebooted it stopped some services including all the managing services like WebUI and SSH services but froze on another part ( irrelavant of on what it froze ). So essentially rebooting the system killed it. If this was an unattended system in a remote location then I would have entirely killed the server because 1 drive failed. No way to reboot the server short of cutting the power physically.
Now this is unacceptable - this means I can (or rather won't) ever install this in a production environment like a business.
Now I'm diagnosing the issue now / trying to disaster recovery / data recovery this drive as the data isn't essential but I would prefer to recovery it if possible.
I disconnected/exported the pool to remove the pool, becasue by Sunday (yesterday) the truenas wouldn't even boot anymore if the WD is connected. same 'Uncorrectable I/O" error and then it freezes on creating the cache for a different pool. Once again the system is essentailly killed due to 1 drive failure. If it was remote I would have been screwed. To diagnose I disconnected the pool.
If I do that then I can boot the system. Even with the drive connected at bootup. In the system I can see the drive displayed and here is the interesting part. All smart tests succeed - even the LONG ( that previosuly froze at 90%) the slugishness I noticed on the system is gone and everything seems fine again.
Then I tried to import the pool - I did this as the last step before removing the drive and performing data recovery on it. When I do this it kills the system again the entire middlewared is completely unresponsive and once again can't reboot remotely. it freezes on the reboot process. So system killed by 1 drive failure again.
I left the first import try for 12 hours nad tried 2 times after that. When rebooting the pool is back and tries to import it and then freezes on the cache part of another pool like previsouly stated. So can't boot again.
I have now as a last step on the server removed the M.2 expansion card and remove the drive to diagnose independantly and perform data recovery if necessary.
So the point of this post is:
1. To vent my frustrations with TrueNAS and to inform the devs of IX system to please have a look at how ZFS errors are handled to avoid a complete system failure in a scenario like this.
2. To get some feedback on what could be causing this behaviour. Because where I am now in testing is I'm not finding any errors on the drive inself. Smart is succedding and recovery is not reporting any errors (so far). So is't looking more and more like the drive is fine and this was caused by some oither issue.
Thoughts? Ideas?
PS: If someone needs debug logs (or something) please just ask for what you need I will try to provide it.
I am satill diagnosing the problem. But, wanted to get feedback and shared my frustrations on the forums.
I bought a WD 3TiB for extra storage on my personal TrueNAS system.
Recently upgraded to 13
Created a single ZFS pool for the data as I don't have the cash to buy 3 drives and put them all ina pool. There is lots of drives and 1 of them is a 3 drive zpool.
Initially I didn;t have any problems creating the zpool or share or copying the data to the drive.
Once in a while (like every 3 months or so) I will do a complete drive health check to ensure no unexpected drive failures. I found a few pending sectors on 1 of the drives and repaired them but the new WD failed the EXTENDED smart test. It stuck on 90% every time I ran it. I contacted the supplier and informed them of the issue. Both SHORT and CONVEYANCE test completed fine. The SMART did not report any failures, just stuck at 90% evcery time. I assumed the drive was faulty but as SMART was not reporting any immanent failure I waited on a M.2 - 6SATA expansion card to expand the Truenas system (to have more sata ports) so I can backup the data and send it back to the supplier. The m.2 card has a ASM1166 chip. So the SMART issue presented before the m.2 card. I received the card Friday(6-10-23) and continued with the installation. Everything worked right off the bat with some performance issues - I expected the expansion card would have reduced throughput to no bif surprise. Left the new WD on the on-board SATA ports and added 2 drives as tests. Shortly there after I received some minor I/O failures on this WD. I assumed I was unrelated and it was just the drives' natural failre creeping up. Saturday it started giving me and "Uncorrectable I/O failure" pool is suspended warning and the poolo was unaccessable.
TrueNAS restart fixed the issue (for a while - like 5 hours or so) - which leads me to my first FRUSTRATION with TrueNAS.
I could still access the other drives but the CPU load went ape-shit. Running at 75-80 degrees with 100% load. I doidn't investigate exactly what service was causing the issue. just restart. Now irrelevant of what caused the issue a server OS like TrueNAS can't be causing a runaway CPU load like that. If it was an unattended server and the load went unnoticed for weeks it could have severly damaged the CPU (or even other components). Then OK it could be argued that no preventative measure could have helped and that shit-happens right...OK fine.
But when I tried to reboot the system remotely it backfired. While I left the system as-is the remaining pools were still accessable and functioning - exactly what I'd expect of a high availalbility NAS system like truenas. But when I rebooted it stopped some services including all the managing services like WebUI and SSH services but froze on another part ( irrelavant of on what it froze ). So essentially rebooting the system killed it. If this was an unattended system in a remote location then I would have entirely killed the server because 1 drive failed. No way to reboot the server short of cutting the power physically.
Now this is unacceptable - this means I can (or rather won't) ever install this in a production environment like a business.
Now I'm diagnosing the issue now / trying to disaster recovery / data recovery this drive as the data isn't essential but I would prefer to recovery it if possible.
I disconnected/exported the pool to remove the pool, becasue by Sunday (yesterday) the truenas wouldn't even boot anymore if the WD is connected. same 'Uncorrectable I/O" error and then it freezes on creating the cache for a different pool. Once again the system is essentailly killed due to 1 drive failure. If it was remote I would have been screwed. To diagnose I disconnected the pool.
If I do that then I can boot the system. Even with the drive connected at bootup. In the system I can see the drive displayed and here is the interesting part. All smart tests succeed - even the LONG ( that previosuly froze at 90%) the slugishness I noticed on the system is gone and everything seems fine again.
Then I tried to import the pool - I did this as the last step before removing the drive and performing data recovery on it. When I do this it kills the system again the entire middlewared is completely unresponsive and once again can't reboot remotely. it freezes on the reboot process. So system killed by 1 drive failure again.
I left the first import try for 12 hours nad tried 2 times after that. When rebooting the pool is back and tries to import it and then freezes on the cache part of another pool like previsouly stated. So can't boot again.
I have now as a last step on the server removed the M.2 expansion card and remove the drive to diagnose independantly and perform data recovery if necessary.
So the point of this post is:
1. To vent my frustrations with TrueNAS and to inform the devs of IX system to please have a look at how ZFS errors are handled to avoid a complete system failure in a scenario like this.
2. To get some feedback on what could be causing this behaviour. Because where I am now in testing is I'm not finding any errors on the drive inself. Smart is succedding and recovery is not reporting any errors (so far). So is't looking more and more like the drive is fine and this was caused by some oither issue.
Thoughts? Ideas?
PS: If someone needs debug logs (or something) please just ask for what you need I will try to provide it.