PSU issue; disks (nvme's) exported somehow; unable to re-import to pool?

heyitsjel

Dabbler
Joined
Sep 5, 2023
Messages
13
Hey all,

Currently running TrueNAS-SCALE-23.10.2
System is a 10700k with 32Gb ram, Mellanox CX3 and 4x 4TB NVME drives in a RAIDZ1 config.
Minor mis-match on the sizing of the 4TB drives (3.73 TiB vs 3.64 TiB), so I lose a little capacity (2x Crucial drives; 2x Lexar drives)

A day or two ago I noticed that my server suddenly had a few issues - namely two of the four nvme drives had disappeared / were no longer even detected by the system. Fearing the worst (ie. dual nvme failures simultaneously... but what's the chance of that, considering they're different batches), I started to strip the system apart and do some investigation.
I tried several different configurations of drives installed (eg. standalone alternative drives; and random combinations in various m.2 slots), and was able to see that the drives in question weren't dead (ie. they were detectable by the bios and in the command line). I was still getting erratic behavior however - sometimes the PC would post; other times it wouldn't; drives would magically not show up etc.
Long story short, I'm now 99% sure it's the power supply, as switching to an alternative seems to have it working just fine now. I'll experiment more with PSU cables, but I'm pretty damn sure it's the PSU itself.

Now the issue I'm having is that all my drives for my pool are showing as exported; while the pool is not:

1710913579618.png


1710913616811.png


When I try and use the "Add to Pool" function under storage (showing Unassigned Disks as 4), to the existing/original pool (NVME-16TB-Z1) it basically tells me the existing disks have exported pools on them, and that I'll lose the associated data. Obviously I didn't go ahead with this

1710913697189.png


I then started looking at the shell options to import the pool, and I'm given this:

1710913874242.png


It seems I can force the pool import (using -f) but is this the correct thing to do? I'm fairly confident the data on the disks is intact, as it wouldn't have been under reads/writes when the failure occurred. We had a power outage the other night, and all this started the following morning (despite it being on a UPS). I suspect this may have caused the PSU failure as well (or vice versa).

Any help trying to re-import my disks into the pool is *greatly* appreciated!
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Well, your first screen shot does NOT show the pool imported. It simply states the pool exists.

Now the command line is more definitive, your pool is exported. The commands you can try are these below. Stop after one succeeds. Read the manual page for zpool-import before continuing, so you understand the risks.
zpool import NMVE-16TB-Z1 zpool import -f NMVE-16TB-Z1 zpool import -fF NMVE-16TB-Z1 zpool import -fFX NMVE-16TB-Z1

With the "FAULTED corrupted data" tag, that indicates some pool corruption. With the "-fF" option, the pool import will throw out some recent write transactions, hopefully the corrupt data. Note that data thrown out is irretrievably lost. If that "-fF" option does not work, you can move on to the more extreme "-fFX" with even more risk.

Note that ZFS only returns good data. So if a pool finally imports, and does so without errors, any read should return good data. You can run a scrub first to verify pool health.

If ZFS detects corrupt files, (by scrub or an attempt to read one), ZFS will show them with;
zpool status -v NMVE-16TB-Z1
How you deal with any corrupted files is generally personal. I have full backups of my media pool, so any corruption of it, means I delete the file, re-run the scrub to make sure corruption is gone. Then restore the file that was lost. But, if you don't have a backup, and it's an important file, you can attempt to read it knowing somewhere their is missing data in it. (Probably shown by blocks of zeros...)


As for the cause of the PSU failure, some UPSes create less than ideal simulation of sine wave A.C.. Combined with standby UPS, (one that only creates it's output on loss of input), a PSU on the edge of failure can be triggered to fail. PSUs can also just get old, and some less than ideal quality components, (like electrolytic capacitors which can dry out), can age poorly.

Some higher quality, (read more expensive), UPSes can include one or both these features;
  • True sine wave generation
  • Double conversion, where the external power in goes to the battery & DC power level. Then that DC power is used to create output power.
The last basically means the UPS is "always on" making power. More costly up front and usage of power. But, you get what you pay for.
 

heyitsjel

Dabbler
Joined
Sep 5, 2023
Messages
13
Well, your first screen shot does NOT show the pool imported. It simply states the pool exists.

Now the command line is more definitive, your pool is exported. The commands you can try are these below. Stop after one succeeds. Read the manual page for zpool-import before continuing, so you understand the risks.
zpool import NMVE-16TB-Z1 zpool import -f NMVE-16TB-Z1 zpool import -fF NMVE-16TB-Z1 zpool import -fFX NMVE-16TB-Z1

With the "FAULTED corrupted data" tag, that indicates some pool corruption. With the "-fF" option, the pool import will throw out some recent write transactions, hopefully the corrupt data. Note that data thrown out is irretrievably lost. If that "-fF" option does not work, you can move on to the more extreme "-fFX" with even more risk.

Note that ZFS only returns good data. So if a pool finally imports, and does so without errors, any read should return good data. You can run a scrub first to verify pool health.

If ZFS detects corrupt files, (by scrub or an attempt to read one), ZFS will show them with;
zpool status -v NMVE-16TB-Z1
How you deal with any corrupted files is generally personal. I have full backups of my media pool, so any corruption of it, means I delete the file, re-run the scrub to make sure corruption is gone. Then restore the file that was lost. But, if you don't have a backup, and it's an important file, you can attempt to read it knowing somewhere their is missing data in it. (Probably shown by blocks of zeros...)


As for the cause of the PSU failure, some UPSes create less than ideal simulation of sine wave A.C.. Combined with standby UPS, (one that only creates it's output on loss of input), a PSU on the edge of failure can be triggered to fail. PSUs can also just get old, and some less than ideal quality components, (like electrolytic capacitors which can dry out), can age poorly.

Some higher quality, (read more expensive), UPSes can include one or both these features;
  • True sine wave generation
  • Double conversion, where the external power in goes to the battery & DC power level. Then that DC power is used to create output power.
The last basically means the UPS is "always on" making power. More costly up front and usage of power. But, you get what you pay for.

Thanks for the prompt reply Arwen!

I have done a bit more digging and found info along the lines of what you showed above.
When looking at importing NVME-16TB-Z1 in shell, it suggested:

Recovery is possible, but will result in some data loss. Returning the pool to its state as of Sat Mar 16 17:32:46 2024 should correct the problem. Approximately 6 minutes of data must be discarded, irreversibly. Recovery can be attempted by executing 'zpool import -F NVME-16TB-Z1'. A scrub of the pool is strongly recommended after recovery.

I ran: sudo zpool import -F NVME-16TB-Z1
After a few seconds, it seemed to import successfully; I rebooted.
I then had to unlock the encrypted dataset - thankfully I had my .json key stored and was able to unlock it successfully.
PS. The warning next to "1x RAIDZ1" is regarding the disk sizes being mismatched (always been there).

1710931482139.png


Weirdly, it doesn't seem to mount the dataset; or at least SMB isn't playing ball with being able to access the existing shares for the dataset. SMB Shares service appears to be running, but when I try and enable a specific share it gives:

1710931371316.png


Similarly, if I try and add a new SMB share, I can't actually choose any /mnt location (empty, even when clicking the small arrow like you normally do):

1710931427268.png


I've just run a scrub on the dataset; everything was fine - no errors.

1710931723023.png


One annoying thing I have noted, is that upon reboot it seems to re-lock the dataset. So I must unlock it again each time using the .json file with the encryption key.

Also, I noticed that the Dashboard has decided it won't show my CPU usage %, network activity etc. Basically, the widgets keep loading. Tried multiple browsers; machines; mobile; and even multiple reboots:

1710931605140.png


I'm convinced now there's something up with the actual boot/install of SCALE. I remember a month or two back there was an update that was released (possibly 23.10.1?); and it caused all sorts of chaos with my system. Funnily enough, I then went to look for updates again, and it seemed like they rolled-back the change/update release (23.10.1.1)? I installed the update/rollback, and it seemed to fix it at the time... but maybe it caused some sort of long-standing issue?

Also, weirdly, when I try and export the pool (NVME-16TB-Z1), it now complains about it containing the system dataset... which is very odd, as in my settings (System -> Advanced), my system dataset is set to be on my boot-pool (separate SSD).

1710931902360.png



I'm now thinking a clean re-install of SCALE, and re-import my pool may be the best option? I'm even hesitant to re-import the existing SCALE settings into a new install, in case they're corrupt/screw something up? If I can't export the pool via the GUI, how do I export it properly?
 

Attachments

  • 1710931291379.png
    1710931291379.png
    8.6 KB · Views: 16

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Export in the cli then import in the gui?
 

heyitsjel

Dabbler
Joined
Sep 5, 2023
Messages
13
Okay, so I manually went back into the GUI, and re-selected the boot pool (system --> advanced) to be on the boot-pool (even though it showed it was supposedly already on the boot-pool... but the fact I couldn't export the NVME pool prior due to the warning that the system dataset was on it suggest otherwise). It gave a warning, then some error. Repeated the process and it seemed to happily proceed.

I was then able to export the pool normally through the GUI (although I didn't delete configuration).

I've now re-installed a fresh copy of TrueNAS-SCALE-23.10.2; pointed it towards a different SSD for the boot-pool/install and re-installed.
Imported the pool (NVME-16TB-Z1) and unlocked with the encryption .JSON key without issue.
Was then able to re-establish users manually; add SMB shares etc.

Weirdly, after rebooting again, it seems to have tried and used the system dataset on the NVME-16TB-Z1 pool again (or copied itself there again!). This then proceeded to popup with an alert about "Glusterd work directory dataset is not mounted.". My dashboard then began the same issue of not showing CPU threads; Memory usage; Network etc... (as per this thread / other person's post).

Manually went back and set the system dataset to the boot-pool *again* (system --> advanced); rebooted and everything seems fine.
Is there any downside to having the system dataset on the boot-pool?

1710938807427.png
 
Last edited:
Top