Job running for import zfs pools 15 min on each boot after upgrade from core

dbrannon79

Dabbler
Joined
Oct 21, 2022
Messages
32
Hello all, I just joined the forum but have been lurking around finding answers to various things in the past. I have recently came across an issue after upgrading from core to scale.

upon every boot, I see messages on the server showing my raid is not clean starting background reconstruction for both pools. it will always run for 15 minutes, but once booted and I can access the web gui, all pools look good and show nothing wrong. I have manually ran the smart test checking each drive but everything seems to pass with flying colors as far as I can tell.

After a few days of seeing this and searching online only to come up empty, I decided to wipe the boot ssd and do a fresh install, then upload my saved configuration.

upon reboot the same messages come back. I have attached a couple of screenshots and a text file which I copied and pasted the boot log from, in hopes there is some kind of error in the config that I am missing.

I have two pools in my server both setup in raid z1, pool name Storage has 4 3 tb sas drives with one hot spare, the other is named Backup with three 2tb sata drives in raid z1 with no spare. it's also setup for a replication task every so often from my mains storage to the backup pool to help save me from loosing any data.

I'm hopeful some of you all can help me figure this out.


rpviewer.png


1666419128747.png
 

Attachments

  • Boot Log.txt
    45.7 KB · Views: 122

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
You don't list your hardware, beside disk sizes.

What is the disk connection? Motherboard SATA? LSI HBA?
Are you using hardware RAID controller, even if it's JBOD mode?
 

WN1X

Explorer
Joined
Dec 2, 2019
Messages
77
Those references to the md device in the boot log indicate the use of the Linux software raid. Did you use the CLI to create a raid array? I also get the impression your drives may be connected via USB? TrueNAS does not use raid...attempting to use the Linux software raid is a mission doomed for failure.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Doesn't SCALE create mirrored swap with md just like CORE does with geom?
 

WN1X

Explorer
Joined
Dec 2, 2019
Messages
77
Doesn't SCALE create mirrored swap with md just like CORE does with geom?
My SCALE system shows /dev/md0 pointing to swap0 and swap device is listed as /dev/dm-0.
 

dbrannon79

Dabbler
Joined
Oct 21, 2022
Messages
32
Hardware is a Dell Poweredge R720 server dual xeon E5-2620 cpu's, 64 gig ecc ram and a Dell H200 SAS HBA LSI 9210-8i (=9211-8i) P20 in IT mode the pools were created in core and imported into scale.

I did some more searching online and found this thread with similar output in their logs. but this was all I could find

 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
The 3 MD Mirror devices do appear to be for swap.

However, looking at the boot time messages a bit more, could the disks be USB attached?

I am no expert in reading Linux boot messages, but there are way too many USB references just before each disk is mentioned.
 

dbrannon79

Dabbler
Joined
Oct 21, 2022
Messages
32
no usb devices attached. I am wondering if there is something in the config file.

question: if I were to wipe the boot drive and start over with scale. is there a way to edit the backup configuration file to only import the users and, share privileges and what not leaving out all the zfs pools. If I can do this basically choosing what I want settings I want to import and then before actually importing the config, I would manually import the pools on the fresh install? this way there is no reference in the config to any other raid pools.

a good while back I did have other raid pools mounted when I was on core which I had removed. those pools were created using openmediavault.

it is possible that the two pools I have now were also created on OMV before I migrated over to core, and now on to scale. may this transition between the three is where the problem came from.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
no usb devices attached. I am wondering if there is something in the config file.

question: if I were to wipe the boot drive and start over with scale. is there a way to edit the backup configuration file to only import the users and, share privileges and what not leaving out all the zfs pools. If I can do this basically choosing what I want settings I want to import and then before actually importing the config, I would manually import the pools on the fresh install? this way there is no reference in the config to any other raid pools.

a good while back I did have other raid pools mounted when I was on core which I had removed. those pools were created using openmediavault.

it is possible that the two pools I have now were also created on OMV before I migrated over to core, and now on to scale. may this transition between the three is where the problem came from.
yes that would be the case.

truenas allows other pools to be imported but doesn't create pools with md-raid
 

dbrannon79

Dabbler
Joined
Oct 21, 2022
Messages
32
being that I don't remember what all was created using md raid, what would be the best method to remove all entries or references to any md-raid configuration without having to start over? it would take me hours on end to re-create all the users and permissions I have set for all the shares.

I have several users setup with their own private shares in scale.
 

dbrannon79

Dabbler
Joined
Oct 21, 2022
Messages
32
ok, I have some details I retrieved from my truenas system. I opened the /etc/mdadm.conf and found nothing listed in there. So I ran a command to list what is mounted.

root@truenas[/etc/mdadm]# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md125 : active raid1 sdd1[1] sdb1[0]
2097152 blocks super non-persistent [2/2] [UU]

md126 : active raid1 sdh1[1] sdg1[0]
2097152 blocks super non-persistent [2/2] [UU]

md127 : active raid1 sde1[1] sdc1[0]
2097152 blocks super non-persistent [2/2] [UU]

unused devices: <none>

then I looked at my zfs pools for what disks collide with this above output.

root@truenas[/etc/mdadm]# zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
Backup 5.44T 3.84T 1.59T - - 1% 70% 1.00x ONLINE /mnt
Storage 8.17T 3.78T 4.39T - - 3% 46% 1.00x ONLINE /mnt
boot-pool 222G 2.71G 219G - - 0% 1% 1.00x ONLINE -

root@truenas[/etc/mdadm]# zpool status
pool: Backup
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 03:54:00 with 0 errors on Sun Oct 23 03:54:05 2022
config:

NAME STATE READ WRITE CKSUM
Backup ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
13eeda96-6ce3-11ec-958b-b8ca3a5cba94 ONLINE 0 0 0
14107577-6ce3-11ec-958b-b8ca3a5cba94 ONLINE 0 0 0
14efa27a-6ce3-11ec-958b-b8ca3a5cba94 ONLINE 0 0 0

errors: No known data errors

pool: Storage
state: ONLINE
scan: scrub repaired 0B in 03:56:13 with 0 errors on Sun Oct 23 03:56:17 2022
config:

NAME STATE READ WRITE CKSUM
Storage ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
b17a9086-666a-11ec-b559-b8ca3a5cba94 ONLINE 0 0 0
b1839ced-666a-11ec-b559-b8ca3a5cba94 ONLINE 0 0 0
b1899a42-666a-11ec-b559-b8ca3a5cba94 ONLINE 0 0 0
spares
b16cc225-666a-11ec-b559-b8ca3a5cba94 AVAIL

errors: No known data errors

pool: boot-pool
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
config:

NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
sda3 ONLINE 0 0 0

errors: No known data errors

root@truenas[/etc/mdadm]# lsblk -o NAME,SIZE,SERIAL,LABEL,FSTYPE
NAME SIZE SERIAL LABEL FSTYPE
sda 223.6G 212303A016E1
├─sda1 1M
├─sda2 512M EFI vfat
└─sda3 223.1G boot-pool zfs_member
sdb 1.8T WD-WCC4M0ZAHEXJ
├─sdb1 2G
│ └─md125 2G
│ └─md125 2G swap
└─sdb2 1.8T Backup zfs_member
sdc 2.7T YVJSUMRK
├─sdc1 2G
│ └─md127 2G
│ └─md127 2G swap
└─sdc2 2.7T
sdd 1.8T W1H4E85C
├─sdd1 2G
│ └─md125 2G
│ └─md125 2G swap
└─sdd2 1.8T Backup zfs_member
sde 2.7T YVJSRSRK
├─sde1 2G
│ └─md127 2G
│ └─md127 2G swap
└─sde2 2.7T Storage zfs_member
sdf 1.8T W1H236JH
├─sdf1 2G
└─sdf2 1.8T Backup zfs_member
sdg 2.7T YVJSRT2K
├─sdg1 2G
│ └─md126 2G
│ └─md126 2G swap
└─sdg2 2.7T Storage zfs_member
sdh 2.7T YVJSUM1K
├─sdh1 2G
│ └─md126 2G
│ └─md126 2G swap
└─sdh2 2.7T Storage zfs_member
zd0 20G
zd16 20G

with all the info posted above, I put together the below map of what drive is on what md-raid. does this look like a problem for delayed boot and the "raid not clean" errors on every boot?

Disks Serial Pool Name Size MD-raid member Connected Controller
sda 212303A016E1 Boot 223.57GB SSD mobo sata
sdb WD-WCC4M0ZAHEXJ Backup 1.82TB md125(0) Dell H200 6Gbps SAS HBA LSI 9210-8i (=9211-8i) P20 IT Mode
sdc YVJSUMRK Storage 2.73TB md127(0) Dell H200 6Gbps SAS HBA LSI 9210-8i (=9211-8i) P20 IT Mode
sdd W1H4E85C Backup 1.82TB md125(1) Dell H200 6Gbps SAS HBA LSI 9210-8i (=9211-8i) P20 IT Mode
sde YVJSRSRK Storage 2.73TB md127(1) Dell H200 6Gbps SAS HBA LSI 9210-8i (=9211-8i) P20 IT Mode
sdf W1H236JH Backup 1.82TB (hot spare?) Dell H200 6Gbps SAS HBA LSI 9210-8i (=9211-8i) P20 IT Mode
sdg YVJSRT2K Storage 2.73TB md126(0) Dell H200 6Gbps SAS HBA LSI 9210-8i (=9211-8i) P20 IT Mode
sdh YVJSUM1K Storage 2.73TB md126(1) Dell H200 6Gbps SAS HBA LSI 9210-8i (=9211-8i) P20 IT Mode
 

dbrannon79

Dabbler
Joined
Oct 21, 2022
Messages
32
the end of my last post came out all garbled up. here is that part cleaned up easier to read.

DisksSerialPool NameSizeMD-raid memberConnected Controller
sda212303A016E1Boot223.57GB SSDmobo sata
sdbWD-WCC4M0ZAHEXJBackup1.82TBmd125(0)Dell H200 6Gbps SAS HBA LSI 9210-8i (=9211-8i) P20 IT Mode
sdcYVJSUMRKStorage2.73TBmd127(0)Dell H200 6Gbps SAS HBA LSI 9210-8i (=9211-8i) P20 IT Mode
sddW1H4E85CBackup1.82TBmd125(1)Dell H200 6Gbps SAS HBA LSI 9210-8i (=9211-8i) P20 IT Mode
sdeYVJSRSRKStorage2.73TBmd127(1)Dell H200 6Gbps SAS HBA LSI 9210-8i (=9211-8i) P20 IT Mode
sdfW1H236JHBackup1.82TB(hot spare?)Dell H200 6Gbps SAS HBA LSI 9210-8i (=9211-8i) P20 IT Mode
sdgYVJSRT2KStorage2.73TBmd126(0)Dell H200 6Gbps SAS HBA LSI 9210-8i (=9211-8i) P20 IT Mode
sdhYVJSUM1KStorage2.73TBmd126(1)Dell H200 6Gbps SAS HBA LSI 9210-8i (=9211-8i) P20 IT Mode
 

dbrannon79

Dabbler
Joined
Oct 21, 2022
Messages
32
in further analyzing the data above, it seems odd to me that disk sdf (part of the Backup pool) has no reference to a md array where the other two (sdd and sdb) are in md125. in TN scale the pool Backup is set to use disks sdd, sdf, and sdb in a raidZ1. something is telling me that pool is where I need to start at.

correct me on my analogy here, but I think if I were to first destroy and wipe all of the disks in the pool Backup, then re-create it, and allow my scheduled replication to take place copying all the data in my Storage pool.

once this is complete, do the same with the Storage pool. destroy it and re-create it in scale and bring back all my data.

my fear is once I start performing this, when I attempt to unmount ether pool I am going to be faced with errors since they have smb shares and users associated with them. Mainly the Storage pool.

unfortunately, I think I will have to do this in a fresh install of scale and then once everything is mounted properly not giving any of these errors, I am going to have to re-create all the shares and users with all the same privileges not importing any of the old config files.

this part I am not looking forward to and will be very time consuming! is there an updated guide somewhere on creating restricted shares for scale? I noticed the gui is very different than core and it took me a while watching youtube guides trying to figure that part out in the past.

at the same time I might as well think about expanding my server capacity. I only have one open bay left on my rack server. I would probably benefit greatly by pulling all the backup disks and putting together a separate machine to only hold the backup data.
 

oculto

Cadet
Joined
Dec 6, 2022
Messages
1
I had the exactly same situation and the problem is de ix-application dataset/snapshot on BACKUP drive.
Delete ix-application dataset from BACKUP and remove this child on Storage Replication task (exclude child, to avoid recreated). I read some info in other topic not related to this problem (a truenas core not boot with this ix-applications snapshots on replicated drive) and try, and it works. You will be surprise....

Delete ix-applications dataset ONLY in BACKUP storage (it will remove id and all snapshots related), and exclude this child at replication task on the main storage. thats all. You will keep the ix-application on main pool (Storage in your case), as also its snapshots as usual.

It seems to me it is a bug or a design flaw, but the workaround make sense as this ix-application did not needed to be replicated on same server... you can have the snapshots on the main storage normally. only the replicated cause a problem.
 

dbrannon79

Dabbler
Joined
Oct 21, 2022
Messages
32
I will have to look into this further. I haven't posted any updates lately, but I have since moved my backup storage to a separate machine running scale. on boot I get the same errors but does not sit there trying to import a pool for the 15 minutes. I still see the "not clean" message during the booting process. when checking it within the GUI it's always fine.

as far as my main nas storage. I wiped it clean and recreated the pool along with upgrading from the 4 3tb sas drives to a total of 8 drives in a raid z2 configuration. I don't recall if I get these messages on the main server, I will need to look.

the main server is still the Dell R720 running scale with a recent upgrade to a Dell perc H710 HBA and 8 seagate enterprise SAS 3tb drives. the backup is now on an older Cisco C200 M1 server with the Dell H200 SAS HBA spinning the 4 other 3tb SAS drives.
 

heisian

Dabbler
Joined
Oct 3, 2020
Messages
21
I too upgraded from TrueNAS CORE to SCALE recently and am getting the exact same error. I have an 8-disk vdev and I NEVER created any linux raid arrays. All I did was upgrade. Every other reboot I get this issue. If I reboot again after either waiting 15m or forcing it early, then ZFS seems to pick up the pool and do some kind of... reconstruction? ..and my pool is fine again. I don't know, I'll have to get a screenshot.
 

dbrannon79

Dabbler
Joined
Oct 21, 2022
Messages
32
it seems both of my servers are doing this. so far after destroying the pools and re-creating them in scale they are not making me wait 15 minutes to boot, but I still get the same behavior and logs including some sort of reconstruction.
 

heisian

Dabbler
Joined
Oct 3, 2020
Messages
21
this seemed to stop happeining to me after i excluded the `ix-applications` dataset from my snapshot and replication jobs. i then deleted that folder from my backup pool. not sure if that’s the issue, but worth trying?
 

dbrannon79

Dabbler
Joined
Oct 21, 2022
Messages
32
Thanks. I will have to try this.

I had heard someone mention this before and I'm wondering if the ix-applications dataset is meant to be on it's own separate disk or pool vs in the mix with the nas storage

adding a single disk with a partition to use in this manor isn't that straight forward without the gui warning you about not having any parity drives.
 

heisian

Dabbler
Joined
Oct 3, 2020
Messages
21
It may have something to do with docker trying to mount images found in the backup versions of ix-applications, though I have no idea what any of the procedures are on startup when it comes to determining which containers the docker service wants to start...

I also just upgraded to the Bluefin (22.12.0) release train yesterday, and that fixed a lot of the deep issues I was seeing with deploying apps. If you're not on it already, just doing this might fix your issues.
 
Top