SOLVED Kernel panic during import, Invalid Checksum

coquinone · Mar 6, 2022

Hi all,

I need some help on importing my zfs pool, I'm not an expert in FreeBSD, Truenas and zfs but I'm pretty familiar with Linux.
It's a bit of a long story, so I will try to explain the whole process and hopefully some of you can help me out.

Hardware:
HP N56L microserver, 8GB of memory
5 x 2TB disks
Running Truenas 12U3

I noticed my nfs and sambda shares were offline and not accessible so I rebooted the server a couple of times with no luck.
Hooked up a monitor and saw kernel panic and stack trace points to IO.
I was running Truenas on a usb stick so first thing I thought was a buggered usb stick, so I decided to reinstall on an SSD.
I plugged in the usb and mounted the zfs partition on another PC so I can copy off the config db.

With the db config copied over, I downloaded 12U8 and created an install usb.
During the installation I encountered checksum error on the copied over base-os*.tgz file. After a couple of retries I plugged all my data HDDs and started trying a few things including manually copy the file over. Throughout the process I delete the partitions that were automatically created during the start of the installation. My memory is a bit hazy here so I wasn't sure if I unplugged all the HDDs before I started messing around or after the first a few tries.
Half a day goes pass and after trying different SSDs, different version of Freenas and Truenas, I eventually worked out one of the two sticks of ram I had was faulty. So I replace both with a spare set and got 11.0 installed on the SSD.
After 11.0 got installed, I immediately created a new install usb with 12U8 and upgraded it without booting into 11.0.
Upgrade completed successfully and I booted into 12U8 (during the start process I saw entropy got regenerated). I then overwritten the config db with the one I previously copied and rebooted. After seeing some sqlite errors I realised it was from an older version so I used the webUI restored the config db.
12U8 is not running with my previously configure and zfs pool was not imported and offline

The above is how I got to my current situation and now I'll explain what troubleshooting I have done to the zfs pool

I saw the pool was not online so I went to shell and tried to import it

Code:

root@nas[~]# zpool import vpool
cannot import 'vpool': I/O error
    Destroy and re-create the pool from
    a backup source.
root@nas[~]# zpool status
  pool: boot-pool
 state: ONLINE
config:

    NAME        STATE     READ WRITE CKSUM
    boot-pool   ONLINE       0     0     0
      ada4p2    ONLINE       0     0     0

errors: No known data errors
root@nas[~]# zpool import
   pool: vpool
     id: 12335856088955846037
  state: ONLINE
status: Some supported features are not enabled on the pool.
 action: The pool can be imported using its name or numeric identifier, though
    some features will not be available without an explicit 'zpool upgrade'.
 config:

    vpool       ONLINE
      raidz1-0  ONLINE
        ada0    ONLINE
        ada2    ONLINE
        ada1    ONLINE
        ada3s1  ONLINE
        ada5    ONLINE
root@nas[~]# zpool import 12335856088955846037
cannot import 'vpool': I/O error
    Destroy and re-create the pool from
    a backup source.
root@nas[~]# zpool import -Fn 12335856088955846037
root@nas[~]# zpool list
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
boot-pool   101G  1.20G  99.8G        -         -     0%     1%  1.00x    ONLINE  -
root@nas[~]#

Mounting it in read only also results in same error

Code:

root@nas[~]# zpool import -o readonly=on -NR /mnt 12335856088955846037
cannot import 'vpool': I/O error
    Destroy and re-create the pool from
    a backup source.

zdb result as follows

Code:

root@nas[~]# zdb -e -d vpool
Dataset mos [META], ID 0, cr_txg 4, 185M, 562 objects
Dataset vpool/iocage/jails [ZPL], ID 105, cr_txg 32250219, 153K, 7 objects
Dataset vpool/iocage/templates [ZPL], ID 218, cr_txg 32250225, 153K, 7 objects
Dataset vpool/iocage/download [ZPL], ID 88, cr_txg 32250215, 153K, 7 objects
Dataset vpool/iocage/releases [ZPL], ID 118, cr_txg 32250223, 153K, 7 objects
Dataset vpool/iocage/log [ZPL], ID 111, cr_txg 32250221, 153K, 7 objects
Dataset vpool/iocage/images [ZPL], ID 95, cr_txg 32250217, 153K, 7 objects
Dataset vpool/iocage [ZPL], ID 82, cr_txg 32250213, 9.71M, 597 objects
Dataset vpool/.system/services [ZPL], ID 72, cr_txg 32250130, 153K, 7 objects
Dataset vpool/.system/rrd-e0dba3050c7945dbbaae28f6aa6a3cc9 [ZPL], ID 60, cr_txg 32250124, 105M, 419 objects
Dataset vpool/.system/samba4 [ZPL], ID 54, cr_txg 32250120, 4.05M, 56 objects
Dataset vpool/.system/configs-e0dba3050c7945dbbaae28f6aa6a3cc9 [ZPL], ID 66, cr_txg 32250126, 208M, 191 objects
Dataset vpool/.system/cores [ZPL], ID 227, cr_txg 32250117, 477M, 8 objects
Dataset vpool/.system/webui [ZPL], ID 277, cr_txg 32250128, 153K, 7 objects
Dataset vpool/.system/syslog-e0dba3050c7945dbbaae28f6aa6a3cc9 [ZPL], ID 271, cr_txg 32250122, 42.4M, 73 objects
Dataset vpool/.system [ZPL], ID 46, cr_txg 32250114, 785M, 23 objects
Dataset vpool@auto-20210805-040000 [ZPL], ID 523, cr_txg 32999210, 2.11T, 956559 objects
Dataset vpool@auto-20210813-040000 [ZPL], ID 585, cr_txg 33136519, 2.10T, 956509 objects
Dataset vpool@auto-20210808-040000 [ZPL], ID 543, cr_txg 33050390, 2.10T, 956547 objects
Dataset vpool@auto-20210824-040000 [ZPL], ID 412, cr_txg 33325440, 2.12T, 956514 objects
Dataset vpool@auto-20210729-040000 [ZPL], ID 284, cr_txg 32871730, 1.97T, 956224 objects
Dataset vpool@auto-20210807-040000 [ZPL], ID 538, cr_txg 33033333, 2.10T, 956547 objects
Dataset vpool@auto-20210811-040000 [ZPL], ID 568, cr_txg 33102021, 2.10T, 956520 objects
Dataset vpool@auto-20210826-040000 [ZPL], ID 444, cr_txg 33359509, 2.11T, 956451 objects
Dataset vpool@auto-20210730-040000 [ZPL], ID 290, cr_txg 32888789, 1.97T, 956224 objects
Dataset vpool@auto-20210817-040000 [ZPL], ID 611, cr_txg 33205512, 2.10T, 956509 objects
Dataset vpool@auto-20210801-040000 [ZPL], ID 304, cr_txg 32922901, 1.97T, 956224 objects
Dataset vpool@auto-20210820-040000 [ZPL], ID 635, cr_txg 33257124, 2.10T, 956509 objects
Dataset vpool@auto-20210815-040000 [ZPL], ID 597, cr_txg 33171016, 2.10T, 956509 objects
Dataset vpool@auto-20210818-040000 [ZPL], ID 618, cr_txg 33222759, 2.10T, 956509 objects
Dataset vpool@auto-20210803-040000 [ZPL], ID 320, cr_txg 32957031, 1.97T, 956229 objects
Dataset vpool@auto-20210822-040000 [ZPL], ID 383, cr_txg 33291192, 2.10T, 956492 objects
Dataset vpool@auto-20210821-040000 [ZPL], ID 367, cr_txg 33273936, 2.10T, 956489 objects
Dataset vpool@auto-20210816-040000 [ZPL], ID 602, cr_txg 33188265, 2.10T, 956509 objects
Dataset vpool@auto-20210823-040000 [ZPL], ID 388, cr_txg 33308246, 2.10T, 956492 objects
Dataset vpool@auto-20210819-040000 [ZPL], ID 629, cr_txg 33240005, 2.10T, 956509 objects
Dataset vpool@auto-20210802-040000 [ZPL], ID 313, cr_txg 32939978, 1.97T, 956229 objects
Dataset vpool@auto-20210627-040000 [ZPL], ID 324, cr_txg 32340941, 2.02T, 956646 objects
Dataset vpool@auto-20210814-040000 [ZPL], ID 590, cr_txg 33153768, 2.10T, 956509 objects
Dataset vpool@auto-20210825-040000 [ZPL], ID 437, cr_txg 33342417, 2.11T, 956427 objects
Dataset vpool@auto-20210812-040000 [ZPL], ID 580, cr_txg 33119270, 2.10T, 956509 objects
Dataset vpool@auto-20210809-040000 [ZPL], ID 549, cr_txg 33067534, 2.10T, 956544 objects
Dataset vpool@auto-20210804-040000 [ZPL], ID 356, cr_txg 32975454, 2.00T, 956255 objects
Dataset vpool@auto-20210731-040000 [ZPL], ID 293, cr_txg 32905846, 1.97T, 956224 objects
Dataset vpool@auto-20210727-040000 [ZPL], ID 516, cr_txg 32858267, 1.97T, 956224 objects
Dataset vpool@auto-20210810-040000 [ZPL], ID 560, cr_txg 33084776, 2.10T, 956520 objects
Dataset vpool@auto-20210806-040000 [ZPL], ID 535, cr_txg 33016276, 2.10T, 956547 objects
Dataset vpool [ZPL], ID 21, cr_txg 1, 2.27T, 957584 objects
Verified large_blocks feature refcount of 0 is correct
Verified sha512 feature refcount of 0 is correct
Verified skein feature refcount of 0 is correct
Verified device_removal feature refcount of 0 is correct
Verified indirect_refcount feature refcount of 0 is correct

Also it seems two of my disks don't have a partition table and with the two that gpart can detect they both have the wrong kinds of partitions
ada4 is my 120GB SSD.
I'm also getting messages that says ada3 has unreadable sectors

Code:

root@nas[~]# gpart show
=>        63  3907029105  ada3  MBR  (1.8T)
          63        1985        - free -  (993K)
        2048  3907026944     1  linux-data  (1.8T)
  3907028992         176        - free -  (88K)

=>       40  246162592  ada4  GPT  (117G)
         40       1024     1  freebsd-boot  (512K)
       1064   33554432     3  freebsd-swap  (16G)
   33555496  212598784     2  freebsd-zfs  (101G)
  246154280       8352        - free -  (4.1M)

=>        63  3907029105  ada0  MBR  (1.8T)
          63  3907024002     1  ntfs  (1.8T)
  3907024065        5103        - free -  (2.5M)

Mar  7 14:46:03 nas 1 2022-03-07T14:46:03.634637+11:00 nas.localdomain smartd 978 - - Device: /dev/ada3, 256 Currently unreadable (pending) sectors
Mar  7 14:46:03 nas 1 2022-03-07T14:46:03.634714+11:00 nas.localdomain smartd 978 - - Device: /dev/ada3, 256 Offline uncorrectable sectors

Smartctl output for ada3 (looks like the disk is faulty)

Code:

root@nas[~]# smartctl -a /dev/ada3
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda Green (AF)
Device Model:     ST2000DL003-9VT166
Serial Number:    5YD4GDHL
LU WWN Device Id: 5 000c50 03918a8d4
Firmware Version: CC32
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    5900 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Mar  7 15:30:18 2022 AEDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  633) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      ( 349) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x30b7)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   102   099   006    Pre-fail  Always       -       4037456
  3 Spin_Up_Time            0x0003   092   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       215
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   079   060   030    Pre-fail  Always       -       4382177005
  9 Power_On_Hours          0x0032   074   074   000    Old_age   Always       -       22984
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       204
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   077   077   000    Old_age   Always       -       23
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   099   099   000    Old_age   Always       -       1
190 Airflow_Temperature_Cel 0x0022   066   047   045    Old_age   Always       -       34 (Min/Max 31/34)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       159
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       225
194 Temperature_Celsius     0x0022   034   053   000    Old_age   Always       -       34 (0 11 0 0 0)
195 Hardware_ECC_Recovered  0x001a   014   004   000    Old_age   Always       -       4037456
197 Current_Pending_Sector  0x0012   097   097   000    Old_age   Always       -       256
198 Offline_Uncorrectable   0x0010   097   097   000    Old_age   Offline      -       256
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       22937 (189 75 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       3222260999
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1273268245

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     22635         -
# 2  Short offline       Completed without error       00%     22467         -
# 3  Short offline       Completed without error       00%     22299         -
# 4  Short offline       Completed without error       00%     22131         -
# 5  Short offline       Completed without error       00%     21963         -
# 6  Short offline       Completed without error       00%     21795         -
# 7  Short offline       Completed without error       00%     21627         -
# 8  Short offline       Completed without error       00%     21459         -
# 9  Short offline       Completed without error       00%     21292         -
#10  Short offline       Completed without error       00%     21124         -
#11  Short offline       Completed without error       00%     20956         -
#12  Short offline       Completed without error       00%     20788         -
#13  Short offline       Completed without error       00%     20620         -
#14  Short offline       Completed without error       00%     20452         -
#15  Short offline       Completed without error       00%     20284         -
#16  Short offline       Completed without error       00%     20116         -
#17  Short offline       Completed without error       00%     19948         -
#18  Short offline       Completed without error       00%     19781         -
#19  Short offline       Completed without error       00%     19613         -
#20  Short offline       Completed without error       00%     19445         -
#21  Short offline       Completed without error       00%     19277         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

artlessknave · Mar 6, 2022

you have a raidz1 with 2 damaged disks. sounds like probably 1 died and then you broke the other by mucking about nuking partitions.

Code:

cannot import 'vpool': I/O error
    Destroy and re-create the pool from
    a backup source.

I'm pretty sure it's dead Jim. if you don't have at least 4 disks there is no pool to import; read-only isn't going to help, because there isn't enough to read *from*.
raidz1 is not recommended. while you are at the threshold of disk size, raidz1 is just not reliable when you need it to be, unless you have it planned and backups in place, or don't care about what's on it.

glabel status, or disklist.pl in my signature, are also useful

there is also a chance that you damaged it by rebooting the pool while it was trying to fix itself. you do not have the *minimum* amount of memory, and if it was trying to unFUBAR your pool while using the *emergency* paging partitions, due to insufficient RAM, and rebooted it......RIP vpool.

coquinone · Mar 7, 2022

artlessknave said:
you have a raidz1 with 2 damaged disks. sounds like probably 1 died and then you broke the other by mucking about nuking partitions.

Code:
cannot import 'vpool': I/O error Destroy and re-create the pool from a backup source.

I'm pretty sure it's dead Jim. if you don't have at least 4 disks there is no pool to import; read-only isn't going to help, because there isn't enough to read *from*.
raidz1 is not recommended. while you are at the threshold of disk size, raidz1 is just not reliable when you need it to be, unless you have it planned and backups in place, or don't care about what's on it.

glabel status, or disklist.pl in my signature, are also useful

there is also a chance that you damaged it by rebooting the pool while it was trying to fix itself. you do not have the *minimum* amount of memory, and if it was trying to unFUBAR your pool while using the *emergency* paging partitions, due to insufficient RAM, and rebooted it......RIP vpool.

Thanks for your reply. that's great insights. I did have 16gb previously but needed some extra for another build and thought I'd be fine in the short/medium term.
What's interesting is I never received any emails on SMART errors and I have scheduled short tests that runs every week

It seems the combination of lack of memory and one of the sticks becoming faulty has caused the HDD gone cactus.

Is there any way of restoring the missing partitions on the other two drives? Essentially, resulting 1 faulty drive and 4 operational drive? What does a normal partition table looks like and gpart recover help in anyway?
It's very bizarre that one of the disk became NTFS and another became Linux partition. Do you know what a regular partition table of a zfs disk looks like? Perhaps I could try to manually create the partition table?

artlessknave · Mar 7, 2022

it must be noted, that if you really want this data back, expensive professional recovery services are the most reliable, and to use them you should stop messing about with the disks and ship them out, becase the more you do the harder it is for them to recover anything.

with that out of the way, this is what standard data looks should look like:

Code:

=>         40  15628053088  da23  GPT  (7.3T)
           40           88        - free -  (44K)
          128      4194304     1  freebsd-swap  (2.0G)
      4194432  15623858696     2  freebsd-zfs  (7.3T)
=>        40  7814037088  da24  GPT  (3.6T)
          40          88        - free -  (44K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  7809842696     2  freebsd-zfs  (3.6T)
=>         6  1465130635  da25  GPT  (5.5T)
           6         122        - free -  (488K)
         128     4194304     1  freebsd-swap  (16G)
     4194432  1460936209     2  freebsd-zfs  (5.4T)

just changing the partition table, however, likely wont help. you *might* be able to recover the backup GPT table, since MBR probably woulnt overright that, but if you wrote anything to these mystery partitions...
your gpart only shows 3 disks, and one of those is boot, so we have no idea what the status of your other disks is. if they are also not available....

coquinone · Mar 7, 2022

Yeah 100% agree, the data is important but not critical. I've got a couple of spare drives so the plan is dd one of the existing disk in the pool to a spare drive and mess on this instead. And hopefully get to a point where it can be detected and mounted
So there's a total of 6 drives, 5 x 2tb and 1 SSD, gpart sees 3 of 6 with one being the boot.
Do you think the problem is with the 2 it can see (obvisouly wrong type of partitions) or the 3 it can't? Or maybe all of them?
I'm trying to understand which of the 5 drives requires repair and which can be used as a working example etc. At the moment I have noway to tell what's a working one looks like. Any ideas?

artlessknave · Mar 7, 2022

if you are unable to see 3 whole drives, that starts pointing to something like the controller or cables being possible problems.

I do not know much about this particular hardware, but my first question now is, what is the controller(s)? its not using any kind of RAID functions is it?

if whatever you stole the RAM for is less important that the data on these disks....put it back in. minimum specs are the minimum for a reason. and add any more compatible RAM you have available as well, if you manage to get anything restored you want it to have everything it needs.

I'm pretty sure you need to clone all 5 drives, just 1 isnt going to get you very far, but I have never tried to resurrect such abused(?) drives before.

sudo dmesg | grep da to try and see what you have, I guess?

coquinone · Mar 7, 2022

That's a great idea, it now has 16gb of ram. Unfortunately there's only two dimm slots (HP microserver N54L) and officially only support up to 16GB of ram I believe. The disks are connected to a HDD cage in the server and via a mini sas cable to the motherboard directly and presented as individual disks.

Apologies if I have confused you, Truenas does see all 6 disks, 1 being the OS, 5 for zfs pool. 3 of the 5 doesn't show partition info from gpart show

and 2 of the 5 has wrong partition info.

As I'm more familiar with Ubuntu and feel more comfortable with different commands I booted up Ubuntu via live usb.
The following is what I see from parted -l I've indicated which disk is which

Code:

Model: ATA ST2000DL003-9VT1 (scsi) ---->  ##### 1 of the 2 disks that was showing has a NTFS partition, this disk is not faulty
Disk /dev/sda: 2000GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:

Number  Start   End     Size    Type     File system  Flags
 1      32.3kB  2000GB  2000GB  primary


Model: ATA WDC WD20EARX-328 (scsi) ##### 1 of the 3 disks not showing has a partition
Disk /dev/sdb: 2000GB
Sector size (logical/physical): 512B/4096B
Partition Table: loop
Disk Flags:

Number  Start  End     Size    File system  Flags
 1      0.00B  2000GB  2000GB  zfs


Model: ATA WDC WD20EARS-00M (scsi) ##### 1 of the 3 disks not showing has a partition
Disk /dev/sdc: 2000GB
Sector size (logical/physical): 512B/4096B
Partition Table: loop
Disk Flags:

Number  Start  End     Size    File system  Flags
 1      0.00B  2000GB  2000GB  zfs


Model: ATA ST2000DL003-9VT1 (scsi) ##### 1 of the 2 disks  showing has a partition (Linux)
Disk /dev/sdd: 2000GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:

Number  Start   End     Size    Type     File system  Flags
 1      1049kB  2000GB  2000GB  primary


Model: ATA WDC WD20EARS-00M (scsi) ##### An extra 2TB disk I just installed in case I need the space for whatever reason
Disk /dev/sde: 2000GB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Disk Flags:

Number  Start  End     Size    File system  Flags
 1      0.00B  2000GB  2000GB  ext4


Model: ATA WDC WD20EARS-00J (scsi) ##### 1 of the 3 disks not showing has a partition
Disk /dev/sdf: 2000GB
Sector size (logical/physical): 512B/4096B
Partition Table: loop
Disk Flags:

Number  Start  End     Size    File system  Flags
 1      0.00B  2000GB  2000GB  zfs


Model: Verbatim STORE N GO (scsi) ##### Live USB
Disk /dev/sdg: 15.5GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:

Number  Start   End     Size    Type     File system  Flags
 1      1049kB  15.5GB  15.5GB  primary  fat32        boot, lba

The below is from fdisk -l

Code:

Disk /dev/sdf: 1.84 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: WDC WD20EARS-00J
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sde: 1.84 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: WDC WD20EARS-00M
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/sdb: 1.84 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: WDC WD20EARX-328
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdd: 1.84 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: ST2000DL003-9VT1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x8f4ede6a

Device     Boot Start        End    Sectors  Size Id Type
/dev/sdd1        2048 3907028991 3907026944  1.8T 83 Linux


Disk /dev/sdc: 1.84 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: WDC WD20EARS-00M
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sda: 1.84 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: ST2000DL003-9VT1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x8f4ede6d

Device     Boot Start        End    Sectors  Size Id Type
/dev/sda1          63 3907024064 3907024002  1.8T  7 HPFS/NTFS/exFAT


Disk /dev/sdg: 14.45 GiB, 15504900096 bytes, 30283008 sectors
Disk model: STORE N GO     
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x0e6788ca

Device     Boot Start      End  Sectors  Size Id Type
/dev/sdg1  *     2048 30283007 30280960 14.4G  c W95 FAT32 (LBA)

As you can see, the 3 drives that were showing no partitions actually has partitions and being detected to have a zfs filesystem
This zpool was created probably around 7-8 years ago and has went through quite a few version upgrades, do you think this could be reason why partitions aren't displaying under freebsd but the disks/zpool data are actually ok? If this is somehow true, how would you recommend me to continue with the recovery?
I'm thinking starting with 1 of the 2 drives that have wrong partitions (probably the one that's not faulty) and do a binary copy using dd.
Then try to recover it? As you mentioned MBR probably didn't touch the back up GPT table give its written at the end of the disk?
Any advice is most welcome and appreciated.

artlessknave · Mar 7, 2022

we have reached the end of my knowledge. I have never tried to recover something this mangled (mainly because I have never mangled anything in this way), and so anything else I recommend is just as likely to destroy it more (if that's possible).

I was hoping that getting more details would spark someone's interest who might be able to recover this, but that has not happened.
you could try looking through the resources. I don't know how else you could get attention without annoying the ones most able to help.

coquinone · Mar 8, 2022

artlessknave said:
we have reached the end of my knowledge. I have never tried to recover something this mangled (mainly because I have never mangled anything in this way), and so anything else I recommend is just as likely to destroy it more (if that's possible).

I was hoping that getting more details would spark someone's interest who might be able to recover this, but that has not happened.
you could try looking through the resources. I don't know how else you could get attention without annoying the ones most able to help.

Thanks for your help so far, much appreciated. Hopefully someone sees this thread and can chip in.

coquinone · Mar 8, 2022

After for reading some more info on zfs, it seems when I first created the zfs pool, I've chosen to create the pool without using partition tables. Which would explain why the those 3 disks don't have a partitions at all (under Linux, partition is showing as loop).
For the 2 disks with the wrong partitions, both were previously used for other things and was put into the zpool to replace failed drives. I'm pretty sure I just added the disks into pool and let the pool sorted itself out.
Can any body confirm the theory?

Another piece of information to add is when booting using the original Truenas installed on the usb stick (I think its 12U3) I'll get a kernel panic during boot and Truenas refuses to continue. I'll try to pull the stack trace.

artlessknave · Mar 8, 2022

@jgreco is this a lost cause?

jgreco · Mar 8, 2022

artlessknave said:
@jgreco is this a lost cause?

<as Marv, looking up> Wow, what a hole.

This is a bad situation to be sure. Lost cause? Maybe, maybe not. However, it is quite possibly beyond the abilities of someone trying to remote-debug it via forum messages.

If this were my mess, which I'm glad it's not, my first step would be to image all the disks onto virtual disks, snapshot them, and then feel safer playing around with a virtual instance of the NAS. That's clearly impractical in most cases. But it's the thing that lets you try the desperation moves.

coquinone · Mar 8, 2022

jgreco said:
<as Marv, looking up> Wow, what a hole.

This is a bad situation to be sure. Lost cause? Maybe, maybe not. However, it is quite possibly beyond the abilities of someone trying to remote-debug it via forum messages.

If this were my mess, which I'm glad it's not, my first step would be to image all the disks onto virtual disks, snapshot them, and then feel safer playing around with a virtual instance of the NAS. That's clearly impractical in most cases. But it's the thing that lets you try the desperation moves.

Agree, I have no clue how it got to this. Possibly lack of memory + one faulty memory stick + one faulty disk.
I've started cloning one of the disks that's showing as NTFS, it will take another 24 hours to complete. Unfortunately I don't have enough disks to clone all 5.
After reading more on zfs, I feel like partitions in these disks have always been like this due I created the pool without partitions a long time ago. NTFS and ext4 were just the partitions those disks were previously formatted into. Adding the disk into the pool and rebuilding the data did not remove these two partitions. Do you think this is plausible?

What would you recommend to do next?
I'll provide the kernel stack trace once the cloning finishes.

coquinone · Mar 9, 2022

So here is an update, not sure if its good or bad but I think I'm gaining more understanding of the situation.
I'm reasonably confident that the pool failure got nothing to do with the reported disk partitions.
Quick summary of all my 5 disks
Disk A: WD 2TB, no smart errors and doesn't have partition table
Disk B: WD 2TB, no smart errors and doesn't have partition table
Disk C: WD 2TB, no smart errors and doesn't have partition table
Disk D: Seagate 2TB, whole bunch of smart errors and has an ext4 partition
Disk E: Seagate 2TB, no smart errors and has a NTFS partition

This is the kernel stacktrace during boot (please excuse the photos, its the quickest way for me to capture it)
This is booting via USB stick using 12U4, with Disk B-E plugged in
I saw the erorr on failed to open /dev/ada4 and ada4 is Disk D so then decided to plug it completely.
After unplugging it, 12U4 started without issues (obviously pool not been imported). I was not able to get 12U4 start until this point.

Then I plugged in Disk A, Disk B, Disk C and Disk E. Started 12U4 via usb stick
Which is the same checksum error as the previous step

So I thought Disk A doesn't have any smart error so the disk itself could still be ok and latest 12U8 maybe using newer version of zfs and could better deal with this. So I started 12U8 and this is what I got. 2 metadata errors and 0 data errors
This is with Disk A, Disk B, Disk C and Disk E plugged in and 12U8 started fine. I then tested with just Disk B, C and E, 12U8 also started fine.

Could this be a better situation then I was previously in? I don't know and I hope someone can give me some ideas.
Could I try mount the pool readonly with Disk A, B, C and E? then try with the -Fn switch if no luck on read only?
Or should I try to clone the faulty disk and hope it can somehow provide more data?

coquinone · Mar 10, 2022

Updated title to reflect what I think the real problem is.

coquinone · Mar 15, 2022

Some quick updates on the progress
1. I've made a copy of all 5 drives using ddrescue
2. kicked off zpool import -f -F -N -X -R /mnt -o readonly=on vpool , this has been running for a few days but I'll give it a few more days. From memory I think I have around 3TB of data.
3. If Step 2 fails, I'm planning to run the below

Code:

# Turn off data verification, probably should have done this before Step 2 was ran given that I already knew some data are corrupted
echo 0 > /sys/module/zfs/parameters/spa_load_verify_metadata 

#Find the list of uber blocks
zdb -e vpool -ul

#Try to roll back to the earliest one as I don't care about the latest data
zpool import -N -o readonly=on -f -R /mnt -F -T <transaction_id> vpool

If anyone else has some advice or pointers I would love to hear them.

coquinone · Mar 17, 2022

This will probably my last update. I've successfully recovered all my data, restore the pool to a functional state (read and write) and running a scrub at the moment.
Here is what I have done
1. I've followed this post (https://www.truenas.com/community/threads/zfs-has-failed-you.11951/)

Reboot FreeNAS, and select the Escape to Loader option in the boot menu.
enter the following commands:
set vfs.zfs.recover=1
set vfs.zfs.debug=1
boot -s
press enter to start /bin/sh
# sh /etc/rc.initdiskless
sysctl vfs.zfs.spa.load_verify_metadata=0
sysctl vfs.zfs.spa.load_verify_data=0
zpool import -f -F -N -X -R /mnt -o readonly=on vpool
By this point the pool was imported and mounted. I was able to rsync the data onto another machine and that was a huge relief
I then unmounted the pool and exported it.

I then imported the pool again but without the -n switch
zpool import -f -F -X -R /mnt -o readonly=on vpool
This gave me an error along the line of read only file system and can not create mount point.
zpool status tells me that the pool got imported fine but couldn't be mounted.
So I realised must be due to the -R switch, so I reconfigured the mount point with zfs set mountpoint=/mnt/vpool vpool
The pool mounted fine after this, then its just unmount, export and reboot to the latest version of truenas.
After reboot, vpool got imported fine and I have access to my data.

jgreco · Mar 17, 2022

Well, definitely quite the adventure. Happy to hear it worked out, and congratulations on your tenacity.

coquinone · Mar 17, 2022

jgreco said:
Well, definitely quite the adventure. Happy to hear it worked out, and congratulations on your tenacity.

blood oath it was an adventure. lots of learning along the way but mostly importantly, have backup of backup

artlessknave · Mar 17, 2022

nice.
you probably want to revisit your pool/partition/disk design so this never happens again, and seriously consider some kind of backup...

Important Announcement for the TrueNAS Community.

SOLVED Kernel panic during import, Invalid Checksum

Dabbler

Wizard

Dabbler

Wizard

Dabbler

Wizard

Dabbler

Wizard

Dabbler

Dabbler

Wizard

Resident Grinch

Dabbler

Dabbler

Dabbler

Dabbler

Dabbler

Resident Grinch

Dabbler

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Kernel panic during import, Invalid Checksum"

Similar threads