One drive failed, now a cache drive is unavailable and I can't bring the pool online

MadHatR75 · Oct 15, 2023

My pool has gone unavailable due to 1 drive dying and another drives part ion UUID does not match.

I am running the newest version of TrueNAS Scale. RAIDZ1 with 4 12TB drives and 2 SSDs. The back-plane of my dell T330 was starting to fail. I ejected the drive and put it in another bay but I it looks like the damage was already done.

Fast forward and I built a new system and this time setting it up in Proxmox, passing through a 12 port SATA card, uploading the config, etc. The 12TB drive is still unavailable. So I ordered in a few 10TB drives, made a new pool and started moving everything to it. Then after a reboot the cache drive started showing up as unavailable. Now with 2 drives down I can no longer get the pool online.

I matched up the drives to the pool and noticed that the cache drive does not match the UUID of any of the drives. I have tried new TrueNAS VMs, new installs of Proxmox, running TrueNAS on bare metal and even putting the drives back in the old server. I even tried repairing the bad 12TB drive.

Any help would be greatly appreciated.

pool: tank id: 4695044492445768575 state: UNAVAIL status: One or more devices contains corrupted data. action: The pool cannot be imported due to damaged devices or data. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-5E config:

tank UNAVAIL insufficient replicas
raidz1-0 UNAVAIL insufficient replicas
180a178f-227a-4f59-bf63-13c204ea5e3d ONLINE sdb ZTN0E7JS
bd500474-e688-4ad9-8218-3baedaae18e3 ONLINE sde ZL006VBP
61f8b3c1-acdf-4536-8fa9-7ebc7081a3d0 UNAVAIL
ce8e3970-ffa6-4648-8fda-da5518ff2248 ONLINE sdg ZHZ75WHF
2bf0f77f-986f-45a9-96bc-61c6e8a6b43e UNAVAIL
b36b9199-be21-4bb0-9390-b94902ff0987 ONLINE sdd 222303A00691

smartctl -a /dev/sda smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.107+truenas] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION === Model Family: Seagate Exos X14 Device Model: ST12000NM0008-2H3101 Serial Number: ZHZ3DPZ7 LU WWN Device Id: 5 000c50 0c29be84f Firmware Version: SN03 User Capacity: 12,000,138,625,024 bytes [12.0 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-4 (minor revision not indicated) SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Fri Oct 13 01:11:56 2023 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled

Read SMART Data failed: scsi error badly formed scsi parameters

=== START OF READ SMART DATA SECTION === SMART Status command failed: scsi error badly formed scsi parameters SMART overall-health self-assessment test result: UNKNOWN! SMART Status, Attributes and Thresholds cannot be read.

Read SMART Log Directory failed: scsi error badly formed scsi parameters

Read SMART Error Log failed: scsi error badly formed scsi parameters

Read SMART Self-test Log failed: scsi error badly formed scsi parameters

Selective Self-tests/Logging not supported

fdisk -x /dev/sdf
Disk /dev/sdf: 10.91 TiB, 12000138625024 bytes, 23437770752 sectors Disk model: ST12000NM0008-2H Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disklabel type: gpt Disk identifier: EB48CE70-4290-4FC9-9B75-B84B757886AD First LBA: 34 Last LBA: 23437770718 Alternative LBA: 23437770751 Partition entries LBA: 2 Allocated partition entries: 128

Device Start End Sectors Type-UUID UUID Name Attrs /dev/sdf1 2048 4194304 4192257 0657FD6D-A4AB-43C4-84E5-0933C84B4F4F 2E1C447A-E993-4F85-A36D-12862075E4F5
/dev/sdf2 4196352 23437770718 23433574367 6A898CC3-1DD2-11B2-99A6-080020736631 6CD68B47-7B86-4E7B-B33E-AD8BE3F81D72

fdisk -x /dev/sdc Disk /dev/sdc: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors Disk model: SanDisk SSD PLUS Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: FFB2375D-4962-48F3-8B68-89CE87D03723 First LBA: 34 Last LBA: 1953525134 Alternative LBA: 1953525167 Partition entries LBA: 2 Allocated partition entries: 128

Device Start End Sectors Type-UUID UUID Name Attrs /dev/sdc1 40 1953525134 1953525095 6A898CC3-1DD2-11B2-99A6-080020736631 4D52B9D4-014B-4AE0-B3B4-2276D1C95264

Patrick M. Hausen · Oct 16, 2023

I don't see a cache drive in your output. Only a single RAIDZ1 vdev with 6 drives. If that is indeed the case and two drives have failed, your pool is lost beyond repair.

ChrisRJ · Oct 16, 2023

While not related to the question at hand: The 12-port SATA card is probably not suitable for ZFS

sfatula · Oct 16, 2023

I can't tell which drive is which, and the post didn't use code tags, so, super hard to follow. Put output of "zpool status -Lx" in code tags, without edit. That way, can match up the drives. Include smart and fdisk of the unavailable or errored ones, and again, in code tags.

If you have truly lost both drives, you can't repair it, time to restore from backups. But not sure yet. Perhaps the backplane failing did some physical damage somehow?

joeschmuck · Oct 16, 2023

I have to agree with @Patrick M. Hausen, it appears your pool is lost. And I don't see a cached drive either. Hopefully you have a backup of your data.

MadHatR75 · Oct 17, 2023

Ok, I was able to get it working. I was confused about one of the drives. I had an extra 12TB drive that i forgot was in there. The original pool was 5 12TB drives, 2 SSDs mirrored. I also had a small nvme drive as a log drive.

I physically removed the drive that was not showing at all. The other drive that is failing had a corrupt partition table. I thought I had fixed it by recreating the lost partitions by copying the sectors. After more searching, thanks to your very knowledgeable forum, I ran across the trick to rebuilding the table by saving the table from one of the good drives and writing it to the one with errors and then creating a random UUID.

I then was able to force the pool to import and remove the dead drive. I am now copying all the data to a new pool made up of 3 8TB drives. I will then be able to destroy that tank pool and start over with 2 less drives.

sfatula · Oct 17, 2023

Glad you got it working, great outcome nevertheless. zpool -L is useful as I mentioned above to use drive "letters" instead of the id if you ever need it.

It would have been best to have included the procedure you followed as part of the OP so we'd have known the full story.

Important Announcement for the TrueNAS Community.

One drive failed, now a cache drive is unavailable and I can't bring the pool online

MadHatR75

Cadet

Patrick M. Hausen

Hall of Famer

ChrisRJ

Wizard

sfatula

Guru

joeschmuck

Old Man

MadHatR75

Cadet

sfatula

Guru

Similar threads

Important Announcement for the TrueNAS Community.

One drive failed, now a cache drive is unavailable and I can't bring the pool online

MadHatR75

Cadet

Patrick M. Hausen

Hall of Famer

ChrisRJ

Wizard

sfatula

Guru

joeschmuck

Old Man

MadHatR75

Cadet

sfatula

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "One drive failed, now a cache drive is unavailable and I can't bring the pool online"

Similar threads