SOLVED Drive marked as unavailable randomly. Passes all smart tests and is 3 weeks old

M3PH

Dabbler
Joined
Mar 3, 2024
Messages
21
Hey guys. My first post here. Please be nice

I recently purchased 4 2TB seagate ironwolf pro drives from ebuyer here in the UK to replace a bunch of old seagate baracuda's and firecuda's (i know. Don't ever use sshd's in a zpool) i had used to create a pool in my server while i was a bit hard up. One of the new drives - which was workng fine - at the start of last week became marked as "unavailable". ZFS couldn't re-add it as it already had a partition on it. I have run all the smart tests available in truenas and the drive passes them all (i'm also, as i type this waiting for an extended online test to complete in GsmartControl on my windows 11 desktop. the drive is connected using an anker sata to usb3 adpater). Also, since this has happened, truenas has auto run a scrub and a resilver and the box has been rebooted before both of those ran and the drive will not come back online. Now, i'm a bit of truenas noob and i'm still learning ZFS so if you guys can point me in the right direction of how to potentially fix this and be a bit tolerant of me not knowing where everything is or all of the console commands that would be cool.

My server specs are as follows:
intel xeon E5 2695 v3
supermicro x10srl-f mobo
128GB of samsung 2133mhz ECC ram (8x16GB dimms)
LSI 9211-81 HBA with a noctua fan mod and the indepenant channel\non raid firmware
Highpoint 7204 NVME HBA
intel x550 10gbe nic
Corsair rm1000x PSU
Supermicro AOM-TPM-9655V (TPM 1.2 add-in Module)


Drives:
boot pool: 2x 1TB crucial M500's
data pool 1 (raidz2) : 4x 2TB seagate iron wolf pro's and 4 very old and soon to be replaced 2TB seagate barracuda's in 1 data vdev with a cache using a 250GB samasung 970 evo plus NVME SSD
data pool 2 (vm storage. raidz1): 2x sabrent 500GB Rocket 3 NVME drives

The server runs the newest version of truenas scale (i always update when there is one available) and hosts 3 windows server 2022 standard VM's (yes they are legally licensed and up to date) that run my windows domain, my DNS server and one that hosts a few other things. Ram usage is split between the VM's\services (60GB) and the Arc cache with about 5GB free. It is also attached to a 2KVa ups (but there have not been any power outages). Temps are also well within tolerance for everything.

I have already started the rma process with ebuyer but they want me to call them to progress it and i can't do that till tomorrow. So, i figured i would ask here to see if anyone had any ideas on how to avoid sending the drive back coz it seems fine. I will post the results of the smart test when it is finished. If anyone wants me to run some commands on the truenas box please reply with them.

Thanks in advance.
M3PH
 

M3PH

Dabbler
Joined
Mar 3, 2024
Messages
21
As promised here is the smart test results
 

Attachments

  • ST2000NT001-3M3101_WRE0QGW9_2024-03-03_1439.txt
    15.7 KB · Views: 90

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,995
There is nothing wrong with the hard drive. All looks very good.

The server runs the newest version of truenas scale (i always update when there is one available)
Exactly what version are you running now? You are letting us "assume" something and they is not a good start. Never let us assume as you could get not so good advice. My assumption could be that you are running 23.10.2, 24.04-BETA.1, or nightlies.

Hopefully you know how to use (no spaces)
[ code] tags [ /code]
but here is my request for data:
zpool status -v

Okay, just the one for now.
 

M3PH

Dabbler
Joined
Mar 3, 2024
Messages
21
There is nothing wrong with the hard drive. All looks very good.


Exactly what version are you running now? You are letting us "assume" something and they is not a good start. Never let us assume as you could get not so good advice. My assumption could be that you are running 23.10.2, 24.04-BETA.1, or nightlies.

Hopefully you know how to use (no spaces)

but here is my request for data:
zpool status -v

Okay, just the one for now.
the exact version for truenas is: TrueNAS-SCALE-23.10.2

here is the zpool status
Code:
pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:14 with 0 errors on Wed Feb 28 03:45:16 2024
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdg3    ONLINE       0     0     0
            sdi3    ONLINE       0     0     0

errors: No known data errors

  pool: datapool_test
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: resilvered 536G in 01:05:31 with 0 errors on Sat Feb 17 18:03:40 2024
config:

        NAME                                      STATE     READ WRITE CKSUM
        datapool_test                             DEGRADED     0     0     0
          raidz2-0                                DEGRADED     0     0     0
            6082669867442403258                   UNAVAIL      0     0     0  was /dev/disk/by-partuuid/456ebbf4-a3ae-471a-862b-c25f3107ba2c
            e2dff84f-96da-4d2a-a47f-ea310dbaec10  ONLINE       0     0     0
            df4d6519-44b6-4fc8-9318-12339e90f36a  ONLINE       0     0     0
            0f127442-c291-44b0-9d65-9f56f239442b  ONLINE       0     0     0
            236f1e76-fa4d-4d7d-bc81-0240d503889b  ONLINE       0     0     0
            44c909a5-bd8b-48ff-a571-1b726f308db2  ONLINE       0     0     0
            a3a4432b-a0d4-436e-9542-81db9c85bbbd  ONLINE       0     0     0
            e8a7dad4-8c8a-4c00-9a9d-3ddad6bc301e  ONLINE       0     0     0
        cache
          f56dcb5d-e4d6-4f15-aaaa-e223edc2cb3a    ONLINE       0     0     0

errors: No known data errors

  pool: vmstore
 state: ONLINE
  scan: scrub repaired 0B in 00:04:18 with 0 errors on Thu Feb  1 00:04:20 2024
config:

        NAME                                      STATE     READ WRITE CKSUM
        vmstore                                   ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            4ff0f55e-b13d-46ce-acec-d117e7b0d4c3  ONLINE       0     0     0
            6ccdbca6-393e-4b9a-9390-4da21a00262c  ONLINE       0     0     0

errors: No known data errors


It should be noted the suspect drive is currently on my desk connected via usb
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,995
The good thing, does not look like any corrupt data exists, yet.

I'm going to as a series of questions, it may sound like you have already provided the answer but we must communicate clearly, hence the questions.

1) How many "data" drives should be installed? I count 2 drives for your boot-pool, 2 drives for your vmstore, and 7 drives for your 'datapool_test'.
2) How many "data" drives are physically installed right now? If you change the setup, move anything around, you must communicate that information.
3) Provide the output fdisk -l

EDIT: Sorry, I screwed up line 3, fixed now.
 

M3PH

Dabbler
Joined
Mar 3, 2024
Messages
21
1) How many "data" drives should be installed? I count 2 drives for your boot-pool, 2 drives for your vmstore, and 7 drives for your 'datapool_test'.
should be 8 drives for datapool_test + a cache drive (so 9 i guess). As i said, drive 8 is sitting on my desk connected to my desktop via a sata to usb adaptor. if you want me to go install it in the server again. let me know
2) How many "data" drives are physically installed right now? If you change the setup, move anything around, you must communicate that information.
right now there are 7 HDD's and 5 SSD's installed and working. 2 ssd's for the boot pool. 2 ssd's for the vmstore and 7 hdd's (data drives) and 1 ssd (cache drive) for datapool_test
3) Provide the output fdisk -l/icode]
fdisk -l output

Code:
Disk /dev/nvme0n1: 476.94 GiB, 512110190592 bytes, 1000215216 sectors
Disk model: Sabrent                                
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: E8150FCF-63D7-48AA-BB88-4DE859839CFF


Device           Start        End   Sectors   Size Type
/dev/nvme0n1p1     128    4194304   4194177     2G Linux swap
/dev/nvme0n1p2 4194432 1000215182 996020751 474.9G Solaris /usr & Apple ZFS




Disk /dev/nvme2n1: 476.94 GiB, 512110190592 bytes, 1000215216 sectors
Disk model: Sabrent                                
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: EC03AB15-559E-40D1-9F1E-E4F9927DA1D5


Device           Start        End   Sectors   Size Type
/dev/nvme2n1p1     128    4194304   4194177     2G Linux swap
/dev/nvme2n1p2 4194432 1000215182 996020751 474.9G Solaris /usr & Apple ZFS




Disk /dev/nvme1n1: 232.89 GiB, 250059350016 bytes, 488397168 sectors
Disk model: Samsung SSD 970 EVO Plus 250GB        
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 43604192-54B0-428D-B8D2-8AB68CBA5A93


Device         Start       End   Sectors   Size Type
/dev/nvme1n1p1  4096 488396800 488392705 232.9G Solaris /usr & Apple ZFS




Disk /dev/sdi: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: CT1000MX500SSD1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: F0BC8839-4A8D-4C3F-B727-3DB766826D91


Device        Start        End    Sectors  Size Type
/dev/sdi1      4096       6143       2048    1M BIOS boot
/dev/sdi2      6144    1054719    1048576  512M EFI System
/dev/sdi3  34609152 1953525134 1918915983  915G Solaris /usr & Apple ZFS
/dev/sdi4   1054720   34609151   33554432   16G Linux swap


Partition table entries are not in disk order.




Disk /dev/sdg: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: CT1000MX500SSD1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 601E025B-F221-482D-A4E0-D5825975619B


Device        Start        End    Sectors  Size Type
/dev/sdg1      4096       6143       2048    1M BIOS boot
/dev/sdg2      6144    1054719    1048576  512M EFI System
/dev/sdg3  34609152 1953525134 1918915983  915G Solaris /usr & Apple ZFS
/dev/sdg4   1054720   34609151   33554432   16G Linux swap


Partition table entries are not in disk order.




Disk /dev/sda: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: ST2000NT001-3M31
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 8246F613-DA39-40CE-91E4-255356E83D56


Device     Start        End    Sectors  Size Type
/dev/sda1   4096 3907028992 3907024897  1.8T Solaris /usr & Apple ZFS




Disk /dev/sdc: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: ST2000NT001-3M31
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 05BDF6E0-7601-42CC-9BF9-F1173AA84EBC


Device     Start        End    Sectors  Size Type
/dev/sdc1   4096 3907028992 3907024897  1.8T Solaris /usr & Apple ZFS




Disk /dev/sdf: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: ST2000NT001-3M31
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 245137E2-258F-4C57-9234-B36E45CB9D40


Device     Start        End    Sectors  Size Type
/dev/sdf1   4096 3907028992 3907024897  1.8T Solaris /usr & Apple ZFS




Disk /dev/sdd: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: ST2000DM008-2FR1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 1C866CC1-53F3-4117-9515-B0B5E5C4F43B


Device     Start        End    Sectors  Size Type
/dev/sdd1   4096 3907028992 3907024897  1.8T Solaris /usr & Apple ZFS




Disk /dev/sde: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: ST2000DX002-2DV1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 8907C32B-CCEE-4EB8-8C4F-F2D7965729F4


Device     Start        End    Sectors  Size Type
/dev/sde1   4096 3907028992 3907024897  1.8T Solaris /usr & Apple ZFS




Disk /dev/sdb: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: ST2000DM008-2FR1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 7DD74584-1005-49F2-A13A-886613BA428D


Device     Start        End    Sectors  Size Type
/dev/sdb1   4096 3907028992 3907024897  1.8T Solaris /usr & Apple ZFS




Disk /dev/sdh: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: ST2000DM008-2FR1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 6CB96559-323F-4CF7-BF72-9C0F438B48AA


Device     Start        End    Sectors  Size Type
/dev/sdh1   4096 3907028992 3907024897  1.8T Solaris /usr & Apple ZFS




Disk /dev/md127: 15.98 GiB, 17162043392 bytes, 33519616 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes




Disk /dev/zd0: 40 GiB, 42949672960 bytes, 83886080 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 16384 bytes
I/O size (minimum/optimal): 16384 bytes / 16384 bytes
Disklabel type: gpt
Disk identifier: 320BD6EA-0AD6-4117-9B5D-407BEDC21D08


Device        Start      End  Sectors  Size Type
/dev/zd0p1     2048   206847   204800  100M EFI System
/dev/zd0p2   206848   239615    32768   16M Microsoft reserved
/dev/zd0p3   239616 81786879 81547264 38.9G Microsoft basic data
/dev/zd0p4 81787008 83886015  2099008    1G Windows recovery environment




Disk /dev/zd16: 60 GiB, 64424525824 bytes, 125829152 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 16384 bytes
I/O size (minimum/optimal): 16384 bytes / 16384 bytes
Disklabel type: gpt
Disk identifier: 08349FDE-9D05-42D6-A819-F664CE9B416B


Device          Start       End   Sectors  Size Type
/dev/zd16p1      2048    206847    204800  100M EFI System
/dev/zd16p2    206848    239615     32768   16M Microsoft reserved
/dev/zd16p3    239616 123729919 123490304 58.9G Microsoft basic data
/dev/zd16p4 123730048 125829087   2099040    1G Windows recovery environment




Disk /dev/zd32: 100 GiB, 107374198784 bytes, 209715232 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 16384 bytes
I/O size (minimum/optimal): 16384 bytes / 16384 bytes
Disklabel type: gpt
Disk identifier: 25E8F040-C969-40AB-98C2-7D81DBF118BD


Device      Start       End   Sectors  Size Type
/dev/zd32p1    34     32767     32734   16M Microsoft reserved
/dev/zd32p2 32768 209711103 209678336  100G Microsoft basic data


Partition 1 does not start on physical sector boundary.




Disk /dev/zd48: 40 GiB, 42949672960 bytes, 83886080 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 16384 bytes
I/O size (minimum/optimal): 16384 bytes / 16384 bytes
Disklabel type: gpt
Disk identifier: 41144EDA-D5BE-4CD3-BE4F-6A137DC84B87


Device         Start      End  Sectors  Size Type
/dev/zd48p1     2048   206847   204800  100M EFI System
/dev/zd48p2   206848   239615    32768   16M Microsoft reserved
/dev/zd48p3   239616 81786879 81547264 38.9G Microsoft basic data
/dev/zd48p4 81787008 83886015  2099008    1G Windows recovery environment




Disk /dev/mapper/md127: 15.98 GiB, 17162043392 bytes, 33519616 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
 
Last edited:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,995
What I'm doing is comparing the drive identifiers, making sure all looks good before proceeding, but I must be losing my mind, not a single ID on the Fdisk list is on the zpool list. None!

Sorry but I'm baffled right now and I will not give you bad advice. Someone else will need to step in and help. I thought it would be an easy fix to be honest.

What can you do? Zero out the new drive, it's not just a simple format, you must remove the beginning and the end data that holds the formatting of the drive (partition tables). That will make TrueNAS think the drive was never in a NAS. You can write zeros to the entire drive if you desire.

Next use the replacement procedure to use the new drive to replace the failed drive. This is what I'd do but your system is confusing me so sorry I couldn't be of more help.
 

M3PH

Dabbler
Joined
Mar 3, 2024
Messages
21
ok no worries. have you got any recommendations for windows based tools that will zero out the drive?
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,995
I like Minitool Partition Wizard. It's free and does a lot of things, including wiping the drive. I'd just write 0's or 1's, either will be fine. Then you can try to add the drive again as a new drive.

Good Luck.
 

M3PH

Dabbler
Joined
Mar 3, 2024
Messages
21
So just a quick update. i just reinserted the drive into the server and it seems to have accepted it and is resilvering the pool as i type.

Thanks very much for helping @joeschmuck.

One last question: do i need to mark this thread as resolved? if so how? or do i just leave it as it is?
 

Patrick_3000

Contributor
Joined
Apr 28, 2021
Messages
167
I recently upgraded to SCALE 23.10.2, and I started getting some random messages that my devices (which are SSD) are removed, even though I suspect there is nothing wrong with them. I just now posted about it in a separate post.

Can I ask whether you previously used these same drives in an earlier version of SCALE? I'm wondering if this is a problem with 23.10.2.
 

M3PH

Dabbler
Joined
Mar 3, 2024
Messages
21
i did. the 23.10.2 update came out a few days after i had completely wiped the box and reinstalled everything. the reason for the wipe was moving to a boot mirror from only a single drive.

I'm also wondering if the drive identifiers not matching is a bug too coz i didn't change\create them. Truenas did all of that itself
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
i had completely wiped the box and reinstalled everything. the reason for the wipe was moving to a boot mirror from only a single drive.
That's akin to setting fire to a shed because one of the two lawnmowers inside got sold off.
 

M3PH

Dabbler
Joined
Mar 3, 2024
Messages
21
That's akin to setting fire to a shed because one of the two lawnmowers inside got sold off.
I was also replacing 4 drives in a pool at the same time. Plus there was a lot of things i wasn't happy with config wise. It was easier to just wipe it and start from scratch. sometimes when you are learning, it is better to repeat things you have learnt to make sure you have learnt them than it is to try to learn more new things. especially when the new things add to the complexity and if done wrong could cause more trouble.

(also i think you read my last post wrong @Ericloewe coz your analogy is backwards to what i was trying to achieve. i bought a second lawnmower and burnt down the shed coz it was too small)
 
Top