Failed Drive and I now think I am in trouble

HouliNZ

Cadet
Joined
Mar 15, 2024
Messages
4
I am currently running TrueNAS CORE 13.0-U6.1. My drives and pools are listed below

The history building up to where I am now
  1. TrueNAS CORE working well (aside from one Jail causing the machine to lock up every now and again - I think it may be the SATA controller driver not coping when under load)
  2. One of my 6TB drives starting failing SMART tests
  3. So I shutdown the machine, removed the offending drive, installed a replacement drive, restarted the machine
  4. The storage pool (SmartArrayPool) was gone! Argghhh, complete data loss :(
  5. So I shutdown the machine, removed the replacement drive, re-installed a SMART offending drive, restarted the machine
  6. So the storage pool (SmartArrayPool) was back! Yipee.
  7. But now the storage pool is degraded. It says "/dev/gptid/ac329ad8-8e9c-11ee-8c88-e0d55ef88b07" is offline. the 7 drives are showing as ONLINE. I'm not sure what this is. The replacement drive I didnt add to the storage pool?
I've done some research and it appears I was supposed to take the offending drive before replacing it. I guess I stuffed that one up.

I don't have any spare SATA ports to have the failing one and the new one in the machine at the same time.

What do I do now? I'm keen not to loose the data (its 90% full).

Any help would be appreciated.

My drives
  • ada0 - 500GB boot SSD
  • ada1 - 7.28TB in SmartArray8TBPool
  • ada2 - 7.28TB in SmartArray8TBPool
  • ada3 - 7.28TB in SmartArray8TBPool
  • ada4 - 7.28TB in SmartArray8TBPool
  • ada5 - 5.46TB in SmartArrayPool
  • ada6 - 5.46TB in SmartArrayPool
  • ada7 - 5.46TB in SmartArrayPool
  • ada8 - 5.46TB in SmartArrayPool
  • ada9 - 5.46TB in SmartArrayPool
  • ada10 - 5.46TB in SmartArrayPool
  • ada11 - 5.46TB in SmartArrayPool
Storage pools
  • SmartArray8TBPool - 20.18TB, RAIDZ1 - Healthy
  • SmartArray - 32.99TB, RAIDZ1 - Degraded
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
Please post the output of zpool status and list your hardware in detail!

My assumption would be you are using hardware raid, which is not good. What type is your storage pool? If it's raidz1 or higher you didn't lose any data yet (if the assessment that only one drive failed is correct). Make a backup now.
 
Last edited:

HouliNZ

Cadet
Joined
Mar 15, 2024
Messages
4
Chuck32, thank you for coming back so quickly..

Hardware RAID
I can confirm that I am 100% not using hardware raid. All RAID technology is coming from TrueNAS.

STORAGE POOL DETAIL
Storage Pool Overview.png


Storage Pool Degraded.png


ZPOOL STATUS COMMAND RESULTS
root@truenas[~]# zpool status
pool: SmartArray8TBPool
state: ONLINE
scan: scrub repaired 0B in 02:30:29 with 0 errors on Sun Feb 25 02:30:39 2024
config:

NAME STATE READ WRITE CKSUM
SmartArray8TBPool ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/81e4ff91-93f8-11ee-bb2c-e0d55ef88b07 ONLINE 0 0 0
gptid/81f2f3c9-93f8-11ee-bb2c-e0d55ef88b07 ONLINE 0 0 0
gptid/81d710e9-93f8-11ee-bb2c-e0d55ef88b07 ONLINE 0 0 0
gptid/81f21a18-93f8-11ee-bb2c-e0d55ef88b07 ONLINE 0 0 0

errors: No known data errors

pool: SmartArrayPool
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: scrub repaired 16K in 08:36:42 with 0 errors on Sun Feb 25 08:36:43 2024
config:

NAME STATE READ WRITE CKSUM
SmartArrayPool DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
1548659058482393193 OFFLINE 0 0 0 was /dev/gptid/ac329ad8-8e9c-11ee-8c88-e0d55ef88b07
gptid/ac73245f-8e9c-11ee-8c88-e0d55ef88b07 ONLINE 0 0 0
gptid/ac3fb29e-8e9c-11ee-8c88-e0d55ef88b07 ONLINE 0 0 0
gptid/ac1e595d-8e9c-11ee-8c88-e0d55ef88b07 ONLINE 0 0 0
gptid/ac30999d-8e9c-11ee-8c88-e0d55ef88b07 ONLINE 0 0 0
gptid/ac0d89a6-8e9c-11ee-8c88-e0d55ef88b07 ONLINE 0 0 0
gptid/ac29e9c3-8e9c-11ee-8c88-e0d55ef88b07 ONLINE 0 0 0
gptid/ac7e5d25-8e9c-11ee-8c88-e0d55ef88b07 ONLINE 0 0 0

errors: No known data errors

pool: boot-pool
state: ONLINE

Hardware
I'm sorry. I'm not sure what you are after or the best way to get it too you. From a SATA controller perspective I am using
motherboard SATA controller (non-RAID mode)
Asmedia Sata controller (trying to confirm which model now - sorry)

Does this help?

Thanks Houli
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Seems pretty straight forward, you lost a drive, simply replace it per the documentation. See "Documentation" link at the top of any Forum page.

Glad you are not using hardware RAID. Their is a RAID controller from HP, (I think), called SmartArray. Thus, @chuck32 probably wanted to make sure you were not using such.

Motherboard SATA ports should be fine for use, though some ASMedia and other brand of SATA expansion chips can be problematic if they use SATA Port Multipliers. Thus, the desire to know the exact make & model of all the SATA ports in use.
 

HouliNZ

Cadet
Joined
Mar 15, 2024
Messages
4
Thanks Arwen.

I checked the documentation - did you mean this.. Replacement disk tutorial?

If so, I can take ada5 offline without data loss? I'm not sure what the '/dev/gptid/ac329ad8-8e9c-11ee-8c88-e0d55ef88b07' device is that is showing as OFFLINE and don't want to stuff it up.

I checked the 8TB pool it doesnt have one of these mystery devices.

I know I should back it up but I don't have a spare 30TB's lying around to back it all off too. I could backup some of it of course but not all.

Any advice on dealing with this mystery device BEFORE I take ada5 offline and replace it?
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
I'm slightly confused, why are you talking about ada5?
The numbering / labeling may change in between reboots, you need to identify the disks by serial number rather than adaX. Ada5 now can be different from ada5 before you shutdown the server.
Identify the failed drive and replace it.

The storage pool (SmartArrayPool) was gone! Argghhh, complete data loss :(
What do you mean by was gone?

Can you clarify your current setup? You now have all 8 original drives installed, one of which is the failed drive. I assume this is the currently offline drive as all others seem be healthy. See above.

It's already offline, so you physically remove it and add your replacement drive, then according to the documentation you replace it. You may also not remove it first and remove it later.

On a side note, a 8 wide Raidz1 is wider than I'd be comfortable with honestly, especially given you have no backups. If possible either a backup and/or consider creating a raidz2 pool. Now that the drive is faulty you cannot afford to lose another drive until you replaced it.
 

HouliNZ

Cadet
Joined
Mar 15, 2024
Messages
4
I'm slightly confused, why are you talking about ada5?
The numbering / labeling may change in between reboots, you need to identify the disks by serial number rather than adaX. Ada5 now can be different from ada5 before you shutdown the server.
Identify the failed drive and replace it.


What do you mean by was gone?

Can you clarify your current setup? You now have all 8 original drives installed, one of which is the failed drive. I assume this is the currently offline drive as all others seem be healthy. See above.

It's already offline, so you physically remove it and add your replacement drive, then according to the documentation you replace it. You may also not remove it first and remove it later.

On a side note, a 8 wide Raidz1 is wider than I'd be comfortable with honestly, especially given you have no backups. If possible either a backup and/or consider creating a raidz2 pool. Now that the drive is faulty you cannot afford to lose another drive until you replaced it.
Chuck32,

The drive which is ADA5 (it hasnt changed for some reason) is a failing (well, started failing SMART tests) 5.46TB Western Digital drive. I'm tracking its serial number and this is the one I removed, replaced - only to see that my drive pool had completely gone! I thought I had lost all my data!

However, I removed the replacement drive and returned the failing 5.46TB Western Digital and the pool came back!

Its clear that I should have followed the documentation process and
1. marked the drive as offline
2. shut the machine down
3. removed the drive
4. installed a replacement
5. started the machine up
6. followed the WEB GUI to mark it has the replacement allowing the RAID to rebuild.

But alas, I didnt do this.

So today, I am still running on the failing drive and have a mystery item OFFLINE in my drive pool. The failing drive is now showing as healthy - but I don't trust it so do want to follow the correct process.

I am reluctant to follow the process now until I am know what this mystery item is.

"On a side note, a 8 wide Raidz1 is wider than I'd be comfortable with honestly, especially given you have no backups. If possible either a backup and/or consider creating a raidz2 pool. Now that the drive is faulty you cannot afford to lose another drive until you replaced it.". Is there an in-place upgrade option to go from RAIDZ1 to RAIDZ2?
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
I am reluctant to follow the process now until I am know what this mystery item is.
I deleted my original answer because now I also got confused. You original post lists 7 drives. Now you seem to have 8 drives in your if I'm not mistaken. Please clarify how many drives you originally had. I don't see how you should end up with one more drive if you only switched them physically for one boot. Probably I'm missing something here.

Chuck32,

only to see that my drive pool had completely gone!
It did not import on boot? You should be able to manually import the pool then.

Can you post the output for the failed smart test of ada5 and make sure it's still ada5?

Probably I'm missing something but it should be rather straightforward.

If you're that concerned that you may lose your data,
I could backup some of it of course but not all.
Did you do that?

Is there an in-place upgrade option to go from RAIDZ1 to RAIDZ2?
Destroy the pool, create new pool and restore from backup (which you don't have unfortunately).
Keep in mind that you need more drives for that, with 90 % usage you don't have enough space. In the near future you need to address that too. You don't want to fill your pool completely.
 
Last edited:
Top