Replaced disk fails resilver and marks drive REMOVED

ArsenalNAS

Cadet
Joined
Apr 21, 2022
Messages
3
I am new to this forum. I am running TrueNAS CORE 12.0 U6 with 22 drives (2 groups of 11). I have replaced drives previously, but I am not running into a problem with rebuilding this time around. The usage is high (~95%) and I have now have 2 troubled drives in the pool. Both questionable drives are in the same group. When I make attempts to offline and replace the drive with a new drive, the process looks standard. I see the new drive show up as the replacement pulldown. I start the replacement process (resilver), and shortly there after the system pauses and then continues. If I refresh after that pause, the replacement disk is not listed as "REMOVED" with the GPID number string listed.
Screen Shot 2022-04-21 at 11.36.07 AM.png


I have not discovered a process to get the system to make another attempt at using the replacement disk. The system has labeled that failed disk as REMOVED and will not proceed, if it the UI has been refreshed. The drive has been marked out. If I put in another drive, I can start the process again. However, I have done that, and the 2nd replaced disk did the same error.

When I look at the console and look at the disk serial history (da21), I can see the different disks that failed the attempts.

Screen Shot 2022-04-21 at 12.25.08 PM.png


I would appreciate any insights into how to best resolve the issue.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Sounds like a problem with that port on the backplane or the cable to it.

Have a look at dmesg and see if CAM is reporting errors that point at CRC errors or something else.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
What brand and model of disk, exactly?
 

ArsenalNAS

Cadet
Joined
Apr 21, 2022
Messages
3
The 2 disk types I have used are both Seagate. The initial disk installations are 3TB drives, which I changed out the replacement disks using 6TB (ST6000NM0044). I am inclined to think the faulty connector (or midplane). Since it is a unified midplane it is a bit more difficult to replace. I may have to map out the connector location. Although, I will have an imbalanced RAID config now between the two zgroups.

The "dmesg" simply lists SCSI ERROR. (I am not familiar with the CDB addressing)
Screen Shot 2022-04-22 at 1.05.00 PM.png
 

ArsenalNAS

Cadet
Joined
Apr 21, 2022
Messages
3
As a side question, if I have a replament disk which has failed (in this case, connection failure). Would I be able to use it in another slot as a replacement disk? I know if I use it in the same slot, the system seems to ignore it. Or would it be best to for erase, and use as replacement in different slot?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
if I have a replament disk which has failed (in this case, connection failure). Would I be able to use it in another slot as a replacement disk? I know if I use it in the same slot, the system seems to ignore it. Or would it be best to for erase, and use as replacement in different slot?
I'd start by running badblocks on it for a bit to make sure you're not just introducing trouble for yourself... just search for badblocks and you'll find some good resources on how to run it.

The errors are certainly pointing at a failure in the path between the system and the disk(s) (only da21 implicated at this point).

Maybe the backplane port, so try working around it and see how that goes.
 
Top