Failed disk replace only added more disks to volume.

MR. T. · Jan 4, 2018

Hi all,

I have a volume with 11 disks on a raidz2 format on it.
One of the disks simply died, so i bought a new one and pressed replace on the GUI.

While buying/replacing the disk, another disk started developing a few faults so i got myself another replacement and set that second disk to replace as well.

While replacing, (i have noticed before) that freenas just adds the disk to the volume while resilvering and removes the old disk at the end of it, but this time it didn't do that... it just kept the new and old disks all in the same volume.

pressing detatch doesn't seem to do anything other than triggering a resilver.

What can i do?

Edit: pressing detatch gives me the message "Disk detach has been successfully done." but the disk is still in the volume.

for older users of the forum that might recognise me and think it was strange for me to disappear:
I've built my freenas machine a long time ago and when i started it off i had no idea on what i was getting myself into. After much pain i have been to a point where many of the people in this forum would be proud. (server mobo, aLOT of ecc ram, etc,etc,etc)
While trying to work out all the kinks on my setup i got my entire NAS encrypted by some russian ransomware (i hadn't gotten to the point of enabling snapshots). That really made me depressed.

After a while i came back to it but the nas has been an unending source of pain because it never worked correctly when it simply should. i believe i tracked it down to a faulty disk or cable signaling on the 3.3v incorrecly and spinning down/offlining my disks randomly.

I also had 2 boot devices dying on my as well (a pen drive and a regular spinning disk). I haven't been lucky
The nas after a year and a half of being extremely flaky has been running for 15 days without having to reboot it, without losing disks, etc.
At this point i would really like for it to just work without any more problems :)

Chris Moore · Jan 4, 2018

In code tags, please share the output of the zpool status command.

Sent from my SAMSUNG-SGH-I537 using Tapatalk

MR. T. · Jan 4, 2018

Hi.

i don't have SSH access right now, but here is the result of the daily email.
I think the information is quite similar.

Code:

Checking status of zfs pools:
NAME		   SIZE  ALLOC   FREE  EXPANDSZ   FRAG	CAP  DEDUP  HEALTH  ALTROOT
freenas-boot  74.5G   850M  73.7G		 -	  -	 1%  1.00x  ONLINE  -
tmp6		  1.81T  1.70T   115G		 -	52%	93%  1.00x  ONLINE  /mnt
volume1	   1.80T  1.19T   626G		 -	44%	66%  1.00x  ONLINE  /mnt
volume2	   80.0T  24.0T  56.0T		 -	12%	30%  1.00x  DEGRADED  /mnt

  pool: volume2
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
		corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
		entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 3.08T in 2 days 03:06:51 with 96 errors on Wed Jan  3 12:45:25 2018
config:

		NAME											  STATE	 READ WRITE CKSUM
		volume2										   DEGRADED	 0	 0   153
		  raidz2-0										DEGRADED	 0	 0   306
			gptid/4fbee96c-5769-11e6-b9dd-d05099c0e76d	ONLINE	   0	 0	 0
			gptid/e8186958-a209-11e6-9ac1-d05099c0e76d	DEGRADED	 0	 0	 0  too many errors
			gptid/197585c6-b5a9-11e6-9917-d05099c0e76d	DEGRADED	 0	 0	 0  too many errors
			da7p1										 FAULTED	 24	 0	 0  too many errors
			gptid/f38fb3bf-96eb-11e6-b4e3-d05099c0e76d	DEGRADED	 0	 0	 0  too many errors
			replacing-5								   DEGRADED	 0	 0	 0
			  5477075899050874913						 UNAVAIL	  0	 0	 0  was /dev/gptid/263734a0-97bd-11e6-bd78-d05099c0e76d
			  gptid/27aad4c9-d2e8-11e7-96d0-d05099c0e76d  ONLINE	   0	 0	 0
			  gptid/95e337aa-e2d1-11e7-9f7e-d05099c0e76d  ONLINE	   0	 0	 0
			gptid/19b747eb-71de-11e6-a3d7-d05099c0e76d	DEGRADED	 0	 0	 0  too many errors
			gptid/757de200-71e7-11e6-a3d7-d05099c0e76d	DEGRADED	 0	 0	 0  too many errors
			gptid/c00c56c5-97bf-11e6-bd78-d05099c0e76d	ONLINE	   0	 0	 0
			gptid/f8071135-b5a8-11e6-9917-d05099c0e76d	DEGRADED	 0	 0	 0  too many errors
			gptid/b9868489-9af2-11e6-aef6-d05099c0e76d	DEGRADED	 0	 0	 0  too many errors

errors: 43 data errors, use '-v' for a list

now i look at it closely.... i wonder if i used both disks to replace the same one.
I wanted to replace the missing disk and the one that is showing as faulted.

Chris Moore · Jan 4, 2018

I feel bad about this. Do you have a backup of your data? If your data is accessible, you probably should be working on getting a backup.
You have a lot of drives that are listed as degraded and that is not usually a good situation.

MR. T. said:

See where this says da7p1, did you try to do some work on this from the command line?

MR. T. · Jan 4, 2018

hi,

I'm really wary of doing anything over the command line. i avoid it unless i have no other option.

I've added da11 and da12 to replace the disk than is unavailable (
5477075899050874913) and the disk that developed a fault (da7p1)

My expectation was that one of the new disks to take place of the missing disk and it stopping to be listed, and the other disk to take place of da0p1 and it to be removed from the volume.
I ended up with 13 disks on the volume instead of the 11 it should have

at this point i'm starting to freak out because taking a backup is not an option due to the ridiculous amount of data in that volume.

i have 9 disks running fine and that keeps my data available but the 2 disks that give me the extra safety are stuck and i can't do anything about it.

These are 8tb disks, so it's not easy to find a backup site.

Chris Moore · Jan 4, 2018

MR. T. said:
at this point i'm starting to freak out because taking a backup is not an option due to the ridiculous amount of data in that volume.

Well, don't get in a hurry and do something prematurely. How long has this system been running?
Also, can you give as much detail as possible about the hardware we are looking at?

MR. T. · Jan 4, 2018

the system has been running for a year and a half but it has been extremely unstable till now. disks would randomly drop out and not come back online again.
I would have to disconnect all disks and reconnect them with the machine offline to get it working again.

its been running the past 15 days without problems so... something i've done lately must have done some good.

my machine is a ASRock C2550D4I quadcore avoton
32gb ecc ram
IBM sas controller with IT firmware
2 hp sas expanders
freenas installed on a small (80gb) spinning disk
currently 3 volumes:
1 volume 2gb single disk just to use as a torrent target
1 volume raidz1, 4 500gb disks
1 volume raidz2, 11 8tb disks

have in fact 12 8tb disks as i had problems with one of them and a second is faulty (the ones i am trying to replace)

The NAS is mostly used as a very large archive, maximizing the amount of free space with just 2 disks for parity just in case something goes wrong.
The intent for it is to have the data always available but in cold storage preventing bitrot (my motivation to start using freenas)

I also want to expand (once things get stable) to get the vmware disks into a new volume in the future

Chris Moore · Jan 4, 2018

I don't know if it is related at all, but I thought you might want to look into this:
https://forums.freenas.org/index.php?threads/please-help.46766/page-2#post-321255
There have been a lot of problems lately with that model AsRock board. Many users have gotten replacements from AsRock even though the board was ostensibly out of warranty.

Other possibilities regarding your system having a past instability. What is the model of the HP SAS Expanders you are using?
What model drive are you using?

We need to tread carefully because your pool is in a very fragile condition, so lets see if we can get some more eyes on this before we do anything that would be destructive.

MR. T. said:
i don't have SSH access right now, but here is the result of the daily email.

Is that just because you are away from the system? Is the interface working properly?

@Stux , @JoshDW19 , @Ericloewe , @Spearfoot I am not sure what happened, but that zpool status looks wrong to me. Perhaps someone else can recognize what the problem is and offer a suggestion or two?

MR. T. · Jan 4, 2018

Chris Moore said:
Is that just because you are away from the system? Is the interface working properly?

Yes i was away. i had access to the webinterface but not much more than that.
I could have used the console on the web interface but i it always gets cut off and i never managed to figure out how to scroll up :)

Chris Moore said:
What is the model of the HP SAS Expanders you are using?

they are 24 port SAS expanders similar to this one
http://www.kingofservers.com/hpe-sa...MImNa0uJq_2AIVL7HtCh1e8A4BEAQYASABEgLAyfD_BwE

the 2tb hdd is a seagate
the 500gb disks are random brands i picked up over the years
the 8tb disks are all seagate archive drives. mixed between V1 and v2.

I think the past instability was due to either:
bad sas to sata cables
bad power cables
a failing disk triggering a spindown on other disks by introducing noise on the 3.3v

i did replace plenty of the sas to sata cables, plenty of the power cables, and removed 2 disks that were problematic and one that died.
That seemed to do the trick

There is also a chance that there is a powersupply issue but its a 800w supply.... should be enough.
I have a power meter on the power socket and on boot the nas draws at most 450w

The original plan was to add all the disks i collected over the years but some of them simply died others were throwing all kinds of errors so i ended up by just buying these 8tb drives new.
I bought them from all all kinds of shops and slowly over a year (whenever there was some money to spare) so there is a lower possibility of many of them failing suddenly due some batch defect.

MR. T. · Jan 8, 2018

Over the weekend i bit the bullet and did a zpool clear in a attempt to get freenas to release the extra disks.
It did not work. My hope was that the data errors were holding back freenas and clearing them would accept that i had lost data and move on.

I also tried to offline the two new disks but was met with an error saying there are not enough replicas.

I did not fudge with freenas using the command line, so it should be fairly simple to recover from a "naturally occurring" scenario, but i cant find a way of doing it.

Bumping the thread to keep it alive

Edit: Trying to detach the missing disk says that the detach was successful but the disk is still listed there

Important Announcement for the TrueNAS Community.

Failed disk replace only added more disks to volume.

MR. T.

Explorer

Chris Moore

Hall of Famer

MR. T.

Explorer

Chris Moore

Hall of Famer

MR. T.

Explorer

Chris Moore

Hall of Famer

MR. T.

Explorer

Chris Moore

Hall of Famer

MR. T.

Explorer

MR. T.

Explorer

Similar threads

Important Announcement for the TrueNAS Community.

Failed disk replace only added more disks to volume.

Explorer

Hall of Famer

Explorer

Hall of Famer

Explorer

Hall of Famer

Explorer

Hall of Famer

Explorer

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Failed disk replace only added more disks to volume."

Similar threads