Array keeps resilvering itself

Status
Not open for further replies.

Th3RadMan

Explorer
Joined
Aug 26, 2017
Messages
76
I had a drive in my array fail and I had a ton of issues after I got it replaced. Luckily I managed to somehow not loose my data, but now I have a new issue. My system keeps resilvering over and over. I have no idea why and no idea how to fix it. Any ideas?

Specs:
Xeon 2670
16gb ecc ram
lsi 9211-8i IT mode
5 WD 3tb Red drives
 

Th3RadMan

Explorer
Joined
Aug 26, 2017
Messages
76
pool: Media
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Sep 17 11:50:09 2017
1.09G scanned out of 1017G at 11.2M/s, 25h41m to go
214M resilvered, 0.11% done
config:

NAME STATE READ WRITE CKSUM
Media ONLINE 0 0 14
raidz1-0 ONLINE 0 0 28
gptid/98921532-8af5-11e7-beee-00e05b880d05 ONLINE 0 0 0 (resilvering)
gptid/9960ad7c-8af5-11e7-beee-00e05b880d05 ONLINE 0 0 0
gptid/9a41eeb0-8af5-11e7-beee-00e05b880d05 ONLINE 0 0 0
gptid/230929e7-902b-11e7-b170-00e05b880d05 ONLINE 0 0 0
gptid/922e7fac-9a93-11e7-958d-00e05b880d05 ONLINE 0 0 0 (resilvering)

errors: 1 data errors, use '-v' for a list
 

m0nkey_

MVP
Joined
Oct 27, 2015
Messages
2,739
You have two drives in a RAIDZ1 resilvering. I'm surprised it's still in tact. I strongly recommend you backup your data if you have not already done so.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,079
pool: Media
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Sep 17 11:50:09 2017
1.09G scanned out of 1017G at 11.2M/s, 25h41m to go
214M resilvered, 0.11% done
config:

NAME STATE READ WRITE CKSUM
Media ONLINE 0 0 14
raidz1-0 ONLINE 0 0 28
gptid/98921532-8af5-11e7-beee-00e05b880d05 ONLINE 0 0 0 (resilvering)
gptid/9960ad7c-8af5-11e7-beee-00e05b880d05 ONLINE 0 0 0
gptid/9a41eeb0-8af5-11e7-beee-00e05b880d05 ONLINE 0 0 0
gptid/230929e7-902b-11e7-b170-00e05b880d05 ONLINE 0 0 0
gptid/922e7fac-9a93-11e7-958d-00e05b880d05 ONLINE 0 0 0 (resilvering)

errors: 1 data errors, use '-v' for a list
since it is a RAIDz1 pool and two drives are resilvering, you are lucky that you still have access to your data at all.
Do you have a backup? If not, you better make one quick.
You really need to destroy this pool and recreate it as a RAIDz2 pool. Before creating the new pool, these drives need to be thoroughly tested because you probably have another drive (or two) with a hardware fault that you have not detected.
It is the recommendation (for cause) that any pool using drives larger than 1TB should be RAIDz2 or better. RAIDz1 is only suitable for drives 1TB or smaller and even then, I wouldn't recommend it.
I would suggest that you go to six drives in RAIDz2 for your pool. Having only one drive of redundancy is part of the problem here.
 

Th3RadMan

Explorer
Joined
Aug 26, 2017
Messages
76
what would be the best way to test all the drives? I have a separate sytem I can use, and 1 more drive has been ordered and I'm going to pick it up in a couple hours
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,079
what would be the best way to test all the drives? I have a separate sytem I can use, and 1 more drive has been ordered and I'm going to pick it up in a couple hours
Have you used smartctl to test your drives before? Are you aware of how to SSH in to a command prompt on the server?
The problem with trying to do this through the web interface is the scrolling of the data out of view. If you SSH into the server with a utility that allows scrolling back to see what scrolled out of view, you can run a command like camcontrol devlist
Which gives you output like this:
Code:
root@Irene-NAS:~ # camcontrol devlist
<ATA ST2000DM001-1ER1 CC25>		at scbus0 target 0 lun 0 (pass0,da0)
<ATA ST2000DM001-1ER1 CC25>		at scbus0 target 3 lun 0 (pass1,da1)
<ATA ST4000DM000-1F21 CC54>		at scbus0 target 10 lun 0 (pass3,da3)
<ATA ST4000DM000-1F21 CC54>		at scbus0 target 11 lun 0 (pass4,da4)
<ATA ST4000DM000-1F21 CC54>		at scbus0 target 12 lun 0 (pass5,da5)
<ATA ST2000DM001-1ER1 CC27>		at scbus0 target 13 lun 0 (da2,pass2)
<ATA ST2000DM001-1ER1 CC25>		at scbus1 target 7 lun 0 (pass6,da6)
<ATA ST2000DM001-1ER1 CC25>		at scbus1 target 8 lun 0 (pass7,da7)
<ATA ST2000DM001-1ER1 CC25>		at scbus1 target 9 lun 0 (pass8,da8)
<ATA ST4000DM000-1F21 CC54>		at scbus1 target 10 lun 0 (pass9,da9)
<ATA ST4000DM000-1F21 CC54>		at scbus1 target 11 lun 0 (pass10,da10)
<ATA ST4000DM000-1F21 CC54>		at scbus1 target 12 lun 0 (pass11,da11)
<ST5000DM000-1FK178 CC49>		  at scbus2 target 0 lun 0 (ada2,pass15)
<ST5000DM000-1FK178 CC49>		  at scbus2 target 1 lun 0 (ada4,pass17)
<ST5000DM000-1FK178 CC49>		  at scbus2 target 2 lun 0 (ada3,pass16)
<ST5000DM000-1FK178 CC49>		  at scbus2 target 3 lun 0 (ada5,pass18)
<Port Multiplier 575f197b 000e>	at scbus2 target 15 lun 0 (pmp0,pass14)
<FUJITSU MHW2040BS 00000012>	   at scbus8 target 0 lun 0 (pass12,ada0)
<FUJITSU MHW2040BS 00000012>	   at scbus9 target 0 lun 0 (pass13,ada1)

In the output above, the part that you are looking for is where it says something like ada5 or da9 as that will tell you how to specify the drive for testing.

You can then use a command like smartctl -a /dev/da3 to get the diagnostic output from the drive. You can post it up here for us to look at if you don't know what it all means. Where it says da3 you would substitute the designation of the drive you want to examine or test

If you have not run tests on your drives, you will want to run a long test to thoroughly test as sometimes a short test will pass when a long will fail, showing if the drive really has a fault. Use this command to initiate tests: smartctl -t long /dev/ada1 substituting your drive designation for the ada1 part.

I would start by testing the two drives that are trying to resilver and looking at the output of the report to find out what their problem is.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
28 checksum errors sounds like bad hardware. Could be cables, controller, or enclosure. But first thing to check is your RAM. Run a memtest.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Random restarts are a very bad sign. Your first priority is to ensure your backups are as up to date as possible, since that pool is in very poor shape.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,079
1.09G scanned out of 1017G at 11.2M/s, 25h41m to go

I just noticed this. That scanning speed, 11.2M/s, is that the usual speed you get?
I would have expected a larger number?
What kind of controller are these drives connected to?
 

Th3RadMan

Explorer
Joined
Aug 26, 2017
Messages
76
Going to long test all drives tonight, and memtest. I only Iknow the write which maxes at 60ish MBs on a good day using an lsi 9211-8i in IT mode. Always thought that was a bit slow
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,079
I only Iknow the write which maxes at 60ish MBs on a good day using an lsi 9211-8i in IT mode. Always thought that was a bit slow
That does sound a little slow. How odd.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Going to long test all drives tonight, and memtest. I only Iknow the write which maxes at 60ish MBs on a good day using an lsi 9211-8i in IT mode. Always thought that was a bit slow

Could be caused by one drive being a lagard, often because of 'slow' blocks... ie not quite faulty.... but hard to read. Can be a sign of a failing drive.

solnet array tester is good for checking that all drives are relatively equal in performance
 

Th3RadMan

Explorer
Joined
Aug 26, 2017
Messages
76
Well, destroyed the old volume, check ram, no errors, rebuilt with another drive in Z2 and hitting an average of 100MB's write (almost saturating my network?). looks like everything is going smoothly. I'll smart test all the drives tonight as I sleep, I'm assuming you can test them all at once, just having to execute the tests individually?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,079

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,079
I'm assuming you can test them all at once, just having to execute the tests individually?
Yes, you can test them all at once.
 
Status
Not open for further replies.
Top