Disks dropping from array during scrub

Status
Not open for further replies.

rs225

Guru
Joined
Jun 28, 2014
Messages
878
It's not bizarre at all. You have multiple drives dropping out of the pool randomly all the time without the ability to reconsile with a zpool scrub and you *will* get corruption. There's nothing shocking at all. In fact, when I saw his first output I figured the pool would be done for, the question was whether it would even mount or not. :p

I also thought it would be wiped out. But seriously, I do think it is bizarre. Look at the math: He had to get 6 out of 12 drives to, in that (proximate) instant, write corrupt data for that particular metadata block. Remember, metadata is written twice, and with two vdevs, one copy goes to each vdev. His metadata corruption is absolutely astonishing, and to still have a semi-functional system? Amazing.

That's why I tend toward running the memory test on ECC RAM. It's more believable!
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I also thought it would be wiped out. But seriously, I do think it is bizarre. Look at the math: He had to get 6 out of 12 drives to, in that (proximate) instant, write corrupt data for that particular metadata block. Remember, metadata is written twice, and with two vdevs, one copy goes to each vdev. His metadata corruption is absolutely astonishing, and to still have a semi-functional system? Amazing.

That's why I tend toward running the memory test on ECC RAM. It's more believable!

The underlined is my emphasis. That's not completely true. All he needed was for the disks to get out of sync. Remember, ZFS writes data to the drive (which technically goes to the hard drive's on-disk cache). ZFS would normally also issue a command to the drive to flush the drive's write cache. BUT, this all falls apart when the disk receives the data but ZFS' flush command doesn't get performed because the disk suddenly goes offline. ZFS also *will* initiate the flush command to the other drives (which may or may not be attached still).

Next thing you know the different drives in vdevs (and pool) aren't in sync and it's looking for copies that aren't bad whenever it can. But once you can't get good copies because ZFS can't figure out if the alleged newer transactions are good or bad things go ugly.

You haven't been a member of the forums very long rs225, but we've had users that have had pools out of sync that were corrected by pulling out one or two bad disks (obviously you can't remove so many disks you have no redundancy + 1 more disk removed).

It happens. We know it happens. It's pretty rare for people to have problems, but it's not unheard of either. ;)
 

SirMaster

Patron
Joined
Mar 19, 2014
Messages
241
What firmware version do you have on your RES2SV240?

I had some SATA signal issues with mine until I updated it to the latest v13 firmware. After that it has been working perfectly with my LSI HBA.
 

agreenfi

Cadet
Joined
Nov 23, 2014
Messages
7
Figured I should give an update since everything has been working fine for a few weeks. I put in a new power supply and reconnected all the cables, and i haven't had any more problems with disks disappearing. Other thoughts:
- I did upgrade the firmware of my M1015 to v16 (it was nice of FreeNAS 9.3 to warn me about this, as I was using an older version)
- All problems with scrub not completing went away after deleting the iScsi file extents. I think these files more susceptible to corruption due to not using sync writes?
- I didn't end up deleting the pool. I deleted a corrupted file, then ran 'zpool clear' to reset the error counters. No new errors over past three weeks.
- Keep an eye on the disks selected for smart testing. If a drive drops out or gets replaced, you will have to re-select it in the FreeNas GUI.
- If i have problems in the future, I should take the pool offline immediately. Then import it as read only if necessary for diagnostics. Is this okay to do from the shell?
- It seems like FreeNAS stores system logs on the pool, and so if the pool becomes inaccessible you may not have access to needed diagnostic information. Am I understanding this correctly, and is there some other best-practice?
 
Status
Not open for further replies.
Top