50% of directories not sharing after outage...FSCK?

Status
Not open for further replies.

cyberdan2002

Cadet
Joined
May 13, 2018
Messages
6
I have tried in vain to find the answer to this issue, and I want to confirm before I do and do something stupid and get pointed to the 'backup your data page'. I will try to be concise, but if you need more info, let me know. If I am posting improperly, please let me know (as I am sure you will :)
  • At 2am, my system started a scrub.
  • At 2:38am, I received an email - Critical Alert - Vol MainRaid state is ONLINE: One or more devices has experience and error resulting in data corruption
  • At 3:01am, I received a freenas.local kernel log messages:
  • This email contained the following arp: 10.0.1.152 moved from "20:c9:d0:93:38:a7 to 6c:70:9f:cf:ca:72 on epair0b, SAME thing listed about 460 times. NOTE: 10.0.1.152 is a plex client. The FreeNAS server is at 10.0.1.74. It also shows up 460 times when I did dmesg.
  • 9:00am I noticed the system was non-responsive. All attempts to awaken the patient failed and a restart was required.
  • The scrub started again up (and still has a few hours to complete)
  • All seemed good with the world, HOWEVER, my main movies directory is now showing only 3212 directories when SHARED VIA SMB on my mac and the command ls | wc -l shows 6714 directories, WHICH IS WHAT IT SHOULD BE.
  • PLEX still has access to the full media library (I tested about 2 dozen or so of the missing files and they worked fine. Accessing everything /MainMedia/Media/Movies/... through SSH is no issue. It appears that all the files and directories are there.
What I have done (I have tried a bunch, but this is the only pertinent stuff).
  • I thought this might be a permissions issue or a sharing issue, so I changed the share settings and volume share settings, I let process on sub-folders, but no change.
  • zpool status -v shows "Permanent errors have been detected MainRaid/Media: <0x3f5e8>" as well as freenas-boot 0 errors.
What I am thinking next:
  • I am currently thinking that this is a GPT / Partition table issue and looking squarely at 'unmounting' and running fsck, but what I am concerned with is that I could either GET my data to share again or LOSE that data in the process.[/b]
  • While my data is still accessible through FreeNAS and I can back it up completely (although very painfully and slowly).
  • I feel like this is something that the right 'Dr.' would say 'oh, no, don't give him a shot of adrenaline, he needs a nitroglycerin' or vice versa.

While I do have most of the data backed up, it took me a LONG time to get everything together (weeks) and it would be horrific if I had to start from scratch. I really appreciate this community and all the help it has brought to me through learning from others. This is the first time I have had to ask a question and appreciate any advice, help and the time!

My Hardware:

FreeNAS 11.1-Stable, ASRock H97M Pro4, i5-4440 CPU, 16GB Ram, Marvell 88SE9230 AHCI SATA, 9x8TB WD RED in RAIDZ2, 2x80GB Intel Cache SSD
 
Joined
Jan 4, 2014
Messages
1,644
Sorry to hear of your plight. I had a similar situation several years ago. I've actually crippled a ZFS volume during a disk replacement on a FreeNAS system. The resilver process was progressing well, but, because of the size of the disk being replaced, was taking a very long time to complete. During the period of the resilver, I experienced an extended power outage which my UPS was unable to cope with. For whatever reason, once power was restored, FreeNAS resumed the resilver, but became confused and disoriented and began to gradually corrupt the data. I spent about a week under great duress salvaging my data before I could begin the process of rebuilding the pool and bringing the system back online. I didn't have a complete data backup at the time as I didn't have an effective way of automating the backup of 12TB of data.

Start copying your data off while waiting for expert advice from more experienced forum members. Once I had the data off, the only way I could be confident of a structurally sound volume was to rebuild it from scratch. Post-disaster, I set up replication and snapshots and haven't looked back since. It's a reliable way to backup (and recover data) from large datastores.
 

DaveY

Contributor
Joined
Dec 1, 2014
Messages
141
You're screwed. Go to the 'backup your data page' now. Kidding. ;)

Wow, where to start. Let's separate the important from the less important first. Ignore the "arp: 10.0.1.152 moved from "20:c9:d0:93:38:a7 to 6c:70:9f:cf:ca:72 on epair0b" for now. Those messages tend to show up when you have bonded interfaces so it may not necessary mean there's a problem. It's definitely not related to your missing files issue.

For the more serious issue with your zfs. Can you post the entire zpool status -v output? Since you're running RAIDZ2, it may be salvageable, but it really depends on the severity of the error and seeing the full zpool status output will help determine that.

There's no such thing as fsck in zfs world. Scrub is your fsck equivalent. It goes through and tries to correct any errors. Why do you think is your GPT? Unless you made some changes to it recently, it's unlikely your partition tables changed. The corruption is probably more due to the hard reboot you did at 9am, but you can do a "gpart show" and make sure all the disks have the same partition table.

The most important thing right now is to let your scrub finish ASAP. I would shut down plex for now and any other non-essentials apps/jails that read/write to the disk. Backing up your data now will also slow down your scrub, but if there are really critical files that you must save, then I would cherry pick and back those up and leave the rest until scrub is done.

What are your 2 80GB SSD used for? ZIL or L2ARC?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Stop making changes until the scrub is done.

Sent from my SAMSUNG-SGH-I537 using Tapatalk
 

cyberdan2002

Cadet
Joined
May 13, 2018
Messages
6
Thanks, Dave for the audible duh that came before I realized it was that joke. :)

Few things...
  • Stopped making any changes long ago, but you are right, I should have checked the scrub status before doing anything.
  • Scrub has finished with no repair or improvement.
  • 2x80 SSD - One ZIL - ONE L2ARC
  • Checked via Plex and SSH, files all seem to be there and working in Plex but SMB in OSX still showing 3212 directories.
I found some stuff on fsdk, but looks like it was for non-ZFS stuff. Still learning, thanks for the insight. God knows I would have somehow gotten it installed and eating my entire filesystem. lol

zpool status -v dump:
pool: MainRaid

state: ONLINE

status: One or more devices has experienced an error resulting in data

corruption. Applications may be affected.

action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.

see: http://illumos.org/msg/ZFS-8000-8A

scan: scrub repaired 1.07M in 0 days 21:18:30 with 1 errors on Sun May 13 23:18:31 2018

config:


NAME STATE READ WRITE CKSUM

MainRaid ONLINE 0 0 0

raidz2-0 ONLINE 0 0 0

gptid/250fafd5-137a-11e8-995c-d05099515883 ONLINE 0 0 0

gptid/25ade059-137a-11e8-995c-d05099515883 ONLINE 0 0 0

gptid/26444958-137a-11e8-995c-d05099515883 ONLINE 0 0 0

gptid/26e4386f-137a-11e8-995c-d05099515883 ONLINE 0 0 0

gptid/277e9874-137a-11e8-995c-d05099515883 ONLINE 0 0 0

gptid/28066e3b-137a-11e8-995c-d05099515883 ONLINE 0 0 0

gptid/289d5643-137a-11e8-995c-d05099515883 ONLINE 0 0 0

gptid/29413979-137a-11e8-995c-d05099515883 ONLINE 0 0 0

gptid/29eafb84-137a-11e8-995c-d05099515883 ONLINE 0 0 0

logs

gptid/bb01cd85-137a-11e8-995c-d05099515883 ONLINE 0 0 0

cache

gptid/bd0ad8e4-137a-11e8-995c-d05099515883 ONLINE 0 0 0


errors: Permanent errors have been detected in the following files:


MainRaid/Media:<0x3f5e8>


pool: freenas-boot

state: ONLINE

scan: scrub repaired 0 in 0 days 00:01:04 with 0 errors on Sun May 6 03:46:04 2018

config:


NAME STATE READ WRITE CKSUM

freenas-boot ONLINE 0 0 0

da0p2 ONLINE 0 0 0


errors: No known data errors
[\color][\font]
 

cyberdan2002

Cadet
Joined
May 13, 2018
Messages
6
I'm including this, but I see it all being 1 & 2 on the 8TB drives, so I'm pretty sure this is ok unless I'm missreading it.

gpart show dump:
> 40 120127408 da0 GPT (57G)

40 532480 1 efi (260M)

532520 119594920 2 freebsd-zfs (57G)

120127440 8 - free - (4.0K)


=> 34 5860532157 da1 GPT (2.7T)

34 2014 - free - (1.0M)

2048 5860528128 1 ms-basic-data (2.7T)

5860530176 2015 - free - (1.0M)


=> 40 15628053088 ada0 GPT (7.3T)

40 88 - free - (44K)

128 4194304 1 freebsd-swap (2.0G)

4194432 15623858688 2 freebsd-zfs (7.3T)

15628053120 8 - free - (4.0K)


=> 40 15628053088 ada1 GPT (7.3T)

40 88 - free - (44K) [\color][\font]

128 4194304 1 freebsd-swap (2.0G)

4194432 15623858688 2 freebsd-zfs (7.3T)

15628053120 8 - free - (4.0K)


=> 40 15628053088 ada2 GPT (7.3T)

40 88 - free - (44K)

128 4194304 1 freebsd-swap (2.0G)

4194432 15623858688 2 freebsd-zfs (7.3T)

15628053120 8 - free - (4.0K)


=> 40 15628053088 ada3 GPT (7.3T)

40 88 - free - (44K)

128 4194304 1 freebsd-swap (2.0G)

4194432 15623858688 2 freebsd-zfs (7.3T)

15628053120 8 - free - (4.0K)


=> 40 15628053088 ada4 GPT (7.3T)

40 88 - free - (44K)

128 4194304 1 freebsd-swap (2.0G)

4194432 15623858688 2 freebsd-zfs (7.3T)

15628053120 8 - free - (4.0K)


=> 40 15628053088 ada5 GPT (7.3T)

40 88 - free - (44K)

128 4194304 1 freebsd-swap (2.0G)

4194432 15623858688 2 freebsd-zfs (7.3T)

15628053120 8 - free - (4.0K)


=> 40 15628053088 ada6 GPT (7.3T)

40 88 - free - (44K)

128 4194304 1 freebsd-swap (2.0G)

4194432 15623858688 2 freebsd-zfs (7.3T)

15628053120 8 - free - (4.0K)


=> 40 156301408 ada7 GPT (75G)

40 88 - free - (44K)

128 156301312 1 freebsd-zfs (75G)

156301440 8 - free - (4.0K)


=> 40 15628053088 ada8 GPT (7.3T)

40 88 - free - (44K)

128 4194304 1 freebsd-swap (2.0G)

4194432 15623858688 2 freebsd-zfs (7.3T)

15628053120 8 - free - (4.0K)


=> 40 156301408 ada9 GPT (75G)

40 88 - free - (44K)

128 156301312 1 freebsd-zfs (75G)

156301440 8 - free - (4.0K)


=> 40 15628053088 ada10 GPT (7.3T)

40 88 - free - (44K)

128 4194304 1 freebsd-swap (2.0G)

4194432 15623858688 2 freebsd-zfs (7.3T)

15628053120 8 - free - (4.0K)
 

cyberdan2002

Cadet
Joined
May 13, 2018
Messages
6
Feeling a little like a fool, but there is no cure for the patient that doesn't speak up.

It might help to know that I had a drive go missing from the volume the other day and it became 'degregaded'. I simply made sure the drive was re-connected by simply making sure it was in the case properly and 'poof', ZFS showed it was now healthy. I was VERY impressed with this wizardry. No need to rebuild or anything, just all my good Karma flowing back to me finally :smile:. I figured because I hadn't written anything to drives since it disappeared, it was able to resync it or some other Harry Potter style magic.

No fuss, no muss. Now I'm thinking now, NO WAY, NO HOW and feeling a little silly that my hairs on the back of my neck didn't raise up when it went right back to healthy.I'm pretty sure I know which drive it was, but not 100%.

I have read some destroy and rebuild pool style answers for this, but I'm a bit nervous about destroying a pool and it not coming back (although it seems like a reasonable chance it will from everything I can see).

This might sound crazy, but is there anything I can do now, such as a snapshot, that would let me get back to this point before I did anything else?

Thank you for the help!
 

cyberdan2002

Cadet
Joined
May 13, 2018
Messages
6
Sorry to hear of your plight. I had a similar situation several years ago. I've actually crippled a ZFS volume during a disk replacement on a FreeNAS system. The resilver process was progressing well, but, because of the size of the disk being replaced, was taking a very long time to complete. During the period of the resilver, I experienced an extended power outage which my UPS was unable to cope with. For whatever reason, once power was restored, FreeNAS resumed the resilver, but became confused and disoriented and began to gradually corrupt the data. I spent about a week under great duress salvaging my data before I could begin the process of rebuilding the pool and bringing the system back online. I didn't have a complete data backup at the time as I didn't have an effective way of automating the backup of 12TB of data.

Start copying your data off while waiting for expert advice from more experienced forum members. Once I had the data off, the only way I could be confident of a structurally sound volume was to rebuild it from scratch. Post-disaster, I set up replication and snapshots and haven't looked back since. It's a reliable way to backup (and recover data) from large datastores.

I hope you got most of it back. I'm going to start moving it off here shortly... Luckily I can still get to and see all my data still, it's just the shares that are messed up, so I think this could have been a lot worse.
 

cyberdan2002

Cadet
Joined
May 13, 2018
Messages
6
You're screwed. Go to the 'backup your data page' now. Kidding. ;)

Wow, where to start. Let's separate the important from the less important first. Ignore the "arp: 10.0.1.152 moved from "20:c9:d0:93:38:a7 to 6c:70:9f:cf:ca:72 on epair0b" for now. Those messages tend to show up when you have bonded interfaces so it may not necessary mean there's a problem. It's definitely not related to your missing files issue.

For the more serious issue with your zfs. Can you post the entire zpool status -v output? Since you're running RAIDZ2, it may be salvageable, but it really depends on the severity of the error and seeing the full zpool status output will help determine that.

There's no such thing as fsck in zfs world. Scrub is your fsck equivalent. It goes through and tries to correct any errors. Why do you think is your GPT? Unless you made some changes to it recently, it's unlikely your partition tables changed. The corruption is probably more due to the hard reboot you did at 9am, but you can do a "gpart show" and make sure all the disks have the same partition table.

The most important thing right now is to let your scrub finish ASAP. I would shut down plex for now and any other non-essentials apps/jails that read/write to the disk. Backing up your data now will also slow down your scrub, but if there are really critical files that you must save, then I would cherry pick and back those up and leave the rest until scrub is done.

What are your 2 80GB SSD used for? ZIL or L2ARC?

I added a few responses below but I should have replied to you. Literally first forum post ever earlier, so at least I'm 'in the mix' now :smile:
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I'm a bit nervous about destroying a pool and it not coming back
When you destroy a pool, it won't come back. It's destroyed. "Destroy and rebuild" is followed by "...and restore from backup."

The problem is that you have a metadata error, and it's not at all clear that you actually have access to all the files your system says is there. I don't know of a good way to fix this that doesn't involve restoring from backup.
 

DaveY

Contributor
Joined
Dec 1, 2014
Messages
141
I would start backing up everything and rebuild the pool. Try copying a directory that you can see via ssh, but not over SMB first and see if you can even read the data. Like danb35 said, you may just be seeing the directories, but may have lost access to the files
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
If you have metadata errors, it's game over. Unless you happen to have a checkpoint, which you definitely don't.
 
Status
Not open for further replies.
Top