50% of directories not sharing after outage...FSCK?

cyberdan2002 · May 13, 2018

I have tried in vain to find the answer to this issue, and I want to confirm before I do and do something stupid and get pointed to the 'backup your data page'. I will try to be concise, but if you need more info, let me know. If I am posting improperly, please let me know (as I am sure you will :)

At 2am, my system started a scrub.
At 2:38am, I received an email - Critical Alert - Vol MainRaid state is ONLINE: One or more devices has experience and error resulting in data corruption
At 3:01am, I received a freenas.local kernel log messages:
This email contained the following arp: 10.0.1.152 moved from "20:c9:d0:93:38:a7 to 6c:70:9f:cf:ca:72 on epair0b, SAME thing listed about 460 times. NOTE: 10.0.1.152 is a plex client. The FreeNAS server is at 10.0.1.74. It also shows up 460 times when I did dmesg.
9:00am I noticed the system was non-responsive. All attempts to awaken the patient failed and a restart was required.
The scrub started again up (and still has a few hours to complete)
All seemed good with the world, HOWEVER, my main movies directory is now showing only 3212 directories when SHARED VIA SMB on my mac and the command ls | wc -l shows 6714 directories, WHICH IS WHAT IT SHOULD BE.
PLEX still has access to the full media library (I tested about 2 dozen or so of the missing files and they worked fine. Accessing everything /MainMedia/Media/Movies/... through SSH is no issue. It appears that all the files and directories are there.

What I have done (I have tried a bunch, but this is the only pertinent stuff).

I thought this might be a permissions issue or a sharing issue, so I changed the share settings and volume share settings, I let process on sub-folders, but no change.
zpool status -v shows "Permanent errors have been detected MainRaid/Media: <0x3f5e8>" as well as freenas-boot 0 errors.

What I am thinking next:

I am currently thinking that this is a GPT / Partition table issue and looking squarely at 'unmounting' and running fsck, but what I am concerned with is that I could either GET my data to share again or LOSE that data in the process.[/b]
While my data is still accessible through FreeNAS and I can back it up completely (although very painfully and slowly).
I feel like this is something that the right 'Dr.' would say 'oh, no, don't give him a shot of adrenaline, he needs a nitroglycerin' or vice versa.

While I do have most of the data backed up, it took me a LONG time to get everything together (weeks) and it would be horrific if I had to start from scratch. I really appreciate this community and all the help it has brought to me through learning from others. This is the first time I have had to ask a question and appreciate any advice, help and the time!

My Hardware:
FreeNAS 11.1-Stable, ASRock H97M Pro4, i5-4440 CPU, 16GB Ram, Marvell 88SE9230 AHCI SATA, 9x8TB WD RED in RAIDZ2, 2x80GB Intel Cache SSD

Basil Hendroff · May 13, 2018

Sorry to hear of your plight. I had a similar situation several years ago. I've actually crippled a ZFS volume during a disk replacement on a FreeNAS system. The resilver process was progressing well, but, because of the size of the disk being replaced, was taking a very long time to complete. During the period of the resilver, I experienced an extended power outage which my UPS was unable to cope with. For whatever reason, once power was restored, FreeNAS resumed the resilver, but became confused and disoriented and began to gradually corrupt the data. I spent about a week under great duress salvaging my data before I could begin the process of rebuilding the pool and bringing the system back online. I didn't have a complete data backup at the time as I didn't have an effective way of automating the backup of 12TB of data.

Start copying your data off while waiting for expert advice from more experienced forum members. Once I had the data off, the only way I could be confident of a structurally sound volume was to rebuild it from scratch. Post-disaster, I set up replication and snapshots and haven't looked back since. It's a reliable way to backup (and recover data) from large datastores.

DaveY · May 13, 2018

You're screwed. Go to the 'backup your data page' now. Kidding. ;)

Wow, where to start. Let's separate the important from the less important first. Ignore the "arp: 10.0.1.152 moved from "20:c9:d0:93:38:a7 to 6c:70:9f:cf:ca:72 on epair0b" for now. Those messages tend to show up when you have bonded interfaces so it may not necessary mean there's a problem. It's definitely not related to your missing files issue.

For the more serious issue with your zfs. Can you post the entire zpool status -v output? Since you're running RAIDZ2, it may be salvageable, but it really depends on the severity of the error and seeing the full zpool status output will help determine that.

There's no such thing as fsck in zfs world. Scrub is your fsck equivalent. It goes through and tries to correct any errors. Why do you think is your GPT? Unless you made some changes to it recently, it's unlikely your partition tables changed. The corruption is probably more due to the hard reboot you did at 9am, but you can do a "gpart show" and make sure all the disks have the same partition table.

The most important thing right now is to let your scrub finish ASAP. I would shut down plex for now and any other non-essentials apps/jails that read/write to the disk. Backing up your data now will also slow down your scrub, but if there are really critical files that you must save, then I would cherry pick and back those up and leave the rest until scrub is done.

What are your 2 80GB SSD used for? ZIL or L2ARC?

Chris Moore · May 13, 2018

Stop making changes until the scrub is done.

Sent from my SAMSUNG-SGH-I537 using Tapatalk

cyberdan2002 · May 14, 2018

Thanks, Dave for the audible duh that came before I realized it was that joke. :)

Few things...

Stopped making any changes long ago, but you are right, I should have checked the scrub status before doing anything.
Scrub has finished with no repair or improvement.
2x80 SSD - One ZIL - ONE L2ARC
Checked via Plex and SSH, files all seem to be there and working in Plex but SMB in OSX still showing 3212 directories.

I found some stuff on fsdk, but looks like it was for non-ZFS stuff. Still learning, thanks for the insight. God knows I would have somehow gotten it installed and eating my entire filesystem. lol

zpool status -v dump:
pool: MainRaid

state: ONLINE

status: One or more devices has experienced an error resulting in data

corruption. Applications may be affected.

action: Restore the file in question if possible. Otherwise restore theentire pool from backup.

see: http://illumos.org/msg/ZFS-8000-8A

scan: scrub repaired 1.07M in 0 days 21:18:30 with 1 errors on Sun May 13 23:18:31 2018

config:

NAME STATE READ WRITE CKSUM

MainRaid ONLINE 0 0 0

raidz2-0 ONLINE 0 0 0

gptid/250fafd5-137a-11e8-995c-d05099515883 ONLINE 0 0 0

gptid/25ade059-137a-11e8-995c-d05099515883 ONLINE 0 0 0

gptid/26444958-137a-11e8-995c-d05099515883 ONLINE 0 0 0

gptid/26e4386f-137a-11e8-995c-d05099515883 ONLINE 0 0 0

gptid/277e9874-137a-11e8-995c-d05099515883 ONLINE 0 0 0

gptid/28066e3b-137a-11e8-995c-d05099515883 ONLINE 0 0 0

gptid/289d5643-137a-11e8-995c-d05099515883 ONLINE 0 0 0

gptid/29413979-137a-11e8-995c-d05099515883 ONLINE 0 0 0

gptid/29eafb84-137a-11e8-995c-d05099515883 ONLINE 0 0 0

logs

gptid/bb01cd85-137a-11e8-995c-d05099515883 ONLINE 0 0 0

cache

gptid/bd0ad8e4-137a-11e8-995c-d05099515883 ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:

MainRaid/Media:<0x3f5e8>

pool: freenas-boot

state: ONLINE

scan: scrub repaired 0 in 0 days 00:01:04 with 0 errors on Sun May 6 03:46:04 2018

config:

NAME STATE READ WRITE CKSUM

freenas-boot ONLINE 0 0 0

da0p2 ONLINE 0 0 0

errors: No known data errors
[\color][\font]

cyberdan2002 · May 14, 2018

I'm including this, but I see it all being 1 & 2 on the 8TB drives, so I'm pretty sure this is ok unless I'm missreading it.

gpart show dump:
> 40 120127408 da0 GPT (57G)

40 532480 1 efi (260M)

532520 119594920 2 freebsd-zfs (57G)

120127440 8 - free - (4.0K)

=> 34 5860532157 da1 GPT (2.7T)

34 2014 - free - (1.0M)

2048 5860528128 1 ms-basic-data (2.7T)

5860530176 2015 - free - (1.0M)

=> 40 15628053088 ada0 GPT (7.3T)

40 88 - free - (44K)

128 4194304 1 freebsd-swap (2.0G)

4194432 15623858688 2 freebsd-zfs (7.3T)

15628053120 8 - free - (4.0K)

=> 40 15628053088 ada1 GPT (7.3T)

40 88 - free - (44K) [\color][\font]

128 4194304 1 freebsd-swap (2.0G)

4194432 15623858688 2 freebsd-zfs (7.3T)

15628053120 8 - free - (4.0K)

=> 40 15628053088 ada2 GPT (7.3T)

40 88 - free - (44K)

128 4194304 1 freebsd-swap (2.0G)

4194432 15623858688 2 freebsd-zfs (7.3T)

15628053120 8 - free - (4.0K)

=> 40 15628053088 ada3 GPT (7.3T)

40 88 - free - (44K)

128 4194304 1 freebsd-swap (2.0G)

4194432 15623858688 2 freebsd-zfs (7.3T)

15628053120 8 - free - (4.0K)

=> 40 15628053088 ada4 GPT (7.3T)

40 88 - free - (44K)

128 4194304 1 freebsd-swap (2.0G)

4194432 15623858688 2 freebsd-zfs (7.3T)

15628053120 8 - free - (4.0K)

=> 40 15628053088 ada5 GPT (7.3T)

40 88 - free - (44K)

128 4194304 1 freebsd-swap (2.0G)

4194432 15623858688 2 freebsd-zfs (7.3T)

15628053120 8 - free - (4.0K)

=> 40 15628053088 ada6 GPT (7.3T)

40 88 - free - (44K)

128 4194304 1 freebsd-swap (2.0G)

4194432 15623858688 2 freebsd-zfs (7.3T)

15628053120 8 - free - (4.0K)

=> 40 156301408 ada7 GPT (75G)

40 88 - free - (44K)

128 156301312 1 freebsd-zfs (75G)

156301440 8 - free - (4.0K)

=> 40 15628053088 ada8 GPT (7.3T)

40 88 - free - (44K)

128 4194304 1 freebsd-swap (2.0G)

4194432 15623858688 2 freebsd-zfs (7.3T)

15628053120 8 - free - (4.0K)

=> 40 156301408 ada9 GPT (75G)

40 88 - free - (44K)

128 156301312 1 freebsd-zfs (75G)

156301440 8 - free - (4.0K)

=> 40 15628053088 ada10 GPT (7.3T)

40 88 - free - (44K)

128 4194304 1 freebsd-swap (2.0G)

4194432 15623858688 2 freebsd-zfs (7.3T)

15628053120 8 - free - (4.0K)

cyberdan2002 · May 14, 2018

Feeling a little like a fool, but there is no cure for the patient that doesn't speak up.

It might help to know that I had a drive go missing from the volume the other day and it became 'degregaded'. I simply made sure the drive was re-connected by simply making sure it was in the case properly and 'poof', ZFS showed it was now healthy. I was VERY impressed with this wizardry. No need to rebuild or anything, just all my good Karma flowing back to me finally

. I figured because I hadn't written anything to drives since it disappeared, it was able to resync it or some other Harry Potter style magic.

No fuss, no muss. Now I'm thinking now, NO WAY, NO HOW and feeling a little silly that my hairs on the back of my neck didn't raise up when it went right back to healthy.I'm pretty sure I know which drive it was, but not 100%.

I have read some destroy and rebuild pool style answers for this, but I'm a bit nervous about destroying a pool and it not coming back (although it seems like a reasonable chance it will from everything I can see).

This might sound crazy, but is there anything I can do now, such as a snapshot, that would let me get back to this point before I did anything else?

Thank you for the help!

cyberdan2002 · May 14, 2018

Seymour Butt said:
Sorry to hear of your plight. I had a similar situation several years ago. I've actually crippled a ZFS volume during a disk replacement on a FreeNAS system. The resilver process was progressing well, but, because of the size of the disk being replaced, was taking a very long time to complete. During the period of the resilver, I experienced an extended power outage which my UPS was unable to cope with. For whatever reason, once power was restored, FreeNAS resumed the resilver, but became confused and disoriented and began to gradually corrupt the data. I spent about a week under great duress salvaging my data before I could begin the process of rebuilding the pool and bringing the system back online. I didn't have a complete data backup at the time as I didn't have an effective way of automating the backup of 12TB of data.

Start copying your data off while waiting for expert advice from more experienced forum members. Once I had the data off, the only way I could be confident of a structurally sound volume was to rebuild it from scratch. Post-disaster, I set up replication and snapshots and haven't looked back since. It's a reliable way to backup (and recover data) from large datastores.

I hope you got most of it back. I'm going to start moving it off here shortly... Luckily I can still get to and see all my data still, it's just the shares that are messed up, so I think this could have been a lot worse.

cyberdan2002 · May 14, 2018

DaveY said:
You're screwed. Go to the 'backup your data page' now. Kidding. ;)

Wow, where to start. Let's separate the important from the less important first. Ignore the "arp: 10.0.1.152 moved from "20:c9:d0:93:38:a7 to 6c:70:9f:cf:ca:72 on epair0b" for now. Those messages tend to show up when you have bonded interfaces so it may not necessary mean there's a problem. It's definitely not related to your missing files issue.

For the more serious issue with your zfs. Can you post the entire zpool status -v output? Since you're running RAIDZ2, it may be salvageable, but it really depends on the severity of the error and seeing the full zpool status output will help determine that.

There's no such thing as fsck in zfs world. Scrub is your fsck equivalent. It goes through and tries to correct any errors. Why do you think is your GPT? Unless you made some changes to it recently, it's unlikely your partition tables changed. The corruption is probably more due to the hard reboot you did at 9am, but you can do a "gpart show" and make sure all the disks have the same partition table.

The most important thing right now is to let your scrub finish ASAP. I would shut down plex for now and any other non-essentials apps/jails that read/write to the disk. Backing up your data now will also slow down your scrub, but if there are really critical files that you must save, then I would cherry pick and back those up and leave the rest until scrub is done.

What are your 2 80GB SSD used for? ZIL or L2ARC?

I added a few responses below but I should have replied to you. Literally first forum post ever earlier, so at least I'm 'in the mix' now

danb35 · May 14, 2018

cyberdan2002 said:
I'm a bit nervous about destroying a pool and it not coming back

When you destroy a pool, it won't come back. It's destroyed. "Destroy and rebuild" is followed by "...and restore from backup."

The problem is that you have a metadata error, and it's not at all clear that you actually have access to all the files your system says is there. I don't know of a good way to fix this that doesn't involve restoring from backup.

DaveY · May 15, 2018

I would start backing up everything and rebuild the pool. Try copying a directory that you can see via ssh, but not over SMB first and see if you can even read the data. Like danb35 said, you may just be seeing the directories, but may have lost access to the files

Ericloewe · May 17, 2018

If you have metadata errors, it's game over. Unless you happen to have a checkpoint, which you definitely don't.

Server	Version	HPE Proliant Micro Server	CPU	RAM (DDR3 ECC @ 1600 MHz)	Pool	Boot	Battery Backup	Jails	VMs	Docker	Other
truenas-l	CORE 12.0-U6	Gen 8	Intel Xeon E3-1270L V2 @ 2.3GHz	16GB	4 x 10TB WD Red+ in RAID-Z1	2 x 16GB Verbatim Store n Go USB 3.0 Gold flash drives in mirror	PowerShield Defender 1200VA. Server is NUT master	DNSmasq, Heimdall, Nextcloud, Plex (Beta), Resilio Sync, Tautulli, Transmission, WordPress			File & media server. Replication source.
truenas-l2	CORE 12.0-U6	Gen 8	Intel Xeon E3-1220L V2 @ 3.5GHz	16GB	4 x 8TB WD Red+ in RAID-Z1	2 x 16GB Verbatim Store n Go USB 3.0 Gold flash drives in mirror	PowerShield Defender 1200VA. Server is NUT slave	Caddy Reverse Proxy	Ubuntu 20.0.1 Desktop (2 core, 4GB RAM, 150GB HDD) with Docker and Docker Compose	OnlyOffice, Collabora, TrueCommand, TC 1.2.3 & 1.3.2 Portainer, Nextcloud-Apache, Nextcloud-FPM, WordPress	Plex DVR media server.
truenas-b1	CORE 12.0-U6	Gen 7 N54L	AMD Turion II Neo N54L @ 2.2GHz	12GB	5 x 6TB WD Red+ in RAID-Z1	2 x 16GB SanDisk Cruzer Facet USB 2.0 flash drives in mirror	PowerShield Defender 1200VA. Server is NUT master				Media replication target.
truenas-b2	CORE 12.0-U6	Gen 7 N54L	AMD Turion II Neo N54L @ 2.2GHz	12GB	5 x 4TB WD Red+ in RAID-Z1	2 x 16GB SanDisk Cruzer Facet USB 2.0 flash drives in mirror	PowerShield Defender 1200VA Server is NUT slave				File replication target.
truenas-r	CORE 12.0-U6	Gen 7 N54L	AMD Turion II Neo N54L @ 2.2GHz	10GB	5 x 6TB WD Red+ in RAID-Z1	2 x 16GB SanDisk Cruzer Facet USB 2.0 flash drives in mirror		Plex (Beta)			Off-site backup
truenas-t	CORE 12.0-U6	Gen 7 N40L	AMD Turion II Neo N40L @ 1.5GHz	8GB	4 x 3TB WD Red+ in RAID-Z1	2 x 16GB SanDisk Cruzer Facet USB 2.0 flash drives in mirror					Test server
truenas-s	SCALE 22.02-RC.1	Gen 8	Intel Xeon E3-1220L V2 @ 3.5GHz	16GB	2 x 1TB WD Red in mirror	1 x 32GB Transcend M.2 SSD in a USB 3.1 enclosure				OnlyOffice, Collabora, TrueCommand	Test server

Important Announcement for the TrueNAS Community.

50% of directories not sharing after outage...FSCK?

cyberdan2002

Cadet

Basil Hendroff

Wizard

DaveY

Contributor

Chris Moore

Hall of Famer

cyberdan2002

Cadet

cyberdan2002

Cadet

cyberdan2002

Cadet

cyberdan2002

Cadet

cyberdan2002

Cadet

danb35

Hall of Famer

DaveY

Contributor

Ericloewe

Server Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

50% of directories not sharing after outage...FSCK?

Cadet

Wizard

Contributor

Hall of Famer

Cadet

Cadet

Cadet

Cadet

Cadet

Hall of Famer

Contributor

Server Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "50% of directories not sharing after outage...FSCK?"

Similar threads