urgent issue with degraded system

Status
Not open for further replies.

sbertsch

Dabbler
Joined
May 12, 2015
Messages
10
Over the night I received a notification of a degraded system, I have done a number of searches and have spent the better part of the morning trying to find out what I need to do to correct this.

Our current system has a bout 9 TB on it so moving the data is not really an option,

here is the error we received.


Checking status of zfs pools:

NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT

filer01-pool 21.8T 13.3T 8.47T - 38% 61% 1.00x DEGRADED /mnt

freenas-boot 3.78G 954M 2.85G - - 24% 1.00x ONLINE -



pool: filer01-pool

state: DEGRADED

status: One or more devices has experienced an error resulting in data

corruption. Applications may be affected.

so I jumped into the host and ran the following command.

zpool status -v

to which I got a similar output. however it is showing this:

errors: Permanent errors have been detected in the following files:

/var/db/system/samba4/locking.tdb

pool: freenas-boot
state: ONLINE
scan: none requested
config:

so I start looking at SAMBA info and they are talking about .tdb files being backups.

Please help
Scott

action: Restore the file in question if possible. Otherwise restore the

entire pool from backup.

see: http://illumos.org/msg/ZFS-8000-8A

scan: none requested

config:



NAME STATE READ WRITE CKSUM

filer01-pool DEGRADED 0 0 12.8K

raidz1-0 DEGRADED 0 0 25.6K

gptid/51077b86-0b99-11e5-aaca-0022190356c4 DEGRADED 0 0 5 too many errors

gptid/51c32da8-0b99-11e5-aaca-0022190356c4 DEGRADED 0 0 5 too many errors

gptid/527fb069-0b99-11e5-aaca-0022190356c4 DEGRADED 0 0 10 too many errors

gptid/533b3346-0b99-11e5-aaca-0022190356c4 DEGRADED 0 0 4 too many errors

gptid/53f70980-0b99-11e5-aaca-0022190356c4 DEGRADED 0 0 3 too many errors

gptid/54b22365-0b99-11e5-aaca-0022190356c4 DEGRADED 0 0 8 too many errors



errors: 1 data errors, use '-v' for a list



-- End of daily output --
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
Scott, when posting error messages, logs and command results, please post
the output with code tags. This helps maintain formatting and makes them easier to read.
i.e
Code:
 formatting
is
easier
to
read
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
You have 6 x 4 TB disks in a RAIDZ1 pool? Ouch. Not a good plan for your data integrity. However, something is causing all your disks to be giving errors at the same time, and that's unlikely to be the disks (six disks failing simultaneously just doesn't happen). What's your hardware and FreeNAS version?
 

sbertsch

Dabbler
Joined
May 12, 2015
Messages
10
We are running 9.3-RELEASE stable on a DELL 2950 with percH700, I am certain that the drives are ok as MegaCli isn't showing any errors.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
CPU? RAM? Which FreeNAS release? How is your H700 configured? What's the output of 'camcontrol devlist'?

Rebooting your server will clear the error counts, but it won't fix whatever's causing the errors. I'd expect a bad connection somewhere--to the backplane if you're using one, or power perhaps.

This thread makes it sound like the H700 does not provide a true JBOD mode which would give FreeNAS direct access to your disks. If that's the case, this is potentially fatal to your data.
 

sbertsch

Dabbler
Joined
May 12, 2015
Messages
10
32 Gig of RAM Intel(R) Xeon(R) CPU E5405 @ 2.00GHz the H700 is configured as each disk is a logical volume to take advantage of the ZFS spanning across all the disks.


Code:
[camcontrol devlist
<HGST HUS724040ALS640 A1C4>  at scbus0 target 0 lun 0 (pass0)
<HGST HUS724040ALS640 A1C4>  at scbus0 target 1 lun 0 (pass1)
<HGST HUS724040ALS640 A1C4>  at scbus0 target 2 lun 0 (pass2)
<HGST HUS724040ALS640 A1C4>  at scbus0 target 3 lun 0 (pass3)
<HGST HUS724040ALS640 A1C4>  at scbus0 target 6 lun 0 (pass4)
<HGST HUS724040ALS640 A1C4>  at scbus0 target 7 lun 0 (pass5)
<DP BACKPLANE 1.05>  at scbus0 target 32 lun 0 (pass6,ses0)
<Generic Flash Disk 8.07>  at scbus5 target 0 lun 0 (pass7,da0)[code/]


So yes you are accurate I had read that post the H700 does not support true JBOD mode. So we opted to go with each disk being a logical volume.
 
Last edited:
Joined
Oct 2, 2014
Messages
925
So each disk is in RAID0? ....
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
32 Gig of RAM Intel(R) Xeon(R) CPU E5405 @ 2.00GHz the H700 is configured as each disk is a logical volume to take advantage of the ZFS spanning across all the disks.


Code:
[camcontrol devlist
<HGST HUS724040ALS640 A1C4>  at scbus0 target 0 lun 0 (pass0)
<HGST HUS724040ALS640 A1C4>  at scbus0 target 1 lun 0 (pass1)
<HGST HUS724040ALS640 A1C4>  at scbus0 target 2 lun 0 (pass2)
<HGST HUS724040ALS640 A1C4>  at scbus0 target 3 lun 0 (pass3)
<HGST HUS724040ALS640 A1C4>  at scbus0 target 6 lun 0 (pass4)
<HGST HUS724040ALS640 A1C4>  at scbus0 target 7 lun 0 (pass5)
<DP BACKPLANE 1.05>  at scbus0 target 32 lun 0 (pass6,ses0)
<Generic Flash Disk 8.07>  at scbus5 target 0 lun 0 (pass7,da0)[code/]


So yes you are accurate I had read that post the H700 does not support true JBOD mode. So we opted to go with each disk being a logical volume.
This is your problem. Zfs needs raw disk access. Your controller is doing things behind zfs back which makes running zfs useless.
 

sbertsch

Dabbler
Joined
May 12, 2015
Messages
10
Yes each disk is a raid 0

So short of spending money and rebuilding the system is there anything I can do?
 
Joined
Oct 2, 2014
Messages
925
Yes each disk is a raid 0

So short of spending money and rebuilding the system is there anything I can do?
You can find a M1015 and cross flash it to P 16 for under $100 on eBay if you're in the US
 
Joined
Jun 17, 2015
Messages
4
The trouble is this system is a swing out server for an openfiler replacement, so we need to get the data on this system rebuild the openfiler system and then put the data back on the other system which will then be running FreeNAS.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Yes each disk is a raid 0

So short of spending money and rebuilding the system is there anything I can do?
People always refer to the hba card as spending money on their system but that is the cheapest part. I got mine for >$100 which is the cheapest part of my build.
 
Joined
Oct 2, 2014
Messages
925
This is unforchanitly out of my comfort zone and knowledge, maybe @cyberjock can weigh in
 
Joined
Jun 17, 2015
Messages
4
OK so I rebooted the box tonight after hours and I ran the following

Code:
zpool status -v
  pool: filer01-pool
state: ONLINE
status: One or more devices has experienced an error resulting in data
  corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
  entire pool from backup.
  see: http://illumos.org/msg/ZFS-8000-8A
  scan: none requested
config:

  NAME  STATE  READ WRITE CKSUM
  filer01-pool  ONLINE  0  0  0
  raidz1-0  ONLINE  0  0  0
  gptid/51077b86-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0
  gptid/51c32da8-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0
  gptid/527fb069-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0
  gptid/533b3346-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0
  gptid/53f70980-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0
  gptid/54b22365-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0

errors: Permanent errors have been detected in the following files:

  /var/db/system/samba4/locking.tdb

  pool: freenas-boot
state: ONLINE
  scan: none requested
config:

  NAME  STATE  READ WRITE CKSUM
  freenas-boot  ONLINE  0  0  0
  gptid/fa552387-1152-11e5-b576-0022190356c4  ONLINE  0  0  0

errors: No known data errors


So it looks like samba is locking this file, how would I clean up the locking.tdb file? Am I safe to stop samba rename it and restart samba?

Thanks in advance
Scott
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Rebooting probably is a bad sign of things to come. Try doing a scrub of the zpool. I have a fishy feeling things are much worse than you think.

Normally, (as in... you aren't doing RAIDZ1 and using a RAID controller, and yes, you are still using it if you can't do JBOD) then the correct action would be to stop Samba, delete the offending files, then try to start the service. Unfortunately, since you are using hardware RAID, most of the assumptions made go right out the window.

Read this: https://forums.freenas.org/index.php?threads/hardware-recommendations-read-this-first.23069/
What not to do...

So I've told you what stuff you should get. Here's stuff you should NOT get:

Note: this list is compiled by people that have used hardware and lost all of their data as a result. If you can't learn from someone else's screwups you probably shouldn't use FreeNAS at all.
  • Highpoint controllers (they'll work until one day when you suddenly have no data)
  • Adaptec controllers (they'll work until one day when you suddenly have no data)
  • Dell PERC RAID controllers like the 5i (if you have to do RAID0 of individual disks you have failed to follow the "do not use hardware RAID" rule... that is NOT a JBOD. End of discussion.)
  • Try to use less than 8GB of RAM (If you plan to use Plex you should have at least 12GB of RAM)
  • Hardware that has a FSB (front-side bus). The FSB will be a performance killer for ZFS
  • Hardware that is older than the Intel Sandy Bridge (older stuff burns LOTS of watts)
  • Anything that doesn't use DDR3 or newer RAM (DDR2 and FB-DIMMs are just too expensive to try to upgrade later)
Notice I already totally admonished your plan to use RAID0s of individual disks. That decision may be the single fastest way to lose your zpool based on all the people I've seen lose their zpool over the years. I'm so tired of people somehow justifying to themselves that the RAID0 thing is okay that I have that "end of discussion" in there because 99.9% of the time I'll not respond to them since they clearly do not understand what it means to be a jbod, and the chances of having any kind of recovery are nearly 0% and the fraction of a percent that get lucky spend waaaay too much of my own time trying to unf*ck their system, so I don't play with that anymore (my time is too precious with me trying to move across country).
 

anodos

Sambassador
iXsystems
Joined
Mar 6, 2014
Messages
9,554
OK so I rebooted the box tonight after hours and I ran the following

Code:
zpool status -v
  pool: filer01-pool
state: ONLINE
status: One or more devices has experienced an error resulting in data
  corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
  entire pool from backup.
  see: http://illumos.org/msg/ZFS-8000-8A
  scan: none requested
config:

  NAME  STATE  READ WRITE CKSUM
  filer01-pool  ONLINE  0  0  0
  raidz1-0  ONLINE  0  0  0
  gptid/51077b86-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0
  gptid/51c32da8-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0
  gptid/527fb069-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0
  gptid/533b3346-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0
  gptid/53f70980-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0
  gptid/54b22365-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0

errors: Permanent errors have been detected in the following files:

  /var/db/system/samba4/locking.tdb

  pool: freenas-boot
state: ONLINE
  scan: none requested
config:

  NAME  STATE  READ WRITE CKSUM
  freenas-boot  ONLINE  0  0  0
  gptid/fa552387-1152-11e5-b576-0022190356c4  ONLINE  0  0  0

errors: No known data errors


So it looks like samba is locking this file, how would I clean up the locking.tdb file? Am I safe to stop samba rename it and restart samba?

Thanks in advance
Scott

I believe you can do those steps safely, but I think it's a better idea to run a scrub first. Try to figure out if this is the tip of the iceberg. It's a good idea to make backups and get appropriate hardware ASAP.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
@cyberjock

I forced the reboot; the host didn't reboot by itself.

Thanks
Scott

That's how I took it. I'd do a scrub and find out if there is significant damage or just the one file. The fact that the system identified a bad file so quickly is a really bad sign. Someone might argue that it is a commonly used file that may be written to often, so it's not illogical to potentially be corrupted. The problem is that logic is flawed when pitted against the ZFS design model. *any* irreparable corruption, at all, should basically be impossible with a properly designed and administrated server. So this means you have major, major problems.
 

sbertsch

Dabbler
Joined
May 12, 2015
Messages
10
@cyberjock

You were correct the scrub identified many issues with the system. We are in the process of trying to get back to a good state which will include a complete rebuild of the system.

I wanted to take a moment and thank everyone who commented and assisted me with this.

Thanks again
Scott
 
Status
Not open for further replies.
Top