urgent issue with degraded system

sbertsch · Jul 8, 2015

Over the night I received a notification of a degraded system, I have done a number of searches and have spent the better part of the morning trying to find out what I need to do to correct this.

Our current system has a bout 9 TB on it so moving the data is not really an option,

here is the error we received.

Checking status of zfs pools:

NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT

filer01-pool 21.8T 13.3T 8.47T - 38% 61% 1.00x DEGRADED /mnt

freenas-boot 3.78G 954M 2.85G - - 24% 1.00x ONLINE -

pool: filer01-pool

state: DEGRADED

status: One or more devices has experienced an error resulting in data

corruption. Applications may be affected.

so I jumped into the host and ran the following command.

zpool status -v

to which I got a similar output. however it is showing this:

errors: Permanent errors have been detected in the following files:

/var/db/system/samba4/locking.tdb

pool: freenas-boot
state: ONLINE
scan: none requested
config:

so I start looking at SAMBA info and they are talking about .tdb files being backups.

Please help
Scott

action: Restore the file in question if possible. Otherwise restore the

entire pool from backup.

see: http://illumos.org/msg/ZFS-8000-8A

scan: none requested

config:

NAME STATE READ WRITE CKSUM

filer01-pool DEGRADED 0 0 12.8K

raidz1-0 DEGRADED 0 0 25.6K

gptid/51077b86-0b99-11e5-aaca-0022190356c4 DEGRADED 0 0 5 too many errors

gptid/51c32da8-0b99-11e5-aaca-0022190356c4 DEGRADED 0 0 5 too many errors

gptid/527fb069-0b99-11e5-aaca-0022190356c4 DEGRADED 0 0 10 too many errors

gptid/533b3346-0b99-11e5-aaca-0022190356c4 DEGRADED 0 0 4 too many errors

gptid/53f70980-0b99-11e5-aaca-0022190356c4 DEGRADED 0 0 3 too many errors

gptid/54b22365-0b99-11e5-aaca-0022190356c4 DEGRADED 0 0 8 too many errors

errors: 1 data errors, use '-v' for a list

-- End of daily output --

BigDave · Jul 8, 2015

Scott, when posting error messages, logs and command results, please post
the output with code tags. This helps maintain formatting and makes them easier to read.
i.e

Code:

 formatting
is
easier
to
read

sbertsch · Jul 8, 2015

noted thank you

danb35 · Jul 8, 2015

You have 6 x 4 TB disks in a RAIDZ1 pool? Ouch. Not a good plan for your data integrity. However, something is causing all your disks to be giving errors at the same time, and that's unlikely to be the disks (six disks failing simultaneously just doesn't happen). What's your hardware and FreeNAS version?

sbertsch · Jul 8, 2015

We are running 9.3-RELEASE stable on a DELL 2950 with percH700, I am certain that the drives are ok as MegaCli isn't showing any errors.

danb35 · Jul 8, 2015

CPU? RAM? Which FreeNAS release? How is your H700 configured? What's the output of 'camcontrol devlist'?

Rebooting your server will clear the error counts, but it won't fix whatever's causing the errors. I'd expect a bad connection somewhere--to the backplane if you're using one, or power perhaps.

This thread makes it sound like the H700 does not provide a true JBOD mode which would give FreeNAS direct access to your disks. If that's the case, this is potentially fatal to your data.

sbertsch · Jul 8, 2015

32 Gig of RAM Intel(R) Xeon(R) CPU E5405 @ 2.00GHz the H700 is configured as each disk is a logical volume to take advantage of the ZFS spanning across all the disks.

Code:

[camcontrol devlist
<HGST HUS724040ALS640 A1C4>  at scbus0 target 0 lun 0 (pass0)
<HGST HUS724040ALS640 A1C4>  at scbus0 target 1 lun 0 (pass1)
<HGST HUS724040ALS640 A1C4>  at scbus0 target 2 lun 0 (pass2)
<HGST HUS724040ALS640 A1C4>  at scbus0 target 3 lun 0 (pass3)
<HGST HUS724040ALS640 A1C4>  at scbus0 target 6 lun 0 (pass4)
<HGST HUS724040ALS640 A1C4>  at scbus0 target 7 lun 0 (pass5)
<DP BACKPLANE 1.05>  at scbus0 target 32 lun 0 (pass6,ses0)
<Generic Flash Disk 8.07>  at scbus5 target 0 lun 0 (pass7,da0)[code/]

So yes you are accurate I had read that post the H700 does not support true JBOD mode. So we opted to go with each disk being a logical volume.

Darren Myers · Jul 8, 2015

So each disk is in RAID0? ....

SweetAndLow · Jul 8, 2015

sbertsch said:
32 Gig of RAM Intel(R) Xeon(R) CPU E5405 @ 2.00GHz the H700 is configured as each disk is a logical volume to take advantage of the ZFS spanning across all the disks.

Code:
[camcontrol devlist <HGST HUS724040ALS640 A1C4> at scbus0 target 0 lun 0 (pass0) <HGST HUS724040ALS640 A1C4> at scbus0 target 1 lun 0 (pass1) <HGST HUS724040ALS640 A1C4> at scbus0 target 2 lun 0 (pass2) <HGST HUS724040ALS640 A1C4> at scbus0 target 3 lun 0 (pass3) <HGST HUS724040ALS640 A1C4> at scbus0 target 6 lun 0 (pass4) <HGST HUS724040ALS640 A1C4> at scbus0 target 7 lun 0 (pass5) <DP BACKPLANE 1.05> at scbus0 target 32 lun 0 (pass6,ses0) <Generic Flash Disk 8.07> at scbus5 target 0 lun 0 (pass7,da0)[code/]

So yes you are accurate I had read that post the H700 does not support true JBOD mode. So we opted to go with each disk being a logical volume.

This is your problem. Zfs needs raw disk access. Your controller is doing things behind zfs back which makes running zfs useless.

sbertsch · Jul 8, 2015

Yes each disk is a raid 0

So short of spending money and rebuilding the system is there anything I can do?

Darren Myers · Jul 8, 2015

sbertsch said:
Yes each disk is a raid 0

So short of spending money and rebuilding the system is there anything I can do?

You can find a M1015 and cross flash it to P 16 for under $100 on eBay if you're in the US

Scott Bertsch · Jul 8, 2015

The trouble is this system is a swing out server for an openfiler replacement, so we need to get the data on this system rebuild the openfiler system and then put the data back on the other system which will then be running FreeNAS.

SweetAndLow · Jul 8, 2015

sbertsch said:
Yes each disk is a raid 0

So short of spending money and rebuilding the system is there anything I can do?

People always refer to the hba card as spending money on their system but that is the cheapest part. I got mine for >$100 which is the cheapest part of my build.

Darren Myers · Jul 8, 2015

This is unforchanitly out of my comfort zone and knowledge, maybe @cyberjock can weigh in

Scott Bertsch · Jul 8, 2015

OK so I rebooted the box tonight after hours and I ran the following

Code:

zpool status -v
  pool: filer01-pool
state: ONLINE
status: One or more devices has experienced an error resulting in data
  corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
  entire pool from backup.
  see: http://illumos.org/msg/ZFS-8000-8A
  scan: none requested
config:

  NAME  STATE  READ WRITE CKSUM
  filer01-pool  ONLINE  0  0  0
  raidz1-0  ONLINE  0  0  0
  gptid/51077b86-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0
  gptid/51c32da8-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0
  gptid/527fb069-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0
  gptid/533b3346-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0
  gptid/53f70980-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0
  gptid/54b22365-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0

errors: Permanent errors have been detected in the following files:

  /var/db/system/samba4/locking.tdb

  pool: freenas-boot
state: ONLINE
  scan: none requested
config:

  NAME  STATE  READ WRITE CKSUM
  freenas-boot  ONLINE  0  0  0
  gptid/fa552387-1152-11e5-b576-0022190356c4  ONLINE  0  0  0

errors: No known data errors

So it looks like samba is locking this file, how would I clean up the locking.tdb file? Am I safe to stop samba rename it and restart samba?

Thanks in advance
Scott

cyberjock · Jul 8, 2015

Rebooting probably is a bad sign of things to come. Try doing a scrub of the zpool. I have a fishy feeling things are much worse than you think.

Normally, (as in... you aren't doing RAIDZ1 and using a RAID controller, and yes, you are still using it if you can't do JBOD) then the correct action would be to stop Samba, delete the offending files, then try to start the service. Unfortunately, since you are using hardware RAID, most of the assumptions made go right out the window.

Read this: https://forums.freenas.org/index.php?threads/hardware-recommendations-read-this-first.23069/

What not to do...

So I've told you what stuff you should get. Here's stuff you should NOT get:

Note: this list is compiled by people that have used hardware and lost all of their data as a result. If you can't learn from someone else's screwups you probably shouldn't use FreeNAS at all.

Highpoint controllers (they'll work until one day when you suddenly have no data)

Adaptec controllers (they'll work until one day when you suddenly have no data)

Dell PERC RAID controllers like the 5i (if you have to do RAID0 of individual disks you have failed to follow the "do not use hardware RAID" rule... that is NOT a JBOD. End of discussion.)

Try to use less than 8GB of RAM (If you plan to use Plex you should have at least 12GB of RAM)

Hardware that has a FSB (front-side bus). The FSB will be a performance killer for ZFS

Hardware that is older than the Intel Sandy Bridge (older stuff burns LOTS of watts)

Anything that doesn't use DDR3 or newer RAM (DDR2 and FB-DIMMs are just too expensive to try to upgrade later)

Notice I already totally admonished your plan to use RAID0s of individual disks. That decision may be the single fastest way to lose your zpool based on all the people I've seen lose their zpool over the years. I'm so tired of people somehow justifying to themselves that the RAID0 thing is okay that I have that "end of discussion" in there because 99.9% of the time I'll not respond to them since they clearly do not understand what it means to be a jbod, and the chances of having any kind of recovery are nearly 0% and the fraction of a percent that get lucky spend waaaay too much of my own time trying to unf*ck their system, so I don't play with that anymore (my time is too precious with me trying to move across country).

Scott Bertsch · Jul 8, 2015

@cyberjock

I forced the reboot; the host didn't reboot by itself.

Thanks
Scott

anodos · Jul 8, 2015

Scott Bertsch said:

OK so I rebooted the box tonight after hours and I ran the following

Code:

zpool status -v
  pool: filer01-pool
state: ONLINE
status: One or more devices has experienced an error resulting in data
  corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
  entire pool from backup.
  see: http://illumos.org/msg/ZFS-8000-8A
  scan: none requested
config:

  NAME  STATE  READ WRITE CKSUM
  filer01-pool  ONLINE  0  0  0
  raidz1-0  ONLINE  0  0  0
  gptid/51077b86-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0
  gptid/51c32da8-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0
  gptid/527fb069-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0
  gptid/533b3346-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0
  gptid/53f70980-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0
  gptid/54b22365-0b99-11e5-aaca-0022190356c4  ONLINE  0  0  0

errors: Permanent errors have been detected in the following files:

  /var/db/system/samba4/locking.tdb

  pool: freenas-boot
state: ONLINE
  scan: none requested
config:

  NAME  STATE  READ WRITE CKSUM
  freenas-boot  ONLINE  0  0  0
  gptid/fa552387-1152-11e5-b576-0022190356c4  ONLINE  0  0  0

errors: No known data errors

So it looks like samba is locking this file, how would I clean up the locking.tdb file? Am I safe to stop samba rename it and restart samba?

Thanks in advance
Scott

I believe you can do those steps safely, but I think it's a better idea to run a scrub first. Try to figure out if this is the tip of the iceberg. It's a good idea to make backups and get appropriate hardware ASAP.

cyberjock · Jul 8, 2015

Scott Bertsch said:
@cyberjock

I forced the reboot; the host didn't reboot by itself.

Thanks
Scott

That's how I took it. I'd do a scrub and find out if there is significant damage or just the one file. The fact that the system identified a bad file so quickly is a really bad sign. Someone might argue that it is a commonly used file that may be written to often, so it's not illogical to potentially be corrupted. The problem is that logic is flawed when pitted against the ZFS design model. *any* irreparable corruption, at all, should basically be impossible with a properly designed and administrated server. So this means you have major, major problems.

sbertsch · Jul 13, 2015

@cyberjock

You were correct the scrub identified many issues with the system. We are in the process of trying to get back to a good state which will include a complete rebuild of the system.

I wanted to take a moment and thank everyone who commented and assisted me with this.

Thanks again
Scott

Important Announcement for the TrueNAS Community.

urgent issue with degraded system

Dabbler

FreeNAS Enthusiast

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Guru

Sweet'NASty

Dabbler

Guru

Cadet

Sweet'NASty

Guru

Cadet

Inactive Account

Cadet

Sambassador

Inactive Account

Dabbler

Similar threads