Scrub frozen

joeschmuck · Jul 2, 2014

I actually have to disagree for once and that is not typically me for when it comes with instructions. I fell the wdidle program was provided with adequate instructions on how to format the command line and this is typical for most programs that return some generic command line help info. And if it doesn't make sense to you then you should do a Google (or your favorite seach engine) search for, in this case, "wdidle3 example". The problem is the user experience level. (I'm starting to sound like Cyberjock) And this type of tool was not designed for a non-technical person. I can see that the OP is on a steep learning curve and we will help out but you need to seriously learn some basics and do a little research as well. I'm certain you will get there.

solarisguy · Jul 2, 2014

@joeschmuck, my apologies. At least I tried, that is why I said “could have included examples” and not “should have included examples”...

climb2bhi · Jul 2, 2014

Thanks,

I have found far more help here than missing information. But with these systems one missing character or space, and things come to a stop.

I have been searching, reading, watching videos, and learning. With tons of good info and help here.

But every once in a while I feel pretty lost and in over my head.

And my own fault has been to stop learning when my NAS is running smoothly.

joeschmuck · Jul 2, 2014

climb2bhi said:
And my own fault has been to stop learning when my NAS is running smoothly.

You are not the first and not the last person wiht this problem. I too have limited memory on many things I use to know.

climb2bhi · Jul 6, 2014

Resilvering. But it has reported errors.

Not sure what I should do about that. If the errors are on my data that is fine as I will eventually come across the errors in my data and restore that file from my backups.

But if the errors prevents the resilver I have weeks of work restoring my NAS from optical backup discs. (7.7TB of transfers)

Code:

status: One or more devices is currently being resilvered.  The pool will                                                     
        continue to function, possibly in a degraded state.                                                                   
action: Wait for the resilver to complete.                                                                                     
  scan: resilver in progress since Sat Jul  5 22:25:30 2014                                                                   
        7.40T scanned out of 7.70T at 158M/s, 0h33m to go                                                                     
        1.23T resilvered, 96.04% done                                                                                         
config:                                                                                                                       
                                                                                                                               
        NAME                                              STATE    READ WRITE CKSUM                                           
        NAS                                              UNAVAIL      7    0    0                                           
          raidz1-0                                        UNAVAIL    24    0    0                                           
            gptid/7ea665e9-9f56-11e3-b135-003048605442    ONLINE      0    0    0                                           
            gptid/7f7af805-9f56-11e3-b135-003048605442    ONLINE      0    0    0                                           
            gptid/8051a313-9f56-11e3-b135-003048605442    ONLINE      0    0    0                                           
            gptid/81035aa0-9f56-11e3-b135-003048605442    ONLINE      0    0    0                                           
            replacing-4                                  OFFLINE      0    0    0                                           
              4840332512403835441                        OFFLINE      0    0    0  was /dev/gptid/81ef60f1-9f56-11e3-b135-0030486
05442                                                                                                                         
              gptid/8f5e29e8-04c5-11e4-a1c1-003048605442  ONLINE      0    0    0  (resilvering)                           
            13462019100277585825                          REMOVED      0    0    0  was /dev/gptid/82abc733-9f56-11e3-b135-0030486
05442                                                                                                                         
                                                                                                                               
errors: Permanent errors have been detected in the following files:                                                           
                                                                                                                               
        <metadata>:<0x0>                                                                                                       
        <metadata>:<0x1>                                                                                                       
        <metadata>:<0x61>                                                                                                     
        <metadata>:<0x8e>                                                                                                     
        <metadata>:<0xa0>                                                                                                     
        <metadata>:<0xa3>                                                                                                     
        <metadata>:<0xa4>                                                                                                     
        <metadata>:<0xa8>                                                                                                     
        NAS:<0x0>                                                                                                             
        NAS:<0x1e0af>                                                                                                         
        NAS:<0xfffffffffffffffe>                                                                                               
        NAS:<0xffffffffffffffff>                                                                                               
[root@freenas ~]#

Also when I now go to reporting there is No activity on my drives. Even though this report says it is resilvering, the Disc activity report (Graph in GUI) says my drives are sitting idle.

solarisguy · Jul 6, 2014

If you do zpool status -v now (6 hours later), do you see resilvering inching ahead or finished?

Before I go any further, do you have ECC RAM, ECC capable CPU and motherboard?

climb2bhi · Jul 6, 2014

No it is not moving forward. It has been stuck at 96.04% done all day.

I have ECC capable CPU and motherboard, but ECC RAM is my next upgrade.

I just purchased a UPS since the power here in the mountains is unreliable, and the ECC memory will have to wait until I return to work this Fall.

Is there some way to clear the errors or scrub the volume and then retry to resilver?

solarisguy · Jul 6, 2014

I was afraid of these permanent errors. Your option is to copy everything you can to 2-3 backup sets and recreate everything anew. Do not proceed yet! Read on.

The bad news is that your errors look like the aftermath of using non-ECC memory... Shutdown your FreeNAS and try extensive memtest..., this is not a solution, just a test whether RAM errors were transient or are permanent. I would not touch your drives (including backing them up) until you have a system with ECC.

I know of no motherboard that takes either ECC or non-ECC memory and would be able to do ECC when running with ECC memory. All the ones, I know, that take either memory ignore ECC altogether. But I am interested in learning.

And also run S.M.A.R.T. long tests on all the drives on a regular schedule.

climb2bhi · Jul 6, 2014

Thanks,

I have Long S.M.A.R.T. tests scheduled weekly. So I had some warning of this coming but waited too long to replace the failing drive.

Scrubs scheduled weekly. Emails working.

I received some bad advice and forced a shut down during a scrub that appeared to be stuck.

If I am unable to recover this volume it will be a PITA to restore my data from optical discs, but on the bright side I will switch to raidz2 for my next go at it.

My memory passes all tests. Other than not being correct for use in servers.

ECC memory will have to wait until this fall if I want to pay my mortgage and eat this Summer. And it sounds like you are saying that I need a motherboard that only takes ECC memory?

But I will continue to play with adding my UPS and learning from this system prior to getting ECC memory. Because with being unemployed I have lots of time. And I have invested so much time and money in it so far that I cant resist the temptation to poke it with a stick some more.

Eventually I will have a much more reliable system. UPS, Raidz2, and ECC memory (and a new motherboard?).

solarisguy · Jul 6, 2014

OK, good points.

Do at least one copy of data before shutting down. It can wait, nothing happens if you do not start a scrub or write to the pool.

Your faults could be deleted files, but I would have difficulty to say either way (non-ECC or deleted files).

If you have Intel chipset and it is not one of C2xx series, then you do not have ECC.

cyberjock · Jul 6, 2014

climb2bhi said:
I have Long S.M.A.R.T. tests scheduled weekly. So I had some warning of this coming but waited too long to replace the failing drive.

Scrubs scheduled weekly. Emails working.

Sounds like ^^^ is 100% true. With RAIDZ1 there is no margin for error. If something goes wrong you must replace it, immediately. Not having a spare on the shelf ready for immediate replacement is just a recipe for disaster (as you have unfortunately seen first-hand).

If you look at what you posted above right next to the poolname you'll see UNAVAIL. That means that the pool was already unusable at that point so shutting down your server, rebooting, etc had no bearing on the final outcome. You were already screwed at that point.

Ericloewe · Jul 7, 2014

solarisguy said:
I was afraid of these permanent errors. Your option is to copy everything you can to 2-3 backup sets and recreate everything anew. Do not proceed yet! Read on.

The bad news is that your errors look like the aftermath of using non-ECC memory... Shutdown your FreeNAS and try extensive memtest..., this is not a solution, just a test whether RAM errors were transient or are permanent. I would not touch your drives (including backing them up) until you have a system with ECC.

I know of no motherboard that takes either ECC or non-ECC memory and would be able to do ECC when running with ECC memory. All the ones, I know, that take either memory ignore ECC altogether. But I am interested in learning.

And also run S.M.A.R.T. long tests on all the drives on a regular schedule.

I'm fairly certain any C22X board will happily work with non-ECC memory and still use ECC when available. Supermicro X10s, the instance, claim compatibility with both types.

joeschmuck · Jul 7, 2014

My MB also accepts and runs both Non-ECC and ECC RAM, selectable in the BIOS.

At climb2bhi, I didn't go back to the beginning of this thread but if you haven't posted your hardware configuration, you should. That won't help your current issue but with respect to the ECC RAM thing, it will help give you proper advice.

climb2bhi · Jul 7, 2014

My System is:

FreeNAS 9.2.1.3, Super Micro H8SMAi-2, Dual-Core AMD Opteron Processor 1218 HE, 8GB (4x2GB) DDR2 800 Non ECC RAM, APC RS 1000 UPS, 8GB USB stick, Six green WD30EZRX 3TB each (18TB total).

One thing I noticed is that since running Wdidle3 my system has renumbered all of my drives. For instance I was replacing ada1p2 and resilvering a new drive. But now my system shows ada1p2 as online and ada3p2 is the missing number. This is in GUI>Storage>Volume status. My current status report:

Code:

[root@freenas ~]# zpool status                                                                                                   
  pool: NAS                                                                                                                     
state: DEGRADED                                                                                                                 
status: One or more devices has experienced an error resulting in data                                                           
        corruption.  Applications may be affected.                                                                               
action: Restore the file in question if possible.  Otherwise restore the                                                         
        entire pool from backup.                                                                                                 
  see: http://illumos.org/msg/ZFS-8000-8A                                                                                       
  scan: resilvered 1.24T in 24h46m with 1189114 errors on Sun Jul  6 23:11:33 2014                                               
config:                                                                                                                         
                                                                                                                                 
        NAME                                              STATE    READ WRITE CKSUM                                             
        NAS                                              DEGRADED    0    0 1.14M                                             
          raidz1-0                                        DEGRADED    0    0 2.29M                                             
            gptid/7ea665e9-9f56-11e3-b135-003048605442    ONLINE      0    0    0                                             
            gptid/7f7af805-9f56-11e3-b135-003048605442    ONLINE      0    0    0                                             
            gptid/8051a313-9f56-11e3-b135-003048605442    ONLINE      0    0    0                                             
            gptid/81035aa0-9f56-11e3-b135-003048605442    ONLINE      0    0    0                                             
            replacing-4                                  DEGRADED    0    0    0                                             
              4840332512403835441                        OFFLINE      0    0    0  was /dev/gptid/81ef60f1-9f56-11e3-b135-0030486
05442                                                                                                                           
              gptid/8f5e29e8-04c5-11e4-a1c1-003048605442  ONLINE      0    0    0                                             
            13462019100277585825                          REMOVED      0    0    0  was /dev/gptid/82abc733-9f56-11e3-b135-0030486
05442                                                                                                                           
                                                                                                                                 
errors: 1189114 data errors, use '-v' for a list                                                                                 
[root@freenas ~]#

cyberjock · Jul 7, 2014

That has nothing to do with the adaXX numbers and has to do with the arrangement of your physical connections from the disks to your SATA/SAS controller. I'm guessing you pulled all your disks out and put them back in in a different order. ;) There are no "consequences" to having them out of order and ZFS will still identify what goes where and work with it appropriately.

climb2bhi · Jul 7, 2014

Good to know. But No I did not pull any of them except the one being replaced. I did run Wdidle3 on them though.

Ericloewe · Jul 7, 2014

climb2bhi said:
Good to know. But No I did not pull any of them except the one being replaced. I did run Wdidle3 on them though.

My extremely limited experience suggests that FreeNAS shifts the "old" disks so they are sequentially numbered (if you have ada0-5 and remove 3, 4 becomes 3 and 5 becomes 4) and numbers from there.

Wdidle shouldn't affect that (unless the drives disconnect during the process, which is always possible).

climb2bhi · Jul 7, 2014

That would explain the numbering. And it is good to know that ZFS will still identify what goes where and work with it.

Although now I'm detaching the volume and destroying my data so it looks like it is time to start all over.:(

climb2bhi · Jul 7, 2014

Volume detached, data destroyed. New Volume created successfully, with a new share for it. Everything seems good except for this:

Code:

Jul  7 14:43:56 freenas kernel: (ada3:ata5:0:0:0): READ_DMA. ACB: c8 00 a0 02 40 40 00 00 00 00 e0 00
Jul  7 14:43:56 freenas kernel: (ada3:ata5:0:0:0): CAM status: ATA Status Error
Jul  7 14:43:56 freenas kernel: (ada3:ata5:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
Jul  7 14:43:56 freenas kernel: (ada3:ata5:0:0:0): RES: 51 40 08 03 40 00 00 00 00 6f 00
Jul  7 14:43:56 freenas kernel: (ada3:ata5:0:0:0): Retrying command
Jul  7 14:43:59 freenas kernel: (ada3:ata5:0:0:0): READ_DMA. ACB: c8 00 a0 02 40 40 00 00 00 00 e0 00
Jul  7 14:43:59 freenas kernel: (ada3:ata5:0:0:0): CAM status: ATA Status Error
Jul  7 14:43:59 freenas kernel: (ada3:ata5:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
Jul  7 14:43:59 freenas kernel: (ada3:ata5:0:0:0): RES: 51 40 08 03 40 00 00 00 00 6f 00
Jul  7 14:43:59 freenas kernel: (ada3:ata5:0:0:0): Error 5, Retries exhausted
Jul  7 14:44:00 freenas kernel: Lost cambio swi (2)!!!
Jul  7 14:44:01 freenas kernel: Lost cambio swi (3)!!!
Jul  7 14:44:09 freenas kernel: (ada3:ata5:0:0:0): READ_DMA. ACB: c8 00 a0 02 40 40 00 00 00 00 e0 00
Jul  7 14:44:09 freenas kernel: (ada3:ata5:0:0:0): CAM status: ATA Status Error
Jul  7 14:44:09 freenas kernel: (ada3:ata5:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
Jul  7 14:44:09 freenas kernel: (ada3:ata5:0:0:0): RES: 51 40 08 03 40 00 00 00 00 6f 00
Jul  7 14:44:09 freenas kernel: (ada3:ata5:0:0:0): Retrying command

It keeps doing this over and over.

solarisguy · Jul 7, 2014

It could be either

a physical failure of a cable
cable disconnect (a locking variety of cables is recommended)
SATA port or card failure
(unlikely) a disk electronics misbehaving

or

your SATA card is not compatible
your SATA card is too fast or does not support IOMMU

Important Announcement for the TrueNAS Community.

Scrub frozen

Old Man

Guru

Contributor

Old Man

Contributor

Guru

Contributor

Guru

Contributor

Guru

Inactive Account

Server Wrangler

Old Man

Contributor

Inactive Account

Contributor

Server Wrangler

Contributor

Contributor

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Scrub frozen"

Similar threads