Disk go offline after upgrade to 9.2 from 8.3.1

Status
Not open for further replies.

naa41

Dabbler
Joined
Jan 28, 2014
Messages
18
Sorry for my noob since I was running FreeNAS 8.3 I never did scrub. It sometimes show disk disappear then after restart it came back. Some time the disappear disk change to happen to the other disk. I think one go offline before I did upgrade to 9.2.

After completed upgrade to 9.2, red alert warning that I loss two disk. (My config 8disk in Z2 total 15TB) I ran zpool status and It reply like this

Code:
[root@freenas ~]# zpool status
pool: Punya_NAS
state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a degraded state.
action: Online the device using 'zpool online' or replace the device with 'zpool replace'.
scan: resilvered 579M in 0h1m with 0 errors on Wed Jan 29 21:42:33 2014
config:
NAME STATE READ WRITE CKSUM
Punya_NAS DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
gptid/5321c047-d64b-11e2-9807-0015173e3e75 ONLINE 0 0 0
gptid/537be8a4-d64b-11e2-9807-0015173e3e75 ONLINE 0 0 0
gptid/53d7a95a-d64b-11e2-9807-0015173e3e75 ONLINE 0 0 0
gptid/54326bf4-d64b-11e2-9807-0015173e3e75 ONLINE 0 0 0
8558738958178464050 REMOVED 0 0 0
was /dev/gptid/548c8437-d64b-11e2-9807-0015173e3 e75
13578628206157475055 REMOVED 0 0 0
was /dev/gptid/54e82bfc-d64b-11e2-9807-0015173e3 e75
gptid/5543a681-d64b-11e2-9807-0015173e3e75 ONLINE 0 0 0
gptid/559fdcf2-d64b-11e2-9807-0015173e3e75 ONLINE 0 0 0
errors: No known data errors


I'm not sure my disk are failed or not but it has no "Replace Disk" button on GUI.
Any advice would be appreciated.
Naa :)

edited : I found "Replace" button on volume status page later.
 

naa41

Dabbler
Joined
Jan 28, 2014
Messages
18
So, I will try "zpool online" then scrub and I will back to report. I guess my disks did not real failed.
 

naa41

Dabbler
Joined
Jan 28, 2014
Messages
18
Continue ...
Code:
[root@freenas ~]# zpool online -e Punya_NAS gptid/548c8437-d64b-11e2-9807-001517
3e3e75                                                                         
warning: device 'gptid/548c8437-d64b-11e2-9807-0015173e3e75' onlined, but remain
s in faulted state                                                             
use 'zpool replace' to replace devices that are no longer present

So I will try "zpool replace".
 

naa41

Dabbler
Joined
Jan 28, 2014
Messages
18
Before I ran "zpool replace" I wiped my ada4 then reboot. After that, all disk back online again but with some CKSUM error. Anyway, warning button changed from red to yellow.
Code:
[root@freenas ~]# zpool status                                                                                                 
  pool: Punya_NAS                                                                                                             
state: ONLINE                                                                                                                 
status: One or more devices has experienced an unrecoverable error.  An                                                       
        attempt was made to correct the error.  Applications are unaffected.                                                   
action: Determine if the device needs to be replaced, and clear the errors                                                     
        using 'zpool clear' or replace the device with 'zpool replace'.                                                       
  see: http://illumos.org/msg/ZFS-8000-9P                                                                                     
  scan: resilvered 200K in 0h0m with 0 errors on Thu Jan 30 02:52:21 2014                                                     
config:                                                                                                                       
                                                                                                                               
        NAME                                            STATE    READ WRITE CKSUM                                             
        Punya_NAS                                      ONLINE      0    0    0                                             
          raidz2-0                                      ONLINE      0    0    0                                             
            gptid/5321c047-d64b-11e2-9807-0015173e3e75  ONLINE      0    0    0                                             
            gptid/537be8a4-d64b-11e2-9807-0015173e3e75  ONLINE      0    0    0                                             
            gptid/53d7a95a-d64b-11e2-9807-0015173e3e75  ONLINE      0    0    0                                             
            gptid/54326bf4-d64b-11e2-9807-0015173e3e75  ONLINE      0    0    0                                             
            gptid/548c8437-d64b-11e2-9807-0015173e3e75  ONLINE      0    0    2                                             
            gptid/54e82bfc-d64b-11e2-9807-0015173e3e75  ONLINE      0    0    2                                             
            gptid/5543a681-d64b-11e2-9807-0015173e3e75  ONLINE      0    0    0                                             
            gptid/559fdcf2-d64b-11e2-9807-0015173e3e75  ONLINE      0    0    0                                             
                                                                                                                               
errors: No known data errors

What should I do to remove CKSUM error?
 

SmallGuy

Guru
Joined
Jun 7, 2013
Messages
560
If I were you, I would test my RAM before going further.
 

naa41

Dabbler
Joined
Jan 28, 2014
Messages
18
I try to wipe ada5 by OFFLINE it then wipe and reboot. May be get more problem...

Code:
[root@freenas ~]# zpool status                                                                                                   
  pool: Punya_NAS                                                                                                               
state: DEGRADED                                                                                                                 
status: One or more devices is currently being resilvered.  The pool will                                                       
        continue to function, possibly in a degraded state.                                                                     
action: Wait for the resilver to complete.                                                                                       
  scan: resilver in progress since Thu Jan 30 03:21:07 2014                                                                     
        56.6G scanned out of 91.4G at 331M/s, 0h1m to go                                                                         
        6.92G resilvered, 61.92% done                                                                                           
config:                                                                                                                         
                                                                                                                                 
        NAME                                              STATE    READ WRITE CKSUM                                             
        Punya_NAS                                        DEGRADED    0    0    0                                             
          raidz2-0                                        DEGRADED    0    0    0                                             
            gptid/5321c047-d64b-11e2-9807-0015173e3e75    ONLINE      0    0    0                                             
            gptid/537be8a4-d64b-11e2-9807-0015173e3e75    ONLINE      0    0    0                                             
            gptid/53d7a95a-d64b-11e2-9807-0015173e3e75    ONLINE      0    0    0                                             
            gptid/54326bf4-d64b-11e2-9807-0015173e3e75    ONLINE      0    0    0                                             
            gptid/548c8437-d64b-11e2-9807-0015173e3e75    ONLINE      0    0    4  (resilvering)                             
            replacing-5                                  DEGRADED    0    0    2                                             
              13578628206157475055                        OFFLINE      0    0    0  was /dev/gptid/54e82bfc-d64b-11e2-9807-0015173
e3e75                                                                                                                           
              gptid/5566d4bd-8921-11e3-b212-0015173e3e75  ONLINE      0    0    0  (resilvering)                             
            gptid/5543a681-d64b-11e2-9807-0015173e3e75    ONLINE      0    0    0                                             
            gptid/559fdcf2-d64b-11e2-9807-0015173e3e75    ONLINE      0    0    0                                             
                                                                                                                                 
errors: No known data errors

so after the completion of its resilver I would try zpool clear again.
 

naa41

Dabbler
Joined
Jan 28, 2014
Messages
18
Thank you, SmallGuy. I will test RAM as your advice.
 

naa41

Dabbler
Joined
Jan 28, 2014
Messages
18
Finally after resilvering...
Code:
[root@freenas ~]# zpool status                                                                                                     
  pool: Punya_NAS                                                                                                                 
state: ONLINE                                                                                                                     
status: One or more devices has experienced an unrecoverable error.  An                                                           
        attempt was made to correct the error.  Applications are unaffected.                                                       
action: Determine if the device needs to be replaced, and clear the errors                                                         
        using 'zpool clear' or replace the device with 'zpool replace'.                                                           
  see: http://illumos.org/msg/ZFS-8000-9P                                                                                         
  scan: resilvered 11.9G in 0h7m with 0 errors on Thu Jan 30 03:28:26 2014                                                         
config:                                                                                                                           
                                                                                                                                   
        NAME                                            STATE    READ WRITE CKSUM                                                 
        Punya_NAS                                      ONLINE      0    0    0                                                 
          raidz2-0                                      ONLINE      0    0    0                                                 
            gptid/5321c047-d64b-11e2-9807-0015173e3e75  ONLINE      0    0    0                                                 
            gptid/537be8a4-d64b-11e2-9807-0015173e3e75  ONLINE      0    0    0                                                 
            gptid/53d7a95a-d64b-11e2-9807-0015173e3e75  ONLINE      0    0    0                                                 
            gptid/54326bf4-d64b-11e2-9807-0015173e3e75  ONLINE      0    0    0                                                 
            gptid/548c8437-d64b-11e2-9807-0015173e3e75  ONLINE      0    0 33.0K                                                 
            gptid/5566d4bd-8921-11e3-b212-0015173e3e75  ONLINE      0    0    0                                                 
            gptid/5543a681-d64b-11e2-9807-0015173e3e75  ONLINE      0    0    0                                                 
            gptid/559fdcf2-d64b-11e2-9807-0015173e3e75  ONLINE      0    0    0                                                 
                                                                                                                                   
errors: No known data errors

Look like my system is not stable. Anyway try to change RAM next day.
 

SmallGuy

Guru
Joined
Jun 7, 2013
Messages
560
Don't change your RAM, just throw MemTest86+ on a usb stick, boot on it and let it run!
http://www.memtest.org
 

naa41

Dabbler
Joined
Jan 28, 2014
Messages
18
Thank again SmallGuy. Sorry for my noop. I will follow your instructions.
 

naa41

Dabbler
Joined
Jan 28, 2014
Messages
18
Thanks Dusan. Googling for it.
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
Post output of "glabel status" if you need assistance with SMART.
 

naa41

Dabbler
Joined
Jan 28, 2014
Messages
18
As follow.
Code:
[root@freenas ~]# glabel status                                               
                                      Name  Status  Components                 
gptid/5321c047-d64b-11e2-9807-0015173e3e75    N/A  ada0p2                     
gptid/537be8a4-d64b-11e2-9807-0015173e3e75    N/A  ada1p2                     
gptid/53d7a95a-d64b-11e2-9807-0015173e3e75    N/A  ada2p2                     
gptid/54326bf4-d64b-11e2-9807-0015173e3e75    N/A  ada3p2                     
gptid/548c8437-d64b-11e2-9807-0015173e3e75    N/A  ada4p2                     
gptid/5566d4bd-8921-11e3-b212-0015173e3e75    N/A  ada5p2                     
gptid/5543a681-d64b-11e2-9807-0015173e3e75    N/A  ada6p2                     
gptid/559fdcf2-d64b-11e2-9807-0015173e3e75    N/A  ada7p2                     
                            ufs/FreeNASs3    N/A  da0s3                     
                            ufs/FreeNASs4    N/A  da0s4                     
                    ufsid/5165e77974bdc41f    N/A  da0s1a                     
                            ufs/FreeNASs1a    N/A  da0s1a                     
                            ufs/FreeNASs2a    N/A  da0s2a
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
OK, so please post output of "smartctl -a /dev/ada4". You can start a long test by running "smartctl -t long /dev/ada4". It will tell you how long you should wait for a result (it will be about 6,5 hours with your drives). Later you can run "smartctl -l selftest /dev/ada4" to check the result of the test.
 

naa41

Dabbler
Joined
Jan 28, 2014
Messages
18
Output of "smartctl -a /dev/ada4". I can only capture last 50 lines.
Code:
recommended polling time:        (  2) minutes.                                                                                   
Extended self-test routine                                                                                                         
recommended polling time:        ( 382) minutes.                                                                                   
Conveyance self-test routine                                                                                                       
recommended polling time:        (  5) minutes.                                                                                   
SCT capabilities:              (0x70bd) SCT Status supported.                                                                     
                                        SCT Error Recovery Control supported.                                                     
                                        SCT Feature Control supported.                                                             
                                        SCT Data Table supported.                                                                 
                                                                                                                                   
SMART Attributes Data Structure revision number: 16                                                                               
Vendor Specific SMART Attributes with Thresholds:                                                                                 
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                   
  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      0                                           
  3 Spin_Up_Time            0x0027  179  178  021    Pre-fail  Always      -      6050                                       
  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      47                                         
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0                                           
  7 Seek_Error_Rate        0x002e  100  253  000    Old_age  Always      -      0                                           
  9 Power_On_Hours          0x0032  100  100  000    Old_age  Always      -      292                                         
10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -      0                                           
11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0                                           
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      47                                         
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      9                                           
193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -      37                                         
194 Temperature_Celsius    0x0022  110  106  000    Old_age  Always      -      40                                         
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0                                           
197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0                                           
198 Offline_Uncorrectable  0x0030  100  253  000    Old_age  Offline      -      0                                           
199 UDMA_CRC_Error_Count    0x0032  200  193  000    Old_age  Always      -      3464                                       
200 Multi_Zone_Error_Rate  0x0008  100  253  000    Old_age  Offline      -      0                                           
                                                                                                                                   
SMART Error Log Version: 1                                                                                                         
No Errors Logged                                                                                                                   
                                                                                                                                   
SMART Self-test log structure revision number 1                                                                                   
No self-tests have been logged.  [To run self-tests, use: smartctl -t]                                                             
                                                                                                                                   
                                                                                                                                   
SMART Selective self-test log data structure revision number 1                                                                     
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                       
    1        0        0  Not_testing                                                                                               
    2        0        0  Not_testing                                                                                               
    3        0        0  Not_testing                                                                                               
    4        0        0  Not_testing                                                                                               
    5        0        0  Not_testing                                                                                               
Selective self-test flags (0x0):                                                                                                   
  After scanning selected spans, do NOT read-scan remainder of disk.                                                               
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

SmallGuy

Guru
Joined
Jun 7, 2013
Messages
560
There is some udma errors. Perhaps bad sata cable?
 

naa41

Dabbler
Joined
Jan 28, 2014
Messages
18
Now running "smartctl -t long /dev/ada4". then wait the result in next 6 hr.
Thank you very much Dusan.

and also thank to SmallGuy. May be I try change SATA cable.
:)
 

SmallGuy

Guru
Joined
Jun 7, 2013
Messages
560
Status
Not open for further replies.
Top