SOLVED Checksum errors - Degraded Pool too many errors

Status
Not open for further replies.

Varun Chugh

Dabbler
Joined
Dec 15, 2015
Messages
38
Hi,

I am new to Freenas and have set up a mirror pool with 2 x4TB WD Red NAS drives. Couple of days back the NAS suddenly restarted and immediately after restart it started resilvering. Since this happened the pool has gone into degraded state and gives this error on the UI - "The volume SAVA-mirror (ZFS) state is DEGRADED: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state"

I have done a scrub couple of times but the issue still persists. Copying the status of the pool below. Any help to resolve the error would be appreciated. Specs of my machine below:

Freenas: FreeNAS-9.3-STABLE-201512121950
Motherboard: Asus P8P67 Pro B3 (Rev 3.0)
CPU: Core i5 2310 (2nd gen)
PSU: corsair cx750m
RAM: Kingston 4x4GB (non-ECC, could this be the issue?)
HDD: 2x4TB WD Red NAS drives


Code:
  pool: SAVA-mirror                                                                                                               
state: DEGRADED                                                                                                                  
status: One or more devices is currently being resilvered.  The pool will                                                         
        continue to function, possibly in a degraded state.                                                                       
action: Wait for the resilver to complete.                                                                                        
  scan: resilver in progress since Wed Dec 16 19:42:41 2015                                                                       
        192G scanned out of 1.34T at 113M/s, 2h57m to go                                                                          
        192G resilvered, 14.00% done                                                                                              
config:                                                                                                                           
                                                                                                                                  
        NAME                                            STATE     READ WRITE CKSUM                                                
        SAVA-mirror                                     DEGRADED     0     0   338                                                
          mirror-0                                      DEGRADED     0     0 1.32K                                                
            gptid/31681994-9810-11e5-bd39-f46d041bbf8d  DEGRADED     0     0 1.32K  too many errors                               
            gptid/321da29c-9810-11e5-bd39-f46d041bbf8d  ONLINE       0     0 1.32K  (resilvering)                                 
                                                                                                                                  
errors: 14 data errors, use '-v' for a list  


Thanks, happy to provide any other info.

- Varun
 
Last edited:

gpsguy

Active Member
Joined
Jan 22, 2012
Messages
4,472
I'd also suggesting running zpool status -v and delete the files with the 14 errors.
 

Varun Chugh

Dabbler
Joined
Dec 15, 2015
Messages
38
Thanks for replying @hugovsky and @gpsguy . Below are the results from the short SMART test for the 2 disks. Are there error on the disks?

Code:
[root@freenas] ~# smartctl -A /dev/ada0
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p28 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   182   177   021    Pre-fail  Always       -       7866
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       112
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       225
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       112
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       100
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       34
194 Temperature_Celsius     0x0022   104   098   000    Old_age   Always       -       48
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0


Code:
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   187   183   021    Pre-fail  Always       -       7641
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       121
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       302
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       120
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       107
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       39
194 Temperature_Celsius     0x0022   102   096   000    Old_age   Always       -       50
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
You are cooking your drives so that could be causing your problems. Get more or better fans.

You don't have any failing drives though. Do you know why it rebooted? Anything on the logs?
 

Varun Chugh

Dabbler
Joined
Dec 15, 2015
Messages
38
You are cooking your drives so that could be causing your problems. Get more or better fans.

You don't have any failing drives though. Do you know why it rebooted? Anything on the logs?

Not sure why did it reboot. I was copying data and was streaming plex, and backing up Time machine simultaneously. Too much processing? High temperature possibility?
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Yeah I bet your cpu overheated. What temperature is it running at?
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Ha your cpu is roasting. You need to fix your cooling and look at replacing your drives because they will probably fail soon from the increased temps. And hopefully you didn't kill your cpu.
 

Varun Chugh

Dabbler
Joined
Dec 15, 2015
Messages
38
Ha your cpu is roasting. You need to fix your cooling and look at replacing your drives because they will probably fail soon from the increased temps. And hopefully you didn't kill your cpu.

Thanks. So I just bought the drives. Think the drives should be fine. Will fix the cooling. Any recommendation? I have the cpu cooler that came with the processor. And one additional fan in the box (the box is a usual Home desktop box)
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Provide all hardware specs like forum rules say. What case? And it doesn't really matter because you need more case fans. Cpu fan isn't your problem.
 

Varun Chugh

Dabbler
Joined
Dec 15, 2015
Messages
38
Provide all hardware specs like forum rules say. What case? And it doesn't really matter because you need more case fans. Cpu fan isn't your problem.

Below is the spec. Case is a local case, bought at a local show here. There is no brand or such. I just upgraded the PSU and the drives of my old desktop to create the Freenas server. The real use for me is just really streaming media via Plex. Do you think I should upgrade to ECC?

Freenas: FreeNAS-9.3-STABLE-201512121950
Motherboard: Asus P8P67 Pro B3 (Rev 3.0)
CPU: Core i5 2310 (2nd gen)
PSU: corsair cx750m
RAM: Kingston 4x4GB (non-ECC, could this be the issue?)
HDD: 2x4TB WD Red NAS drives; 2x500GB WD Blue Caviar
Add on card: 1 add on video card for HDMI suppor
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Ecc will help to protect your data and improve stability. If you did that you will need new cpu and motherboard. Your system isn't great for freenas. Get more fans and see if things improve.
 

Varun Chugh

Dabbler
Joined
Dec 15, 2015
Messages
38
Ecc will help to protect your data and improve stability. If you did that you will need new cpu and motherboard. Your system isn't great for freenas. Get more fans and see if things improve.

Okay. Can anything be done about the checksum errors that I have right now? Or I am thinking, back up the data and then wipe the disks and have a fresh mirror set up. Do you think that is a good idea?
 

rsquared

Explorer
Joined
Nov 17, 2015
Messages
81
Okay. Can anything be done about the checksum errors that I have right now? Or I am thinking, back up the data and then wipe the disks and have a fresh mirror set up. Do you think that is a good idea?
Drive temps into the 50s isn't good, but that usually just shortens the life (sometimes significantly), it doesn't usually kill the drive with a few weeks.

I'd suggest reading this post on why you need ECC memory with ZFS before going any further. https://forums.freenas.org/index.php?threads/ecc-vs-non-ecc-ram-and-zfs.15449/ Then I'd run memtest to see if your RAM is bad... It's quite possible you're making your data worse with every scrub.
 

Varun Chugh

Dabbler
Joined
Dec 15, 2015
Messages
38
H
Drive temps into the 50s isn't good, but that usually just shortens the life (sometimes significantly), it doesn't usually kill the drive with a few weeks.

I'd suggest reading this post on why you need ECC memory with ZFS before going any further. https://forums.freenas.org/index.php?threads/ecc-vs-non-ecc-ram-and-zfs.15449/ Then I'd run memtest to see if your RAM is bad... It's quite possible you're making your data worse with every scrub.

How do I do a memtest?

Yes, have read that post :) and I do understand that ECC is absolutely needed for data safety and stability.
 

Varun Chugh

Dabbler
Joined
Dec 15, 2015
Messages
38
Download a copy of it from here - http://www.memtest.org/ and run it.

While the test is still on. I can see a lot of errors. That means the RAM is an issues. But I have 4 sticks how to check if only 1 or more than 1 RAMs are the issues?

I am based in Singapore and most of the ECC motherboards (supermicro) recommended on the forums are not available over the counter. Any other motherboard recommendation?
 
Last edited:

rsquared

Explorer
Joined
Nov 17, 2015
Messages
81
H


How do I do a memtest?

Yes, have read that post :) and I do understand that ECC is absolutely needed for data safety and stability.
The other important point from that thread for you is that every scrub (and you said you've done a few) done with bad RAM will corrupt your data further. Even once you get a new board and ECC memory, you can't trust what you've got on those drives.
 
Status
Not open for further replies.
Top