SOLVED Checksum errors - Degraded Pool too many errors

Varun Chugh · Dec 16, 2015

Hi,

I am new to Freenas and have set up a mirror pool with 2 x4TB WD Red NAS drives. Couple of days back the NAS suddenly restarted and immediately after restart it started resilvering. Since this happened the pool has gone into degraded state and gives this error on the UI - "The volume SAVA-mirror (ZFS) state is DEGRADED: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state"

I have done a scrub couple of times but the issue still persists. Copying the status of the pool below. Any help to resolve the error would be appreciated. Specs of my machine below:

Freenas: FreeNAS-9.3-STABLE-201512121950
Motherboard: Asus P8P67 Pro B3 (Rev 3.0)
CPU: Core i5 2310 (2nd gen)
PSU: corsair cx750m
RAM: Kingston 4x4GB (non-ECC, could this be the issue?)
HDD: 2x4TB WD Red NAS drives

Code:

  pool: SAVA-mirror                                                                                                               
state: DEGRADED                                                                                                                  
status: One or more devices is currently being resilvered.  The pool will                                                         
        continue to function, possibly in a degraded state.                                                                       
action: Wait for the resilver to complete.                                                                                        
  scan: resilver in progress since Wed Dec 16 19:42:41 2015                                                                       
        192G scanned out of 1.34T at 113M/s, 2h57m to go                                                                          
        192G resilvered, 14.00% done                                                                                              
config:                                                                                                                           
                                                                                                                                  
        NAME                                            STATE     READ WRITE CKSUM                                                
        SAVA-mirror                                     DEGRADED     0     0   338                                                
          mirror-0                                      DEGRADED     0     0 1.32K                                                
            gptid/31681994-9810-11e5-bd39-f46d041bbf8d  DEGRADED     0     0 1.32K  too many errors                               
            gptid/321da29c-9810-11e5-bd39-f46d041bbf8d  ONLINE       0     0 1.32K  (resilvering)                                 
                                                                                                                                  
errors: 14 data errors, use '-v' for a list

Thanks, happy to provide any other info.

- Varun

hugovsky · Dec 16, 2015

You probably have an error in that disk. Can you show smart stats of that drive? You should copy all data out of that pool. You have no redundancy. Did you tested your hardware for errors ? And finally.... DON'T USE NON-ECC if you love your data.

gpsguy · Dec 16, 2015

I'd also suggesting running zpool status -v and delete the files with the 14 errors.

Varun Chugh · Dec 16, 2015

Thanks for replying @hugovsky and @gpsguy . Below are the results from the short SMART test for the 2 disks. Are there error on the disks?

Code:

[root@freenas] ~# smartctl -A /dev/ada0
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p28 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   182   177   021    Pre-fail  Always       -       7866
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       112
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       225
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       112
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       100
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       34
194 Temperature_Celsius     0x0022   104   098   000    Old_age   Always       -       48
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

Code:

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   187   183   021    Pre-fail  Always       -       7641
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       121
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       302
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       120
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       107
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       39
194 Temperature_Celsius     0x0022   102   096   000    Old_age   Always       -       50
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SweetAndLow · Dec 16, 2015

You are cooking your drives so that could be causing your problems. Get more or better fans.

You don't have any failing drives though. Do you know why it rebooted? Anything on the logs?

Varun Chugh · Dec 16, 2015

SweetAndLow said:
You are cooking your drives so that could be causing your problems. Get more or better fans.

You don't have any failing drives though. Do you know why it rebooted? Anything on the logs?

Not sure why did it reboot. I was copying data and was streaming plex, and backing up Time machine simultaneously. Too much processing? High temperature possibility?

SweetAndLow · Dec 16, 2015

Yeah I bet your cpu overheated. What temperature is it running at?

Varun Chugh · Dec 16, 2015

SweetAndLow said:
Yeah I bet your cpu overheated. What temperature is it running at?

At the moment:

dev.cpu.0.temperature: 72.0C

dev.cpu.1.temperature: 77.0C

dev.cpu.2.temperature: 73.0C

dev.cpu.3.temperature: 75.0C

SweetAndLow · Dec 16, 2015

Ha your cpu is roasting. You need to fix your cooling and look at replacing your drives because they will probably fail soon from the increased temps. And hopefully you didn't kill your cpu.

Varun Chugh · Dec 16, 2015

SweetAndLow said:
Ha your cpu is roasting. You need to fix your cooling and look at replacing your drives because they will probably fail soon from the increased temps. And hopefully you didn't kill your cpu.

Thanks. So I just bought the drives. Think the drives should be fine. Will fix the cooling. Any recommendation? I have the cpu cooler that came with the processor. And one additional fan in the box (the box is a usual Home desktop box)

SweetAndLow · Dec 16, 2015

Provide all hardware specs like forum rules say. What case? And it doesn't really matter because you need more case fans. Cpu fan isn't your problem.

Varun Chugh · Dec 16, 2015

SweetAndLow said:
Provide all hardware specs like forum rules say. What case? And it doesn't really matter because you need more case fans. Cpu fan isn't your problem.

Below is the spec. Case is a local case, bought at a local show here. There is no brand or such. I just upgraded the PSU and the drives of my old desktop to create the Freenas server. The real use for me is just really streaming media via Plex. Do you think I should upgrade to ECC?

Freenas: FreeNAS-9.3-STABLE-201512121950
Motherboard: Asus P8P67 Pro B3 (Rev 3.0)
CPU: Core i5 2310 (2nd gen)
PSU: corsair cx750m
RAM: Kingston 4x4GB (non-ECC, could this be the issue?)
HDD: 2x4TB WD Red NAS drives; 2x500GB WD Blue Caviar
Add on card: 1 add on video card for HDMI suppor

SweetAndLow · Dec 16, 2015

Ecc will help to protect your data and improve stability. If you did that you will need new cpu and motherboard. Your system isn't great for freenas. Get more fans and see if things improve.

Varun Chugh · Dec 16, 2015

SweetAndLow said:
Ecc will help to protect your data and improve stability. If you did that you will need new cpu and motherboard. Your system isn't great for freenas. Get more fans and see if things improve.

Okay. Can anything be done about the checksum errors that I have right now? Or I am thinking, back up the data and then wipe the disks and have a fresh mirror set up. Do you think that is a good idea?

rsquared · Dec 16, 2015

Varun Chugh said:
Okay. Can anything be done about the checksum errors that I have right now? Or I am thinking, back up the data and then wipe the disks and have a fresh mirror set up. Do you think that is a good idea?

Drive temps into the 50s isn't good, but that usually just shortens the life (sometimes significantly), it doesn't usually kill the drive with a few weeks.

I'd suggest reading this post on why you need ECC memory with ZFS before going any further. https://forums.freenas.org/index.php?threads/ecc-vs-non-ecc-ram-and-zfs.15449/ Then I'd run memtest to see if your RAM is bad... It's quite possible you're making your data worse with every scrub.

Varun Chugh · Dec 16, 2015

H

rsquared said:
Drive temps into the 50s isn't good, but that usually just shortens the life (sometimes significantly), it doesn't usually kill the drive with a few weeks.

I'd suggest reading this post on why you need ECC memory with ZFS before going any further. https://forums.freenas.org/index.php?threads/ecc-vs-non-ecc-ram-and-zfs.15449/ Then I'd run memtest to see if your RAM is bad... It's quite possible you're making your data worse with every scrub.

How do I do a memtest?

Yes, have read that post :) and I do understand that ECC is absolutely needed for data safety and stability.

gpsguy · Dec 17, 2015

Download a copy of it from here - http://www.memtest.org/ and run it.

Varun Chugh · Dec 17, 2015

gpsguy said:
Download a copy of it from here - http://www.memtest.org/ and run it.

While the test is still on. I can see a lot of errors. That means the RAM is an issues. But I have 4 sticks how to check if only 1 or more than 1 RAMs are the issues?

I am based in Singapore and most of the ECC motherboards (supermicro) recommended on the forums are not available over the counter. Any other motherboard recommendation?

jgreco · Dec 17, 2015

The Supermicro boards tend to be the easiest to find, however, there are other options. The "So you want some hardware suggestions" talks about the qualities that make a board useful. You can also find prebuilts of various kinds as well.

rsquared · Dec 17, 2015

Varun Chugh said:
H

How do I do a memtest?

Yes, have read that post :) and I do understand that ECC is absolutely needed for data safety and stability.

The other important point from that thread for you is that every scrub (and you said you've done a few) done with bad RAM will corrupt your data further. Even once you get a new board and ECC memory, you can't trust what you've got on those drives.

Important Announcement for the TrueNAS Community.

SOLVED Checksum errors - Degraded Pool too many errors

Dabbler

Guru

Active Member

Dabbler

Sweet'NASty

Dabbler

Sweet'NASty

Dabbler

Sweet'NASty

Dabbler

Sweet'NASty

Dabbler

Sweet'NASty

Dabbler

Explorer

Dabbler

Active Member

Dabbler

Resident Grinch

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Checksum errors - Degraded Pool too many errors"

Similar threads