Checksum errors on 2 drives

Status
Not open for further replies.
Joined
Oct 2, 2014
Messages
925
Recently build a new striped RAIDz3 of 14 drives, last night I received an email that 2 drives have 4 checksum errors.

Code:
pool: Data3
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 4h9m with 0 errors on Mon Aug 15 04:45:56 2016
config:

        NAME                                            STATE     READ WRITE CKSUM
        Data3                                           ONLINE       0     0     0
          raidz3-0                                      ONLINE       0     0     0
            gptid/27b50fcd-601c-11e6-bf20-001517d98fed  ONLINE       0     0     0
            gptid/28adc681-601c-11e6-bf20-001517d98fed  ONLINE       0     0     0
            gptid/299ef317-601c-11e6-bf20-001517d98fed  ONLINE       0     0     0
            gptid/2a87dcb1-601c-11e6-bf20-001517d98fed  ONLINE       0     0     0
            gptid/2b82a53a-601c-11e6-bf20-001517d98fed  ONLINE       0     0     0
            gptid/2c7bbf23-601c-11e6-bf20-001517d98fed  ONLINE       0     0     0
            gptid/2d726498-601c-11e6-bf20-001517d98fed  ONLINE       0     0     0
          raidz3-1                                      ONLINE       0     0     0
            gptid/2e86c514-601c-11e6-bf20-001517d98fed  ONLINE       0     0     0
            gptid/2f88b854-601c-11e6-bf20-001517d98fed  ONLINE       0     0     0
            gptid/307eb07d-601c-11e6-bf20-001517d98fed  ONLINE       0     0     4
            gptid/316e123e-601c-11e6-bf20-001517d98fed  ONLINE       0     0     4
            gptid/3271fe11-601c-11e6-bf20-001517d98fed  ONLINE       0     0     0
            gptid/335b24bd-601c-11e6-bf20-001517d98fed  ONLINE       0     0     0
            gptid/3458b29c-601c-11e6-bf20-001517d98fed  ONLINE       0     0     0

errors: No known data errors                                                                                                               


That is what zpool status output shows, when I pull smart data for all the drives with the smart report script there are no drives with SMART errors.

What is the best way to proceed? The 14 drives were burned in for 7 days, completing 10 ore more passes each, and after completing those passes ran SMART long tests on each and pulled SMART for all of them and showed no errors.


System Specs:

FreeNAS Server 9.3
Supermicro X9DR3-LN4F+ , dual E5-2670 @ 2.60Ghz with noctua heatsinks
96GB of RAM
M1015 with 2 SAS cables to the backplane
Dual 80GB Intel SSD's for boot
x7 4TB WD Red's RAIDz3
x14 2TB WD RE4's Striped RAIDz3
x14 2TB WD RE4's Striped RAIDz3
Chelsio T420-CR 10GB card
Supermicro SC846E16-R1200B case
 
Last edited:

Sakuru

Guru
Joined
Nov 20, 2015
Messages
527
Joined
Oct 2, 2014
Messages
925
If you're on mobile it may not have shown, I have included it here and the original post:

FreeNAS Server 9.3
Supermicro X9DR3-LN4F+ , dual E5-2670 @ 2.60Ghz with noctua heatsinks
96GB of RAM
M1015 with 2 SAS cables to the backplane
Dual 80GB Intel SSD's for boot
x7 4TB WD Red's RAIDz3
x14 2TB WD RE4's Striped RAIDz3
x14 2TB WD RE4's Striped RAIDz3
Chelsio T420-CR 10GB card
Supermicro SC846E16-R1200B case
 
Last edited by a moderator:

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
What is the best way to proceed?
I guess I'd say "cautiously" :D

Having a few checksum errors is not necessarily a terrible thing. There are many causes for checksum errors that are ultimately no big deal; my bigger concern would be if you keep getting more checksum errors.

What exactly did you do for burn in, and did you see any odd numbers is the SMART attributes?
 
Joined
Oct 2, 2014
Messages
925
I guess I'd say "cautiously" :D

Having a few checksum errors is not necessarily a terrible thing. There are many causes for checksum errors that are ultimately no big deal; my bigger concern would be if you keep getting more checksum errors.

What exactly did you do for burn in, and did you see any odd numbers is the SMART attributes?

Burn in was a full 7 days of DD , following the guide found here https://forums.freenas.org/index.php?threads/how-to-hard-drive-burn-in-testing.21451/ which I used in the past ; it was badblocks -ws /dev/da(X) , followed by the smart long pass after 7 days of badblocks which returned no errors.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Can you check /var/log/messages to see what drive went offline? You should have also gotten an email.
 
Joined
Oct 2, 2014
Messages
925
Can you check /var/log/messages to see what drive went offline? You should have also gotten an email.
The email I received is:
Code:
The volume Data3 (ZFS) state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.


This is whats in the /var/logs/messages
Code:
Aug 17 00:00:00 FreeNAS newsyslog[83740]: logfile turned over due to size>100K
Aug 17 00:00:00 FreeNAS syslog-ng[2588]: Configuration reload request received, reloading configuration;
Aug 17 00:18:46 FreeNAS ses1: da21,pass24: Element descriptor: 'Slot 01'
Aug 17 00:18:46 FreeNAS ses1: da21,pass24: SAS Device Slot Element: 1 Phys at Slot 0
Aug 17 00:18:46 FreeNAS ses1:  phy 0: SATA device
Aug 17 00:18:46 FreeNAS ses1:  phy 0: parent 50030480001fe1bf addr 50030480001fe18c
Aug 17 00:18:46 FreeNAS ses1: da22,pass25: Element descriptor: 'Slot 02'
Aug 17 00:18:46 FreeNAS ses1: da22,pass25: SAS Device Slot Element: 1 Phys at Slot 1
Aug 17 00:18:46 FreeNAS ses1:  phy 0: SATA device
Aug 17 00:18:46 FreeNAS ses1:  phy 0: parent 50030480001fe1bf addr 50030480001fe18d
Aug 17 00:18:46 FreeNAS ses1: da23,pass26: Element descriptor: 'Slot 03'
Aug 17 00:18:46 FreeNAS ses1: da23,pass26: SAS Device Slot Element: 1 Phys at Slot 2
Aug 17 00:18:46 FreeNAS ses1:  phy 0: SATA device
Aug 17 00:18:46 FreeNAS ses1:  phy 0: parent 50030480001fe1bf addr 50030480001fe18e
Aug 17 00:18:46 FreeNAS ses1: da24,pass27: Element descriptor: 'Slot 04'
Aug 17 00:18:46 FreeNAS ses1: da24,pass27: SAS Device Slot Element: 1 Phys at Slot 3
Aug 17 00:18:46 FreeNAS ses1:  phy 0: SATA device
Aug 17 00:18:46 FreeNAS ses1:  phy 0: parent 50030480001fe1bf addr 50030480001fe18f
Aug 17 00:18:46 FreeNAS ses1: da25,pass28: Element descriptor: 'Slot 05'
Aug 17 00:18:46 FreeNAS ses1: da25,pass28: SAS Device Slot Element: 1 Phys at Slot 4
Aug 17 00:18:46 FreeNAS ses1:  phy 0: SATA device
Aug 17 00:18:46 FreeNAS ses1:  phy 0: parent 50030480001fe1bf addr 50030480001fe190
Aug 17 00:18:46 FreeNAS ses1: da26,pass29: Element descriptor: 'Slot 06'
Aug 17 00:18:46 FreeNAS ses1: da26,pass29: SAS Device Slot Element: 1 Phys at Slot 5
Aug 17 00:18:46 FreeNAS ses1:  phy 0: SATA device
Aug 17 00:18:46 FreeNAS ses1:  phy 0: parent 50030480001fe1bf addr 50030480001fe191
Aug 17 00:18:46 FreeNAS ses1: da27,pass30: Element descriptor: 'Slot 07'
Aug 17 00:18:46 FreeNAS ses1: da27,pass30: SAS Device Slot Element: 1 Phys at Slot 6
Aug 17 00:18:46 FreeNAS ses1:  phy 0: SATA device
Aug 17 00:18:46 FreeNAS ses1:  phy 0: parent 50030480001fe1bf addr 50030480001fe192
Aug 17 00:18:46 FreeNAS ses1: da28,pass31: Element descriptor: 'Slot 08'
Aug 17 00:18:46 FreeNAS ses1: da28,pass31: SAS Device Slot Element: 1 Phys at Slot 7
Aug 17 00:18:46 FreeNAS ses1:  phy 0: SATA device
Aug 17 00:18:46 FreeNAS ses1:  phy 0: parent 50030480001fe1bf addr 50030480001fe193
Aug 17 00:18:46 FreeNAS ses1: da29,pass32: Element descriptor: 'Slot 09'
Aug 17 00:18:46 FreeNAS ses1: da29,pass32: SAS Device Slot Element: 1 Phys at Slot 8
Aug 17 00:18:46 FreeNAS ses1:  phy 0: SATA device
Aug 17 00:18:46 FreeNAS ses1:  phy 0: parent 50030480001fe1bf addr 50030480001fe194
Aug 17 00:18:46 FreeNAS ses1: da30,pass33: Element descriptor: 'Slot 10'
Aug 17 00:18:46 FreeNAS ses1: da30,pass33: SAS Device Slot Element: 1 Phys at Slot 9
Aug 17 00:18:46 FreeNAS ses1:  phy 0: SATA device
Aug 17 00:18:46 FreeNAS ses1:  phy 0: parent 50030480001fe1bf addr 50030480001fe195
Aug 17 00:18:46 FreeNAS ses1: da31,pass34: Element descriptor: 'Slot 11'
Aug 17 00:18:46 FreeNAS ses1: da31,pass34: SAS Device Slot Element: 1 Phys at Slot 10
Aug 17 00:18:46 FreeNAS ses1:  phy 0: SATA device
Aug 17 00:18:46 FreeNAS ses1:  phy 0: parent 50030480001fe1bf addr 50030480001fe196
Aug 17 00:18:46 FreeNAS ses1: da32,pass35: Element descriptor: 'Slot 12'
Aug 17 00:18:46 FreeNAS ses1: da32,pass35: SAS Device Slot Element: 1 Phys at Slot 11
Aug 17 00:18:46 FreeNAS ses1:  phy 0: SATA device
Aug 17 00:18:46 FreeNAS ses1:  phy 0: parent 50030480001fe1bf addr 50030480001fe197
Aug 17 00:18:46 FreeNAS ses1: da33,pass36: Element descriptor: 'Slot 17'
Aug 17 00:18:46 FreeNAS ses1: da33,pass36: SAS Device Slot Element: 1 Phys at Slot 16
Aug 17 00:18:46 FreeNAS ses1:  phy 0: SATA device
Aug 17 00:18:46 FreeNAS ses1:  phy 0: parent 50030480001fe1bf addr 50030480001fe19c
Aug 17 00:18:46 FreeNAS ses1: da34,pass37: Element descriptor: 'Slot 18'
Aug 17 00:18:46 FreeNAS ses1: da34,pass37: SAS Device Slot Element: 1 Phys at Slot 17
Aug 17 00:18:46 FreeNAS ses1:  phy 0: SATA device
Aug 17 00:18:46 FreeNAS ses1:  phy 0: parent 50030480001fe1bf addr 50030480001fe19d
Aug 17 00:30:34 FreeNAS alert.py: [freenasOS.Configuration:624] Unable to load https://web.ixsystems.com/updates/ix_crl.pem: Host web.ixsyst$
Aug 17 00:30:34 FreeNAS alert.py: [freenasOS.Configuration:638] Unable to load ['https://web.ixsystems.com/updates/ix_crl.pem']: Host web.ix$
Aug 17 00:30:36 FreeNAS alert.py: [freenasOS.Configuration:624] Unable to load https://web.ixsystems.com/updates/ix_crl.pem: Host web.ixsyst$
Aug 17 00:30:36 FreeNAS alert.py: [freenasOS.Configuration:638] Unable to load ['https://web.ixsystems.com/updates/ix_crl.pem']: Host web.ix$
Aug 17 01:30:56 FreeNAS alert.py: [freenasOS.Configuration:624] Unable to load https://web.ixsystems.com/updates/ix_crl.pem: Host web.ixsyst$
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
Unlikely it is the drives. It will be the system. power, cables, backplane.
 
Joined
Oct 2, 2014
Messages
925
Can you check /var/log/messages to see what drive went offline? You should have also gotten an email.
Here's the latest SMART long test I ran when I saw the pool errors:

Code:
+------+---------------+----+-----+-----+-----+-------+-------+--------+------+----------+------+-------+
|Device|Serial         |Temp|Power|Start|Spin |ReAlloc|Current|Offline |Seek  |Total     |High  |Command|
|      |               |    |On   |Stop |Retry|Sectors|Pending|Uncorrec|Errors|Seeks     |Fly   |Timeout|
|      |               |    |Hours|Count|Count|       |Sectors|Sectors |      |          |Writes|Count  |
+------+---------------+----+-----+-----+-----+-------+-------+--------+------+----------+------+-------+
|ada0 ?|CVPO037501MJ080JGN|  |16193|    0|     |      1|       |        |   N/A|       N/A|   N/A|    N/A|
|ada1 ?|CVPO011303C8080BGN|  |14922|    0|     |      5|       |        |   N/A|       N/A|   N/A|    N/A|
|da0   |WD-WCC4E0333314| 32 |24063|  101|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da1   |WD-WCC4E0293479| 32 |24066|  101|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da2   |WD-WCC4E0253908| 30 |24065|  101|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da3   |WD-WCC4EKPA04D8| 30 |12350|   30|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da4   |WD-WCC4EJCK7VX3| 30 |12350|   30|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da5   |WD-WCC4EEXVU2D5| 30 |12350|   30|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da6   |WD-WMAY04825619| 39 |13930|   88|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da7   |WD-WMAY01340945| 37 |33612|   54|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da8 ? |WD-WMAY01419853| 36 |34603|   58|    0|      1|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da9   |WD-WMAY01418027| 35 |34496|   58|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da10  |WD-WMAY01754916| 33 |32187|   49|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da11  |WD-WCC4E0359664| 31 |24056|  101|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da12  |WD-WCAY01305863| 38 |18507|   45|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da13  |WD-WMAY04429512| 38 |13615|   62|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da14  |WD-WCAY00488052| 37 |15529|   77|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da15  |WD-WMAY04219143| 36 |13993|   62|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da16  |WD-WMAY01316454| 35 |36097|   47|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da17  |WD-WMAY01407506| 33 |29860|   60|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da18  |WD-WCAY01342905| 35 |18393|   45|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da19  |WD-WMAY05057183| 33 |15520|   75|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da20  |WD-WCAY00454125| 32 |14488|   75|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da21 ?|WD-WMAY01733817| 45 |42090| 6494|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da22 ?|WD-WMAY01675694| 45 |38631|   31|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da23 ?|WD-WCAY00808149| 45 |14590|   24|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da24 ?|WD-WCAY00862938| 43 |19461|   23|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da25 ?|WD-WMAY01648686| 42 |43059|   36|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da26 ?|WD-WCAY00808865| 41 |15662|   26|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da27 !|WD-WMAY01696895| 46 |38605|   69|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da28 !|WD-WCAY00862864| 51 |14589|   23|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da29 !|WD-WCAY00865833| 52 |14671|   26|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da30 !|WD-WCAY00870325| 50 |16855|   26|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da31 !|WD-WCAY00812424| 47 |18048|   26|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da32 ?|WD-WCAY00870260| 44 |20373|  321|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da33 ?|WD-WMAY01666924| 45 |41116|   29|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
|da34 ?|WD-WCAY00850202| 44 |13276|   24|    0|      0|      0|       0|   N/A|       N/A|   N/A|    N/A|
+------+---------------+----+-----+-----+-----+-------+-------+--------+------+----------+------+-------+


da8 currently has 1 reallocated sector, the SMART reports that I pulled show that cropping up between Aug 5th and 10th, on the 10th it was displayed as 1 reallocated sector.

The suspected drives are da30 and da31 , as pulled from GPTID cross-referenced by serial for identification.
 
Last edited:

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Waow, they are very hot, you should do something about that asap, I'm pretty sure it's why you have the errors.
 
Joined
Oct 2, 2014
Messages
925
Waow, they are very hot, you should do something about that asap, I'm pretty sure it's why you have the errors.
sheeeeet, didnt even think of that! The fans are set for full, time to have a better look
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Do you see the question and exclamation marks next to the devices labels? well the question mark is here to tell you there's probably a problem (here you have some drives with a few bad sectors and a few over 40 °C) and you should see what's going on; and the excalamation mark is here to tell you there's a problem, probably critical (here you have some drives over 45 °C), so you need to do something asap. Don't know if you known but I prefer to be sure you know how to read the output of the script so you don't miss problems with your drives ;)
 
Joined
Oct 2, 2014
Messages
925
Do you see the question and exclamation marks next to the devices labels? well the question mark is here to tell you there's probably a problem (here you have some drives with a few bad sectors and a few over 40 °C) and you should see what's going on; and the excalamation mark is here to tell you there's a problem, probably critical (here you have some drives over 45 °C), so you need to do something asap. Don't know if you known but I prefer to be sure you know how to read the output of the script so you don't miss problems with your drives ;)
I went ahead and fixed the temp issue, turns out the fans on my second chassis werent set properly and I didnt notice until now because the AC has been on super high due to outsite temps for the past 2 weeks.

I changed the fan speed and turned down the AC as well, gonna do zpool clear and see what comes up if anything for tomorrow.

I also managed to edit the 2 scripts to email me :D
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Perfect ;)
 
Joined
Oct 2, 2014
Messages
925
Perfect ;)
I managed to work it out, issue was my primary chassis had the CPU's,ram,HBA, and some harddrives, the second chassis had the rest of the hdds (and all the hdds that were experiencing high temps) due to the fans not spinning at a correct/fast enough speed.

Finally got the board and HBA all transferred over to the new 36 bay and got the PWM fans running at the proper speed to cool it all :D ; NEVER EVERRRR doing that all again.
 
Status
Not open for further replies.
Top