8.3.0 rc1 minor gui bug in volume status?

Status
Not open for further replies.

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
Hi.

Hope this is the right section to post this.

Just noticed something during a scrub.

I'm having issues with 2 hard drives in a raidz2 pool. This is the result of the scrub:

Code:
[root@nas /]# zpool status nasbackup
  pool: nasbackup
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
  scan: scrub repaired 550G in 9h30m with 0 errors on Thu Oct 25 07:52:06 2012
config:

        NAME                                            STATE     READ WRITE CKSUM
        nasbackup                                       ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/0e6e97ab-1975-11e2-8888-6805ca0a1737  ONLINE       0     0     0
            gptid/0ee79e03-1975-11e2-8888-6805ca0a1737  ONLINE       0     0 8.72M
            gptid/0f6308d1-1975-11e2-8888-6805ca0a1737  ONLINE       0     0     0
            gptid/0fdd4351-1975-11e2-8888-6805ca0a1737  ONLINE       0     0 8.72M
            gptid/1041584c-1975-11e2-8888-6805ca0a1737  ONLINE       0     0     0
            gptid/109396d2-1975-11e2-8888-6805ca0a1737  ONLINE       0     0     0

errors: No known data errors


I would interpret this as 8.72 million checksum errors?

Going to the gui, and checking volume status, only 8 checksum errors show. Is this an issue with the web interface only showing whole numbers?

Not having checked 'zpool status' myself, I might believe that there were only 8 errors, ie, not a huge deal. Whereas 8 million errors is something different.

I didn't see anywhere that this had been posted before, so I apologize if it's been mentioned.
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
That seems suspicious with both drives have the exact same number of errors. How are those to drives connected? What does the SMART info look like on those drives? E.G.:
Code:
smartctl -q noserial -a /dev/adaX


I can't speak to the GUI/CLI mismatch part.
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
Hi..

I'm not all that worried about the pool, or the drives. They were only hooked up for testing.

They are hooked up through esata though. A 2 bay enclosure, and a 4 bay enclosure. What's funny is the 2 drives throwing checksum errors are 2 out of the 4 bay enclosure. I had thought that the 2 bay enclosure was bad, being 2 drives are showing similar (same?) errors. But the 2 in the double bay, and 2 in the quad bay are fine.

The pool is a mismatch of any 1tb drives I had lying around. Like I said, testing purposes only. Of the 6 drives, 1 is a seagate 7200, 3 are samsungs, and 2 are WD Greens. Both 'bad' drives are samsungs though. Here's the smart info on the 2 questionable drives:

ada1:
Code:
=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F1 DT
Device Model:     SAMSUNG HD103UJ
Firmware Version: 1AA01109
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 3b
Local Time is:    Thu Oct 25 15:20:14 2012 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (11884) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 199) minutes.
Conveyance self-test routine
recommended polling time:        (  21) minutes.
SCT capabilities:              (0x003f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0007   076   076   011    Pre-fail  Always       -       8100
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       315
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       2
  7 Seek_Error_Rate         0x000f   100   100   051    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      -       9885
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       32803
 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       309
 13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   099    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   069   058   000    Old_age   Always       -       31 (Min/Max 13/35)
194 Temperature_Celsius     0x0022   065   057   000    Old_age   Always       -       35 (Min/Max 13/36)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       166291039
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x000a   099   099   000    Old_age   Always       -       8611
201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision number = 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     32785         -
# 2  Extended offline    Aborted by host               30%     32631         -

Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



ada2:
Code:
=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F1 DT
Device Model:     SAMSUNG HD103UJ
Firmware Version: 1AA01109
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 3b
Local Time is:    Thu Oct 25 15:21:23 2012 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (11778) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 197) minutes.
Conveyance self-test routine
recommended polling time:        (  21) minutes.
SCT capabilities:              (0x003f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0007   076   076   011    Pre-fail  Always       -       7920
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       90
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       1
  7 Seek_Error_Rate         0x000f   253   253   051    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      -       9802
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       10303
 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       84
 13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   099    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   067   056   000    Old_age   Always       -       33 (Min/Max 13/37)
194 Temperature_Celsius     0x0022   063   054   000    Old_age   Always       -       37 (Min/Max 13/39)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       221963206
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x000a   099   099   000    Old_age   Always       -       4306
201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision number = 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      9989         -

Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Nothing jumps out at me. They both have reallocated sectors. But only 1 & 2 each. ada1 has almost 4 years on spin time on it.

Both drives are out of warranty, even if they are bad. They both pass the extended 'long' smart test. Although for ada2, it was 13 days ago. I haven't tried running seatools on them yet.

I'm scrubbing the pool again, just out of curiosity. Only 10% done, but no errors yet.
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
If it was 8.72 meg, how come the scrub showed 550 gig repaired? 8 megs of bad data on 2 drives wouldn't have caused 550 gigs of scrub repair.

Plus, when you see a number like "2". This is 2 bytes, then? Seems more likely it's "2" checksum errors. I don't know if it's per sector, or zfs 'stripe'.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'm thinking that 550GB repaired means that at some point you replaced a failed drive and didn't do a scrub of the zpool to restore the data to the new drive added. At least, when I did some drive replacements I saw really high GB/TB values because I was replacing a drive.

If you didn't replace a drive that failed ever for the zpool, did you ever use the array with 1 drive(or more potentially) not connected (or perhaps diffferent drives at different times) and the drive needed to resync with the array? It really sounds like one of your hard drives got out of sync with the zpool somehow and your scrub operation corrected the discrepancy.
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
But upon replacing a drive, doesn't the resilver handle restoring data to the drive? I've replaced drives in my raidz3 pool, and after the resilver, I ran a scrub. The scrub showed zero repaired.

The pool in question has never been used degraded. Not as far as I'm aware at least.

I just noticed on console that I'm getting timeout errors coming from the controller of 4 drive esata enclosure. Still don't know if it's the drives, the enclosure, or the controller.

I still think the checksum 'stat's are just numbers, not representing megabytes, or anything. Just the total number of checksum errors. To keep the output smaller when the numbers get big, it simply appends the correct number suffix. It just seems that the gui doesn't interpret this correctly.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
If you do a hard drive replacement using the GUI I know a resilvering will automatically start. Not sure about command line though. If you wanted to, and this would be stupid, you could stop the resilvering from the command line after replacing a drive from the GUI. Also, if you replace a drive and do a reboot before the resilvering finishes I don't think it will restart automatically.

So there's a few situations where a resilvering may not complete.

If a drive is accidentally unplugged from the system and later reattached FreeNAS will not automatically start resilvering. I don't know how it would be possible for a drive to be detached and then reattached without you either being the cause or knowing of the issue.

In any case the 550GB of data seems to be from a drive that was not in sync with the zpool. Maybe someone else can pipe up with more experience with this or another theory.
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
I'm pretty sure the 550 gig of repair is from the 2 drives showing checksum errors. When the checksum errors were lower, the repaired number was much lower too. Something happened to those two drives that corrupted a significant portion of them. Yea, replacing a drive via the command line absolutely kicks off a resilver. The drives I did swap, I used the commandline for. "offline", "replace", wait for resilver, then "detach" the old drive.

Anyway, I just thought I'd mentioned the gui anomaly. Not a big deal, but not a true representation of what the cli would tell you.
 
Status
Not open for further replies.
Top