SOLVED CAM status: uncorrectable parity/crc error

Status
Not open for further replies.

hidperf

Dabbler
Joined
Jun 12, 2012
Messages
34
I've done some searching before I posted this but everything I've tried hasn't fixed my problem.

I've got one drive (ada3) that keeps throwing this error. Per this thread I swapped the SATA cable with no luck.

smartctl –a /dev/ada4 gets this
Code:
smartctl 6.1 2013-03-16 r3800 [FreeBSD 9.1-STABLE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF INFORMATION SECTION ===
Model Family:    Western Digital Red (AF)
Device Model:    WDC WD10EFRX-68JCSN0
Serial Number:    WD-WMC1U5316668
LU WWN Device Id: 5 0014ee 6ad43ffe3
Firmware Version: 01.01A01
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:    512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:  ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Sep  3 21:20:28 2013 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (  0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (12960) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (  2) minutes.
Extended self-test routine
recommended polling time:        ( 148) minutes.
Conveyance self-test routine
recommended polling time:        (  5) minutes.
SCT capabilities:              (0x30bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      0
  3 Spin_Up_Time            0x0027  138  135  021    Pre-fail  Always      -      4091
  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      96
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0
  9 Power_On_Hours          0x0032  091  091  000    Old_age  Always      -      7105
10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -      0
11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      96
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      86
193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -      9
194 Temperature_Celsius    0x0022  114  107  000    Old_age  Always      -      29
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0
197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0030  100  253  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0
200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      1
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline      Completed: read failure      10%      7049        1953523152
# 2  Short offline      Completed: read failure      10%      7046        1953523152
# 3  Short offline      Completed: read failure      10%      7043        1953523152
# 4  Short offline      Completed: read failure      10%      7040        1953523152
# 5  Short offline      Completed: read failure      60%      7037        1953523152
# 6  Extended offline    Completed: read failure      10%      7028        1953523152
# 7  Short offline      Completed: read failure      60%      6801        1953523152
# 8  Short offline      Completed: read failure      60%      6777        1953523152
# 9  Short offline      Completed: read failure      60%      6753        1953523152
#10  Short offline      Completed: read failure      60%      6729        1953523152
#11  Short offline      Completed: read failure      60%      6710        1953523152
#12  Short offline      Interrupted (host reset)      90%      6705        -
#13  Short offline      Completed: read failure      60%      6686        1953523152
#14  Short offline      Completed: read failure      60%      6662        1953523152
#15  Short offline      Completed: read failure      60%      6638        1953523152
#16  Short offline      Completed: read failure      60%      6614        1953523152
#17  Short offline      Completed: read failure      60%      6590        1953523152
#18  Short offline      Completed: read failure      60%      6566        1953523152
#19  Short offline      Completed: read failure      60%      6542        1953523152
#20  Short offline      Completed: read failure      60%      6518        1953523152
#21  Short offline      Completed: read failure      60%      6494        1953523152
 
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


zpool status -v returns this
Code:
 pool: freenas
state: ONLINE
  scan: resilvered 1.27M in 0h0m with 0 errors on Sun Sep  1 15:59:45 2013
config:
 
        NAME                                            STATE    READ WRITE CKSUM
        freenas                                        ONLINE      0    0    0
          raidz2-0                                      ONLINE      0    0    0
            gptid/10efd3f1-1a1c-11e2-86a1-000e04b77199  ONLINE      0    0    0
            gptid/114dadc7-1a1c-11e2-86a1-000e04b77199  ONLINE      0    0    0
            gptid/11aa50f8-1a1c-11e2-86a1-000e04b77199  ONLINE      0    0    0
            gptid/1206470c-1a1c-11e2-86a1-000e04b77199  ONLINE      0    0    0
            gptid/1264c2f2-1a1c-11e2-86a1-000e04b77199  ONLINE      0    0    0
            gptid/12c11d81-1a1c-11e2-86a1-000e04b77199  ONLINE      0    0    0
 
errors: No known data errors


I also found another post where they had commands (dd commands) to repair the bad area of the disc, but I can't find right now. I tried the commands and they wouldn't run. So that's why I'm here.

The only thing I haven't tried yet is swapping SATA ports on the mother board and seeing if the errors follow the drive or the port. That's next.

Any other ideas?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Well, you should start with not just blindly running commands. People that have done that in the past have been renowned for their next thread to be "ZOMG my data is gone!".

If ada3 is having the problems then you need to post the SMART data for ada3, not ada4. ada4 looks fine though(not surprisingly).

Your zpool status doesn't indicate any errors. What makes you think ada3 is a problem(or that you have a problem at all)? Is it the fact that 1.27M were resilvered? Because that was done in 0h0m makes me think that a disk was disconnected, then reconnected, then a scrub(or resilvering) was performed.

So can you explain in ALOT more detail what is going on and what you've done. Because all I'm seeing is that there isn't a problem but if someone(points at YOU) doesn't stop messing with the pool you're going to have problems. The kind that make your wife want to leave you when you lose the family albums ;)
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
Looks fine? Huh? All smart tests in that log finished with "read failure"! Replace the disk!
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Sorry. I didn't look at the test log. Short tests are typically worthless. What I don't understand is that the short test fail with a read error but there are no read errors recorded in the raw SMART data. I hope this isn't a sign of how WD Reds behave normally. I'd replace that drive regardless. Something is very wrong with it. My guess is firmware issues but who knows. Just replace it and be done with it.
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
What I don't understand is that the short test fail with a read error but there are no read errors recorded in the raw SMART data.

That's OK. It shows that the SMART tests discovered bad sector(s) that are not yet used by the pool. zpool status doesn't show any READ errors either. However, as the pool fills with data you would eventually start to see the errors. (SMART test are far from worthless, they can detect bad sectors before the OS starts to use them.)

hidperf, could you please verify your smartd settings for me?
  • Do you have Services / S.M.A.R.T enabled?
  • Do you have Email to report filled in with a valid email?
  • Do you have any of the temperatures set to something other than 0?
If the answer is yes to all questions and you did not receive an e-mail about number of failed smart tests increasing then it confirms this bug for ATA drives: https://bugs.freenas.org/issues/2537
The -a smartd directive includes -l selftest, which would log a CRIT message and send an email "if the number of failed tests reported in the SMART Self-Test Log has increased since the last check, or if the timestamp associated with the most recent failed test has increased." (http://smartmontools.sourceforge.net/man5/smartd.conf.5.html)

If you do not have the e-mail warning enabled, you can also check this:
grep "Self-Test Log error" /var/log/messages
The error logged and e-mailed is either "Device: %s, new Self-Test Log error at hour timestamp %d\n" or "Device: %s, Self-Test Log error count increased from %d to %d\n" (http://sourceforge.net/apps/trac/smartmontools/browser/trunk/smartmontools/smartd.cpp#L2399).
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
That's OK. It shows that the SMART tests discovered bad sector(s) that are not yet used by the pool. zpool status doesn't show any READ errors either. However, as the pool fills with data you would eventually start to see the errors. (SMART test are far from worthless, they can detect bad sectors before the OS starts to use them.)

Here's my problem. If a test found a bad sector(whether it has/hasn't been used before, is being used now, etc.) it should have shown up on the raw disk information. It didn't. That scares the living crap out of me. I'd have expected Read Error Rate, Reallocated Sector Count, Reallocated Event Count, Current Pending Sector Count, or Offline Uncorrectable to have been non zero. At least one of them would show something. Instead they are reporting a drive with no problems.

Short tests are worthless. We've been through this before on the forums many times. The short(pun not intended) is that a short test tests things that, if the drive were to have a problem that would be identified on the short test, would have been blatantly evident before the next time the short test was performed anyway. You'd have the drive disconnecting and reconnecting, zpool read/write/checksum errors, SMART errors, etc. The only exception in any form is the surface test(and that exception is limited to just a few million sectors or so). But, the surface test on the short test is still pretty much worthless. It checks a small number of sectors in each of the "zones" of the disk and a few other sectors that are on its questionable sector list(which normally would be zero*) and that's all. A long test can be either the equivalent of a short test plus a full surface scan or just a surface scan(depends on manufacturer).

* - Normally, sectors will only end up on the questionable sector list if a long smart test identifies them. If they aren't identified on a long sector test then they are usually hard failed or were soft errors(neither of which would be expected to "correct" themselves during a short test).

I don't know if we've seen a WD Red fail in this forum. I truly hope this isn't a game by WD to make disks not report their failing status if they are failing. If so the WD Reds would require significantly higher maintenance (SMART tests, scrubs, etc.) than other drives, which isn't good. Considering that most Windows Servers I've seen in the world have never had a SMART test run on their array, it could be a very smart move by WD to lower the RMA numbers by basically have drives falsely report their raw data and pretty much force failure identification to be possible only via SMART tests. Many RAID controllers don't allow running SMART tests on disks. So with a disk that doesn't report errors like most disks, and you can't run a SMART test you pretty much won't know a disk is failing until its failing bigtime. If they can stretch that time out long enough, they won't have to deal with the cost of dealing with an RMA, so they saved money. And you'll blindly assume that the disk failed suddenly after over 3 years of service and may wrongly jump to the conclusion that the disk was very reliable and trustworthy until it failed suddenly.

It's a win-win-win for WD and a loss for the users. I'll definitely be watching more WD Reds in the future to see if this is just a fluke or something that is seriously wrong.
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
Here's my problem. If a test found a bad sector(whether it has/hasn't been used before, is being used now, etc.) it should have shown up on the raw disk information. It didn't. That scares the living crap out of me. I'd have expected Read Error Rate, Reallocated Sector Count, Reallocated Event Count, Current Pending Sector Count, or Offline Uncorrectable to have been non zero. At least one of them would show something. Instead they are reporting a drive with no problems.

No. Wrong. This is perfectly possible if the bad sector was never access by the system (the pool was never full). Do you have anything to back the claim the sector should show in the attributes if it was never accessed? Manufacturer documentation, spec sheet, ATA standard, ... (Trust me, I checked.)

The short/long tests do not update SMART attributes, the result is logged only in the seftest error log. Quoting from smartctl man page:
"short - [ATA] runs SMART Short Self Test (usually under ten minutes). This command can be given during normal system operation (unless run in captive mode - see the '-C' option below). This is a test in a different category than the immediate or automatic offline tests. The "Self" tests check the electrical and mechanical performance as well as the read performance of the disk. Their results are reported in the Self Test Error Log, readable with the '-l selftest' option." (http://smartmontools.sourceforge.net/man/smartctl.8.html)

Only the immediate offline test updates the attributes. Quoting the documentation again:
"offline - [ATA] runs SMART Immediate Offline Test. This immediately starts the test described above. This command can be given during normal system operation. The effects of this test are visible only in that it updates the SMART Attribute values, and if errors are found they will appear in the SMART error log, visible with the '-l error' option."

You also mention the reallocated sector count. If it is a hard unrecoverable error reallocation will not happen on a read operation, as there is nothing to reallocate (the sector is unreadable). The drive will keep the sector as is, hoping that it maybe be able to read it later. The reallocation will definitely happen only when you try to write into that sector. See: ATA drive is failing self-tests, but SMART health status is 'PASSED'. What's going on?

Short tests are worthless.
I must disagree again. Even this particular case shows that short test can detect a problem before bad things happen. It is also possible that a short test fails, while the long test passes (mostly Seagate drives): http://www.hardwarecanucks.com/foru...-s-m-r-t-short-test-but-passes-long-test.html, http://forums.seagate.com/t5/Desktop-HDD-Desktop-SSHD/Fail-short-test-pass-long-test-query/m-p/89184, http://forums.seagate.com/t5/Deskto...used-Short-Test-Fail-Long-Tes-Pass/td-p/41857

I won't comment on the WD conspiracy theory.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
No. Wrong. This is perfectly possible if the bad sector was never access by the system (the pool was never full). Do you have anything to back the claim the sector should show in the attributes if it was never accessed? Manufacturer documentation, spec sheet, ATA standard, ... (Trust me, I checked.)

The information you need is not provided by the Manufacturer to consumers. I have inside info on WD drives and I have some data for Seagates from circa 2008(I can't vouch if they changed though). I'm not a fan of Seagates because they do very unusual things with their SMART attributes. But, I do have a hard drive that has never had a partition table put on it, never been used for data storage, and only short tests were performed, and it has a value of over 10000 for Current Pending Sector Count. At the same time that the Current Pending Sector Count went above zero, only a minute before that a short test had been started.

The short/long tests do not update SMART attributes, the result is logged only in the seftest error log. Quoting from smartctl man page:
"short - [ATA] runs SMART Short Self Test (usually under ten minutes). This command can be given during normal system operation (unless run in captive mode - see the '-C' option below). This is a test in a different category than the immediate or automatic offline tests. The "Self" tests check the electrical and mechanical performance as well as the read performance of the disk. Their results are reported in the Self Test Error Log, readable with the '-l selftest' option." (http://smartmontools.sourceforge.net/man/smartctl.8.html)

That says nothing about what the manufacturer does for a short test though(or a long test for that matter). Some drives (one of my old SSDs come to mind) take less than 3 seconds to complete a short test and the manufacturer has said it actually performs no test at all, but they didn't disable SMART tests because it may cause problems with some software and/or hardware configurations that automatically run SMART tests at a given schedule. They simply report that the SMART test passed. They said that if something was wrong with the drive it would fail completely or fail the diagnostic on bootup. Of course, that's not too useful as I'd prefer that it not run the test at all and I get an error that the test isn't supported(My Intel SSD doesn't have a Conveyance test option and returns an error if you try to run one). Same for a long test on that drive, unfortunately. I was tipped off that something was horribly wrong when I tried to do a long test and a 64GB drive reported it passed 3 seconds later.

Only the immediate offline test updates the attributes. Quoting the documentation again:
"offline - [ATA] runs SMART Immediate Offline Test. This immediately starts the test described above. This command can be given during normal system operation. The effects of this test are visible only in that it updates the SMART Attribute values, and if errors are found they will appear in the SMART error log, visible with the '-l error' option."

My issue is that you are quoting smartctl. smartctl only tells the drive to run a test. What the test actually performs is totally up to the manufacturer. They can do no test at all(such as the SSD I mentioned above), or they can do an exhaustive test of every single component on the drive. The choice is theirs, and they aren't about to tell you what a particular SMART test does or doesn't do. In general, a long test is supposed to be the short test, but include a total surface scan of all user areas. (I'll discuss this more below)

You also mention the reallocated sector count. If it is a hard unrecoverable error reallocation will not happen on a read operation, as there is nothing to reallocate (the sector is unreadable). The drive will keep the sector as is, hoping that it maybe be able to read it later. The reallocation will definitely happen only when you try to write into that sector. See: ATA drive is failing self-tests, but SMART health status is 'PASSED'. What's going on?

Yeah, I've read that link before. They aren't explaining some small details. If a sector is labeled as UNC it should be annotated in the "Current_Pending_Sector_Count" value. If you do happen to write to those sectors they will be marked "bad", your newly written data will then be written to the spare sectors, Reallocated_Sector_Ct and/or Reallocated_Event_Count may increment, you may see a Raw_Read_Error_Rate go up, and life goes on. The whole reason why it behaves this way is to allow for RAID controllers to accept the fact that the sector is bad and the RAID controller will regenerate the missing data from parity(if it exists). If it doesn't exist, well, the data was already lost so it really doesn't matter if you claim it was lost on your next read or when it found the problem. But its better to allow a possible RAID controlller(or software backup) to restore the bad data. I have a drive that I just RMAed that did exactly that. Somewhat disappointed because I did the RMA because a long SMART test found problems, so I did the RMA. Then, I did a scrub just before the disk replacement(which then lowered CUPS to zero, and the Reallocated_Event_Count and Reallocated_Sector_Count went above zero).

You have to keep in mind that the health status is, from what I understand, based solely on if the "Value" is worse than the "Threshold".

Here's one of my disks..

Code:
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      0
  3 Spin_Up_Time            0x0027  147  141  021    Pre-fail  Always      -      9650
  4 Start_Stop_Count        0x0032  099  099  000    Old_age  Always      -      1017
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0
  9 Power_On_Hours          0x0032  063  063  000    Old_age  Always      -      27129
10 Spin_Retry_Count        0x0032  100  100  000    Old_age  Always      -      0
11 Calibration_Retry_Count 0x0032  100  100  000    Old_age  Always      -      0
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      473
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      403
193 Load_Cycle_Count        0x0032  199  199  000    Old_age  Always      -      4225
194 Temperature_Celsius    0x0022  118  101  000    Old_age  Always      -      34
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0
197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0
200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      0


Everything looks good.

Now, if I changed a line, say...Reallocated_Sector_Ct had a value of 139 or lower(note that the "THRESH" is 140) then I'd expect that the drive would have said:

SMART overall-health self-assessment test result: FAILED (or something to that effect). Then you get the BIOS warnings on bootup and all sorts of other nastiness. Of course, at that point, you are probably in deep doo doo and if you don't have a backup you've probably lost significant data. Not always, but usually. One RMAed drive I had gave me a BIOS warning on first powerup despite everything else being fine. I could test the drive all day long and have no errors. I still called Seagate and they sent me a second RMA in the same week.

Now check out all the drive THRESH that are zero. Those will never ever trip SMART failure warnings. Kind of crappy if you ask me because Current_Pending_Sector count seems to often be my first indication that a drive is failing(for WDs.. Seagates seem to be Multi_Zone_Error_Rate from my experience). And no matter how many sectors fail, you will never go "below" zero. So unless you look at the actual attributes and interpret them, your disk could be losing data left and right and you might not know it until the drive is practically dead. Me personally, I want to know the second a disk starts having problems of any kind. Not just when reallocation events reach the THRESH value.

Now let me muddy the waters. Reallocated_Sector_Ct was 200, right? And we said that when it hits 139 it will trigger the nasty SMART Failed message. That doesn't mean that if you reallocate 61 sectors it will trigger the warning. Each integer increment might be linear, might be exponential, or each increment might be 10k sectors. Which one is it for your disks? How about my disks? Are they even the same? Are you sure about that?

I do perform regular long test on my drives(twice a month) on the 7th and 21st. Typically if the RAW_VALUE looks like the disk isn't in perfect health a Long test has always failed for me. Been luck, the way my batch was made, I don't know. But, the important thing is that a failed short or long test give you the opportunity to RMA the drive. Typically, I'd always RMA a drive that fails a short or long test. Regardless of what it tests(or doesn't test), the outcome is the same. Any manufacturer's SMART test shouldn't ever fail. Luckily for hard disk manufacturers I've never seen a Windows Desktop that had SMART tests run on them regularly aside from one's I've setup myself, so the average user is ignorant to any indication of failure often until it is significant(and frequently self evident). I try to stay proactive with my server disks, I use RAIDZ3, and I replace them at the first sign of problems.

I must disagree again. Even this particular case shows that short test can detect a problem before bad things happen. It is also possible that a short test fails, while the long test passes (mostly Seagate drives): http://www.hardwarecanucks.com/foru...-s-m-r-t-short-test-but-passes-long-test.html, http://forums.seagate.com/t5/Desktop-HDD-Desktop-SSHD/Fail-short-test-pass-long-test-query/m-p/89184, http://forums.seagate.com/t5/Deskto...used-Short-Test-Fail-Long-Tes-Pass/td-p/41857

Your first link is similar to this thread, but Seagate had major firmware issues in 2009(the main reason whey I stopped using them after only buying and recommending Seagate for more than 10 years). If I buy $2k worth of drives that turn into paperweights within 90 days because they can't perform their function, and then they pass every test you throw at them(hence I don't qualify for an RMA), don't expect me to buy more of them.(I switched to WD at that point and I've had good luck with them...so far). Most of those drives are still in the 20 drive box I put them in because I can't trust them to store data without randomly disconnecting from any system they are put in. Of course my issue is unrelated to the issue we are discussing. I really can't explain why Short test would fail while a Long test would pass. The SMART spec used to say that a Long test required all Short tests + the full surface scan, but I don't know if that has changed or not. But the Short tests typically are things like a controller diagnostic(done on POST, and failure is often evident because the disk disconnects from the host), bad RAM cache on the drive(often evident because you'll start seeing all sorts of nasty behavior as things get corrupted, zpool status will identify the corrupted data, etc.), and a short seek test(well, if the drive is having problems seeking you'd again know before you did a test that something was very wrong, probably would show up on zpool status, etc.).

I won't comment on the WD conspiracy theory.

And I don't blame you. The non-standard(based on the information I have) use of the SMART data worries me, but it could be a failing drive that is having other problems so its not a big deal. Right now we have no solid ground to claim anything is a actually awry(except that the short test is failing and that is definitely bad). But its behavior is not what I'd expect, hence the reason why I'll just keep an eye on the forums for future failing disks that are WD Reds and see how they behave. Might be a fluke or might not. I'm not going to go rushing out with a conspiracy theory based on a single disk's issues.

------------------------------------


We've kind of gotten way off topic on this. The reality of it is that the OP's hard drive clearly has something wrong with it. Regardless of if it is normal behavior for SMART, normal behavior for WD drives, etc the issue still stands that it should definitely be RMAed(or at least not relied on long-term). I think we both agree on that. As for what is or isn't normal behavior, that seems to be manufacturer's secret sauce and not something they'll discuss with the public.
 

hidperf

Dabbler
Joined
Jun 12, 2012
Messages
34
Well, you should start with not just blindly running commands. People that have done that in the past have been renowned for their next thread to be "ZOMG my data is gone!".

If ada3 is having the problems then you need to post the SMART data for ada3, not ada4. ada4 looks fine though(not surprisingly).

Your zpool status doesn't indicate any errors. What makes you think ada3 is a problem(or that you have a problem at all)? Is it the fact that 1.27M were resilvered? Because that was done in 0h0m makes me think that a disk was disconnected, then reconnected, then a scrub(or resilvering) was performed.

So can you explain in ALOT more detail what is going on and what you've done. Because all I'm seeing is that there isn't a problem but if someone(points at YOU) doesn't stop messing with the pool you're going to have problems. The kind that make your wife want to leave you when you lose the family albums ;)

Well I have no idea what's going on, but I posted a reply to this last night and apparently it didn't show up. So I'll try and recap what I posted.

Cyberjock, those results are from ada3, not ada4. I typed it wrong. It's been a long week.

The NAS isn't acting strange or doing anything that would make me think it has a problem, other than this report I get every night. I just noticed that ada2 showed up, which has never showed up before.

Code:
kernel log messages:
+++ /tmp/security.HFzGPfke      2013-09-05 03:01:00.000000000 -0500
+(ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 d8 70 b9 40 23 00 00 00 00 00
+(ada2:ahcich2:0:0:0): CAM status: Uncorrectable parity/CRC error
+(ada2:ahcich2:0:0:0): Retrying command
+(ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 e0 70 b9 40 23 00 00 00 00 00
+(ada2:ahcich2:0:0:0): CAM status: Uncorrectable parity/CRC error
+(ada2:ahcich2:0:0:0): Retrying command
+(ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 e8 70 b9 40 23 00 00 00 00 00
+(ada2:ahcich2:0:0:0): CAM status: Uncorrectable parity/CRC error
+(ada2:ahcich2:0:0:0): Retrying command
+(ada2:ahcich2:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 08 95 bc 40 39 00 00 00 00 00
+(ada2:ahcich2:0:0:0): CAM status: Uncorrectable parity/CRC error
+(ada2:ahcich2:0:0:0): Retrying command
+(ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 28 00 b0 69 40 65 00 00 00 00 00
+(ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
+(ada3:ahcich3:0:0:0): Retrying command
+(ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 88 cc 1f 40 24 00 00 00 00 00
+(ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
+(ada3:ahcich3:0:0:0): Retrying command
+(ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 98 c2 be 40 39 00 00 00 00 00
+(ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
+(ada3:ahcich3:0:0:0): Retrying command
......
+(ada3:ahcich3:0:0:0): Retrying command
+(ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 18 a8 ec d0 40 39 00 00 00 00 00
+(ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
+(ada3:ahcich3:0:0:0): Retrying command
+(ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 10 98 da 93 40 65 00 00 00 00 00
+(ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error


And I don't "blindly run commands" unless I do my research. Being familiar with how forums work, I'm always amazed how you get flamed if you post a question without trying to get results on your own. And you get flamed if you post a question even after attempting to fix things yourself. But the knowledge and help I've gotten from this forum is worth the mild burning smell. :)

I found a similar thread on another site HERE where they fixed the problem, so I tried it.

I ran a long test
Code:
smartctl -t long /dev/ada3


Which got this
Code:
 smartctl -l selftest /dev/ada3
smartctl 6.1 2013-03-16 r3800 [FreeBSD 9.1-STABLE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Interrupted (host reset)      80%      7126        -
# 2  Short offline      Interrupted (host reset)      80%      7106        -
# 3  Short offline      Completed: read failure      10%      7049        1953523152
# 4  Short offline      Completed: read failure      10%      7046        1953523152
# 5  Short offline      Completed: read failure      10%      7043        1953523152
# 6  Short offline      Completed: read failure      10%      7040        1953523152
# 7  Short offline      Completed: read failure      60%      7037        1953523152
# 8  Extended offline    Completed: read failure      10%      7028        1953523152
# 9  Short offline      Completed: read failure      60%      6801        1953523152
#10  Short offline      Completed: read failure      60%      6777        1953523152
#11  Short offline      Completed: read failure      60%      6753        1953523152
#12  Short offline      Completed: read failure      60%      6729        1953523152
#13  Short offline      Completed: read failure      60%      6710        1953523152
#14  Short offline      Interrupted (host reset)      90%      6705        -
#15  Short offline      Completed: read failure      60%      6686        1953523152
#16  Short offline      Completed: read failure      60%      6662        1953523152
#17  Short offline      Completed: read failure      60%      6638        1953523152
#18  Short offline      Completed: read failure      60%      6614        1953523152
#19  Short offline      Completed: read failure      60%      6590        1953523152
#20  Short offline      Completed: read failure      60%      6566        1953523152
#21  Short offline      Completed: read failure      60%      6542        1953523152


Based on that thread I attempted to run this
Code:
dd if=/dev/zero of=/dev/ada3 conv=sync bs=4096 count=1 seek=244190394


to "zero out the sector" but it wouldn't run.

Also, the drive was unplugged. I have 6 hot-swap drive bays with the WD Red drives in them. While I had the system down, I pulled the drives to check the serial numbers so I knew which drive was ada3. I then swapped the SATA cable and fired it back up. I guess the drive didn't seat good enough because it vanished from the pool, so I reseated it and it came back into the pool.
This makes me think it could be a bad drive bay causing this too.

Dusan,
Services/ SMART is enabled
Email to report has a valid email address in it
Temperatures are all set to zero

Here's the most current results from ada3
Code:
 smartctl -a /dev/ada3
smartctl 6.1 2013-03-16 r3800 [FreeBSD 9.1-STABLE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF INFORMATION SECTION ===
Model Family:    Western Digital Red (AF)
Device Model:    WDC WD10EFRX-68JCSN0
Serial Number:    WD-WMC1U5316668
LU WWN Device Id: 5 0014ee 6ad43ffe3
Firmware Version: 01.01A01
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:    512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:  ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Sep  5 22:43:07 2013 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (  40) The self-test routine was interrupted
                                        by the host with a hard or soft reset.
Total time to complete Offline
data collection:                (12960) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (  2) minutes.
Extended self-test routine
recommended polling time:        ( 148) minutes.
Conveyance self-test routine
recommended polling time:        (  5) minutes.
SCT capabilities:              (0x30bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      0
  3 Spin_Up_Time            0x0027  138  135  021    Pre-fail  Always      -      4091
  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      96
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0
  9 Power_On_Hours          0x0032  091  091  000    Old_age  Always      -      7155
10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -      0
11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      96
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      86
193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -      9
194 Temperature_Celsius    0x0022  114  107  000    Old_age  Always      -      29
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0
197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0030  100  253  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0
200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      1
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Interrupted (host reset)      80%      7126        -
# 2  Short offline      Interrupted (host reset)      80%      7106        -
# 3  Short offline      Completed: read failure      10%      7049        1953523152
# 4  Short offline      Completed: read failure      10%      7046        1953523152
# 5  Short offline      Completed: read failure      10%      7043        1953523152
# 6  Short offline      Completed: read failure      10%      7040        1953523152
# 7  Short offline      Completed: read failure      60%      7037        1953523152
# 8  Extended offline    Completed: read failure      10%      7028        1953523152
# 9  Short offline      Completed: read failure      60%      6801        1953523152
#10  Short offline      Completed: read failure      60%      6777        1953523152
#11  Short offline      Completed: read failure      60%      6753        1953523152
#12  Short offline      Completed: read failure      60%      6729        1953523152
#13  Short offline      Completed: read failure      60%      6710        1953523152
#14  Short offline      Interrupted (host reset)      90%      6705        -
#15  Short offline      Completed: read failure      60%      6686        1953523152
#16  Short offline      Completed: read failure      60%      6662        1953523152
#17  Short offline      Completed: read failure      60%      6638        1953523152
#18  Short offline      Completed: read failure      60%      6614        1953523152
#19  Short offline      Completed: read failure      60%      6590        1953523152
#20  Short offline      Completed: read failure      60%      6566        1953523152
#21  Short offline      Completed: read failure      60%      6542        1953523152
 
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


And from ada2
Code:
smartctl -a /dev/ada2
smartctl 6.1 2013-03-16 r3800 [FreeBSD 9.1-STABLE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF INFORMATION SECTION ===
Model Family:    Western Digital Red (AF)
Device Model:    WDC WD10EFRX-68JCSN0
Serial Number:    WD-WMC1U5310538
LU WWN Device Id: 5 0014ee 6ad4398ea
Firmware Version: 01.01A01
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:    512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:  ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Sep  5 22:44:31 2013 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (  0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (13200) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (  2) minutes.
Extended self-test routine
recommended polling time:        ( 151) minutes.
Conveyance self-test routine
recommended polling time:        (  5) minutes.
SCT capabilities:              (0x30bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      0
  3 Spin_Up_Time            0x0027  137  135  021    Pre-fail  Always      -      4150
  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      91
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0
  9 Power_On_Hours          0x0032  091  091  000    Old_age  Always      -      7155
10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -      0
11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      91
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      81
193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -      9
194 Temperature_Celsius    0x0022  114  107  000    Old_age  Always      -      29
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0
197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0030  100  253  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0
200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      0
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline      Completed without error      00%      7057        -
# 2  Short offline      Completed without error      00%      7054        -
# 3  Short offline      Completed without error      00%      7051        -
# 4  Short offline      Completed without error      00%      7049        -
# 5  Short offline      Completed without error      00%      7046        -
# 6  Short offline      Completed without error      00%      7043        -
# 7  Short offline      Completed without error      00%      7040        -
# 8  Short offline      Completed without error      00%      7037        -
# 9  Short offline      Completed without error      00%      6801        -
#10  Short offline      Completed without error      00%      6777        -
#11  Short offline      Completed without error      00%      6753        -
#12  Short offline      Completed without error      00%      6729        -
#13  Short offline      Completed without error      00%      6710        -
#14  Short offline      Interrupted (host reset)      90%      6705        -
#15  Short offline      Completed without error      00%      6686        -
#16  Short offline      Completed without error      00%      6662        -
#17  Short offline      Completed without error      00%      6639        -
#18  Short offline      Completed without error      00%      6615        -
#19  Short offline      Completed without error      00%      6591        -
#20  Short offline      Completed without error      00%      6567        -
#21  Short offline      Completed without error      00%      6543        -
 
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
First: The CAM errors are usually indicative of a bad SATA cable or poor power to the drives.

Second: You ran a long test on ada3, but then you interrupted it...Not sure if you realized that a long test can take anywhere from 20 minutes to 6+ hours.

How did you get the number for your dd command? 244190394? Also, you won't be able to dd to the drive if you have any partitions on the disk mounted. So you'll have to export your pool to do the repair. You might have to export, reboot, do the repair, then reboot and import.

The fact that a 3rd disk is having problems with the CAM errors tells me its very unlikely to be bad disks. I'd definitely look at the power supply unless all of the disks are sharing the same SFF-8087/8088 cable or something. But, the disks that is ada3 should still be RMAed. Even if you do rewrite the sector, statistically its very likely that more things will go wrong soon. So unless its out of warranty, attempting to rewrite the bad sector isn't likely to help. Normally once bad sectors start appearing the problem gets out of control rapidly. You disk can't possibly be out of warranty since Red's are covered for 3 years and the drives haven't been sold for a year yet.
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
First: The CAM errors are usually indicative of a bad SATA cable or poor power to the drives.
Agree, there were already many cases on this forum of CAM errors being caused by bad cabled, backplanes, etc... However, a failed SMART test can not be caused by bad cable. The test runs internally in the drive, no data is transferred through the cable. ada3 should be replaced.

Second: You ran a long test on ada3, but then you interrupted it...Not sure if you realized that a long test can take anywhere from 20 minutes to 6+ hours.
No need to guess. The drive will report how long the test takes. In this case it's 2,5 hours:
Code:
Extended self-test routine
recommended polling time:        ( 148) minutes.

smartctl -t will also tell you how long you should wait. However, if you are sure that you did reboot the machine or cancel the test then it could point to a bad cable.

How did you get the number for your dd command? 244190394?
It's in the link hidperf posted. Multiply the LBA addres by 512 (LBA is in logical sectors) and divide it by 4096 to get a physical sector. 1953523152 * 512 / 4096.

If you do the dd on a detached pool as cyberjock suggested, you will probably see the count of reallocated sectors increase.

Edit: removed a paragraph about LBA's, It seems I'm not able to properly compare two numbers :).
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
No wonder the math wasn't working for me.. 512 logical and 4096 physical. I swore it said 4096 for both.. sigh.
 

hidperf

Dabbler
Joined
Jun 12, 2012
Messages
34
First: The CAM errors are usually indicative of a bad SATA cable or poor power to the drives.

Second: You ran a long test on ada3, but then you interrupted it...Not sure if you realized that a long test can take anywhere from 20 minutes to 6+ hours.

How did you get the number for your dd command? 244190394? Also, you won't be able to dd to the drive if you have any partitions on the disk mounted. So you'll have to export your pool to do the repair. You might have to export, reboot, do the repair, then reboot and import.

The fact that a 3rd disk is having problems with the CAM errors tells me its very unlikely to be bad disks. I'd definitely look at the power supply unless all of the disks are sharing the same SFF-8087/8088 cable or something. But, the disks that is ada3 should still be RMAed. Even if you do rewrite the sector, statistically its very likely that more things will go wrong soon. So unless its out of warranty, attempting to rewrite the bad sector isn't likely to help. Normally once bad sectors start appearing the problem gets out of control rapidly. You disk can't possibly be out of warranty since Red's are covered for 3 years and the drives haven't been sold for a year yet.

The drive is still under warranty, so I'll be contacting WD as soon as I get back from vacation.

I didn't interrupt the test though. I started it around 8pm Wed. night and let it run and didn't touch anything until I retrieved the results. The first time I ran it, last week, I got an email telling me it completed. This time I got nothing, which is why I didn't post the results until Thursday.

Thanks for the help! I guess I was hoping for something other than a bad disc. OH well, it could have been worse right?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
The drive is still under warranty, so I'll be contacting WD as soon as I get back from vacation.

I didn't interrupt the test though. I started it around 8pm Wed. night and let it run and didn't touch anything until I retrieved the results. The first time I ran it, last week, I got an email telling me it completed. This time I got nothing, which is why I didn't post the results until Thursday.

Thanks for the help! I guess I was hoping for something other than a bad disc. OH well, it could have been worse right?

Oh, you are right! I bet the CAM errors initiated the reset. It's rare to have both at the same time. CAM errors are a "host reset" to some machines. So yeah, you could have gone to bed and the system would do the reset for you via the CAM errors.

I'd definitely figure out the CAM errors. They are a symptom but not the cause(but they can be the cause for issues later if not fixed).
 

hidperf

Dabbler
Joined
Jun 12, 2012
Messages
34
Just thought I'd update this, since it's been several months already.

I returned from vacation and got tied up with work and personal business and haven't had time to rectify this situation until now.

The CAM errors kept occurring until the drive finally vanished from the pool and it was operating in a degraded state. I thought I really screwed up by not taking care of this right away. I took a look at my NAS box and discovered that one of my hot-swap housings, this model, had died. It just happened to be the same housing that ada3 was in. So I pulled the drive out of the housing and hooked it up directly, bypassing the housing. I fired up FreeNAS and held my breath.

The pool came back to life and I have no more CAM errors. I do still have an error at LBA 1953523152, so now I'm trying to fix that.

Thank you all for the help.
 
Status
Not open for further replies.
Top