Is there a troubleshooting guide?

sdspieg · Nov 29, 2013

I am having problems with my FreeNAS 9.1.1 system. The error button is Green and says

OK: The volume StephanNAS (ZFS) status is HEALTHY

See also here:

But I do get loads of error messages along these lines:

Code:

Nov 29 18:15:34 freenas kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 40 4e 56 40 ee 00 00 01 00 00
Nov 29 18:15:34 freenas kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Nov 29 18:15:34 freenas kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Nov 29 18:15:34 freenas kernel: (ada0:ahcich0:0:0:0): RES: 41 40 40 4e 56 00 ee 00 00 00 01
Nov 29 18:15:34 freenas kernel: (ada0:ahcich0:0:0:0): Retrying command
Nov 29 18:15:55 freenas kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f8 ac 56 40 ee 00 00 01 00 00
Nov 29 18:15:55 freenas kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Nov 29 18:15:55 freenas kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Nov 29 18:15:55 freenas kernel: (ada0:ahcich0:0:0:0): RES: 41 40 d0 ad 56 00 ee 00 00 00 01
Nov 29 18:15:55 freenas kernel: (ada0:ahcich0:0:0:0): Retrying command
Nov 29 18:16:09 freenas kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 10 ce 56 40 ee 00 00 01 00 00
Nov 29 18:16:09 freenas kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Nov 29 18:16:09 freenas kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Nov 29 18:16:09 freenas kernel: (ada0:ahcich0:0:0:0): RES: 41 40 28 ce 56 00 ee 00 00 00 01
Nov 29 18:16:09 freenas kernel: (ada0:ahcich0:0:0:0): Retrying command

And none of my Plugins load anymore, although I still 'see' all of my files from my Windows machine in my 'Network'.

Can anybody please tell me whether there is a systematic troubleshooting guide for things like this? So which commands I should use to diagnose what's wrong and then what can be done to solve it. Thanks!

-Stephan

sdspieg · Nov 29, 2013

Ok - no answer here so I'll try to document my own progress here. The closest I could find to what I was looking for was here - a post by Joshua Parker Ruehlig of sickbeard/sabnzbd/... fame.

So I ran

Code:

smartctl -t long /dev/ada0 |more

whereby the 0 of ada# refers to the drive that was reported to have problems

Here's what that resulted in (and I'll truncate, as there is a 30'000 character-limit on these messages)

Code:

smartctl 6.1 2013-03-16 r3800 [FreeBSD 9.1-STABLE amd64] (local build)                                                       
=== START OF READ SMART DATA SECTION ===                                                                                     
SMART overall-health self-assessment test result: PASSED                                                                     
See vendor-specific Attribute list for marginal Attributes.                                                                 
                                                                                                                             
General SMART Values:                                                                                                       
Offline data collection status:  (0x82) Offline data collection activity                                                     
                                        was completed without error.                                                         
                                        Auto Offline Data Collection: Enabled.                                               
Self-test execution status:      ( 121) The previous self-test completed having                                             
                                        the read element of the test failed.                                                 
Total time to complete Offline                                                                                               
data collection:                (  575) seconds.                                                                             
Offline data collection                                                                                                     
capabilities:                    (0x7b) SMART execute Offline immediate.                                                     
                                        Auto Offline data collection on/off support.                                         
                                        Suspend Offline collection upon new                                                 
                                        command.                                                                             
                                        Offline surface scan supported.                                                     
                                        Self-test supported.                                                                 
                                        Conveyance Self-test supported.                                                     
                                        Selective Self-test supported.                                                       
SMART capabilities:            (0x0003) Saves SMART data before entering                                                     
                                        power-saving mode.                                                                   
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.                                                             
                                        General Purpose Logging supported.                                                   
Short self-test routine                                                                                                     
recommended polling time:        (  1) minutes.                                                                             
Extended self-test routine                                                                                                   
recommended polling time:        ( 328) minutes.                                                                             
Conveyance self-test routine                                                                                                 
recommended polling time:        (  2) minutes.                                                                             
SCT capabilities:              (0x3085) SCT Status supported.                                                               
                                                                                                                             
SMART Attributes Data Structure revision number: 10                                                                         
Vendor Specific SMART Attributes with Thresholds:                                                                           
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                             
  1 Raw_Read_Error_Rate    0x000f  074  074  006    Pre-fail  Always      -      188712502                             
  3 Spin_Up_Time            0x0003  095  093  000    Pre-fail  Always      -      0                                     
  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      60                                   
  5 Reallocated_Sector_Ct  0x0033  053  051  036    Pre-fail  Always      -      62208                                 
  7 Seek_Error_Rate        0x000f  079  060  030    Pre-fail  Always      -      86134410                             
  9 Power_On_Hours          0x0032  088  088  000    Old_age  Always      -      10750                                 
10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0                                     
12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      60                                   
183 Runtime_Bad_Block      0x0032  100  100  000    Old_age  Always      -      0                                     
184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0                                     
187 Reported_Uncorrect      0x0032  001  001  000    Old_age  Always      -      6705                                 
188 Command_Timeout        0x0032  100  078  000    Old_age  Always      -      110 110 139                           
189 High_Fly_Writes        0x003a  077  077  000    Old_age  Always      -      23                                   
190 Airflow_Temperature_Cel 0x0022  047  038  045    Old_age  Always  In_the_past 53 (3 94 57 51 0)                     
191 G-Sense_Error_Rate      0x0032  100  100  000    Old_age  Always      -      0                                     
192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      48                                   
193 Load_Cycle_Count        0x0032  100  100  000    Old_age  Always      -      1998                                 
194 Temperature_Celsius    0x0022  053  062  000    Old_age  Always      -      53 (0 19 0 0 0)                       
197 Current_Pending_Sector  0x0012  084  081  000    Old_age  Always      -      2688                                 
198 Offline_Uncorrectable  0x0010  084  081  000    Old_age  Offline      -      2688                                 
199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      3                                     
240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      10461h+52m+14.943s                   
241 Total_LBAs_Written      0x0000  100  253  000    Old_age  Offline      -      167398434356543                       
242 Total_LBAs_Read        0x0000  100  253  000    Old_age  Offline      -      67043716510972                       
                                                                                                                             
SMART Error Log Version: 1                                                                                                   
ATA Error Count: 6047 (device log contains only the most recent five errors)                                                 
        CR = Command Register [HEX]                                                                                         
        FR = Features Register [HEX]                                                                                         
        SC = Sector Count Register [HEX]                                                                                     
        SN = Sector Number Register [HEX]                                                                                   
        CL = Cylinder Low Register [HEX]                                                                                     
        CH = Cylinder High Register [HEX]                                                                                   
        DH = Device/Head Register [HEX]                                                                                     
        DC = Device Command Register [HEX]                                                                                   
        ER = Error register [HEX]
        ST = Status register [HEX]                                                                                           
Powered_Up_Time is measured from power on, and printed as                                                                   
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,                                                                       
SS=sec, and sss=millisec. It "wraps" after 49.710 days.                                                             
Error 6047 occurred at disk power-on lifetime: 10748 hours (447 days + 20 hours)                                                   
  When the command that caused the error occurred, the device was active or idle.                                                 
                                                                                                                                   
  After command completion occurred, registers were:                                                                               
  ER ST SC SN CL CH DH                                                                                                             
  -- -- -- -- -- -- --                                                                                                             
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455                                                                 
                                                                                                                                   
  Commands leading to the command that caused the error were:                                                                     
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name                                                                 
  -- -- -- -- -- -- -- --  ----------------  --------------------                                                                 
  60 00 48 ff ff ff 4f 00  16d+02:40:24.199  READ FPDMA QUEUED                                                                     
  2f 00 01 10 00 00 00 00  16d+02:40:24.165  READ LOG EXT                                                                         
  60 00 48 ff ff ff 4f 00  16d+02:40:21.558  READ FPDMA QUEUED                                                                     
  2f 00 01 10 00 00 00 00  16d+02:40:21.456  READ LOG EXT                                                                         
  60 00 48 ff ff ff 4f 00  16d+02:40:18.823  READ FPDMA QUEUED         
                                                                                                                             
SMART Self-test log structure revision number 1                                                                             
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                             
# 1  Extended offline    Completed: read failure      90%    10749        -                                               
# 2  Short offline      Completed without error      00%      110        -                                               
                                                                                                                             
SMART Selective self-test log data structure revision number 1                                                               
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                 
    1        0        0  Not_testing                                                                                         
    2        0        0  Not_testing                                                                                         
    3        0        0  Not_testing                                                                                         
    4        0        0  Not_testing                                                                                         
    5        0        0  Not_testing                                                                                         
Selective self-test flags (0x0):                                                                                             
  After scanning selected spans, do NOT read-scan remainder of disk.                                                         
If Selective self-test is pending on power-up, resume after 0 minute delay.

The scariest bit here is the line that says "ATA Error Count: 6047". I presume that means there are 6047 errors on the drive. So I guess I'll probably have to replace it. I had seen errors appear before, but I figured that the drive was only a bit older than a year, so it couldn't possibly be a big deal. Also since zfs is supposed to be so smart, I thought that'd probably fix the errors (i.e. move 'bad' sectors to 'good' ones) without me having to worry about it.
So I am now struggling with the next instruction, which says to check the smart information for the unreadable sector, but which does not say where to find this.

If anybody has any suggestions as I do through this (OR can still point me to the troubleshooting guide I was hoping would exist), I'd be more grateful. I'll keep updating this post as I make progress. IF I make progress :)

sdspieg · Nov 29, 2013

So I went with the "Current_Pending_Sector" from the log above, which had a raw value of 2688.

Code:

[root@freenas ~]# sysctl kern.geom.debugflags=16                                                                                 
kern.geom.debugflags: 16 -> 16                                                                                                   
[root@freenas ~]# dd if=/dev/zero of=/dev/ada0 bs=2688 count=1 seek=X conv=noerror,sync                                         
dd: seek: Invalid argument                                                                                                       
[root@freenas ~]#

Stuck again - looking for a solution.

cyberjock · Nov 29, 2013

Actually, there's a bunch of scary stuff there.

Your hard drive should be kept below 40C for longevity. Yours is 53C!
Your Current Pending Sector count(the number of sectors that the disk knows about that is bad but hasn't be reallocated) is 2688.
Your Offline Uncorrectable is also 2688.
The ATA error count is bad too. But basically what your info tells me is you've cooked your drives to the point of getting grossly large numbers of errors. You should have identified this drive long ago and clearly don't have email, SMART monitoring enabled properly, nor do you do any scheduled SMART testing of your disks, and let me take a guess that you don't do regular scrubs as often as you should either. Any of those could have provided indication that something was horribly wrong long ago.

That disk is definitely trashed. Replace is the only option.

You should definitely stop and look at ALL of your disks. I seriously doubt that disk is the only one you are letting get too hot and you probably will need to add more fans to your case to get the disks below 40C. I will tell you that I've had 2 disks that would start throwing ATA errors when they got over 50C, but when cooled off they'd start behaving normally. In the end though, the portions of the disk that were identified as bad when the disk was overheating were definitely damaged(from the heat?) and the disks only lasted another week before failing completely. At the end of the day, I'd be surprised if the other disks you have don't have a significantly shortened lifespan because of the temperature you run them at.

In fact, if you look at your disk SMART data you've actually exceeded the designed temperature envelope for the disks! If your disk is in warranty and you RMA it, its possible your RMA will be denied because your disk was allowed to get up to 62C. Most disks are rated for 60C maximum for warranty purposes.

sdspieg · Nov 30, 2013

Hi cyberjock,

Thanks for taking the time to look into my problem. I will get a new HDD. And you're right - I didn't have/do all of the things you mention. Even though I did follow the instructions I had found when I first installed FreeNAS quite carefully. But so: could you please refer me then to a good and reliable guide for how to do all of this properly? Because the wiki is great, but it is not really a seqeuntial installation guide. The forums have a LOT of stuff, but like in the 'How-to guides' (where I would have expected to find it), you only find guides to virtualize FreeNAS AND your exhortation NOT to do so plus some other stuff. And there are LOADS of 'installation guides' all around the internet, but many (most?) of them may not have these steps that I seem to have overlooked.
So if you could once again point me in the right direction, I'd really appreciate it,

In the meanwhile I'll go an buy a new HDD and will read up on http://doc.freenas.org/index.php/Volumes#Replacing_a_Failed_Drive_or_Zil_Device...

Thanks much again - your efforts on this forum here are truly amazing and are greatly appreciated!.

-Stephan

sdspieg · Nov 30, 2013

I also did zpool status and got the following:

Code:

[root@freenas ~]# zpool status                                                                                                 
  pool: StephanNAS                                                                                                             
state: ONLINE                                                                                                                 
status: One or more devices has experienced an unrecoverable error.  An                                                       
        attempt was made to correct the error.  Applications are unaffected.                                                   
action: Determine if the device needs to be replaced, and clear the errors                                                     
        using 'zpool clear' or replace the device with 'zpool replace'.                                                       
  see: http://illumos.org/msg/ZFS-8000-9P                                                                                     
  scan: resilvered 451G in 1h51m with 0 errors on Fri Nov 29 15:31:29 2013                                                     
config:                                                                                                                       
                                                                                                                               
        NAME                                            STATE    READ WRITE CKSUM                                             
        StephanNAS                                      ONLINE      0    0    0                                             
          mirror-0                                      ONLINE      0    0    0                                             
            gptid/d65eb91c-5201-11e2-87a9-3085a943b0e3  ONLINE      0    0    0                                             
            gptid/d6bde0bf-5201-11e2-87a9-3085a943b0e3  ONLINE      0    0    0                                             
          mirror-1                                      ONLINE      0    0    0                                             
            gptid/97d2ac78-526e-11e2-bbec-3085a943b0e3  ONLINE      0    0    1                                             
            gptid/98526e0f-526e-11e2-bbec-3085a943b0e3  ONLINE      0    0    0                                             
                                                                                                                               
errors: No known data errors                                                                                                   
[root@freenas ~]#

I'm now trying to figure out whether that 'gptid/97d2ac78-526e-11e2-bbec-3085a943b0e3' corresponds to my ada0 - for which I have noted the serial number, so that I know which one of the physical disks I have to replace.

Since yesterday night, it also seems like the zpool is back online. And so I can get into the

This does seem to suggest that it is ada0 that has the problem (and it IS in mirror 1, which is what the zpool status showed). So I'm now putting this one offline, and will replace it with a new one, to be followed by resilvering. I do take (some) comfort from the fact that all my plugins seems to be working well again too (especially plex media server, which I do use a lot). So maybe not all is lost.

So the quest continues ...

sdspieg · Nov 30, 2013

So I went for it. I did what the wiki says and it SEEMS like it's resilvering.

Code:

[root@freenas ~]# zpool status                                                                                                     
  pool: StephanNAS                                                                                                                 
state: DEGRADED                                                                                                                   
status: One or more devices is currently being resilvered.  The pool will                                                         
        continue to function, possibly in a degraded state.                                                                       
action: Wait for the resilver to complete.                                                                                         
  scan: resilver in progress since Sat Nov 30 12:02:38 2013                                                                       
        76.5G scanned out of 4.53T at 83.0M/s, 15h37m to go                                                                       
        74.7G resilvered, 1.65% done                                                                                               
config:                                                                                                                           
                                                                                                                                   
        NAME                                              STATE    READ WRITE CKSUM                                               
        StephanNAS                                        DEGRADED    0    0    0                                               
          mirror-0                                        ONLINE      0    0    0                                               
            gptid/d65eb91c-5201-11e2-87a9-3085a943b0e3    ONLINE      0    0    0                                               
            gptid/d6bde0bf-5201-11e2-87a9-3085a943b0e3    ONLINE      0    0    0                                               
          mirror-1                                        DEGRADED    0    0    0                                               
            replacing-0                                  OFFLINE      0    0    0                                               
              545852452536682837                          OFFLINE      0    0    0  was /dev/gptid/97d2ac78-526e-11e2-bbec-3085a94
3b0e3                                                                                                                             
              gptid/ebe6f24f-59ae-11e3-a143-3085a943b0e3  ONLINE      0    0    0  (resilvering)                               
            gptid/98526e0f-526e-11e2-bbec-3085a943b0e3    ONLINE      0    0    0                                               
                                                                                                                                   
errors: No known data errors                                                                                                       
[root@freenas ~]#

Fingers crossed. I also put the disks one slot apart from each other so that they may not get as hot. But I hope somebody here can still provide me with instructions on how to do all the things that I omitted to do last time.

sdspieg · Nov 30, 2013

I have now also created cronjobs for daily email reporting as well as a monthly short SMART-test. I also double-checked, and I had a ZSF-scrub set up for once a month. Is that not enough?

I'm still curious whether there is anything I can do to prevent the disks from running so hot? Because I certainly didn't change anything in the defaults, so I wonder what I could have done wrong there...

warri · Nov 30, 2013

sdspieg said:
I have now also created cronjobs for daily email reporting as well as a monthly short SMART-test. I also double-checked, and I had a ZSF-scrub set up for once a month. Is that not enough?

Have you also set up SMART reporting in addition to the SMART tests? This will usually give you warnings well in advance, and you can even set your custom temperature threshold to get notified via email. Consumer-grade hard disks should be scrubbed once every two weeks.

I'm still curious whether there is anything I can do to prevent the disks from running so hot? Because I certainly didn't change anything in the defaults, so I wonder what I could have done wrong there...

It's not a software problem. Add more fans or use another case with better airflow / reposition the drives. Keep in mind that hard drives can get rather hot while a scrub is being carried out.

sdspieg · Nov 30, 2013

All looks well. It resilvered well. Status is healthy again.

Thanks for the tips also warri. I'll change the scrubbing to twice a week. Could you also recommend what I should put as values for the SMART reporting (difference, informational, critical)?

sdspieg · Dec 5, 2013

Follow-up question. Since resilvering, it seems like my sabnzbd etc. are no longer connecting to the internet. And there is nothing else I changed to my configuration
Could that be a possible consequence of the resilvering? And does anybody have an idea on how to troubleshoot that? My FreeNAS seems to be connected ok to the internet (e.g. test mails come through just fine), but stuff in my jails doesn't seem to connect. Any ideas?

cyberjock · Dec 5, 2013

Can you post the output of zpool status -v?

Important Announcement for the TrueNAS Community.

Is there a troubleshooting guide?

sdspieg

Contributor

sdspieg

Contributor

sdspieg

Contributor

cyberjock

Inactive Account

sdspieg

Contributor

sdspieg

Contributor

sdspieg

Contributor

sdspieg

Contributor

warri

Guru

sdspieg

Contributor

sdspieg

Contributor

cyberjock

Inactive Account

Similar threads

Important Announcement for the TrueNAS Community.

Is there a troubleshooting guide?

Contributor

Contributor

Contributor

Inactive Account

Contributor

Contributor

Contributor

Contributor

Guru

Contributor

Contributor

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Is there a troubleshooting guide?"

Similar threads