unable to open device, what is the cause?

SaiKo · Mar 29, 2016

I've got a new (started with the 9.10 nightlies, swithed to stable when 9.10 was released) freenas build and am running in to a problem with either my disks (no smart errors) or something else, and i'd like to know how to make sure.

The thing is, i bought 12 new drives in groups of 4.
One of the shipments contained drives with the same manufacturing date
1 of these drives had to be rma'd already.
Because of this i am not really confident in the other 3 drives of this group due to the possibility of a bad batch, but they didn't show any problems in my initial testing, and still show nothing strange in smart data.

I've been trying various nas os'es for the past 2 months, circling back to what i started with and sort of worked, freenas, only now there was 9.10 and everything seemed to work.
For the last week i've been installing and setting up a number of plugins, located on a 4-disk (zfs mirror, HGST 5K4000) volume which also holds the system dataset, and been copying data to another volume (10-disk raidz2, brand-new WD red's).

Yesterday i woke up to my pc having halted its file transfer.
Logging in to freenas i see that there are 2 alerts (don't remember the exact drive in the first error message):

freenas Device: /dev/da* [SAT], unable to open device
The volume S.01 (ZFS) state is OFFLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

When i view disks, one drive doesn't show it's serial, and 2 drives aren't even in the list.
1 drive unable to open + 2 drives gone on a raidz2 = volume offline

rebooting freenas leaves me with just the second alert message, all 10 disks are there again as they should, volume is online again.

I checked smart status on all drives and found raw values were all 0 where the should be 0, so my disks are ok?

I had been mucking about trying to get my second nic to work in one of the plugins, failed and couldn't get it to work again on the first nic, + some other plugins i could no longer connect to their webgui, while others requiring the same permissions still worked so i couldn't figure out what was wrong.
i thought it best to do a fresh install.
I detached the 10 drive volume and detached and wiped the 4-disk volume.
After setting up the new install (9.10 release, updated) initialising the 4-disk volume and importing the existing 10 drive volume i did a zpool clear to remove the previous errors.

Yesterday evening i resumed copying files over to the 10-disk volume.
Today i again wake up to my pc having halted its file transfer.
This time in freenas:

freenas Device: /dev/da0 [SAT], unable to open device
The volume S.01 (ZFS) state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

There's just the 1 drive (one of the drives that were missing yesterday) that doesn't have its serial filled in when i view disks, no missing drives this time, the pool is still online but degraded.
Pool is 'healthy' again after reboot.

Still nothing to see in smart data. (Manually started a long smart test on all drives in the 10-disk pool, will be done tomorow morning.)

Since there is nothing to see in the smart data (yet), how do i tell for sure if this issue is related to a disk failing, or what might be another reason for it?

Thank you.

smart output for aforementioned dev/da0:

Code:

[root@freenas] ~# smartctl -q noserial -a /dev/da0
smartctl 6.4 2015-06-04 r4109 [FreeBSD 10.3-RELEASE amd64] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD60EFRX-68L0BN1
Firmware Version: 82.00A82
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5700 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Mar 29 11:11:56 2016 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 248) Self-test routine in progress...
                                        80% of test remaining.
Total time to complete Offline
data collection:                ( 5384) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 707) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   230   198   021    Pre-fail  Always       -       7466
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       27
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       364
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       27
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       23
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       76
194 Temperature_Celsius     0x0022   125   114   000    Old_age   Always       -       27
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Aborted by host               10%       361         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@freenas] ~#

running FreeNAS-9.10-STABLE-201603252134
mobo: asus p10s-i
cpu: i3-6100t
ram: 2*16GB ECC
hba: highpoint dc7280
os: sandisk ultra 32GB
vol1: 4*HGST 5k4000 4TB in mirror
vol2: 10*WD red 6TB in raidz2

btw: both times, when a volume either went offline or degraded, i did not receive any email from freenas to notify me of this.
Meanwhile i did get a daily run output today that reflected the degraded state of the second pool, when i was already well aware, so it's not that email isn't working, but why then weren't there notifications sent?

Mr_N · Mar 30, 2016

did you burn all the new drives in?
had you run short and long smart tests after this?

SaiKo · Mar 30, 2016

hooked them up to my windows pc a couple at a time, gave them full formats and copied data over one to the other and vice versa until they were almost 100% full.

smart data for dev/da0 after long smart test:

Code:

[root@freenas] ~# smartctl -q noserial -a /dev/da0
smartctl 6.4 2015-06-04 r4109 [FreeBSD 10.3-RELEASE amd64] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:  Western Digital Red
Device Model:  WDC WD60EFRX-68L0BN1
Firmware Version: 82.00A82
User Capacity:  6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:  512 bytes logical, 4096 bytes physical
Rotation Rate:  5700 rpm
Device is:  In smartctl database [for details use: -P show]
ATA Version is:  ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:  Wed Mar 30 01:28:36 2016 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
  was never started.
  Auto Offline Data Collection: Disabled.
Self-test execution status:  (  0) The previous self-test routine completed
  without error or no self-test has ever
  been run.
Total time to complete Offline
data collection:  ( 5384) seconds.
Offline data collection
capabilities:  (0x7b) SMART execute Offline immediate.
  Auto Offline data collection on/off support.
  Suspend Offline collection upon new
  command.
  Offline surface scan supported.
  Self-test supported.
  Conveyance Self-test supported.
  Selective Self-test supported.
SMART capabilities:  (0x0003) Saves SMART data before entering
  power-saving mode.
  Supports SMART auto save timer.
Error logging capability:  (0x01) Error logging supported.
  General Purpose Logging supported.
Short self-test routine
recommended polling time:  (  2) minutes.
Extended self-test routine
recommended polling time:  ( 707) minutes.
Conveyance self-test routine
recommended polling time:  (  5) minutes.
SCT capabilities:  (0x303d) SCT Status supported.
  SCT Error Recovery Control supported.
  SCT Feature Control supported.
  SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x002f  200  200  051  Pre-fail  Always  -  0
  3 Spin_Up_Time  0x0027  230  198  021  Pre-fail  Always  -  7466
  4 Start_Stop_Count  0x0032  100  100  000  Old_age  Always  -  27
  5 Reallocated_Sector_Ct  0x0033  200  200  140  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x002e  200  200  000  Old_age  Always  -  0
  9 Power_On_Hours  0x0032  100  100  000  Old_age  Always  -  378
 10 Spin_Retry_Count  0x0032  100  253  000  Old_age  Always  -  0
 11 Calibration_Retry_Count 0x0032  100  253  000  Old_age  Always  -  0
 12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  27
192 Power-Off_Retract_Count 0x0032  200  200  000  Old_age  Always  -  23
193 Load_Cycle_Count  0x0032  200  200  000  Old_age  Always  -  78
194 Temperature_Celsius  0x0022  129  114  000  Old_age  Always  -  23
196 Reallocated_Event_Count 0x0032  200  200  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0032  200  200  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0030  100  253  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x0032  200  200  000  Old_age  Always  -  0
200 Multi_Zone_Error_Rate  0x0008  200  200  000  Old_age  Offline  -  0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline  Completed without error  00%  374  -
# 2  Short offline  Aborted by host  10%  361  -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
  1  0  0  Not_testing
  2  0  0  Not_testing
  3  0  0  Not_testing
  4  0  0  Not_testing
  5  0  0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@freenas] ~#

SweetAndLow · Mar 30, 2016

You are having these problems because of your highpoint hba most likely. Get a LSI card for $100 and it will probably fix your problems.

SaiKo · Mar 30, 2016

i can't even find a 30+ port controller that isn't highpoint, finding one for 100$ well thats just fantasy

Ericloewe · Mar 30, 2016

SaiKo said:
i can't even find a 30+ port controller that isn't highpoint, finding one for 100$ well thats just fantasy

Why would you want such a ridiculous device? Use SAS expanders. Sure, added cost, but there's no way around that with 30+ drives.

SweetAndLow · Mar 30, 2016

You don't need that many ports you just need an expander to go with it. What case do you have that supports 30+ drives?

jgreco · Mar 30, 2016

SaiKo said:
i can't even find a 30+ port controller that isn't highpoint, finding one for 100$ well thats just fantasy

The Highpoint isn't worth $100. You wasted money. :-(

SaiKo · Mar 30, 2016

Lian-li PC-D8000 with everything discribed above in it but the 20 hot-swap bays in the case are still empty.
I have 20 drives currently holding data to be put in there when the data is safe on the new build.
I could add 2 more 4-in-3 modules and it would fit a total of 44 3.5" and 12 2.5" drives.

Can't use expanders cause i only got 1 pci slot.

jgreco · Mar 30, 2016

What does only having one PCIe slot have to do with using an SAS expander?

**extremely confused**

Ericloewe · Mar 30, 2016

jgreco said:
What does only having one PCIe slot have to do with using an SAS expander?

**extremely confused**

I smell confusion due to some discrete expanders having provisions for mounting in PCI-e slots and being powered from them.

@SaiKo - the PCI-e card edge connectors you see on some expanders are for power ONLY. Expanders do all communication over SAS to the HBA. Besides PCI-e power, they generally provide a molex plug you can use for power.

SaiKo · Apr 25, 2016

Ok so i now got a 9211-8i that came with ir mode p18, trying to flash it to it mode p20.

Code:

[root@freenas /nonexistent]# sas2flash -listall                                                                                   
LSI Corporation SAS2 Flash Utility                                                                                                
Version 16.00.00.00 (2013.03.01)                                                                                                  
Copyright (c) 2008-2013 LSI Corporation. All rights reserved                                                                      
                                                                                                                                  
        Adapter Selected is a LSI SAS: SAS2008(B2)                                                                                
                                                                                                                                  
Num   Ctlr            FW Ver        NVDATA        x86-BIOS         PCI Addr                                                       
----------------------------------------------------------------------------                                                      
                                                                                                                                  
0  SAS2008(B2)     18.00.00.00    11.00.00.08    07.35.00.00     00:01:00:00                                                      
                                                                                                                                  
        Finished Processing Commands Successfully.                                                                                
        Exiting SAS2Flash.

Code:

[root@freenas /nonexistent]# sas2flash -o -e 6                                                                                    
LSI Corporation SAS2 Flash Utility                                                                                                
Version 16.00.00.00 (2013.03.01)                                                                                                  
Copyright (c) 2008-2013 LSI Corporation. All rights reserved                                                                      
                                                                                                                                  
        Advanced Mode Set                                                                                                         
                                                                                                                                  
        Adapter Selected is a LSI SAS: SAS2008(B2)                                                                                
                                                                                                                                  
        Executing Operation: Erase Flash                                                                                          
                                                                                                                                  
        Erasing Flash Region...                                                                                                   
                                                                                                                                  
                Erase Flash Command not Supported on this platform.                                                               
                                                                                                                                  
        Resetting Adapter...                                                                                                      
        Reset Successful!                                                                                                         
                                                                                                                                  
        Due to Exception Command not Executed. IOCStatus=0x1, IOCLogInfo=0x0                                                      
        Finished Processing Commands Successfully.                                                                                
        Exiting SAS2Flash.

What can i do about the "Erase Flash Command not Supported on this platform." exception?

jgreco · Apr 25, 2016

Why are you erasing it...? Just update it. I think.

SaiKo · Apr 25, 2016

I need to erase it first because its on ir mode.

Important Announcement for the TrueNAS Community.

unable to open device, what is the cause?

SaiKo

Dabbler

Mr_N

Patron

SaiKo

Dabbler

SweetAndLow

Sweet'NASty

SaiKo

Dabbler

Ericloewe

Server Wrangler

SweetAndLow

Sweet'NASty

jgreco

Resident Grinch

SaiKo

Dabbler

jgreco

Resident Grinch

Ericloewe

Server Wrangler

SaiKo

Dabbler

jgreco

Resident Grinch

SaiKo

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

unable to open device, what is the cause?

Dabbler

Patron

Dabbler

Sweet'NASty

Dabbler

Server Wrangler

Sweet'NASty

Resident Grinch

Dabbler

Resident Grinch

Server Wrangler

Dabbler

Resident Grinch

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "unable to open device, what is the cause?"

Similar threads