Strange Issue with new SAS disk

Status
Not open for further replies.

Lorsung23647

Dabbler
Joined
Mar 10, 2014
Messages
17
So I started replacing some old disks in my array with new SAS disks. I'm doing them 1 at a time as I have money/disks fail. (Keep one on hand at all times, and replace when the next comes in.)

On the first disk I replaced I noticed that the activity light was on almost constantly. It would also flash for normal drive activity so I didn't really think much of it since my array performance wasn't actually any slower.

Last night I got a strange set of errors in a security run log that included that SAS disk:
Code:
>       (da0:mps0:0:16:0): WRITE(10). CDB: 2a 00 73 5a a7 90 00 01 00 00 length 131072 SMID 393 terminated ioc 804b scsi 0 state c xfer 131072
>       (da0:mps0:0:16:0): WRITE(10). CDB: 2a 00 73 84 35 98 00 00 c0 00 length 98304 SMID 867 terminated ioc 804b scsi 0 state c xfer 98304
>       (da0:mps0:0:16:0): WRITE(10). CDB: 2a 00 73 8f 80 88 00 01 00 00 length 131072 SMID 392 terminated ioc 804b scsi 0 state c xfer 131072
>       (da0:mps0:0:16:0): WRITE(10). CDB: 2a 00 73 a1 0f 10 00 01 00 00 length 131072 SMID 393 terminated ioc 804b scsi 0 state c xfer 131072
>       (da0:mps0:0:16:0): WRITE(10). CDB: 2a 00 73 a9 84 90 00 01 00 00 length 131072 SMID 273 terminated ioc 804b scsi 0 state c xfer 131072

I also had a nightly backup process take 50% longer to complete tonight that previous nights.

I found a thread with some info about those errors and did a smart check with "smartctl -a -q noserial /dev/da0"

Code:
=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUS724040ALS640
Revision:             A1C4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
LU is resource provisioned, LBPRZ=0
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Device type:          disk
Transport protocol:   SAS
Local Time is:        Fri Oct  3 19:13:18 2014 CDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     33 C
Drive Trip Temperature:        85 C

Manufactured in week 35 of year 2014
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  1
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  8
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 273934406647808

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:       2168        0         0      2168      18337        351.764           0
write:         0        0         0         0       3596        808.894           0
verify:        0        0         0         0         39          0.000           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ
     Description                              number   (hours)
# 1  Background long   Completed                   -     163                 - [-   -    -
# 2  Background short  Completed                   -     150                 - [-   -    -
# 3  Background short  Completed                   -     126                 - [-   -    -
# 4  Background short  Completed                   -     102                 - [-   -    -
# 5  Background short  Completed                   -      78                 - [-   -    -
# 6  Background short  Completed                   -      54                 - [-   -    -
# 7  Background short  Completed                   -      30                 - [-   -    -
# 8  Background short  Completed                   -       6                 - [-   -    -
Long (extended) Self Test duration: 37038 seconds [617.3 minutes]


The output seems a bit short, but I haven't had to troubleshoot SAS disks too intensively yet, however, the error counter log, and the time to complete a Self Test worry me. 10+ hours for a smart test seems excessive, and 2k+ error corrections for 351 GB of data seems high.

If someone could give me some assistance on if something does seem wrong, I'd appreciate it.

System Specs:

FreeNAS 9.2.1.7-Release-x64
32GB ECC Memory
LSI 9201-16i HBA
Norco RPC-4224 Case
 
Last edited:

pjc

Contributor
Joined
Aug 26, 2014
Messages
187
Your error counts do seem high. For reference, here are mine:

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 36 0 0 36 27 382.530 0
write: 0 0 0 0 229 388.859 0
verify: 126 0 0 126 71 16.143 0

That's about 10x the number of correction invocations per gigabyte written, and over 650x on reads.

I don't know whether these are SAS protocol errors or platter read/write. Maybe someone else can enlighten us.

Have you tried running this drive through the HiTest diagnostics? Your drive was made the same week as one I just returned (same model) because it wouldn't show up in HiTest at all (whereas the one reported above was manufactured in week 10, and passed HiTest). Are you buying your drives from Newegg perchance?

You can sign up at http://www.hgst.com/partners and then you can download HiTest. It's a bit of a pain to use. Windows-only, and you need an ASPI driver in addition to your HBA driver. SAS drives show up as "server" drives (vs. ATA). I use the standard sequence + full-platter write + full-platter read before putting a drive into service.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Your bigger problem is the fact that your controller is masking the SMART parameters. So you're kind of on your own to figure out which drive is failing.

Until the disk starts actually reporting errors (which isn't always quickly) you have no way of proving easily which disk is failing.

So you might want to look at the SAS controller you are using and get something else. Note that we do not and have never recommend the LSI-9201.. and for good reason. It's not a good choice. ;)
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
Your bigger problem is the fact that your controller is masking the SMART parameters. So you're kind of on your own to figure out which drive is failing.

It's been a while since I've had sas disks hooked up directly to freenas, but I don't remember much more than what's in the 1st post of this thread. I don't see anything it's really masking. The error counter logs are shown, which is all I really had to go on with my sas drives. Since CODE tags weren't used, the formatting's been messed up, so it's hard to read which numbers are for what. I'm don't have enough experience reading sas smart info to really tell if it's good or not.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
It's been a while since I've had sas disks hooked up directly to freenas, but I don't remember much more than what's in the 1st post of this thread. I don't see anything it's really masking. The error counter logs are shown, which is all I really had to go on with my sas drives. Since CODE tags weren't used, the formatting's been messed up, so it's hard to read which numbers are for what. I'm don't have enough experience reading sas smart info to really tell if it's good or not.

Ok, so what defines an "error"? That's the problem, because SMART doesn't have an "error" counter. There are plenty of SMART parameters that could constitute an "error" such as CUPS, reallocated sector count, reallocated event count, offline uncorrectable, UDMA CRC errors. Even Raw Read Error Rate and Multi Zone Error Rates are only rates, not the number of hard errors.

If you look at his output and look at my output (which is true SMART output) they look *nothing* alike...
Code:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x002f  200  200  051  Pre-fail  Always  -  0
  3 Spin_Up_Time  0x0027  192  192  021  Pre-fail  Always  -  9366
  4 Start_Stop_Count  0x0032  100  100  000  Old_age  Always  -  27
  5 Reallocated_Sector_Ct  0x0033  200  200  140  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x002e  200  200  000  Old_age  Always  -  0
  9 Power_On_Hours  0x0032  099  099  000  Old_age  Always  -  1423
 10 Spin_Retry_Count  0x0032  100  253  000  Old_age  Always  -  0
 11 Calibration_Retry_Count 0x0032  100  253  000  Old_age  Always  -  0
 12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  27
192 Power-Off_Retract_Count 0x0032  200  200  000  Old_age  Always  -  26
193 Load_Cycle_Count  0x0032  200  200  000  Old_age  Always  -  28
194 Temperature_Celsius  0x0022  119  112  000  Old_age  Always  -  33
196 Reallocated_Event_Count 0x0032  200  200  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0032  200  200  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0030  100  253  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x0032  200  200  000  Old_age  Always  -  0
200 Multi_Zone_Error_Rate  0x0008  200  200  000  Old_age  Offline  -  0



Notice he has almost none of those categories? That's because his SAS controller is providing them instead of the hard drive and giving you some kind of watered down answer based on whatever it defines as an error. The problem, virtually *any* of those parameters I just mentioned could be a cause for serious concern, but you have no way of knowing which ones do (or don't) apply.

His card is basically like driving a vehicle that has a single light that says "I'm broken" when anything goes wrong. Tire with low pressure? Get the light. Need an oil change? You get the light. Engine is seriously overheated? Get the red light. So what is wrong when you get "the light"? Anything. You have no clue. It could be nothing, or it could be extremely serious. You don't know, and you don't get enough information to determine if this is something serious or not.

My SMART output is more like what traditional vehicles have; a dozen or more lights that give you a clue of what is wrong. An overheating engine gives you a red temp light, while a tire with low pressure is maybe a yellow warning light.
 

Lorsung23647

Dabbler
Joined
Mar 10, 2014
Messages
17
I'm also not entirely sure if it isn't the disk giving strange SMART output. Every other SATA disk that I have attached reports all attributes.

From my SAS disk:
Code:
=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUS724040ALS640
Revision:             A1C4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
LU is resource provisioned, LBPRZ=0
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Device type:          disk
Transport protocol:   SAS
Local Time is:        Thu Oct  9 18:54:31 2014 CDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     39 C
Drive Trip Temperature:        85 C

Manufactured in week 35 of year 2014
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  1
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  14
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 343639108616192

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:       2803        0         0      2803      24012        438.624           0
write:         0        0         0         0       3981        978.709           0
verify:        0        0         0         0         61          0.000           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -     294                 - [-   -    -]
# 2  Background short  Completed                   -     270                 - [-   -    -]
# 3  Background short  Completed                   -     246                 - [-   -    -]
# 4  Background short  Completed                   -     222                 - [-   -    -]
# 5  Background short  Completed                   -     198                 - [-   -    -]
# 6  Background short  Completed                   -     174                 - [-   -    -]
# 7  Background long   Completed                   -     163                 - [-   -    -]
# 8  Background short  Completed                   -     150                 - [-   -    -]
# 9  Background short  Completed                   -     126                 - [-   -    -]
#10  Background short  Completed                   -     102                 - [-   -    -]
#11  Background short  Completed                   -      78                 - [-   -    -]
#12  Background short  Completed                   -      54                 - [-   -    -]
#13  Background short  Completed                   -      30                 - [-   -    -]
#14  Background short  Completed                   -       6                 - [-   -    -]
Long (extended) Self Test duration: 37038 seconds [617.3 minutes]


From one of my SATA disks:

Code:
=== START OF INFORMATION SECTION ===
Device Model:     ST4000VN000-1H4168
Firmware Version: SC43
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Oct  9 18:55:36 2014 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
     was never started.
     Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
     without error or no self-test has ever
     been run.
Total time to complete Offline
data collection:   (  117) seconds.
Offline data collection
capabilities:     (0x73) SMART execute Offline immediate.
     Auto Offline data collection on/off support.
     Suspend Offline collection upon new
     command.
     No Offline surface scan supported.
     Self-test supported.
     Conveyance Self-test supported.
     Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
     power-saving mode.
     Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
     General Purpose Logging supported.
Short self-test routine
recommended polling time:   (   1) minutes.
Extended self-test routine
recommended polling time:   ( 536) minutes.
Conveyance self-test routine
recommended polling time:   (   2) minutes.
SCT capabilities:         (0x10bd) SCT Status supported.
     SCT Error Recovery Control supported.
     SCT Feature Control supported.
     SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   116   099   006    Pre-fail  Always       -       112060072
  3 Spin_Up_Time            0x0003   092   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       16
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   079   060   030    Pre-fail  Always       -       89343101
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       5531
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       16
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   001   001   000    Old_age   Always       -       166
190 Airflow_Temperature_Cel 0x0022   067   063   045    Old_age   Always       -       33 (Min/Max 25/35)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       16
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       16
194 Temperature_Celsius     0x0022   033   040   000    Old_age   Always       -       33 (0 20 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      5512         -
# 2  Conveyance offline  Completed without error       00%      5492         -
# 3  Short offline       Completed without error       00%      5488         -
# 4  Short offline       Completed without error       00%      5464         -
# 5  Short offline       Completed without error       00%      5439         -
# 6  Short offline       Completed without error       00%      5415         -
# 7  Short offline       Completed without error       00%      5391         -
# 8  Extended offline    Completed without error       00%      5380         -
# 9  Short offline       Completed without error       00%      5367         -
#10  Short offline       Completed without error       00%      5343         -
#11  Conveyance offline  Completed without error       00%      5324         -
#12  Short offline       Completed without error       00%      5319         -
#13  Short offline       Completed without error       00%      5295         -
#14  Short offline       Completed without error       00%      5271         -
#15  Short offline       Completed without error       00%      5247         -
#16  Short offline       Completed without error       00%      5223         -
#17  Extended offline    Completed without error       00%      5213         -
#18  Short offline       Completed without error       00%      5199         -
#19  Short offline       Completed without error       00%      5175         -
#20  Conveyance offline  Completed without error       00%      5155         -
#21  Short offline       Completed without error       00%      5151         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.




Edit: I also moved it to a LSI 9211-8i HBA I have in my box, and the output is the same as the 9201.
 
Last edited:

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
Ok, so what defines an "error"? That's the problem, because SMART doesn't have an "error" counter. There are plenty of SMART parameters that could constitute an "error" such as CUPS, reallocated sector count, reallocated event count, offline uncorrectable, UDMA CRC errors. Even Raw Read Error Rate and Multi Zone Error Rates are only rates, not the number of hard errors.
And on SATA drives you get all these. These are the attribute I'm use to seeing as well.

If you look at his output and look at my output (which is true SMART output) they look *nothing* alike...
Code:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x002f  200  200  051  Pre-fail  Always  -  0
  3 Spin_Up_Time  0x0027  192  192  021  Pre-fail  Always  -  9366
  4 Start_Stop_Count  0x0032  100  100  000  Old_age  Always  -  27
  5 Reallocated_Sector_Ct  0x0033  200  200  140  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x002e  200  200  000  Old_age  Always  -  0
  9 Power_On_Hours  0x0032  099  099  000  Old_age  Always  -  1423
10 Spin_Retry_Count  0x0032  100  253  000  Old_age  Always  -  0
11 Calibration_Retry_Count 0x0032  100  253  000  Old_age  Always  -  0
12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  27
192 Power-Off_Retract_Count 0x0032  200  200  000  Old_age  Always  -  26
193 Load_Cycle_Count  0x0032  200  200  000  Old_age  Always  -  28
194 Temperature_Celsius  0x0022  119  112  000  Old_age  Always  -  33
196 Reallocated_Event_Count 0x0032  200  200  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0032  200  200  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0030  100  253  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x0032  200  200  000  Old_age  Always  -  0
200 Multi_Zone_Error_Rate  0x0008  200  200  000  Old_age  Offline  -  0

And I assume this is a SATA drive? Because the SAS protocol doesn't do these attributes.
Notice he has almost none of those categories? That's because his SAS controller is providing them instead of the hard drive and giving you some kind of watered down answer based on whatever it defines as an error. The problem, virtually *any* of those parameters I just mentioned could be a cause for serious concern, but you have no way of knowing which ones do (or don't) apply.

His SAS controller is reporting everything it can about his SAS hard drive. The SAS controller is perfectly capable of reporting the SATA attributes you listed above from a SATA drive. Why would you expect SATA output from a SAS controller hooked up to a SAS hard drive?
His card is basically like driving a vehicle that has a single light that says "I'm broken" when anything goes wrong. Tire with low pressure? Get the light. Need an oil change? You get the light. Engine is seriously overheated? Get the red light. So what is wrong when you get "the light"? Anything. You have no clue. It could be nothing, or it could be extremely serious. You don't know, and you don't get enough information to determine if this is something serious or not.
Should we switch to using only SATA drives so we get the SATA attributes then? This is a valid argument. Although I am comfortable with SAS drives even if I'm not used to how smartctl reads them.

My SMART output is more like what traditional vehicles have; a dozen or more lights that give you a clue of what is wrong. An overheating engine gives you a red temp light, while a tire with low pressure is maybe a yellow warning light.
Or it's more like a vehicle that has lights in a configuration / language that you recognize / don't recognize. Like I said in my other post, I'm not as familiar with SAS smartctl output to be able to diagnose a good drive from a bad one.
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
I'm also not entirely sure if it isn't the disk giving strange SMART output. Every other SATA disk that I have attached reports all attributes.

From my SAS disk:

<SNIP>

From one of my SATA disks:

<SNIP>

[/code]

I agree. Your controller looks like it's working fine to me.. It looks like it's reading the sata specific attributes from sata drives, and sas specific data (including the error count logs I mentioned) from sas drives.

I do have some spare sas drives, and I know one of them has about 1,000 reallocated sectors, which I think it showed in the 'grown defect list'. And I think it showed much worse numbers in the error count area.
 

Lorsung23647

Dabbler
Joined
Mar 10, 2014
Messages
17
Your error counts do seem high. For reference, here are mine:

Code:
Error counter log:
          Errors Corrected by           Total   Correction     Gigabytes    Total
              ECC          rereads/    errors   algorithm      processed    uncorrected
          fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:         36        0         0        36         27        382.530           0
write:         0        0         0         0        229        388.859           0
verify:      126        0         0       126         71         16.143           0

That's about 10x the number of correction invocations per gigabyte written, and over 650x on reads.

I don't know whether these are SAS protocol errors or platter read/write. Maybe someone else can enlighten us.

Have you tried running this drive through the HiTest diagnostics? Your drive was made the same week as one I just returned (same model) because it wouldn't show up in HiTest at all (whereas the one reported above was manufactured in week 10, and passed HiTest). Are you buying your drives from Newegg perchance?

You can sign up at http://www.hgst.com/partners and then you can download HiTest. It's a bit of a pain to use. Windows-only, and you need an ASPI driver in addition to your HBA driver. SAS drives show up as "server" drives (vs. ATA). I use the standard sequence + full-platter write + full-platter read before putting a drive into service.

I don't have any other HBAs, other than what are in my FreeNAS box, otherwise I probably would run HiTest.

And it was a Newegg purchase.
 

pjc

Contributor
Joined
Aug 26, 2014
Messages
187
So it could be Newegg got a bad batch. Or their non-standard packaging might be coming back to haunt them.

We'll see how my replacement does (next week).

I can't really offer any additional insight unless you're willing to reboot your FreeNAS box to test the drive (that's what I ended up doing...scheduled downtime).

Incidentally, your SAS output is what I see from all my SAS drives as well. The SATA drives are more verbose, since SMART is an ATA feature, and smartctl is just fairly clever about querying SAS drives for the messages they support.
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
Here's on of my SAS drives hooked up to a 1015 hba in IT mode:

Code:
root@test-nas ~ # smartctl -a -q noserial /dev/da1
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:  HP
Product:  DG146ABAB4
Revision:  HPDE
User Capacity:  146,815,737,856 bytes [146 GB]
Logical block size:  512 bytes
Rotation Rate:  10033 rpm
Device type:  disk
Transport protocol:  SAS
Local Time is:  Thu Oct  9 18:41:45 2014 MDT
SMART support is:  Available - device has SMART capability.
SMART support is:  Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:  37 C
Drive Trip Temperature:  68 C

Elements in grown defect list: 782

Vendor (Seagate) cache information
  Blocks sent to initiator = 1675972621
  Blocks received from initiator = 4171778528
  Blocks read from cache and sent to initiator = 4143331125
  Number of read and write commands whose size <= segment size = 248575062
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 39939.38
  number of minutes until next internal SMART test = 52

Error counter log:
  Errors Corrected by  Total  Correction  Gigabytes  Total
  ECC  rereads/  errors  algorithm  processed  uncorrected
  fast | delayed  rewrites  corrected  invocations  [10^9 bytes]  errors
read:  0  0  0  0  0  0.000  0
write:  0  0  0  0  0  0.000  0

Non-medium error count:  0

SMART Self-test log
Num  Test  Status  segment  LifeTime  LBA_first_err [SK ASC ASQ]
  Description  number  (hours)
# 1  Background long  Completed  -  39935  - [-  -  -]
Long (extended) Self Test duration: 2070 seconds [34.5 minutes]



Note 782 'elements in grown defect list'. I read this as bad sectors.

It also shows all zeros for the error count log. I know some other drives have been good but shown data for this area. I don't know enough about it to know what kind of numbers are good or bad.
 

pjc

Contributor
Joined
Aug 26, 2014
Messages
187
If you look at his output and look at my output (which is true SMART output) they look *nothing* alike...

Notice he has almost none of those categories? That's because his SAS controller is providing them instead of the hard drive and giving you some kind of watered down answer based on whatever it defines as an error.
Good guess, but no. You're looking at SATA drives. He has a SAS drive. Take a look at smartctl's own SAS example (http://smartmontools.sourceforge.net/smartmontools_scsi.html). It looks like his output.

SAS has an entirely different set of "informational exceptions." For example, SAS uses the "grown defects list" instead of "reallocated sector count".

Looking at smartcl, he might try "-l sasphy" to see if there are any additional bus errors.

Separately, what's wrong with the 9201? I thought you guys liked LSI HBAs.
 
Status
Not open for further replies.
Top