SMART output assistance requested

JediDan

Dabbler
Joined
Apr 9, 2019
Messages
11
Hello, I've recently acquired a FreeNAS server and am busy putting it through some tests before trusting it in the home lab. My first clue something may be amiss was an apparent reboot the first night since the uptime reported starting at ~5AM the morning after. Searching online brought me to the forums and pointed me to running smart tests, more on those in a moment.

Questions:
1. Can the console output be found in any system logs that I can access for future diagnostics? Currently I only see this with a monitor plugged in.
2. I've run smartctl -x on all my drives and have different output than all the posts I've seen. I'm not seeing the output table titled "SMART Self-test log structure revision X" etc. I suspect it could be smartctl needs a device specified but the LSI 9210 isn't listed in the smartctl manual. This is a wild hypothesis formulated in ignorance, suggestions welcome. What I do see is below:

The terminal monitor shows the following output.
Code:
(da6:mps0:0:6:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:6:0): Info: 0x7640afc8
(da6:mps0:0:6:0): Field Replaceable Unit: 129
(da6:mps0:0:6:0): Actual Retry Count: 157
(da6:mps0:0:6:0): Error 5, Unretryable error
(da6:mps0:0:6:0): READ(10). CDB: 28 00 76 40 af f8 00 00 08 00
(da6:mps0:0:6:0): CAM status: SCSI Status Error
(da6:mps0:0:6:0): SCSI status: Check Condition
(da6:mps0:0:6:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:6:0): Info: 0x7640aff9
(da6:mps0:0:6:0): Field Replaceable Unit: 129
(da6:mps0:0:6:0): Actual Retry Count: 157
(da6:mps0:0:6:0): Error 5, Unretryable error
(da6:mps0:0:6:0): READ(10). CDB: 28 00 76 40 b0 10 00 00 08 00
(da6:mps0:0:6:0): CAM status: SCSI Status Error
(da6:mps0:0:6:0): SCSI status: Check Condition
(da6:mps0:0:6:0): SCSI sense: MEDIUM ERROR asc:18,5 (Recovered data - recommend reassignment)
(da6:mps0:0:6:0): Info: 0x7640b014
(da6:mps0:0:6:0): Field Replaceable Unit: 1
(da6:mps0:0:6:0): Actual Retry Count: 17
(da6:mps0:0:6:0): READ(10). CDB: 28 00 76 40 af f8 00 00 08 00
(da6:mps0:0:6:0): CAM status: SCSI Status Error
(da6:mps0:0:6:0): SCSI status: Check Condition


Here's the output of smartctl -x /dev/da0.
Code:
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST32000445SS
Revision:             MS02
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c50034da11b7
Serial number:        9WM7CCZ40000920486P4
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Tue Apr  9 19:36:01 2019 PDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     35 C
Drive Trip Temperature:        68 C

Manufactured in week 33 of year 2011
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  44
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  44
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 4098019783
  Blocks received from initiator = 786283223
  Blocks read from cache and sent to initiator = 1061378457
  Number of read and write commands whose size <= segment size = 132265888
  Number of read and write commands whose size > segment size = 48163164

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 17845.47
  number of minutes until next internal SMART test = 22

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   3666938790        1         0  3666938791   3666938791     145034.698           0
write:         0        0         0         0          0       2628.822           0
verify:   134904        0         0    134904     134904          0.000           0

Non-medium error count:       26

No self-tests have been logged

Background scan results log
  Status: scan is active
    Accumulated power on time, hours:minutes 17845:28 [1070728 minutes]
    Number of background scans performed: 115,  scan progress: 72.55%
    Number of background medium scans performed: 22288

   #  when        lba(hex)    [sk,asc,ascq]    reassign_status
   1 6432:27  00000000e8d09238  [1,17,1]   Recovered via rewrite in-place
   2 9211:57  000000009284b09e  [1,17,1]   Recovered via rewrite in-place
   3 10126:32  00000000e8d097b2  [1,17,1]   Recovered via rewrite in-place

snip

  29 17673:56  00000000c8bacccd  [1,17,1]   Recovered via rewrite in-place
  30 17836:43  000000001d54bcac  [1,17,1]   Recovered via rewrite in-place

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 0
  number of phys = 1
  phy identifier = 0
    attached device type: SAS or SATA device
    attached reason: power on
    reason: power on
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=1 stp=1 smp=1
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000c50034da11b5
    attached SAS address = 0x500605b0056c8fb0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0
relative target port id = 2
  generation code = 0
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: unknown
    negotiated logical link rate: phy enabled; 1.5 Gbps
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000c50034da11b6
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0


I'm also running badblocks on all the drives and that may be finished tomorrow.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
That is fairly normal for SAS drives. I don't find the output very easily understandable. Sometimes the SATA drives are not easy to decipher either, but the SAS drives are more cryptic than they need to be. This one appears to have some sectors that are having to be "Recovered via rewrite in-place" and that sounds like a bad sector to me.
How many of the drives are doing that?
 

JediDan

Dabbler
Joined
Apr 9, 2019
Messages
11
That is fairly normal for SAS drives. I don't find the output very easily understandable. Sometimes the SATA drives are not easy to decipher either, but the SAS drives are more cryptic than they need to be. This one appears to have some sectors that are having to be "Recovered via rewrite in-place" and that sounds like a bad sector to me.
How many of the drives are doing that?

I have three that report slightly different as shown below, the remaining 6 report at least some "Recovered via rewrite in-place".

Code:
1 25759:44  000000003289ef24  [4,32,0]   Reassignment by disk failed
   2 25759:44  000000003289ef24  [4,32,0]   Reassignment by disk failed
   3 25759:44  000000003289ef3b  [4,32,0]   Reassignment by disk failed


My badsectors scan is still ongoing, second disk was almost done - 7 more to go. My next question is how bad are these drives? Recognizing that some bad sectors isn't necessarily indicative of the end, but still, some of these disks exhibit more errors than the others. What would you do?

I've attached a log file comprised of the output for all 9 disks.
 

Attachments

  • log.txt
    41.8 KB · Views: 467

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Reassignment by disk failed
That certainly looks to be a bad sector, that could not be reassigned. For the systems I manage at work, I replace drives that start showing bad sectors because it is usually an indicator that the drive either had a manufacturing defect (if it is new) or that it is wearing out if it is old. Drives do have a kind of life span, like tires on a car, and they wear out after a while. If you just purchased the system recently, I would contact the vendor and see if they will exchange the drives. They might.
 

JediDan

Dabbler
Joined
Apr 9, 2019
Messages
11
...If you just purchased the system recently, I would contact the vendor and see if they will exchange the drives. They might.

Thanks, a very reasonable recommendation. I will do so, but first I'd like to ensure I understand which drives are the ones with the issues - or at least the most issues. Am I correct in my conclusion that Recovered via rewrite in-place is a sector that has been successfully read and rewrittten elsewhere and the disk is more or less ok while Reassignment by disk failed is an unrecovered sector? If that's true then I need to see if they'll swap those 3 drives.

And last, any thoughts on the first question about locating that terminal output? I'm concerned if it reflects further issues that aren't being reported in a more visible location in FreeNAS.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
With SAS drives you need to be concerned primarily with two main attributes:

1. Elements in grown defect list
2. Uncorrected errors on any of read/write/verify

Looking at your log file, you have three drives that have major issues, but since you didn't include the /dev/daX identifiers, I'll ID them by serial number here.

Code:
Serial number:        9WM0LX910000C030C3UU
Elements in grown defect list: 4096

Serial number:        9WM0LLL70000C03239DZ
Elements in grown defect list: 4095

Serial number:        9WM0LM2Y0000C036AF70
Elements in grown defect list: 4092


They are also showing a number of uncorrected read errors. Definitely have these three replaced as DOA.

With regards to wanting the console messages, you could set up SNMP if you have that in your environment. Other than that you can periodically pull the files from the /var/log directory over SSH, console messages should be in dmesg or similar depending on their source.

Sidebar: All of your drives are showing as having their write cache disabled:
Writeback Cache is: Disabled
I assume they weren't part of a pool when these smartctl logs were gathered?

Sidebar 2: I'd be a little miffed if I was sold a system promised to come with Ultrastar 7K3000's and got first-gen Constellation ES instead. Not that its a particularly bad drive, but I'd rank the HGST's better personally. They are SED (Self Encrypting Drives) though so you could enable that feature if you're feeling brave. ;)
 
Last edited:

JediDan

Dabbler
Joined
Apr 9, 2019
Messages
11
Thanks for confirming the growing error rate is what to watch. From what I can tell by the activity lights the drives aren't logically mounted in any relation to their physical position. Of the 4-wide x 3-tall case I have the right column empty. The indicator lights show the top-right drive corresponding to activity on /da8, bottom center /da5, and bottom left /da7... I'd have to pull them by serial number anyway.

...you can periodically pull the files from the /var/log directory over SSH, console messages should be in dmesg or similar depending on their source.
Thanks, I'll dig around and see what I can find.

Sidebar: All of your drives are showing as having their write cache disabled:
Writeback Cache is: Disabled
I assume they weren't part of a pool when these smartctl logs were gathered?
They are currently part of a pool that I'm using only for testing. If I had to rebuild the pool that would be fine since I just started copying test data to it. I haven't turned write cache on or off so I'll need to look into that provided I don't end up sending all drives back in light of your next comment.

Sidebar 2: I'd be a little miffed if I was sold a system promised to come with Ultrastar 7K3000's and got first-gen Constellation ES instead. Not that its a particularly bad drive, but I'd rank the HGST's better personally. They are SED (Self Encrypting Drives) though so you could enable that feature if you're feeling brave. ;)
I wasn't expecting much with the seller's disclaimer that they'd use whatever junk they had on hand if they ran out of the performance drives but once I have some decent information to make my case I'll see how good their warranty service is - especially since they say both 1-year and 30-days in the same listing.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Thanks for confirming the growing error rate is what to watch. From what I can tell by the activity lights the drives aren't logically mounted in any relation to their physical position. Of the 4-wide x 3-tall case I have the right column empty. The indicator lights show the top-right drive corresponding to activity on /da8, bottom center /da5, and bottom left /da7... I'd have to pull them by serial number anyway.

You should be able to use sesutil from a shell to map out your bays and find out what's living in where; but using serial numbers for confirmation is a good idea regardless.

They are currently part of a pool that I'm using only for testing. If I had to rebuild the pool that would be fine since I just started copying test data to it. I haven't turned write cache on or off so I'll need to look into that provided I don't end up sending all drives back in light of your next comment.

ZFS should enable (and control) the drive's write cache once they're in a pool; if they didn't get enabled correctly, your performance would be pretty bad.

Good luck with the eBay seller; hopefully getting replacement drives isn't a hassle.
 

JediDan

Dabbler
Joined
Apr 9, 2019
Messages
11
Running 1000T ethernet I was seeing ~85 mbps upload on ftp to the server from a 6-year-old desktop. Not great, not terrible either. I haven't run download tests yet.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
That advertisement lists the backplane as: BPN-SAS-826A I-Path Direct Attached Backplane
I am not sure if it supports this but it can't hurt to try the command: sesutil locate da0 on
This should turn on the locate light for the disk by da#; also the command sesutil locate da0 off would turn it off.
If it works, it will save you from needing to figure it out by SN.

I actually wrote a resource on this that might have some helpful info:

https://www.ixsystems.com/community/threads/find-a-drive-on-a-sas-backplane.73640/
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Running 1000T ethernet I was seeing ~85 mbps upload on ftp to the server from a 6-year-old desktop. Not great, not terrible either. I haven't run download tests yet.
This is probably mostly down to the drives having problems. A defective drive will slow down the entire pool. I have been through that before.
Here is a utility written by one of the forum moderators that will help you test the speed of your drives:

solnet-array-test (for drive / array speed) non destructive test
https://forums.freenas.org/index.php?resources/solnet-array-test.1/
 

JediDan

Dabbler
Joined
Apr 9, 2019
Messages
11
Thanks for the resources. I tried sesutil and received the error sesutil: No SES devices found. Any suggestions?

Badblocks finished scanning the last disk today too. One drive showed over 7700 errors in 8:45 hours after crawling only ~0.1% of the disk - I cancelled the remainder of the scan since the heat death of the universe approaches. The next two indicated 100 errors and climbing in the first ~0.05% of the disk so I stopped scanning after 15 minutes. The last disk finished without any noted errors so I'll retire the three and start adding some new drives.
 
Top