Weird smarctl bug on FreeNAS 11.3 causing disk errors

Joined
May 2, 2017
Messages
5
Ticket: NAS-105721
UPDATE: This happens every time I run smartctl, like:
Code:
smartctl -a /dev/da6

It affects 11.3 only, no issues on 11.2 or older.

After a recent upgrade from 11.2-U8 to 11.3-U2 we have noticed a rapid and constant growth of "non-media errors" in all 12 connected disks simultaneously. I tried to replicate the issue on the test server and with the same hardware but fresh installation (no pools) new disks showed the same behavior.

No such problems with 11.2-U8. After rolling back from 11.3-U2 to 11.2-U8 errors stopped growing. Of course, now we can't change the counter on our new disks back to normal values...

CPU: 2x Intel Xeon E5-2620 V2 Hex (6) Core 2.1GHz
MEMORY: 256GB DDR3 ECC
Chassis: CSE-846E16-R1200B
Motherboard: X9DRi-F
Backplane: BPN-SAS2-846EL1 24-port 4U SAS2 6Gbps
Controller: 1x LSI00301(9207-8i)
Disks: MB6000FEDAU (12x6TB SAS)

Please check the attached screenshots for non-media disk errors and other details. Before the upgrade, all disks had <500 errors.
Any idea what could cause this issue?
 

Attachments

  • gr-01.jpg
    gr-01.jpg
    158.2 KB · Views: 547
  • gr-02.jpg
    gr-02.jpg
    216.7 KB · Views: 579
  • HDD.jpg
    HDD.jpg
    331.2 KB · Views: 582
Last edited by a moderator:

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
Short of someone coming into this thread and exclaiming “I’ve seen this, I know what it is!”, your best bet is to open a bug ticket via jira, under system->support.
 
Joined
May 2, 2017
Messages
5
Yorick, my bad if I did something wrong. I already did open the ticket before posting here. The ticket number and link is right at the beginning of my post.
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
So it is, sorry for missing that!

I am a user of this forum, just like you. I have no room to tell anyone that they “did something wrong”. I wanted to make sure you got some help, and, it turns out you are way ahead of me. Here’s to that ticket getting resolved!
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
That's a weird one. Are you running the latest (P20.00.07) firmware on the card? Are there, by any chance, firmware updates for the backplane?

As a sidenote, I removed your imgur link. For a number of reasons, we generally need users to upload images directly to the forum if they want to share them here. If there any other images you'd like to show us besides the three already posted, please upload them directly (simple drag and drop should work).
 

tfran1990

Patron
Joined
Oct 18, 2017
Messages
294
Thats alot of errors my friend!
 
Joined
May 2, 2017
Messages
5
Ericloewe, I got the latest firmware on the card. Unfortunately, I didn't check backplane's firmware and I don't have access to SC backplane firmwares.
tfran1990, almost all caused by weird smartcl requests.

Code:
smartctl -a -r ioctl /dev/da6


Code:
[log sense: 4d 00 40 ff 00 00 00 3e fc 00 ]
  CAM status=0x8c, SCSI status=0x2, resid=0x3efc
  sense_len=0x20, sense_resid=0x0
  status=0x2: [desc] sense_key=0x5 asc=0x24 ascq=0x0
Log Sense for supported pages and subpages failed [unsupported field in scsi command]


Full output:

Code:
smartctl -a -r ioctl /dev/da6
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

 [inquiry: 12 00 00 00 24 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
 [inquiry: 12 01 00 00 fc 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0xe8
  status=0x0
 [inquiry: 12 00 00 00 24 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
=== START OF INFORMATION SECTION ===
Vendor:               HP
Product:              MB6000FEDAU
Revision:             HPD7
Compliance:           SPC-4
 [read capacity(16): 9e 10 00 00 00 00 00 00 00 00 00 00 00 20 00 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
User Capacity:        6,001,175,126,016 bytes [6.00 TB]
Logical block size:   512 bytes
 [inquiry: 12 01 b1 00 40 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
 [mode sense(6): 1a 00 1c 00 40 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x28
  status=0x0
 [mode sense(6): 1a 00 5c 00 40 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x28
  status=0x0
 [inquiry: 12 01 83 00 fc 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0xb0
  status=0x0
Logical Unit id:      0x5000cca2326c3dec
 [inquiry: 12 01 80 00 fc 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0xe8
  status=0x0
Serial number:        1EHXJDLH
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Wed Apr 15 00:38:12 2020 PDT
 [test unit ready: 00 00 00 00 00 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
 [log sense: 4d 00 40 00 00 00 00 00 04 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
 [log sense: 4d 00 40 00 00 00 00 00 1a 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x1
  status=0x0
 [log sense: 4d 00 40 ff 00 00 00 3e fc 00 ]
  CAM status=0x8c, SCSI status=0x2, resid=0x3efc
  sense_len=0x20, sense_resid=0x0
  status=0x2: [desc] sense_key=0x5 asc=0x24 ascq=0x0
Log Sense for supported pages and subpages failed [unsupported field in scsi command]
 [log sense: 4d 00 6f 00 00 00 00 00 04 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
 [log sense: 4d 00 6f 00 00 00 00 00 5c 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
 [request sense: 03 00 00 00 12 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
 [log sense: 4d 00 4d 00 00 00 00 00 04 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
 [log sense: 4d 00 4d 00 00 00 00 00 10 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
SMART Health Status: OK

 [log sense: 4d 00 4d 00 00 00 00 00 04 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
 [log sense: 4d 00 4d 00 00 00 00 00 10 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
Current Drive Temperature:     30 C
Drive Trip Temperature:        60 C

 [log sense: 4d 00 4e 00 00 00 00 00 04 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
 [log sense: 4d 00 4e 00 00 00 00 00 38 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
Manufactured in week 16 of year 2015
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  58
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  1315
 [read defect list(12): b7 0c 00 00 00 00 00 00 00 08 00 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
Elements in grown defect list: 0

 [log sense: 4d 00 43 00 00 00 00 00 04 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
 [log sense: 4d 00 43 00 00 00 00 00 58 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
 [log sense: 4d 00 42 00 00 00 00 00 04 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
 [log sense: 4d 00 42 00 00 00 00 00 58 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
 [log sense: 4d 00 45 00 00 00 00 00 04 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
 [log sense: 4d 00 45 00 00 00 00 00 58 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0      97512.564           0
write:         0        0         0         0          0      12041.763           0
verify:        0        0         0         0          0          0.000           0
 [log sense: 4d 00 46 00 00 00 00 00 04 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
 [log sense: 4d 00 46 00 00 00 00 00 10 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0

Non-medium error count:     2685

 [mode sense(6): 1a 00 0a 00 40 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x28
  status=0x0
 [request sense: 03 00 00 00 12 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
 [log sense: 4d 00 50 00 00 00 00 00 04 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
 [log sense: 4d 00 50 00 00 00 00 01 94 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x0
  status=0x0
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -   29858                 - [-   -    -]
# 2  Background long   Completed                   -   28564                 - [-   -    -]
 [mode sense(6): 1a 00 0a 00 40 00 ]
  CAM status=0x1, SCSI status=0x0, resid=0x28
  status=0x0

Long (extended) Self-test duration: 55200 seconds [920.0 minutes]
 
Joined
May 2, 2017
Messages
5
So my jira ticket NAS-105721 was closed due to "Cannot Reproduce" :(

However, I just found that I'm not alone and some people are having the same problem with completely different hardware:
(PX04SVB320 - Every time SMART accessed, Non-medium error count increments)
I can say for sure that it affects all my HP SAS disks of different sizes (4TB-6TB), on different MB, cards/cables, and controllers. I have tested this on at 4 storage servers already.
No issues with smartmontools release 6.6...

I couldn't find any solution yet, so I can't upgrade from FreeNAS-11.2-U8
 
Top