S.M.A.R.T. tests fail in undefined segment

pinoli

Dabbler
Joined
Feb 20, 2021
Messages
34
Hello,
I have been receiving weird S.M.A.R.T. results on one of my disks.
I am running TrueNAS SCALE Bluefin 22.12.1 with a Broadcom 9405W-16i HBA (rest in signature) and never had issues with my other 9 drives.

The disk in question is a Seagate X18 SAS 18TB (ST18000NM004J), part of a mirrored vdev.
TrueNAS SCALE has been randomly showing these errors in the UI.
esc41HF.png


when I run sudo smartctl -x /dev/sde I get this
Code:
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.79+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST18000NM004J
Revision:             E001
Compliance:           SPC-5
User Capacity:        18,000,207,937,536 bytes [18.0 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500d85c582f
Serial number:        ZR56JHQZ0000C216GKBL
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Thu Mar 16 04:05:39 2023 CET
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Enabled


=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK


Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     33 C
Drive Trip Temperature:        60 C


Manufactured in week 51 of year 2021
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  60
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  12008
Elements in grown defect list: 0


Vendor (Seagate Cache) information
  Blocks sent to initiator = 3677215160
  Blocks received from initiator = 3828203624
  Blocks read from cache and sent to initiator = 469201710
  Number of read and write commands whose size <= segment size = 133715517
  Number of read and write commands whose size > segment size = 3781631


Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 6534.43
  number of minutes until next internal SMART test = 49


Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0     272153.845           0
write:         0        0         0         0          0      67948.050           0
verify:        0        0         0         0          0        208.749           0


Non-medium error count:        0




[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Failed in segment -->       -    6532       35143810608 [0x3 0x11 0x0]
# 2  Background short  Completed                   -    6508                 - [-   -    -]
# 3  Background short  Completed                   -    6484                 - [-   -    -]
# 4  Background short  Completed                   -    6460                 - [-   -    -]
# 5  Background short  Failed in segment -->       -    6436       35145130168 [0x3 0x11 0x0]
# 6  Background short  Completed                   -    6412                 - [-   -    -]
# 7  Background short  Failed in segment -->       -    6388       35145130168 [0x3 0x11 0x0]
# 8  Background short  Completed                   -    6364                 - [-   -    -]
# 9  Background short  Completed                   -    6340                 - [-   -    -]
#10  Background short  Failed in segment -->       -    6316       35143810608 [0x3 0x11 0x0]
#11  Background short  Completed                   -    6292                 - [-   -    -]
#12  Background short  Failed in segment -->       -    6268       35145130168 [0x3 0x11 0x0]
#13  Background short  Completed                   -    6244                 - [-   -    -]
#14  Background short  Completed                   -    6220                 - [-   -    -]
#15  Background short  Completed                   -    6196                 - [-   -    -]
#16  Background short  Failed in segment -->       -    6172       35143810608 [0x3 0x11 0x0]
#17  Background short  Failed in segment -->       -    6148       35143810608 [0x3 0x11 0x0]
#18  Background short  Failed in segment -->       -    6124       35145130168 [0x3 0x11 0x0]
#19  Background short  Completed                   -    6100                 - [-   -    -]
#20  Background short  Failed in segment -->       -    6076       35143810608 [0x3 0x11 0x0]


Long (extended) Self-test duration: 65535 seconds [1092.2 minutes]


Background scan results log
  Status: no scans active
    Accumulated power on time, hours:minutes 6534:26 [392066 minutes]
    Number of background scans performed: 0,  scan progress: 0.00%
    Number of background medium scans performed: 0


   #  when        lba(hex)    [sk,asc,ascq]    reassign_status
   1  843:45  0000000820a74870  [1,18,4]   Recovered via rewrite in-place
   2  844:01  000000082a91c0c0  [1,18,8]   Recovered via rewrite in-place
   3  844:01  000000082a973580  [3,11,0]   Recovered via rewrite in-place
   4  844:02  000000082ac4d178  [1,18,8]   Recovered via rewrite in-place
   5  844:08  000000082ec152d8  [3,11,0]   Require Write or Reassign Blocks command
   6  844:09  000000082ec15b78  [3,11,0]   Require Write or Reassign Blocks command
   7  844:09  000000082ed020b8  [3,11,0]   Require Write or Reassign Blocks command
   8  844:09  000000082ed03ab8  [3,16,0]   Require Write or Reassign Blocks command
   9  844:09  000000082ed03aa0  [3,11,0]   Require Write or Reassign Blocks command
  10  844:10  000000082ebbfe30  [3,11,0]   Require Write or Reassign Blocks command
 49152  844:01  0001000105523818  [1,18,8]   Recovered via rewrite in-place
 49153  844:02  0001000105589a2f  [1,18,8]   Recovered via rewrite in-place


Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 2
  number of phys = 1
  phy identifier = 0
    attached device type: SAS or SATA device
    attached reason: unknown
    reason: unknown
    negotiated logical link rate: phy enabled; 12 Gbps
    attached initiator port: ssp=1 stp=1 smp=1
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000c500d85c582d
    attached SAS address = 0x500605b00fe37e22
    attached phy identifier = 2
    Invalid DWORD count = 1
    Running disparity error count = 1
    Loss of DWORD synchronization = 126
    Phy reset problem = 26
    Phy event descriptors:
     Invalid word count: 1
     Running disparity error count: 1
     Loss of dword synchronization count: 126
     Phy reset problem count: 26
relative target port id = 2
  generation code = 2
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: unknown
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000c500d85c582e
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0

what puzzles me is the fact that the drive looks perfect aside from this section
Code:
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Failed in segment -->       -    6532       35143810608 [0x3 0x11 0x0]
# 2  Background short  Completed                   -    6508                 - [-   -    -]
# 3  Background short  Completed                   -    6484                 - [-   -    -]
# 4  Background short  Completed                   -    6460                 - [-   -    -]
# 5  Background short  Failed in segment -->       -    6436       35145130168 [0x3 0x11 0x0]
# 6  Background short  Completed                   -    6412                 - [-   -    -]
# 7  Background short  Failed in segment -->       -    6388       35145130168 [0x3 0x11 0x0]
# 8  Background short  Completed                   -    6364                 - [-   -    -]
# 9  Background short  Completed                   -    6340                 - [-   -    -]
#10  Background short  Failed in segment -->       -    6316       35143810608 [0x3 0x11 0x0]
#11  Background short  Completed                   -    6292                 - [-   -    -]
#12  Background short  Failed in segment -->       -    6268       35145130168 [0x3 0x11 0x0]
#13  Background short  Completed                   -    6244                 - [-   -    -]
#14  Background short  Completed                   -    6220                 - [-   -    -]
#15  Background short  Completed                   -    6196                 - [-   -    -]
#16  Background short  Failed in segment -->       -    6172       35143810608 [0x3 0x11 0x0]
#17  Background short  Failed in segment -->       -    6148       35143810608 [0x3 0x11 0x0]
#18  Background short  Failed in segment -->       -    6124       35145130168 [0x3 0x11 0x0]
#19  Background short  Completed                   -    6100                 - [-   -    -]
#20  Background short  Failed in segment -->       -    6076       35143810608 [0x3 0x11 0x0]

when the S.M.A.R.T. test fails it says Failed in segment --> but doesn't show the segment.
also the LBA_first_err sectors are just those two, and every failed test involves one of those two sectors (35143810608 or 35145130168).

this drive has been in use almost a year and never had any issue, weird S.M.A.R.T. alerts aside.
cO0E9qT.png

Code:
Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0     272153.845           0
write:         0        0         0         0          0      67948.050           0
verify:        0        0         0         0          0        208.749           0

Non-medium error count:        0


as you can see I waited a bit before opening a thread about this issue, but this behavior has always stayed the same for almost 9 months.
I am not sure what to do, I can still RMA the disk since it's within the warranty period, but not seeing any error on the drive and having zero issues with my data up until now, makes me think this might be something different. maybe those two blocks are unreadable? but then the read error count should go up, and yet it is zero.

shall I run some specific test? any help would be greatly appreciated, thanks.
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
Run a long smart test and post the results.
 

Ramboxman

Explorer
Joined
Jun 20, 2013
Messages
63
I use to use CRON task to run this below, anyone know what the -d option is? Thanks

sh /mnt/BackupDrive/Scripts/drive.sh /dev/ada0

/dev/ada0: Unable to detect device type
Please specify device type with the -d option
 

pinoli

Dabbler
Joined
Feb 20, 2021
Messages
34
Run a long smart test and post the results.
as soon as I posted this I started a long S.M.A.R.T. test on the disk, knowing it would take forever (more than 22 hours).
it should be ready in about 1 hour as per last edit, I'll make sure to update this post as soon as it's over.
 
Last edited:

pinoli

Dabbler
Joined
Feb 20, 2021
Messages
34
here we go, it finished after 25 hours

Code:
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.79+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST18000NM004J
Revision:             E001
Compliance:           SPC-5
User Capacity:        18,000,207,937,536 bytes [18.0 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500d85c582f
Serial number:        ZR56JHQZ0000C216GKBL
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri Mar 17 06:52:16 2023 CET
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     34 C
Drive Trip Temperature:        60 C

Manufactured in week 51 of year 2021
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  60
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  12021
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 4004829072
  Blocks received from initiator = 3831201352
  Blocks read from cache and sent to initiator = 531163574
  Number of read and write commands whose size <= segment size = 133795690
  Number of read and write commands whose size > segment size = 3781711

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 6561.20
  number of minutes until next internal SMART test = 54

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0     272321.581           0
write:         0        0         0         0          0      67949.609           0
verify:        0        0         0         0          0        208.752           0

Non-medium error count:        0


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Failed in segment -->       -    6560       35143810608 [0x3 0x11 0x0]
# 2  Background short  Failed in segment -->       -    6532       35143810608 [0x3 0x11 0x0]
# 3  Background short  Completed                   -    6508                 - [-   -    -]
# 4  Background short  Completed                   -    6484                 - [-   -    -]
# 5  Background short  Completed                   -    6460                 - [-   -    -]
# 6  Background short  Failed in segment -->       -    6436       35145130168 [0x3 0x11 0x0]
# 7  Background short  Completed                   -    6412                 - [-   -    -]
# 8  Background short  Failed in segment -->       -    6388       35145130168 [0x3 0x11 0x0]
# 9  Background short  Completed                   -    6364                 - [-   -    -]
#10  Background short  Completed                   -    6340                 - [-   -    -]
#11  Background short  Failed in segment -->       -    6316       35143810608 [0x3 0x11 0x0]
#12  Background short  Completed                   -    6292                 - [-   -    -]
#13  Background short  Failed in segment -->       -    6268       35145130168 [0x3 0x11 0x0]
#14  Background short  Completed                   -    6244                 - [-   -    -]
#15  Background short  Completed                   -    6220                 - [-   -    -]
#16  Background short  Completed                   -    6196                 - [-   -    -]
#17  Background short  Failed in segment -->       -    6172       35143810608 [0x3 0x11 0x0]
#18  Background short  Failed in segment -->       -    6148       35143810608 [0x3 0x11 0x0]
#19  Background short  Failed in segment -->       -    6124       35145130168 [0x3 0x11 0x0]
#20  Background short  Completed                   -    6100                 - [-   -    -]

Long (extended) Self-test duration: 65535 seconds [1092.2 minutes]

Background scan results log
  Status: no scans active
    Accumulated power on time, hours:minutes 6561:12 [393672 minutes]
    Number of background scans performed: 0,  scan progress: 0.00%
    Number of background medium scans performed: 0

   #  when        lba(hex)    [sk,asc,ascq]    reassign_status
   1  843:45  0000000820a74870  [1,18,4]   Recovered via rewrite in-place
   2  844:01  000000082a91c0c0  [1,18,8]   Recovered via rewrite in-place
   3  844:01  000000082a973580  [3,11,0]   Recovered via rewrite in-place
   4  844:02  000000082ac4d178  [1,18,8]   Recovered via rewrite in-place
   5  844:08  000000082ec152d8  [3,11,0]   Require Write or Reassign Blocks command
   6  844:09  000000082ec15b78  [3,11,0]   Require Write or Reassign Blocks command
   7  844:09  000000082ed020b8  [3,11,0]   Require Write or Reassign Blocks command
   8  844:09  000000082ed03ab8  [3,16,0]   Require Write or Reassign Blocks command
   9  844:09  000000082ed03aa0  [3,11,0]   Require Write or Reassign Blocks command
  10  844:10  000000082ebbfe30  [3,11,0]   Require Write or Reassign Blocks command
 49152  844:01  0001000105523818  [1,18,8]   Recovered via rewrite in-place
 49153  844:02  0001000105589a2f  [1,18,8]   Recovered via rewrite in-place

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 2
  number of phys = 1
  phy identifier = 0
    attached device type: SAS or SATA device
    attached reason: unknown
    reason: unknown
    negotiated logical link rate: phy enabled; 12 Gbps
    attached initiator port: ssp=1 stp=1 smp=1
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000c500d85c582d
    attached SAS address = 0x500605b00fe37e22
    attached phy identifier = 2
    Invalid DWORD count = 1
    Running disparity error count = 1
    Loss of DWORD synchronization = 126
    Phy reset problem = 26
    Phy event descriptors:
     Invalid word count: 1
     Running disparity error count: 1
     Loss of dword synchronization count: 126
     Phy reset problem count: 26
relative target port id = 2
  generation code = 2
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: unknown
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000c500d85c582e
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0

same error, still undefined, at one of those two infamous blocks.
it's like it's failing the test, but the disk is alright.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
That looks like an RMA to me. However Seagate will want you to take the drive out, put it in a windows box and run the seatools (or whataver they are called) on it for a pre-RMA test on the drive
 

pinoli

Dabbler
Joined
Feb 20, 2021
Messages
34
That looks like an RMA to me.
genuine question, what makes you think that? disk reports zero errors in over 270 days.
btw I might RMA it, but still this is pretty weird.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Its failing internal smart tests
 
Joined
Jun 15, 2022
Messages
674
Am I understanding correctly:
  1. the computer operating system hasn't used that area of the disk to store data so it's not generating Read/ Write/ Verify errors
  2. errors are being generated during HDD self-scans that sometimes happen to read that part of the disk (and the user-invoked long scan that reads the whole disk) because empty blocks and accompanying checksums created during drive formatting are destabilizing and the data portions aren't agreeing with the ECC portions of those empty blocks
  3. errors would be reported if data was written to those blocks, such as via badblocks -w during a burn-in
  4. @pinoli "got lucky" the short scans were coincidentally set to scan those parts of the disk
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
SAS drives don't always report SMART data in the same manner as SATA drives do. What you do know is that the drive has failed both short tests and now a long test. RMA the drive.
 

pinoli

Dabbler
Joined
Feb 20, 2021
Messages
34
Am I understanding correctly:
  1. the computer operating system hasn't used that area of the disk to store data so it's not generating Read/ Write/ Verify errors
  2. errors are being generated during HDD self-scans that sometimes happen to read that part of the disk (and the user-invoked long scan that reads the whole disk) because empty blocks and accompanying checksums created during drive formatting are destabilizing and the data portions aren't agreeing with the ECC portions of those empty blocks
  3. errors would be reported if data was written to those blocks, such as via badblocks -w during a burn-in
  4. @pinoli "got lucky" the short scans were coincidentally set to scan those parts of the disk
very nice explanation, it kinda makes sense. though that pool has been filled up to 80% and it's been in use for more than 270 days: how likely is it that those blocks have never been read nor written? to me it looks like the HD is completely avoiding those blocks (hence no read nor write errors), and it's just the S.M.A.R.T. test that, when reading those blocks that should be left alone, is saying "something is wrong here, but I don't know what" (hence no segment when test fails).

SAS drives don't always report SMART data in the same manner as SATA drives do. What you do know is that the drive has failed both short tests and now a long test. RMA the drive.
agreed, I have other 3 drives like this one and they never failed a single S.M.A.R.T. test, so something is wrong with the drive after all. I am going to run some more tests for the sake of it and RMA it after all.

thanks again for all the help.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
If you have time run a destructive badblocks across the entire surface. Warning this will take a while - like as > 1 week I suspect
 
Joined
Jun 15, 2022
Messages
674
There's also a non-destructive flag that does writes:
 
Top