Another bad drive, or something else?

allanonmage

Dabbler
Joined
Aug 20, 2023
Messages
31
I flipped the switch and loaded up a bunch of data to my NAS (specs in sig), and then we had a power outage yesterday. No big deal, I have UPS's, and told FreeNAS to shut itself down gracefully, which it did. However, when I powered things back up later, I had 2 interesting messages:


Critical​

Device: /dev/sdj [SAT], 1 Currently unreadable (pending) sectors.​

2024-02-28 14:35:25 (America/Los_Angeles)
Dismiss

Critical​

Device: /dev/sdj [SAT], ATA error count increased from 0 to 7.​

2024-02-28 14:35:25 (America/Los_Angeles)


I've had the first error on 3 drives already, and replaced the drives under warranty. The sellers didn't argue about it either, which I thought was very respectful of them. However, given the circumstances of detecting the bad sectors last time (a cold power on), and this thread, I'm wondering if there's a brownout instead of an actual bad sector? I have a 750W power supply, which isn't large by today's standards, but isn't small either. However, I'm up to 14 drives too. I thought I had more than enough room in my power budget (yes it's a single rail design), but the specs on this page made me second guess myself.

I have a spare power supply of larger capacity, but currently no backups of the data at the moment. My plan is to swap out the drive with one of the cold spares and let it resilver. Is the power supply relevent here? Or did I just get bad luck while buying refurb'd drives?
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Drive sdj has a minor failure being reported, but you don't know if this drive is ready to fail and should be replaced.

Do this: at the command prompt smartctl -a /dev/sdj and post that output here in code brackets to retain the format. Next do this smartctl -t long /dev/sdj and let the test finish. Once it finishes which depending on the capacity of the drive could be 1 hour (500GB) to several days (18TB). The first piece of data will state in minutes the duration of a "Extended" test.

Is the power supply relevent here?
Not at all.

The ATA Error Count "could" be a loose SATA connection. If the value is not increasing, the problem is not present. We will see more when you post the first piece of data.
 

allanonmage

Dabbler
Joined
Aug 20, 2023
Messages
31
Do I really want to force it into a several day long test if I'm worried that it's bad? That seems like it would kill it if it's on the fence.

I checked the web UI again, and now I have 3 critical alerts. The new one is:

Critical​

Device: /dev/sdj [SAT], Self-Test Log error count increased from 0 to 1.​

2024-02-29 14:05:25 (America/Los_Angeles)
Dismiss

Earlier today I did start a short self test from the UI, but I cancelled out of it. Accounting for the timezone that I didn't change, this alert is about the time I was kicking off the self test.

I ran the shell from the web UI, ran the first command, and got:

********@truenas[~]$ smartctl -a /dev/sdj
zsh: command not found: smartctl
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Are you running as an unprivileged user? I suspect so. Then try sudo smartctl -a /dev/sdj and see what happens. You should be asked to enter the root password.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Do I really want to force it into a several day long test if I'm worried that it's bad? That seems like it would kill it if it's on the fence.
Yes you do. But it can wait until you post the SMART data. If it's obviously bad I will tell you, no need to test. If it is questionable, well you should be running SMART Long tests often and a daily Short test just to make sure your drives are healthy. I'm making an assumption that your system drives are spinning all the time as well.

The new error is not good, it's actually a bad sign but lets see the data before jumping the gun.
 

allanonmage

Dabbler
Joined
Aug 20, 2023
Messages
31
Code:
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC  WUH721414ALN6L4
Serial Number:    9JGYRH9T
LU WWN Device Id: 5 000cca 258cd8333
Firmware Version: LDGNW2L0
User Capacity:    14,000,519,643,136 bytes [14.0 TB]
Sector Size:      4096 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Mar  1 17:38:48 2024 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                (  101) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1571) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   001    Pre-fail  Always       -       458752
  2 Throughput_Performance  0x0005   136   136   054    Pre-fail  Offline      -       96
  3 Spin_Up_Time            0x0007   082   082   001    Pre-fail  Always       -       375 (Average 374)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       10
  5 Reallocated_Sector_Ct   0x0033   100   100   001    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   001    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       4006
 10 Spin_Retry_Count        0x0013   100   100   001    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       10
 22 Unknown_Attribute       0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       175
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       175
194 Temperature_Celsius     0x0002   059   059   000    Old_age   Always       -       35 (Min/Max 21/42)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 7 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 7 occurred at disk power-on lifetime: 3955 hours (164 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 01 00 bb ff bb 40 00      00:01:15.995  READ FPDMA QUEUED
  60 01 00 ba ff bb 40 00      00:01:13.992  READ FPDMA QUEUED
  60 01 00 b9 ff bb 40 00      00:01:13.992  READ FPDMA QUEUED
  60 01 00 b8 ff bb 40 00      00:01:13.992  READ FPDMA QUEUED
  60 01 00 b7 ff bb 40 00      00:01:13.991  READ FPDMA QUEUED

Error 6 occurred at disk power-on lifetime: 3955 hours (164 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 09 38 b7 ff bb 40 00      00:01:13.579  READ FPDMA QUEUED
  60 0a 30 ac ff bb 40 00      00:01:11.585  READ FPDMA QUEUED
  60 06 28 a5 ff bb 40 00      00:01:11.584  READ FPDMA QUEUED
  60 15 20 8f ff bb 40 00      00:01:11.584  READ FPDMA QUEUED
  60 04 18 8a ff bb 40 00      00:01:11.584  READ FPDMA QUEUED

Error 5 occurred at disk power-on lifetime: 3955 hours (164 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 01 00 bb ff bb 40 00      00:01:09.045  READ FPDMA QUEUED
  60 01 00 ba ff bb 40 00      00:01:07.042  READ FPDMA QUEUED
  60 01 00 b9 ff bb 40 00      00:01:07.042  READ FPDMA QUEUED
  60 01 00 b8 ff bb 40 00      00:01:07.042  READ FPDMA QUEUED
  60 01 00 b7 ff bb 40 00      00:01:07.042  READ FPDMA QUEUED

Error 4 occurred at disk power-on lifetime: 3955 hours (164 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 09 38 b7 ff bb 40 00      00:01:07.037  READ FPDMA QUEUED
  60 0a 30 ac ff bb 40 00      00:01:05.048  READ FPDMA QUEUED
  60 06 28 a5 ff bb 40 00      00:01:05.048  READ FPDMA QUEUED
  60 15 20 8f ff bb 40 00      00:01:05.048  READ FPDMA QUEUED
  60 04 18 8a ff bb 40 00      00:01:05.048  READ FPDMA QUEUED

Error 3 occurred at disk power-on lifetime: 3955 hours (164 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 01 00 bb ff bb 40 00      00:01:04.987  READ FPDMA QUEUED
  60 01 00 ba ff bb 40 00      00:01:02.985  READ FPDMA QUEUED
  60 01 00 b9 ff bb 40 00      00:01:02.984  READ FPDMA QUEUED
  60 01 00 b8 ff bb 40 00      00:01:02.983  READ FPDMA QUEUED
  60 01 00 b7 ff bb 40 00      00:01:02.983  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%      3978         3418095547
# 2  Short offline       Completed without error       00%         8         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

allanonmage

Dabbler
Joined
Aug 20, 2023
Messages
31
Yes, the drives are spinning all the time. I looked into power saving, but haven't implemented that yet.

4,000 hours..... whew that's a lot!

I have no new errors today. Just the 3 that I've pasted above.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
1 Short offline Completed: read failure 90% 3978 3418095547
You have an actual failure. At this point it's time to run a SMART long test using sudo smartctl -t long /dev/sdj and then wait 1571 minutes (27 hours). That is the problem with high capacity drives, they take forever to test.

If the Long test completes without any error, then you might be good. However I'm fairly certain it will fail. I've never seen a drive pass the long test once it has failed the short test. The silver lining, with only 4000 hours on it, you would be covered by the warranty. This is a RMA covered error. Depending on where you live, WD and do an Advanced RMA, where you give them a credit card number, they send you a replacement drive pretty much overnight, and you ship the failed drive in the same shipping container and they pay for the return shipping as well. Your credit card is only charged if you do not return the failed drive. It's a great service and I have used it twice over the decades.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925

allanonmage

Dabbler
Joined
Aug 20, 2023
Messages
31
45.7% of a year ... "nothing" in context.
Well when you put it that way, that sounds way less impressive. Do the numbers roll over after they max out a small INT value or anything, or can that number be counted to be accurate?

Also, since I just got the drives, this is sufficient for me to reach out to the vendor to get an RMA. Thanks for the help!
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Do the numbers roll over
Some drives will roll over at 65536, but typically not all the timers roll over on a drive like that, which causes confusion.
 
Top