New Scale build, SAS drives seem to be stuck in SMART Long test even after power cycle?

billbillw

Dabbler
Joined
Jan 6, 2023
Messages
33
Hello,
I am setting up a new home server with Scale. This is my first dealing with TrueNAS (coming from ProxmoxVE).
I'm just a hobbyist so I tend to get in over my head on some of this. I have an issue and I'm not sure if it is a problem or not. Also, if it is a problem, how can I resolve it?
Details below.

I have a fairly simple server based on a Supermicro X10SRH-CF (built in SAS3008), 64GB of Reg ECC and a pool of 5x HGST HE10 8TB drives (retired from enterprise service) and a few SSDs to run containers/apps. I have Scale 11.12.0 installed successfully and I've been trying to make sure the "new to me" 8TB drives are up to the task. I started by putting them all through a SMART Long test using the TrueNAS Storage GUI.

I made the initial mistake of not selecting all 5 drives and was doing them individually. I initially saw some error messages about only being able to run one at a time? At some point shortly after that, I realized that I should have selected all 5 drives and then initiated the SMART Long test.

So long story short, I have done several power on/off cycles, I did get one full run of SMART Long tests (simultaneous 5 drives) to run successfully, but oddly, both TrueNAS and Putty's smartctl show all 5 drives are still running a long test, but there is nothing showing in the jobs tab. This has been several days now. The Running test pre-dates the successful tests.

Here is an example of what comes up for SDA:
root@truenas[~]# smartctl -a /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.79+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: HGST
Product: HUH721008AL5200
Revision: A384
Compliance: SPC-4
User Capacity: 8,001,563,222,016 bytes [8.00 TB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000cca2520beba0
Serial number: 7SG6K74C
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Sat Jan 21 11:46:51 2023 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature: 28 C
Drive Trip Temperature: 85 C

Accumulated power on time, hours:minutes 38718:40
Manufactured in week 12 of year 2017
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 95
Specified load-unload count over device lifetime: 600000
Accumulated load-unload cycles: 1583
Elements in grown defect list: 0

Vendor (Seagate Cache) information
Blocks sent to initiator = 22030695324975104

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 4463 0 4463 73165473 209169.321 0
write: 0 0 0 0 4775605 183118.763 0
verify: 0 2013 0 2013 3100719 341980.772 0

Non-medium error count: 0

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Completed - 38684 - [- - -]
# 2 Background long Self test in progress ... - NOW - [- - -]
# 3 Background short Completed - 38647 - [- - -]
# 4 Background short Completed - 38623 - [- - -]
# 5 Background short Completed - 38599 - [- - -]
# 6 Background short Completed - 38575 - [- - -]
# 7 Background short Completed - 38556 - [- - -]

Long (extended) Self-test duration: 60592 seconds [1009.9 minutes]

root@truenas[~]#

In addition, all of the drives look like this in the Storage GUI:
1674319770356.png


I don't understand why a complete power down doesn't clear these background long tests.

Is this something I should worry about or is it just a glitch that I shouldn't worry about? I haven't started the more strenuous BadBlock testing or any migration.
Thanks for any help.
 

billbillw

Dabbler
Joined
Jan 6, 2023
Messages
33
I should also mention, I tried to run smartctl -X on all the drives and it sends back an Unsupported field in scsi command error:

root@truenas[~]#smartctl -X /dev/sda
zsh: parse error near `\n'
root@truenas[~]# smartctl -X /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.79+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

Abort self test failed [unsupported field in scsi command]
root@truenas[~]#
 

billbillw

Dabbler
Joined
Jan 6, 2023
Messages
33
Well, after another 24+hrs the drives are still showing that the long test is running, no change. I decided to go ahead and run badblocks on all the disks. About 10% done after an hour, which is faster than I expected.
 

przemeku6

Cadet
Joined
Mar 1, 2023
Messages
3
Hi , Same here. Smart background long never finishing. smartctl -X doesn't stop it with the error message. this happens to 2 hgst 7tb connected to sli 9211-8i 6G SAS HBA in IT mode with the newest firmware. I am a complete noob, started playing with truenas a couple of months ago so please be patient. I tried to create a pool ( mirror ) and use it and everything else looks fine. Any advise and directions would be greatly appreciate it
 

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177
I don't understand why a complete power down doesn't clear these background long tests.
These test are run by the drive itself and are designed to continue after a power off.
 

billbillw

Dabbler
Joined
Jan 6, 2023
Messages
33
These test are run by the drive itself and are designed to continue after a power off.
From what I read, a power cycle is supposed to clear the test. In my case, the tests did not continue. It has been months and it still shows as running, but I have ran and completed other long tests in the meantime. The test that shows running was something I started back in January. This shows the same on the other 3 disks in the storage pool. Ultimately, it seems to be a glitch in the firmware for these SAS drives.
 

Attachments

  • Screenshot 2023-03-08 074941.jpg
    Screenshot 2023-03-08 074941.jpg
    64.1 KB · Views: 70

billbillw

Dabbler
Joined
Jan 6, 2023
Messages
33
Hi , Same here. Smart background long never finishing. smartctl -X doesn't stop it with the error message. this happens to 2 hgst 7tb connected to sli 9211-8i 6G SAS HBA in IT mode with the newest firmware. I am a complete noob, started playing with truenas a couple of months ago so please be patient. I tried to create a pool ( mirror ) and use it and everything else looks fine. Any advise and directions would be greatly appreciate it

Have you tried running the tests and leaving it for 24hrs or more, then checking? In my case, I was able to get tests completed. The only issue is the SMART (SAS version) still reports a test as running even when it is not. I would probably advise to run badblocks on the disks to see if they pass. Make sure to read the tutorials and understand how tmux works before you start.
Tutorial here: https://www.truenas.com/community/resources/hard-drive-burn-in-testing.92/
Note: some of the tmux keyboard commands are different in Scale vs Core because of the base OS being debian. If you search elsewhere for debian tmux tutorials, it may help a little.

You may find that you have to destroy the pool before testing. In most cases, you want to wait on creating a zpool until after the testing is completed.
 

przemeku6

Cadet
Joined
Mar 1, 2023
Messages
3
Have you tried running the tests and leaving it for 24hrs or more, then checking? In my case, I was able to get tests completed. The only issue is the SMART (SAS version) still reports a test as running even when it is not. I would probably advise to run badblocks on the disks to see if they pass. Make sure to read the tutorials and understand how tmux works before you start.
Tutorial here: https://www.truenas.com/community/resources/hard-drive-burn-in-testing.92/
Note: some of the tmux keyboard commands are different in Scale vs Core because of the base OS being debian. If you search elsewhere for debian tmux tutorials, it may help a little.

You may find that you have to destroy the pool before testing. In most cases, you want to wait on creating a zpool until after the testing is completed.
Thank you Sir. I definitely learned a lot last couple of days, run tmux with badblocks and later with smartctl long. badblocks took 70 hours and smartctl LONG was around 16 hours .both disks passed both tests with limited info ( apparently that is the thing with SAS drives) . I still believe there is something wrong with how truenas scale deals with those tests. I run all the testing from putty and tmux and Truenas still shows old tests ( previous ) as running, smartctl -X does nothing for those old tests. Today morning I tried to run s.m.a.r.t SHORT test under Truenas for 2pcs 120Gb sata ( mirror boot pool ) and 6 hours later it is still showing in jobs smart.test.wait 0% , fetching data. I also noticed that those two SAS 6Tb drives power management ( spin down ) does not work . I suspect the s.m.a.r.t. test prevents the drives from spin down. And here I would like to ask for anyone reading this, ready to tell me that the spindowns are bad, that is not the point! Besides where I live 1kwh in peak hours is $0.70 so that 5 Watts per drive 24h 365 days costs more than those recycled drives on ebay. I read somewhere that the record ( logs ) of the previous tests is kept on the drive itself and most disks keep less than 15 so, I am hoping that eventually the logs will clear out.
 
Top