SOLVED Burn-in drives with Spearfoot's script questions

CheeryFlame

Contributor
Joined
Nov 21, 2022
Messages
184
Hello! First I want to say that I'm a rookie and this is my first build, still I'm proud that I went that far by reading documentations and trial and error all by myself!

Although I'm at the point I need some help. I bought refurbished drives and I can return them easily if I can find defective ones within 30 days. I did some research and the best way I could find to intensively test my drives is @Spearfoot 's burn-in script.

I first tried running it in TrueNAS' Shell but unfortunately I could only do 1x drive at a time and I couldn't even finish one as I've been disconnected and lost the progress. I'm not even sure if it's still going in the background. So far nothing in my /logs folder.

What I've achieved so far
  • Install Putty configure SSH, connect successfully and run ./disk-burnin.sh
  • Alter kernel's geometry debug flags to 16 with command sysctl kern.geom.debugflags=0x10
  • Created 12 screens in Putty for all my 12 drives and ran in each the command sudo ./disk-burnin.sh -f -o ~/logs da0
  • Detached all screens one by one with Ctrl-A + D, closed Putty
  • Opened Putty used the command screen -ls to list all screens
  • Verified progress of a hard drive with command screen -RD da7
Where I'm lost
  1. Is my operation I started in Shell on drive da0 is still continuing in the background if the TrueNAS session closed but haven't restarted the server?
  2. When I detach my screens, close my computer and reboot it (need to do this for work), is it still running in the background on the server?
  3. I made a Z1 pool with all of the 12x 12tb drives and installed the script on this pool but ran the destructive test on all drives, what will happen?
  4. I've used this command ./disk-burnin.sh -f -o ~/logs da[x] on each drives but I'm not sure I'm doing the whole operation?
  5. I tried using the flag -s to show progress but it says invalid flag, how does it work?
  6. I tried using the flag -v to show some information but it says invalid flag, how does it work?
  7. I don't understand the flag -b 8192 do I need to use it?
  8. I tried using the flag -w but it says invalid flag, what's the advantage of running this flag?
  9. I don't understand the flag -x it says to perform a full pass of badblocks, is it doing the -w at the same time and does it abort the test immediately if it finds a badblock?
  10. Is there a way to see progress?
Here are some screenshots of Putty, maybe it can help understand where I'm at.

Screenshot - 2022-12-13 - 23h02s05.png


Screenshot - 2022-12-13 - 23h00s28.png


I know I'm asking quite a lot of questions here but please keep in mind that this is new ground to me and I'm already impressed I was able to get this far in the process. The reason why I'm asking is because I'm in a rush to understand as I need to return faulty hard drives asap.

Thank you for helping!
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
Irrespective of the testing, I recommend that you check the number of hours the drives have been operational so far. If that number indicates more than 3 years I would be skeptical. It is a somewhat subjective area though, and I would personally never buy used drives for critical data; that in contrast to motherboard, CPU, and RAM.
 

nKk

Dabbler
Joined
Jan 8, 2018
Messages
42
2. When you detach from screen session the script will continue to work.

3. You will lost all of your data on disks that is tested so it's not a good idea to create pool of them and store data, including the disk-burnin.sh script.

4. The command is OK but because the disks are large it will took long time for test to complete.

About 5 to 8 of your questions.
This are options for badblocks not for disk-burnin.sh and you can't use them when call disk-burnin.sh.

9. This is option of disk-burnin.sh that will call badblocks with "-e 0" this will force badblocks to continue testing after error is found, otherwise badblocks will stop when found error on disk.

10. The progress is shown in console(your last screen schot) and it's depend of the test tool that is executed currently. Because the drives are big it will took long time badblocks and long SMART test finish, but badblocks have progress for every operation.
 

CheeryFlame

Contributor
Joined
Nov 21, 2022
Messages
184
Irrespective of the testing, I recommend that you check the number of hours the drives have been operational so far. If that number indicates more than 3 years I would be skeptical. It is a somewhat subjective area though, and I would personally never buy used drives for critical data; that in contrast to motherboard, CPU, and RAM.
Not related to the topic but since you brought this up I suggest reading this as there's a lot of misinformation about used/refurb/wl drives. Smart data "should" be wiped as part of the refurb/wl process. One trick is to get refurb/wl drives that the production date is from last year.

Thank you very much for your answers. The Smart test didn't show progress but once it finished the badblocks process do show progress. It's now over 70% so I guess I'll lose my logs as I ran like a noob the test on the pool where the script is. I guess I'll have to re-do a test after this one finishes.

About 5 to 8 of your questions.
This are options for badblocks not for disk-burnin.sh and you can't use them when call disk-burnin.sh.
Is there a way to customize the badblocks options so I could use -w and write the four patterns (0xaa, 0x55, 0xff, 0x00)?

Here's a screenshot of the progress so far for reference.
Screenshot - 2022-12-14 - 13h45s10.png
 

CheeryFlame

Contributor
Joined
Nov 21, 2022
Messages
184
Just saw that all of the 12x drives are showing 24 badblocks, is that because it's on a raid? The chance that all 12x drives all have 24 badblocks is improbable... It's currently doing the SMART long test, let's see the log files tomorrow. Also I don't understand how my user folder with the script on this pool wasn't deleted after running simulteanously the badblocks test on all 12 drives.

Screenshot - 2022-12-15 - 00h06s57.png
 
Last edited:

nKk

Dabbler
Joined
Jan 8, 2018
Messages
42
Is there a way to customize the badblocks options so I could use -w and write the four patterns (0xaa, 0x55, 0xff, 0x00)?
The script use -w option, these are parameters for badblocks in script:
badblocks -b 8192 -wsv ...

About similar errors on all drives I think best option is to destroy the pool and test drives when they are not participate in pool.
 

CheeryFlame

Contributor
Joined
Nov 21, 2022
Messages
184
It did work to remove all the drives from the pool. All 12x drives seems to report no apparent errors after 6 days of running the tests.

Here's the log of a drive showing the error
Code:
Loss of dword synchronization count: 2
and this error is present at the same value for each drive. There's absolutely no chance that 12x drives would show the exact same amount of the error and the same error. Is there an issue with my drive controller?

+-----------------------------------------------------------------------------
+ Started burn-in: Thu Dec 15 12:32:07 PST 2022
+-----------------------------------------------------------------------------
Host: truenas.local
OS: FreeBSD
Drive: /dev/da11
Disk Type: 7200_rpm
Drive Model:
Serial Number:
Short test duration: minutes
0 seconds
Extended test duration: minutes
0 seconds
Log file: /mnt/Pollen/Freeman/scripts/disk-burnin/logs/da11/burnin-_.log
Bad blocks file: /mnt/Pollen/Freeman/scripts/disk-burnin/logs/da11/burnin-_.bb
+-----------------------------------------------------------------------------
+ Running SMART short test: Thu Dec 15 12:32:07 PST 2022
+-----------------------------------------------------------------------------
SMART short test started, awaiting completion for 0 seconds ...
SMART self-test timeout threshold exceeded
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed - 67 - [- - -]
# 2 Background long Aborted (device reset ?) - 64 - [- - -]
# 3 Background short Completed - 26 - [- - -]

Long (extended) Self-test duration: 65535 seconds [1092.2 minutes]

Finished SMART short test
+-----------------------------------------------------------------------------
+ Running badblocks test: Thu Dec 15 16:54:13 PST 2022
+-----------------------------------------------------------------------------
Finished badblocks test
+-----------------------------------------------------------------------------
+ Running SMART long test: Wed Dec 21 15:11:08 PST 2022
+-----------------------------------------------------------------------------
SMART long test started, awaiting completion for 0 seconds ...
SMART self-test timeout threshold exceeded
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)

Self-test execution status: 67% of test remaining
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Self test in progress ... - NOW - [- - -]
# 2 Background short Completed - 67 - [- - -]
# 3 Background long Aborted (device reset ?) - 64 - [- - -]
# 4 Background short Completed - 26 - [- - -]

Long (extended) Self-test duration: 65535 seconds [1092.2 minutes]

Finished SMART long test
+-----------------------------------------------------------------------------
+ Drive information: Wed Dec 21 19:33:14 PST 2022
+-----------------------------------------------------------------------------
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)

=== START OF INFORMATION SECTION ===
Vendor:
Product: OOS14000G
Revision: OOS1
Compliance: SPC-5
User Capacity: 14,000,519,643,136 bytes [14.0 TB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000c500d9af983f
Serial number: 000BL9730000C245DGGY
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Wed Dec 21 19:33:14 2022 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
Read Cache is: Enabled
Writeback Cache is: Enabled

SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature: 37 C
Drive Trip Temperature: 60 C

Manufactured in week 37 of year 2022
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 14
Specified load-unload count over device lifetime: 600000
Accumulated load-unload cycles: 296
Elements in grown defect list: 0

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 56009.675 0
write: 0 0 0 0 0 70004.301 0

Non-medium error count: 0


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
Self-test execution status: 67% of test remaining
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Self test in progress ... - NOW - [- - -]
# 2 Background short Completed - 67 - [- - -]
# 3 Background long Aborted (device reset ?) - 64 - [- - -]
# 4 Background short Completed - 26 - [- - -]

Long (extended) Self-test duration: 65535 seconds [1092.2 minutes]

Background scan results log
Status: no scans active
Accumulated power on time, hours:minutes 218:09 [13089 minutes]
Number of background scans performed: 0, scan progress: 0.00%
Number of background medium scans performed: 0

Protocol Specific port log page for SAS SSP
relative target port id = 1
generation code = 0
number of phys = 1
phy identifier = 0
attached device type: expander device
attached reason: SMP phy control function
reason: loss of dword synchronization
negotiated logical link rate: phy enabled; 12 Gbps
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=1
SAS address = 0x5000c500d9af983d
attached SAS address = 0x500056b38b092eff
attached phy identifier = 11
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization = 2
Phy reset problem = 0
Phy event descriptors:
Invalid word count: 0
Running disparity error count: 0
Loss of dword synchronization count: 2
Phy reset problem count: 0
relative target port id = 2
generation code = 0
number of phys = 1
phy identifier = 1
attached device type: no device attached
attached reason: unknown
reason: unknown
negotiated logical link rate: phy enabled; unknown
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x5000c500d9af983e
attached SAS address = 0x0
attached phy identifier = 0
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization = 0
Phy reset problem = 0
Phy event descriptors:
Invalid word count: 0
Running disparity error count: 0
Loss of dword synchronization count: 0
Phy reset problem count: 0

+-----------------------------------------------------------------------------
+ Finished burn-in: Wed Dec 21 19:33:14 PST 2022
+-----------------------------------------------------------------------------
 
Top