disk pool throwing alert

georgelza · Aug 7, 2023

will run the first badblocks tonight/over night
G

georgelza · Aug 7, 2023

sretalla said:
zfs snapshot -r <old pool>/<dataset>@<snapshot name>
zfs send -R <old pool>/<dataset>@<snapshot name> | zfs recv -Fv <New Pool>/<dataset>

Once you're happy the data is all there and safe;
zfs destroy -f <old pool>/<dataset>

You could also set up a replication task and use it once, then do the delete.

once i've moved the media (from tank/media to plex/media)... guessing (as this is only used by plex) I need to update the plex config to then point to bunker/media

G

WI_Hedgehog · Aug 7, 2023

georgelza said:
so, how much cpu does bad blocks eat up.

I have one drive connected/unused already on which i can start bad blocks.

I have a screen/kb avail so was thinking running it in there... allowing me to disconnect

command planned

badblocks -b 4096 <searching for block size for ST4000VN008-2DR166> -c 256 -e 64 -p 1 -svw -o <where ?>

please explain a bit more: -svw to see what's going on (this does wipe the drive, a read won't detect future errors)

G

badblocks CPU usage depends on many factors. I have an old server (like 15 years old with PCI HBA) that it easily uses 50% with 4 drives, but the HP server (PCIe) runs 12 drives at something like 6% total CPU usage until a Verify pass, and that runs about 7% CPU usage per drive. Your 1 drive shouldn't impact server performance.

The block size you can leave default if you want (don't specify -b).

-c 256 should be fine, though 1024 might be far faster. Anything over 2048 seems to be only a slight gain, so 4096 and above is probably not needed.

-o I run to a USB drive and analyze the results in my office, but you could just as easily | tee <network_path> and watch the file with tail.

man badblocks shows help for -s (show progress) -v (verbose).

-w does a destructive write with bit patterns that make "stressful" magnetic encoding, so if the block is getting weak one of the patterns ought to show it:
0xaa is binary: 10101010
0x55: 01010101
0xff: 11111111
0x00: 00000000

Those patterns "cover the worst-case possibilities." Once I had just one of those patterns fail, but the rest stuck without issue, so the block's magnetic ability was "getting weak."

There's a lot of theory on magnetic storage, years back disks were magnetic film on glass platters and bad blocks were expected--there was actually a bad block list from the manufacturer on the label, which then was manually entered into the MFM or RLL controller so the controller wouldn't use those blocks (generally there were 3 to 9 bad blocks). From what I last heard (and I could be incorrect) modern platters are made from a long aluminum cylinder that is squashed, sawed, and polished because a way was found to pack data tighter and more reliably.

In modern industrial-quality drives 1 bad block is often considered "the beginning of the end." I'd argue it depends on the size of the drive and how quickly blocks are going bad, as some of my drives developed 3 bad blocks years back and are still running with those 3 bad blocks. However, I've had other drives start to throw 0 bad blocks on a Short SMART test, 1 or 2 bad blocks on a Long SMART test (both are read tests), and pulling from service and running a badblocks write test started throwing tens of bad blocks, a retest usually starts throwing hundreds of bad blocks. This is why I keep several spare drives on the shelf to chuck into systems that look like they might have a drive going bad.

georgelza · Aug 7, 2023

I ran th

sretalla said:
use tmux from ssh...

tmux new-session -d -s badblocks '/usr/sbin/badblocks -b.....'

Then at any time:

tmux attach -t badblocks

When in that session, CTRL + B, then D to disconnect. Also CTRL + C when in that session to terminate the running code.

is last night.

how would i confirm it's completed, is there a way to check/confirm it was all copied ?

G

georgelza · Aug 7, 2023

WI_Hedgehog said:
badblocks CPU usage depends on many factors. I have an old server (like 15 years old with PCI HBA) that it easily uses 50% with 4 drives, but the HP server (PCIe) runs 12 drives at something like 6% total CPU usage until a Verify pass, and that runs about 7% CPU usage per drive. Your 1 drive shouldn't impact server performance.

The block size you can leave default if you want (don't specify -b).

-c 256 should be fine, though 1024 might be far faster. Anything over 2048 seems to be only a slight gain, so 4096 and above is probably not needed.

-o I run to a USB drive and analyze the results in my office, but you could just as easily | tee <network_path> and watch the file with tail.

man badblocks shows help for -s (show progress) -v (verbose).

-w does a destructive write with bit patterns that make "stressful" magnetic encoding, so if the block is getting weak one of the patterns ought to show it:
0xaa is binary: 10101010
0x55: 01010101
0xff: 11111111
0x00: 00000000

Those patterns "cover the worst-case possibilities." Once I had just one of those patterns fail, but the rest stuck without issue, so the block's magnetic ability was "getting weak."

There's a lot of theory on magnetic storage, years back disks were magnetic film on glass platters and bad blocks were expected--there was actually a bad block list from the manufacturer on the label, which then was manually entered into the MFM or RLL controller so the controller wouldn't use those blocks (generally there were 3 to 9 bad blocks). From what I last heard (and I could be incorrect) modern platters are made from a long aluminum cylinder that is squashed, sawed, and polished because a way was found to pack data tighter and more reliably.

In modern industrial-quality drives 1 bad block is often considered "the beginning of the end." I'd argue it depends on the size of the drive and how quickly blocks are going bad, as some of my drives developed 3 bad blocks years back and are still running with those 3 bad blocks. However, I've had other drives start to throw 0 bad blocks on a Short SMART test, 1 or 2 bad blocks on a Long SMART test (both are read tests), and pulling from service and running a badblocks write test started throwing tens of bad blocks, a retest usually starts throwing hundreds of bad blocks. This is why I keep several spare drives on the shelf to chuck into systems that look like they might have a drive going bad.

FYI, I've had 24 blocks for a while on one of the disks... another had 16 and suddenly the reporting when it was picked up increased... and it started to climb, been semi stable as 26think...

G

georgelza · Aug 7, 2023

Trying to reconfigure my NFS and SMB share it shows me permission error on bunker/media.
closer look, there are no permissions associated with it, where as on tank/media I have ?

G

sretalla · Aug 7, 2023

georgelza said:
is there a way to check/confirm it was all copied ?

rsync -avnc /mnt/pool/original /mnt/pool/new

georgelza · Aug 8, 2023

the dataset eventually "now" showed on the new diskpool with the same permissions... well that gave me an indication how long it took to move it allllll.
initial update of the my shares was incomplete, forgot my apps don't show/point to the share, but to the actual mount point which has changed...
<would have been nice if the app referenced the dataset, wherever ever it lived, thus if you moved it it understood/followed>.

G

georgelza · Aug 8, 2023

any ide

sretalla said:
rsync -avnc /mnt/pool/original /mnt/pool/new

a how long a 6TB dataset check might take on a i5.

realise this could be how long is the string.

G

georgelza · Aug 8, 2023

I executed the rsync inside a tux session... allowing me to walk away.

G

sretalla · Aug 8, 2023

georgelza said:
how long a 6TB dataset check might take on a i5.

realise this could be how long is the string.

Probably more important to know would be how many files/directories... I don't know how to equate that down to a rate per processor type or pool though, so it takes as long as it takes.

georgelza · Aug 8, 2023

with the above rsync, how would i monitor it... guessing that that tmux session would be for as long as it's running, when done it will close and with it any output ?

I've attached to the session for rsync and seeing allot of file names scroll pass ?
sorta more expected it to not show allot or maybe files checked, or files found lacking ;)
G

sretalla · Aug 8, 2023

That list should be (as it would say at the top), the "incremental file list"... meaning exists in the first structure, but missing in the second.

A long list isn't what you would want to see there if they are equal.

georgelza · Aug 8, 2023

then i would also be concerned, why would there be differences.
terminal/tmux is not allowing me to scroll up... so ca't see how long this list is.
G

sretalla said:
That list should be (as it would say at the top), the "incremental file list"... meaning exists in the first structure, but missing in the second.

A long list isn't what you would want to see there if they are equal.

georgelza · Aug 8, 2023

a question would then be, can I run the rsync again (overwrite where the files are and add where not)... to try and recopy the data, aka get rid of the delta.

G

sretalla · Aug 8, 2023

change the switches to remove the n and the copy will happen. (only missing or changed files will copy).

georgelza · Aug 8, 2023

As in

change :
tmux new-session -d -s rsync 'rsync -avnc /mnt/tank/media /mnt/bunker/media'
To:
tmux new-session -d -s rsync 'rsync -avc /mnt/tank/media /mnt/bunker/media'

G

sretalla · Aug 8, 2023

georgelza said:
change :
tmux new-session -d -s rsync 'rsync -avnc /mnt/tank/media /mnt/bunker/media'
To:
tmux new-session -d -s rsync 'rsync -avc /mnt/tank/media /mnt/bunker/media'

That should do it.

It will probably go faster if you drop the c also as it won't do a checksum verify of the files to see if they are different and will trust timestamps instead.

Also add u to skip files that may be modified on the target... maybe that was my mistake to not have that in there already. (I usually use -uav)

georgelza · Aug 8, 2023

what does the "u" do ?
will have to go "man" a bit.
G

sretalla · Aug 8, 2023

georgelza said:
what does the "u" do ?

Skips files that are newer on the target side (meaning you copied them already and then changed it... don't overwrite with the original from the source).

Probably not much of an impact for you if you haven't started using the new dataset yet.

Important Announcement for the TrueNAS Community.

disk pool throwing alert

Patron

Patron

Guru

Patron

Patron

Patron

Attachments

Powered by Neutrality

Patron

Patron

Patron

Powered by Neutrality

Patron

Powered by Neutrality

Patron

Patron

Powered by Neutrality

Patron

Powered by Neutrality

Patron

Powered by Neutrality

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "disk pool throwing alert"

Similar threads