How to run the output of the "find" on multiple threads/cores?

Status
Not open for further replies.

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Since I started using Freenas, I have made changes to my folder and dataset structure to make more efficient usage of the snapshot capability of ZFS.
I have been moving files and directories across datasets and because of the snapshots, I cannot recover the space unless I destroy snapshots prior to the data migration.
I have been looking around for ways to check differences between folder but this solution doesn't seem to be very efficient.
I have found mention of the "diff" command and the like, but this is limited to having to compare identical datasets, folders or files.
I am trying to push the search further and today I came to realize I can use CRC check capability within Freenas and scan the entire pool, the result being placed into a file for further analysis.

the command I am using is:

Code:
find /mnt/my_pool/  -exec shasum -a 512 -b {} \;>> shasum_My_pool.txt


This is going to be a fairly lengthy process, but this as the advantage to being run within Freenas itself using SSH under a "screen" or "tmux" session.

This may or may not be optimal, but I am looking at a way to break the "find" output across multiple threads/cores.

For instance, if "find" goes through a folder with 10 files, ie file_1, file_2, file3...., file_10, for each file it will run one instance of shsum, waiting for each instance being completed before moving to the next file.
Is it possible to set "find" to execute file_1 on one instance of "shsum" then file_2 on a second instance of "shsum" so that multiple instance of "shsum" running on different file can be run simultaneously, then when the first instance is finished, then "find" will send the next file file_3 to the first instance and so on until "find" doesn't have anything else to process?
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
You can use xargs which has support for multiple CPUs.
Code:
find /mnt/my_pool/ -print0 | xargs -0 -P2 -n100 shasum -a 512 -b >> shasum_my_pool.txt

Basically:
find:
"-print0" means separate found file names with a NUL character
xargs:
"-0" means read file names as separated by a NUL character
"-P2" means run up to 2 processes at the same time
"-n100" means send up to 100 file names to each process invocation

Now, there's one issue that I could see and I'm sure how it will turn out. The multiple processes will presumably all be printing at the same time, so it's possible that you could get multiple records interleaved and even printed with multiple entries printed partially on the same lines. I actually just ran a test and I did see this behavior. To get around this you can instead write each process to a different file and then concatenate them afterwards.

Code:
mkdir ./tmp/
find /mnt/my_pool/ -print0 | xargs -0 P2 -n100 sh -c 'sha1sum "$@" > "$(mktemp -p ./tmp/)"'
cat ./tmp/* > shasum_my_pool.txt


Now, when I ran this and compared the output to the single threaded version I was missing approximately two files for each temp file created. I'm not sure what's going on, but I'd bet it's something like the shell stripping off the first and last files for some reason. I'd need more time to investigate. You might have better luck writing a shell script that runs sha1sum on the input arguments and writes the result to a mktemp file. You could then call that script with xargs instead of doing things inline.

Anyway, this is generally the process you'd want to take. Hopefully it helps you get to your goal.

http://www.tummy.com/blogs/2010/04/19/tricks-using-xargs-to-feed-multiple-cpus/
http://www.unixcl.com/2014/04/unix-xargs-parallel-execution-of.html
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Great info, however -p option for mktemp doesn't exist on Freenas 9.3.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
I had lots of trouble executing your command reliably. I got lost between the single, double quote, and the use of th $@. It hink I am getting the hang of it but I wish the "man pages" would make there description clearer and more importantly point out the caveats, especially in the use of the "mktemp" command.
I didn't realize the "X" had to be capitalized for it to work.
Yet after numerous unsuccesful trials and errors I was able to get things working. However I can't still figure out the following.

I want to add the time it took for the operation to be completed, and the result stored on the same file as the calculated sha value. I must use the -a -o option with a file name , but because "mktemp" creates a different file name on every call of the function, I don't see an easy way to accomplish this.

I have tried many variation of the command:time -ao time.txt without success. Any recommendation?
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
OK, so I think you want something like:
Code:
find /mnt/my_pool/ -print0 | xargs -0 P2 -n100 sh -c 'FILE="$(mktemp ./tmp/XXXXXX)"; /usr/bin/time -ao "$FILE" sha1sum "$@" > "$FILE"'


Quoting can be tricky, especially when you get in to nesting quotes; try to copy and paste if possible. In this case, also note that I'm using '/usr/bin/time' as 'time' may be provided by the shell and won't understand '-ao'. So, in the example above, I'm saving the value returned by mktemp as FILE and then using that value later.

Hope this helps.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Thanks, it now works.
I just corrected a typo from your code.
Code:
find /mnt/my_pool/ -print0 | xargs -0 -P2 -n100 sh -c 'FILE="$(mktemp ./tmp/XXXXXX)"; /usr/bin/time -ao "$FILE" shasum  "$@" > "$FILE"'

I have to iron out a thing or two to deal with the actual collected data.
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
Coo, glad I could help. I didn't initially know about "xargs -P" and I'll definitely be using it myself.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Some updates:
I have been running the command last night and when I checked the output of the file which is supposed to include all the concatenated files, it was empty except for the date I placed before and after in order to know how long the process really took.
I thought I had a "carrier feed" issue on the line, no.
I ran the "cat" command manually and got the
"/bin/cat: Argument list too long."

Who knew there would be FreeBSD limitation on the code itself. So much for ZFS.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Cat doesn't work with multiple files and I suspect that is what that error message means.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Cat doesn't work with multiple files and I suspect that is what that error message means.
I did some research online, and it seems "cat" has a limit. Some of the post I have seen, recommend to use a filter to reduce the potential number of files. This doesn't seem to be a viable solution when using "mktemp".
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
What? Of course you can carry multiple files. The error is a limit on how many files can be passed in at once.

I'd solve this by looping over the files.

Code:
for FILE in ./tmp/* ; do cat "$FILE" ; done > all_files.txt
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
If the glob still doesn't work, you can use find.

Code:
find ./tmp/ -type f -print0 | xargs -0 cat > all_files.txt


I'd probably go with find.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
If the glob still doesn't work, you can use find.

Code:
find ./tmp/ -type f -print0 | xargs -0 cat > all_files.txt


I'd probably go with find.
It does the trick, but is not very fast.

Also, I was trying to delete the content of the folder with the "rm ./tmp/*" command and I get the same behavior.
I am using the following command to remove the files one at a time:
Code:
find ./tmp/ -type f -exec rm {} \;


From ZFS viewpoint I don't understand why the base commands have limitations, and from FreeBSD prospective, why do we have to go through so many hoops to run what ought to be fairly simple function.
I guess I get what I paid for.
 

fracai

Guru
Joined
Aug 22, 2012
Messages
1,212
To delete the whole folder you can use: rm -r

And this isn't an issue exclusive to zfs or bsd, it's a limitation of the shell and the tools. They are putting a limit on how much can be passed in to avoid buffer overflows and other problems. There are solutions like using xargs to overcome these limitations and they work quite well.

I suspect the slowness you're experiencing is due to the number of files that have been generated.

What's the output of: "ls ./tmp | wc -l"

That'll tell you how many files there are.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
I had about 275K files.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
It does the trick, but is not very fast.

Also, I was trying to delete the content of the folder with the "rm ./tmp/*" command and I get the same behavior.
I am using the following command to remove the files one at a time:
Code:
find ./tmp/ -type f -exec rm {} \;


From ZFS viewpoint I don't understand why the base commands have limitations, and from FreeBSD prospective, why do we have to go through so many hoops to run what ought to be fairly simple function.
I guess I get what I paid for.

The base commands don't have limitations, it's the argument processing in UNIX that's biting you. Shell expansion is handled by the shell, and the processed arguments are then passed to the command. That's why there's such uniformity of things like wildcard expansion in UNIX. There are limits so that people don't try to pass a trillion arguments to a command. You can run a system out of resources.
 
Status
Not open for further replies.
Top