How to run the output of the "find" on multiple threads/cores?

Apollo · Apr 12, 2015

Since I started using Freenas, I have made changes to my folder and dataset structure to make more efficient usage of the snapshot capability of ZFS.
I have been moving files and directories across datasets and because of the snapshots, I cannot recover the space unless I destroy snapshots prior to the data migration.
I have been looking around for ways to check differences between folder but this solution doesn't seem to be very efficient.
I have found mention of the "diff" command and the like, but this is limited to having to compare identical datasets, folders or files.
I am trying to push the search further and today I came to realize I can use CRC check capability within Freenas and scan the entire pool, the result being placed into a file for further analysis.

the command I am using is:

Code:

find /mnt/my_pool/  -exec shasum -a 512 -b {} \;>> shasum_My_pool.txt

This is going to be a fairly lengthy process, but this as the advantage to being run within Freenas itself using SSH under a "screen" or "tmux" session.

This may or may not be optimal, but I am looking at a way to break the "find" output across multiple threads/cores.

For instance, if "find" goes through a folder with 10 files, ie file_1, file_2, file3...., file_10, for each file it will run one instance of shsum, waiting for each instance being completed before moving to the next file.
Is it possible to set "find" to execute file_1 on one instance of "shsum" then file_2 on a second instance of "shsum" so that multiple instance of "shsum" running on different file can be run simultaneously, then when the first instance is finished, then "find" will send the next file file_3 to the first instance and so on until "find" doesn't have anything else to process?

fracai · Apr 13, 2015

You can use xargs which has support for multiple CPUs.

Code:

find /mnt/my_pool/ -print0 | xargs -0 -P2 -n100 shasum -a 512 -b >> shasum_my_pool.txt

Basically:
find:
"-print0" means separate found file names with a NUL character
xargs:
"-0" means read file names as separated by a NUL character
"-P2" means run up to 2 processes at the same time
"-n100" means send up to 100 file names to each process invocation

Now, there's one issue that I could see and I'm sure how it will turn out. The multiple processes will presumably all be printing at the same time, so it's possible that you could get multiple records interleaved and even printed with multiple entries printed partially on the same lines. I actually just ran a test and I did see this behavior. To get around this you can instead write each process to a different file and then concatenate them afterwards.

Code:

mkdir ./tmp/
find /mnt/my_pool/ -print0 | xargs -0 P2 -n100 sh -c 'sha1sum "$@" > "$(mktemp -p ./tmp/)"'
cat ./tmp/* > shasum_my_pool.txt

Now, when I ran this and compared the output to the single threaded version I was missing approximately two files for each temp file created. I'm not sure what's going on, but I'd bet it's something like the shell stripping off the first and last files for some reason. I'd need more time to investigate. You might have better luck writing a shell script that runs sha1sum on the input arguments and writes the result to a mktemp file. You could then call that script with xargs instead of doing things inline.

Anyway, this is generally the process you'd want to take. Hopefully it helps you get to your goal.

http://www.tummy.com/blogs/2010/04/19/tricks-using-xargs-to-feed-multiple-cpus/
http://www.unixcl.com/2014/04/unix-xargs-parallel-execution-of.html

Apollo · Apr 13, 2015

Great info, however -p option for mktemp doesn't exist on Freenas 9.3.

fracai · Apr 14, 2015

Ah, woops. You should be able to get the same effect with: "mktemp ./tmp/XXXXXX"

http://nixdoc.net/man-pages/FreeBSD/man1/mktemp.1.html

Apollo · Apr 15, 2015

I had lots of trouble executing your command reliably. I got lost between the single, double quote, and the use of th $@. It hink I am getting the hang of it but I wish the "man pages" would make there description clearer and more importantly point out the caveats, especially in the use of the "mktemp" command.
I didn't realize the "X" had to be capitalized for it to work.
Yet after numerous unsuccesful trials and errors I was able to get things working. However I can't still figure out the following.

I want to add the time it took for the operation to be completed, and the result stored on the same file as the calculated sha value. I must use the -a -o option with a file name , but because "mktemp" creates a different file name on every call of the function, I don't see an easy way to accomplish this.

I have tried many variation of the command:time -ao time.txt without success. Any recommendation?

fracai · Apr 15, 2015

OK, so I think you want something like:

Code:

find /mnt/my_pool/ -print0 | xargs -0 P2 -n100 sh -c 'FILE="$(mktemp ./tmp/XXXXXX)"; /usr/bin/time -ao "$FILE" sha1sum "$@" > "$FILE"'

Quoting can be tricky, especially when you get in to nesting quotes; try to copy and paste if possible. In this case, also note that I'm using '/usr/bin/time' as 'time' may be provided by the shell and won't understand '-ao'. So, in the example above, I'm saving the value returned by mktemp as FILE and then using that value later.

Hope this helps.

Apollo · Apr 15, 2015

Thanks, it now works.
I just corrected a typo from your code.

Code:

find /mnt/my_pool/ -print0 | xargs -0 -P2 -n100 sh -c 'FILE="$(mktemp ./tmp/XXXXXX)"; /usr/bin/time -ao "$FILE" shasum  "$@" > "$FILE"'

I have to iron out a thing or two to deal with the actual collected data.

fracai · Apr 16, 2015

Coo, glad I could help. I didn't initially know about "xargs -P" and I'll definitely be using it myself.

Apollo · Apr 17, 2015

Some updates:
I have been running the command last night and when I checked the output of the file which is supposed to include all the concatenated files, it was empty except for the date I placed before and after in order to know how long the process really took.
I thought I had a "carrier feed" issue on the line, no.
I ran the "cat" command manually and got the
"/bin/cat: Argument list too long."

Who knew there would be FreeBSD limitation on the code itself. So much for ZFS.

SweetAndLow · Apr 17, 2015

Cat doesn't work with multiple files and I suspect that is what that error message means.

Apollo · Apr 17, 2015

SweetAndLow said:
Cat doesn't work with multiple files and I suspect that is what that error message means.

I did some research online, and it seems "cat" has a limit. Some of the post I have seen, recommend to use a filter to reduce the potential number of files. This doesn't seem to be a viable solution when using "mktemp".

fracai · Apr 18, 2015

What? Of course you can carry multiple files. The error is a limit on how many files can be passed in at once.

I'd solve this by looping over the files.

Code:

for FILE in ./tmp/* ; do cat "$FILE" ; done > all_files.txt

fracai · Apr 18, 2015

If the glob still doesn't work, you can use find.

Code:

find ./tmp/ -type f -print0 | xargs -0 cat > all_files.txt

I'd probably go with find.

Apollo · Apr 18, 2015

fracai said:
If the glob still doesn't work, you can use find.

Code:
find ./tmp/ -type f -print0 | xargs -0 cat > all_files.txt

I'd probably go with find.

It does the trick, but is not very fast.

Also, I was trying to delete the content of the folder with the "rm ./tmp/*" command and I get the same behavior.
I am using the following command to remove the files one at a time:

Code:

find ./tmp/ -type f -exec rm {} \;

From ZFS viewpoint I don't understand why the base commands have limitations, and from FreeBSD prospective, why do we have to go through so many hoops to run what ought to be fairly simple function.
I guess I get what I paid for.

fracai · Apr 19, 2015

To delete the whole folder you can use: rm -r

And this isn't an issue exclusive to zfs or bsd, it's a limitation of the shell and the tools. They are putting a limit on how much can be passed in to avoid buffer overflows and other problems. There are solutions like using xargs to overcome these limitations and they work quite well.

I suspect the slowness you're experiencing is due to the number of files that have been generated.

What's the output of: "ls ./tmp | wc -l"

That'll tell you how many files there are.

Apollo · Apr 19, 2015

I had about 275K files.

jgreco · Apr 19, 2015

Apollo said:
It does the trick, but is not very fast.

Also, I was trying to delete the content of the folder with the "rm ./tmp/*" command and I get the same behavior.
I am using the following command to remove the files one at a time:

Code:
find ./tmp/ -type f -exec rm {} \;

From ZFS viewpoint I don't understand why the base commands have limitations, and from FreeBSD prospective, why do we have to go through so many hoops to run what ought to be fairly simple function.
I guess I get what I paid for.

The base commands don't have limitations, it's the argument processing in UNIX that's biting you. Shell expansion is handled by the shell, and the processed arguments are then passed to the command. That's why there's such uniformity of things like wildcard expansion in UNIX. There are limits so that people don't try to pass a trillion arguments to a command. You can run a system out of resources.

Important Announcement for the TrueNAS Community.

How to run the output of the "find" on multiple threads/cores?

Apollo

Wizard

fracai

Guru

Apollo

Wizard

fracai

Guru

Apollo

Wizard

fracai

Guru

Apollo

Wizard

fracai

Guru

Apollo

Wizard

SweetAndLow

Sweet'NASty

Apollo

Wizard

fracai

Guru

fracai

Guru

Apollo

Wizard

fracai

Guru

Apollo

Wizard

jgreco

Resident Grinch

Similar threads

Important Announcement for the TrueNAS Community.

How to run the output of the "find" on multiple threads/cores?

Wizard

Guru

Wizard

Guru

Wizard

Guru

Wizard

Guru

Wizard

Sweet'NASty

Wizard

Guru

Guru

Wizard

Guru

Wizard

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "How to run the output of the "find" on multiple threads/cores?"

Similar threads