2

I need to create a script to search through just below a million files of text, code, etc. to find matches and then output all hits on a particular string pattern to a CSV file.

So far I made this;

$location = 'C:\Work*' $arr = "foo", "bar" #Where "foo" and "bar" are string patterns I want to search for (separately) for($i=0;$i -lt $arr.length; $i++) { Get-ChildItem $location -recurse | select-string -pattern $($arr[$i]) | select-object Path | Export-Csv "C:\Work\Results\$($arr[$i]).txt" } 

This returns to me a CSV file named "foo.txt" with a list of all files with the word "foo" in it, and a file named "bar.txt" with a list of all files containing the word "bar".

Is there any way anyone can think of to optimize this script to make it work faster? Or ideas on how to make an entirely different, but equivalent script that just works faster?

All input appreciated!

5
  • 1
    How much does it take now (just out of curiosity)? Do you need only file paths that contain matches in the output? Commented Jan 11, 2011 at 12:18
  • Now it takes ~2 hours pr item in the array. I just learned the measure-command trick a bit ago, I'll see if performance increases as the process gets cached. -- I do only need file paths that contain matches, yes Commented Jan 11, 2011 at 12:24
  • I can also add that the length of each array item (string) seems to significantly affect processing time. CPU usage was around 15-20% during the first run-through. Now it seems to be around 4-5%. Interesting stuff. Commented Jan 11, 2011 at 12:26
  • 1
    Are your files small enough, e.g. to read all text into memory, or is this not an option? Commented Jan 11, 2011 at 12:27
  • The total sum of files would be too big, but that is an interesting thought. If I could cache it all in RAM I would be willing to split the operation and cache one subdirectory at a time before performing the search. Do you have any idea as far as how to implement that? Commented Jan 11, 2011 at 12:45

2 Answers 2

2

If your files are not huge and can be read into memory then this version should work quite faster (and my quick and dirty local test seems to prove that):

$location = 'C:\ROM' $arr = "Roman", "Kuzmin" # remove output files foreach($test in $arr) { Remove-Item ".\$test.txt" -ErrorAction 0 -Confirm } Get-ChildItem $location -Recurse | .{process{ if (!$_.PSIsContainer) { # read all text once $content = [System.IO.File]::ReadAllText($_.FullName) # test patterns and output paths once foreach($test in $arr) { if ($content -match $test) { $_.FullName >> ".\$test.txt" } } }}} 

Notes: 1) mind changed paths and patterns in the example; 2) output files are not CSV but plain text; there is not much reason in CSV if you are interested just in paths - plain text files one path per line will do.

Sign up to request clarification or add additional context in comments.

3 Comments

Awesome, you were 2 seconds faster!
:-) But our suggestions are not exactly the same. Thus, @cc0 has now more options to choose from and that's for good.
This is excellent :] Hopefully others will be able to learn from this as well. Thank you for taking the time!
2

Let's suppose that 1) the files are not too big and you can load it into memory, 2) you really just want the Path of the file, that matches (not the line etc.).

I tried to read the file only once and then iterate through the regexes. There is some gain (it's a faster then the original solution), but the final result will depend on other factors like file sizes, count of files etc.

Also removing 'ignorecase' makes it faster a little bit.

$res = @{} $arr | % { $res[$_] = @() } Get-ChildItem $location -recurse | ? { !$_.PsIsContainer } | % { $file = $_ $text = [Io.File]::ReadAllText($file.FullName) $arr | % { $regex = $_ if ([Regex]::IsMatch($text, $regex, 'ignorecase')) { $res[$regex] = $file.FullName } } } $res.GetEnumerator() | % { $_.Value | Export-Csv "d:\temp\so-res$($_.Key).txt" } 

2 Comments

Thank you :) I will give this a shot also and see which is faster for my situation. Should be interesting!
I'll do that as soon as I have them :] Might take a couple of days, I'll do some proper testing here with many items.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.