22

I have a job that runs every night to pull xml files from a directory that has over 20,000 subfolders under the root. Here is what the structure looks like:

rootFolder/someFolder/someSubFolder/xml/myFile.xml rootFolder/someFolder/someSubFolder1/xml/myFile1.xml rootFolder/someFolder/someSubFolderN/xml/myFile2.xml rootFolder/someFolder1 rootFolder/someFolderN 

So looking at the above, the structure is always the same - a root folder, then two subfolders, then an xml directory, and then the xml file. Only the name of the rootFolder and the xml directory are known to me.

The code below traverses through all the directories and is extremely slow. Any recommendations on how I can optimize the search especially if the directory structure is known?

string[] files = Directory.GetFiles(@"\\somenetworkpath\rootFolder", "*.xml", SearchOption.AllDirectories); 
3
  • Are you looking for a particular xml files or do you want a list of all of them? Commented Apr 3, 2009 at 14:32
  • I am looking for all xml files Commented Apr 3, 2009 at 14:34
  • If the list of xml files doesn't change over time, you could just build the list once and read all the files from the list. But you probably knew that. Commented Apr 3, 2009 at 14:46

9 Answers 9

18

Rather than doing GetFiles and doing a brute force search you could most likely use GetDirectories, first to get a list of the "First sub folder", loop through those directories, then repeat the process for the sub folder, looping through them, lastly look for the xml folder, and finally searching for .xml files.

Now, as for performance the speed of this will vary, but searching for directories first, THEN getting to files should help a lot!

Update

Ok, I did a quick bit of testing and you can actually optimize it much further than I thought.

The following code snippet will search a directory structure and find ALL "xml" folders inside the entire directory tree.

string startPath = @"C:\Testing\Testing\bin\Debug"; string[] oDirectories = Directory.GetDirectories(startPath, "xml", SearchOption.AllDirectories); Console.WriteLine(oDirectories.Length.ToString()); foreach (string oCurrent in oDirectories) Console.WriteLine(oCurrent); Console.ReadLine(); 

If you drop that into a test console app you will see it output the results.

Now, once you have this, just look in each of the found directories for you .xml files.

Sign up to request clarification or add additional context in comments.

1 Comment

What if I wanted to search for a specific text in the xml files and only collect those files, how can that be done fast and efficiently?
6

I created a recursive method GetFolders using a Parallel.ForEach to find all the folders named as the variable yourKeyword

List<string> returnFolders = new List<string>(); object locker = new object(); Parallel.ForEach(subFolders, subFolder => { if (subFolder.ToUpper().EndsWith(yourKeyword)) { lock (locker) { returnFolders.Add(subFolder); } } else { lock (locker) { returnFolders.AddRange(GetFolders(Directory.GetDirectories(subFolder))); } } }); return returnFolders; 

Comments

3

Are there additional directories at the same level as the xml folder? If so, you could probably speed up the search if you do it yourself and eliminate that level from searching.

 System.IO.DirectoryInfo root = new System.IO.DirectoryInfo(rootPath); List<System.IO.FileInfo> xmlFiles=new List<System.IO.FileInfo>(); foreach (System.IO.DirectoryInfo subDir1 in root.GetDirectories()) { foreach (System.IO.DirectoryInfo subDir2 in subDir1.GetDirectories()) { System.IO.DirectoryInfo xmlDir = new System.IO.DirectoryInfo(System.IO.Path.Combine(subDir2.FullName, "xml")); if (xmlDir.Exists) { xmlFiles.AddRange(xmlDir.GetFiles("*.xml")); } } } 

Comments

1

For file and directory search purpose I would want to offer use multithreading .NET library that possess a wide search opportunities. All information about library you can find on GitHub: https://github.com/VladPVS/FastSearchLibrary If you want to download it you can do it here: https://github.com/VladPVS/FastSearchLibrary/releases If you have any questions please ask them.

Works really fast. Check it yourself!

It is one demonstrative example how you can use it:

class Searcher { private static object locker = new object(); private FileSearcher searcher; List<FileInfo> files; public Searcher() { files = new List<FileInfo>(); } public void Startsearch() { CancellationTokenSource tokenSource = new CancellationTokenSource(); searcher = new FileSearcher(@"C:\", (f) => { return Regex.IsMatch(f.Name, @".*[Dd]ragon.*.jpg$"); }, tokenSource); searcher.FilesFound += (sender, arg) => { lock (locker) // using a lock is obligatorily { arg.Files.ForEach((f) => { files.Add(f); Console.WriteLine($"File location: {f.FullName}, \nCreation.Time: {f.CreationTime}"); }); if (files.Count >= 10) searcher.StopSearch(); } }; searcher.SearchCompleted += (sender, arg) => { if (arg.IsCanceled) Console.WriteLine("Search stopped."); else Console.WriteLine("Search completed."); Console.WriteLine($"Quantity of files: {files.Count}"); }; searcher.StartSearchAsync(); } } 

It's part of other example:

*** List<string> folders = new List<string> { @"C:\Users\Public", @"C:\Windows\System32", @"D:\Program Files", @"D:\Program Files (x86)" }; // list of search directories List<string> keywords = new List<string> { "word1", "word2", "word3" }; // list of search keywords FileSearcherMultiple multipleSearcher = new FileSearcherMultiple(folders, (f) => { if (f.CreationTime >= new DateTime(2015, 3, 15) && (f.Extension == ".cs" || f.Extension == ".sln")) foreach (var keyword in keywords) if (f.Name.Contains(keyword)) return true; return false; }, tokenSource, ExecuteHandlers.InCurrentTask, true); *** 

Moreover one can use simple static method:

List<FileInfo> files = FileSearcher.GetFilesFast(@"C:\Users", "*.xml"); 

Note that all methods of this library DO NOT throw UnauthorizedAccessException instead standard .NET search methods.

Furthermore fast methods of this library are performed at least in 2 times faster than simple one-thread recursive algorithm if you use multicore processor.

Comments

0

Only way I can see that would make much difference is to change from a brute strength hunt and use some third party or OS indexing routine to speed the return. that way the search is done off line from your app.

But I would also suggest you should look at better ways to structure that data if at all possible.

Comments

0

Use P/Invoke on FindFirstFile/FindNextFile/FindClose and avoid overhead of creating lots of FileInfo instances.

But this will be hard work to get right (you will have to do all the handling of file vs. directory and recursion yourself). So try something simple (Directory.GetFiles(), Directory.GetDirectories()) to start with and get things working. If it is too slow look at alternatives (but always measure, too easy to make it slower).

3 Comments

Do you know of some FindFirstFile/FindNextFile/FindClose wrapper for .net that acts as an IEnumerable?
@DanielMošmondor Systen,IO.Directoey now has such methods (added in .NET 4). Eg. EnumerateDirectories.
It's not something that I'm proud of, but I'm still on .net 2 :)
0

Depending on your needs and configuration, you could utilize the Windows Search Index: https://msdn.microsoft.com/en-us/library/windows/desktop/bb266517(v=vs.85).aspx

Depending on your configuration this could increase performance greatly.

Comments

0

For those of you who want to search for a single file and you know your root directory then I suggest you keep it simple as possible. This approach worked for me.

 private void btnSearch_Click(object sender, EventArgs e) { string userinput = txtInput.Text; string sourceFolder = @"C:\mytestDir\"; string searchWord = txtInput.Text + ".pdf"; string filePresentCK = sourceFolder + searchWord; if (File.Exists(filePresentCK)) { pdfViewer1.LoadFromFile(sourceFolder+searchWord); } else if(! File.Exists(filePresentCK)) { MessageBox.Show("Unable to Find file :" + searchWord); } txtInput.Clear(); }// end of btnSearch method 

Comments

-1

I can't think of anything faster in C#, but do you have indexing turned on for that file system?

1 Comment

Indexing shouldn't do much for simple traversal.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.