0

i'll try to explain what i want to achieve with my code:

  1. i open a csv file
  2. i pick up every element of the first row and search for this string in every file in every subdirectory starting from rootdir.

with the design showed below, it is extremely slow even with 2 directories and one file in each directory. It takes approximately 1 second for each entry on the main file. i've got 400000 records on that file...

import csv import os rootdir = 'C:\Users\ST\Desktop\Sample' f = open('C:\Users\ST\Desktop\inputIds.csv') f.readline() snipscsv_f=csv.reader(f, delimiter=' ') for row in snipscsv_f: print 'processing another ID' for subdir, dir, files in os.walk(rootdir): print 'processing another folder' for file in files: print 'processing another file' if 'csv' in file: #i want only csv files to be processed ft = open(os.path.join(subdir, file)) for ftrow in ft: if row[0] in ftrow: print row[0] ft.close() 
4
  • 2
    I mean. This is going to be really really really slow because of the sheer volume of files you're looking through. Can you provide some context onto what your end goal is here because there is likely a way to improve your setup substantially. Commented Apr 7, 2016 at 21:40
  • 2
    Holy wow. You are searching for, and reading, all the CSV files once for each line of your ID file. Instead, read your input ID file once, store the IDs in a variable, and read each CSV file once and check whether any of the IDs are in each line. Also, once you have found a match, break out of the loop so you don't read the rest of the file. If the files you're searching are small enough, you may also get some speedup by reading them into memory instead of line by line. Commented Apr 7, 2016 at 21:43
  • can you post some sample lines for your id file and other files? Commented Apr 7, 2016 at 21:47
  • cut -d, -f1 inputIds.csv > Ids.txt then grep -f Ids.txt -r *.csv ? (Edit, oh right, Windows. unxutils.sourceforge.net has Win32 builds of GNU utils for cut and grep, if you want) Commented Apr 7, 2016 at 22:13

1 Answer 1

1

I know you have a large CSV file but it is still MUCH quicker to read it all and compare against, rather than performing the os walk for every entry.

Also, not sure that python is the best tool for this. You may find shell scripts (for windows, Powershell is the only decent tool) much faster for this kind of task. Anyway, you added python tags so...

import csv import fnmatch import os # load the csv with entries with open('file_with_entries.csv','r') as f: readr = csv.reader(f) data = [] for row in readr: data.extend(row) # find csv files rootdir = os.getcwd() # could be anywhere matches = [] for root, dirs, files in os.walk(rootdir): for filename in fnmatch.filter(files, '*.csv'): matches.append(os.path.join(root, filename)) # find occurences of entry in each file for eachcsv in matches: with open(eachcsv, 'r') as f: text = f.read() for entry in data: if entry in text: print("found %s in %s" % (entry,eachcsv)) 

Not sure how critical it is that you only read the first row of the entries file, it would be reasonably easier to amend to the code to do just that.

Sign up to request clarification or add additional context in comments.

1 Comment

Hi,thanks for the input. The only problem is that it is possible that there is more than 1 match on each file and i need to keep track of all of them.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.