So I have about 400 files ranging from 10kb to 56mb in size, file type being .txt/.doc(x)/.pdf/.xml and I have to read them all. My read in files are basically:
#for txt files with open("TXT\\" + path, 'r') as content_file: content = content_file.read().split(' ') #for doc files using pydoc contents = '\n'.join([para.text for para in doc.paragraphs]).encode("ascii","ignore").decode("utf-8").split(' ') #for pdf files using pypdf2 for i in range(0, pdf.getNumPages()): content += pdf.getPage(i).extractText() + "\n" content = " ".join(content.replace(u"\xa0", " ").strip().split()) contents = content.encode("ascii","ignore").decode("utf-8").split(' ') #for xml files using lxml tree = etree.parse(path) contents = etree.tostring(tree, encoding='utf8', method='text') contents = contents.decode("utf-8").split(' ') But I notice even reading 30 text files with under 50kb size each and doing operations on it will take 41 seconds. But If I read a single text file with 56mb takes me 9 seconds. So I'm guessing that it's the file I/O that's slowing me down instead of my program.
Any idea on how to speed up this process? Maybe break down each file type into 4 different threads? But how would you go about doing that since they are sharing the same list and that single list will be written to a file when they are done.
readthe files and do nothing, vs. how long it takes to do your processing. If it's about the same, you're right, it's definitely the I/O time. If it's a lot faster… well, it might still be I/O time (e.g., maybe the module you're using does a lot of inefficient seeks or small reads), but it might not be, so you need to profile.