1

I have a large (2.2GB) text delimited file that holds chemical paths that I search when I want to go from chemical A to chemical B. I'm wondering if anyone knows of a way (preferably in python) that I could sort the file by number of columns in a row?

Example:

CSV:

A B C D E F G H I J K L M N 

Should sort to:

H I E F G A B C D J K L M N 

I've been thinking of making a hashtable of row lengths and rows, but as the csv files get larger: (we're running longest path on a chemical network and the 2.2gb (30mil paths) is only length <= 10), I anticipate this approach may not be the fastest.

2
  • I would build an index and sort that. For example, for each line in your chemical paths file, create a tuple in the index of (length, pointer_to_line). The length is easy, because you could just do len(row.split()). The pointer to the line could be done through f.tell() or something similar. Sort the index. Once sorted, use it to grab lines out of your chemical paths file in order, which you can write to a new file. Edit: This post might be helpful. Commented Jul 11, 2013 at 22:00
  • My first reaction is to put this data in a database rather than trying to forcibly work with a CSV (although perhaps you've already read the data from a database!). You'd have the benefits of a database and be able to use Python + SQL to do more types of data analysis in the future if the need arises. Commented Jul 11, 2013 at 22:04

1 Answer 1

5

I'd go for splitting them into separate files based on the length, then joining them back together afterwards - something like:

from tempfile import TemporaryFile from itertools import chain 

Keep a reference dict of file length->output file. Where a file is already opened, then write to it, or create a new temporary file.

output = {} with open('input') as fin: for line in fin: length = len(line.split()) output.setdefault(length, TemporaryFile()).write(line) 

As Steven Rumbalski has pointed out, this can be also done with a defaultdict:

from collections import defaultdict output = defaultdict(TemporaryFile) ... output[length].write(line) 

The temporary files will all be pointing to the end of the file. Reset them to the beginning so that when reading through them we get the data again...

for fh in output.values(): fh.seek(0) 

Take the rows from each file in increasing order of length... and write them all to the final output file.

with open('output', 'w') as fout: fout.writelines(chain.from_iterable(v for k,v in sorted(output.iteritems()))) 

Python should then clean up the temporary files upon program exit...

Sign up to request clarification or add additional context in comments.

7 Comments

Very pythonic, not really readable for the new comer. Should be commented in my opinion.
@btoueg annotated it slightly - let me know if you think it needs anything further
+1. I would have preferred output = defaultdict(TemporaryFile), but that's just a matter of taste. The word "should" in your final sentence reads like you are exhorting the OP to clean up the files. Better would be "Python should (will?) clean up the temporary files upon program exit." Also, the last word is misspelled.
@JonClements incase you're wondering how well your algorithm did, it took 2 hours and 36 minutes to sort 30,000,000 rows!
@Darkstarone that seems pitiful - Maybe a good 5 - 10 minutes on a slow disk, but 2 and a half hours? That doesn't sound right...
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.