I have a large (2.2GB) text delimited file that holds chemical paths that I search when I want to go from chemical A to chemical B. I'm wondering if anyone knows of a way (preferably in python) that I could sort the file by number of columns in a row?
Example:
CSV:
A B C D E F G H I J K L M N Should sort to:
H I E F G A B C D J K L M N I've been thinking of making a hashtable of row lengths and rows, but as the csv files get larger: (we're running longest path on a chemical network and the 2.2gb (30mil paths) is only length <= 10), I anticipate this approach may not be the fastest.
(length, pointer_to_line). The length is easy, because you could just dolen(row.split()). The pointer to the line could be done throughf.tell()or something similar. Sort the index. Once sorted, use it to grab lines out of your chemical paths file in order, which you can write to a new file. Edit: This post might be helpful.