3

I have a text file that have about 10 columns, 7 of those are date/time stamps (year, month, day, hour, minute, second, centisec). So the data looks roughly like...

User[TAB]System[TAB]Year[TAB]Month[TAB]Day ... centisec[TAB]Message 

Sorry for the horrible formatting, but I hope this gives you the idea.

So if I wanted to sort the file by years, I could use this

sorted_lines = sorted(unsortedfile,key=lambda l: int(l.split('\t')[2])) 

Take the unsorted file, split the lines by tabs, 3rd column, change it to an int and sort by that. I could do the same thing for any one column.

What I'm looking for is a better way to sort this by all the date/time columns. So sort by Year, then by month, then by day... etc

I can think of a few complicated ways of doing this (reading each line combining all the columns, sorting it... or doing a recursive sort by each column), but I'm hoping someone has a simpler, more pythonic, way to do the same thing.

3
  • How large is the file? Commented Apr 14, 2014 at 1:14
  • @dmcauslan I went to see how large the file would be on production, and ran into more bugs I have to fix before I can give an exact answer. My guess is around 150mb, but that's a wild guess. Commented Apr 14, 2014 at 2:22
  • Have you looked at any higher-level libraries that might help you here? I'm thinking primarily of pandas. Commented Apr 14, 2014 at 3:41

1 Answer 1

3

You can use csv module to parse the file with delimiter='\t' and apply sorted() on the reader object with a custom key function that parses the date to the datetime object:

import csv from datetime import datetime from pprint import pprint def sort_by_datetime(line): return datetime.strptime('{0}-{1}-{2}'.format(*line[2:5]), '%Y-%m-%d') with open('input.txt') as f: reader = csv.reader(f, delimiter='\t') pprint(sorted(reader, key=sort_by_datetime)) 

For the input.txt:

User1 System1 2013 1 31 User2 System2 2014 12 1 User3 System3 2012 12 31 User4 System4 2012 6 15 User5 System5 2014 1 1 

it would print:

[['User4', 'System4', '2012', '6', '15'], ['User3', 'System3', '2012', '12', '31'], ['User1', 'System1', '2013', '1', '31'], ['User5', 'System5', '2014', '1', '1'], ['User2', 'System2', '2014', '12', '1']] 
Sign up to request clarification or add additional context in comments.

2 Comments

or return datetime.datetime(*map(int, line[2:-1])) for sort_by_datetime() to be agnostic about the inclusion of hours, minutes, seconds, etc.
@alecxe This looks like exactly what I'm looking for. As I said to dmcauslan above though, I've found some other bugs preventing this from completing. It'll be tomorrow at the earliest. Wanted to let you know I saw this though and I'll come back when I get it working.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.