2

I have a very large text file to parse for some information. Each line I do checks for certain keywords (I call them "flags"). Once I find the "flag", I then call the below method and gather the data that comes right after the flag (usually just a name or number) to find the information after the flag I used the below method(which works):

def findValue(string, flag): string = string.strip() startIndex = string.find(flag) + len(flag) index = startIndex char = string[index:index+1] while char != " " and index < len(string): index += 1 char = string[index:index+1] endIndex = index return string[startIndex:endIndex] 

However, it is much easier if I just use the split() with white spaces as separators and then take the next item in the list rather than "crawling" the characters.

The log files I am parsing are really large (around 1.5 million or more lines), so I would like to know if and how much it would hurt my efficient to use split() on lines compared to my current method.

4
  • char = string[index:index+1] creates a new string at each loop, very unefficient. split (string[startIndex:]) would be much faster than your current method. Commented Jul 15, 2015 at 19:13
  • Does string contain the entire contents of the file, or just a single line? Commented Jul 15, 2015 at 19:16
  • "string" is just a single line Commented Jul 15, 2015 at 19:25
  • Also, and correct me if I'm wrong, isn't string[i] equivalent to string[i:i+1]? And maybe a bit more efficient? Commented Jul 15, 2015 at 19:49

4 Answers 4

2

I did some timing tests using string 'oabsecaosbeoiabsoeib;asdnvzldkxbcoszievbzldkvn.zlisebv;iszdb;vibzdlkv8niandsailbsdlivbslidznclkxvnlidbvlzidbvlzidbvlkxnv', searching for '8', each 100000 times:

Your Method: 2.156 seconds

str.split: 0.151 seconds

Another test, that is somewhat more realistic: 'hello this is for stack overflow and i absolutely hate typing unecessary characters'

Your Method: 0.317 seconds

str.split: 0.267 seconds

A final test, with the above string multiplied 100 times:

Your Method: 0.325 seconds

str.split: 7.376 seconds

Whatever this says.

In your case, with super large strings, I would definitely use your function!

Sign up to request clarification or add additional context in comments.

2 Comments

The last difference is probably due to the fact that his function only looks for the first occurrence of ' ', whereas split would look for all the occurrences of ' '.
Thank you for that data, the lines are in face very long. You just saved me from a lot of work changing my findVaule implementaion
1

Python's split() function is almost certainly written in C, which means it will be faster than the equivalent code if you wrote it in Python. However, if you are just calling split() on a single line (not all 1.5 million of them), the difference won't be huge.

However, why bother even using split() when you just need the next item in the list? This may be the most efficient of any approach:

def findValue(string, flag): startIndex = string.find(flag) + len(flag) endIndex = string.find(' ', startIndex) if endIndex == -1: return string[startIndex:] else: return string[startIndex:endIndex] 

1 Comment

You might want to include the case where the end index extends to the end of the line (no space in the string). In that case just make endIndex = len(string) + 1. In your implementation, it's -1 if no space char is found, , which truncates the last character (second slicing parameter is exclusive)
0

suppose you have a file object pointing to the file:

current_item = "" char = file.read(1) while char: if char != " ": current_item += char else: do_something_about_the_item(current_item) current_item = "" 

Comments

0

You can try python's regular expression tool, the re module, which particularly suited for parsing text files. Some examples: http://www.thegeekstuff.com/2014/07/python-regex-examples/

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.