33

My current Python project will require a lot of string splitting to process incoming packages. Since I will be running it on a pretty slow system, I was wondering what the most efficient way to go about this would be. The strings would be formatted something like this:

Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5 

Explanation: This particular example would come from a list where the first two items are a title and a date, while item 3 to item 5 would be invited people (the number of those can be anything from zero to n, where n is the number of registered users on the server).

From what I see, I have the following options:

  1. repeatedly use split()
  2. Use a regular expression and regular expression functions
  3. Some other Python functions I have not thought about yet (there are probably some)

Solution 1 would include splitting at | and then splitting the last element of the resulting list at <> for this example, while solution 2 would probably result in a regular expression like:

((.+)|)+((.+)(<>)?)+

Okay, this regular expression is horrible, I can see that myself. It is also untested. But you get the idea.

Now, I am looking for the way that a) takes the least amount of time and b) ideally uses the least amount of memory. If only one of the two is possible, I would prefer less time. The ideal solution would also work for strings that have more items separated with | and strings that completely lack the <>. At least the regular expression-based solution would do that.

My understanding would be that split() would use more memory (since you basically get two resulting lists, one that splits at | and the second one that splits at <>), but I don't know enough about Python's implementation of regular expressions to judge how the regular expression would perform. split() is also less dynamic than a regular expression if it comes to different numbers of items and the absence of the second separator. Still, I can't shake the impression that Python can do this better without regular expressions, and that's why I am asking.

Some notes:

  • Yes, I could just benchmark both solutions, but I'm trying to learn something about Python in general and how it works here, and if I just benchmark these two, I still don't know what Python functions I have missed.
  • Yes, optimizing at this level is only really required for high-performance stuff, but as I said, I am trying to learn things about Python.
  • Addition: in the original question, I completely forgot to mention that I need to be able to distinguish the parts that were separated by | from the parts with the separator <>, so a simple flat list as generated by re.split(\||<>,input) (as proposed by obmarg) will not work too well. Solutions fitting this criterium are much appreciated.

To sum the question up: Which solution would be the most efficient one, for what reasons?

Due to multiple requests, I have run some timeit on the split()-solution and the first proposed regular expression by obmarg, as well as the solutions by mgibsonbr and duncan:

import timeit import re def splitit(input): res0 = input.split("|") res = [] for element in res0: t = element.split("<>") if t != [element]: res0.remove(element) res.append(t) return (res0, res) def regexit(input): return re.split( "\||<>", input ) def mgibsonbr(input): # Solution by mgibsonbr items = re.split(r'\||<>', input) # Split input in items offset = 0 result = [] # The result: strings for regular items, lists for <> separated ones acc = None for i in items: delimiter = '|' if offset+len(i) < len(input) and input[offset+len(i)] == '|' else '<>' offset += len(i) + len(delimiter) if delimiter == '<>': # Will always put the item in a list if acc is None: acc = [i] # Create one if doesn't exist result.append(acc) else: acc.append(i) else: if acc is not None: # If there was a list, put the last item in it acc.append(i) else: result.append(i) # Add the regular items acc = None # Clear the list, since what will come next is a regular item or a new list return result def split2(input): # Solution by duncan res0 = input.split("|") res1, res2 = [], [] for r in res0: if "<>" in r: res2.append(r.split("<>")) else: res1.append(r) return res1, res2 print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit() print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit() print "split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit() print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit() print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit() print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit() print "split:", timeit.Timer("split2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit() print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit() 

The results:

mgibs: 14.7349407408 split: 6.403942732 split2: 3.68306812233 regex: 5.28414318792 mgibs: 107.046683735 split: 46.0844590775 split2: 26.5595985591 regex: 28.6513302646 

At the moment, it looks like split2 by duncan beats all other algorithms, regardless of length (with this limited dataset at least), and it also looks like mgibsonbr's solution has some performance issues (sorry about that, but thanks for the solution regardless).

9
  • 4
    Go on, just use timeit and see for yourself. It's easy. My bets are on str.split. Commented Mar 7, 2012 at 14:04
  • 1
    I think for long string, re may be faster. But you have to benchmark to be sure. Commented Mar 7, 2012 at 14:11
  • I'd bet that this is IO bound. My first guess would be to say that any type of splitting will be faster than the input it receives from disk or network. Commented Mar 7, 2012 at 14:30
  • If there are "no invited people" does your string look like this "Item 1 | Item 2" or like this "Item 1 | Item 2 |"? Commented Mar 7, 2012 at 14:33
  • @StevenRumbalski The second option. Commented Mar 7, 2012 at 14:36

5 Answers 5

26

I was slightly surprised that split() performed so badly in your code, so I looked at it a bit more closely and noticed that you're calling list.remove() in the inner loop. Also you're calling split() an extra time on each string. Get rid of those and a solution using split() beats the regex hands down on shorter strings and comes a pretty close second on the longer one.

import timeit import re def splitit(input): res0 = input.split("|") res = [] for element in res0: t = element.split("<>") if t != [element]: res0.remove(element) res.append(t) return (res0, res) def split2(input): res0 = input.split("|") res1, res2 = [], [] for r in res0: if "<>" in r: res2.append(r.split("<>")) else: res1.append(r) return res1, res2 def regexit(input): return re.split( "\||<>", input ) rSplitter = re.compile("\||<>") def regexit2(input): return rSplitter.split(input) print("split: ", timeit.Timer("splitit( 'a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()) print("split2:", timeit.Timer("split2( 'a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()) print("regex: ", timeit.Timer("regexit( 'a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()) print("regex2:", timeit.Timer("regexit2('a|b|c|de|f<>ge<>ah')","from __main__ import regexit2").timeit()) print("split: ", timeit.Timer("splitit( 'a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()) print("split2:", timeit.Timer("split2( 'a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()) print("regex: ", timeit.Timer("regexit( 'a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()) print("regex2:", timeit.Timer("regexit2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit2").timeit()) 

Which gives the following result:

split: 1.8427431439631619 split2: 1.0897291360306554 regex: 1.6694280610536225 regex2: 1.2277749050408602 split: 14.356198082969058 split2: 8.009285948995966 regex: 9.526430513011292 regex2: 9.083608677960001 

And of course split2() gives the nested lists that you wanted whereas the regex solution doesn't.

Compiling the regex will improve performance. It does make a slight difference, but Python caches compiled regular expressions so the saving is not as much as you might expect. I think usually it isn't worth doing it for speed (though it can be in some cases), but it is often worthwhile to make the code clearer.

Sign up to request clarification or add additional context in comments.

4 Comments

Good catch. Did not think about that at all when implementing the test split-function. Now that you pointed it out, it makes a lot of sense, since removing from lists is "expensive", not to mention splitting everything regardless of content (I was looking for a java-like contains, but it seems python does this differently). Thanks a lot.
You can do even better if you "compile" the regular expression, namely add a line like rSplitter = re.compile("\||<>") The define the splitting code: def regexit2(input): return rSplitter.split(input) YYMV, but for me, I saw this compiled version as noticeably faster than the original regex version, and comfortably fastest in all tests.
@F1Rumors, thanks. I've updated the example code though the difference isn't massive as Python will be caching the compiled regex.
Split is most of the time faster than a regex, but it depends on the complexity of the regex. Some new users are linking to this answer stating that "re is faster than split.", which is not entirely true.
11

I'm not sure if it's the most efficient, but certainly the easiest to code seems to be something like this:

>>> input = "Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5" >>> re.split( "\||<>", input ) >>> ['Item 1 ', ' Item 2 ', ' Item 3 ', ' Item 4 ', ' Item 5'] 

I would think there's a fair chance of it being more efficient than a plain old split as well (depending on the input data) since you'd need to perform the second split operation on every string output from the first split, which doesn't seem likely to be efficient for either memory or time.

Though having said that I could easily be wrong, and the only way to be sure would be to time it.

3 Comments

Thanks for your algorithm. I forgot to mention that I need the second part (Item 3-5) as a nested list to distinguish it from the other parts and to make processing easier. But since I forgot it, I can't really hold it against you that your solution does not do it ;-)
@malexmave: Once you have a list it's easy. alist = re.split( "\||<>", input); result = [alist[0], alist[1], alist[2:]]. Voila. Nested.
Thanks, that will definitly work if I have fixed positions inside the list. What about the problem I described in a comment further down? Do you have an easy solution to that? Or at least one that is easier than the one proposed by @mgibsonbr? Thanks for your effort!
3

Calling split multiple times is likely to be inneficient, because it might create unneeded intermediary strings. Using a regex like you proposed won't work, since the capturing group will only get the last item, not every of them. Splitting using a regex, like obmarg suggested, seems to be the best route, assuming a "flattened" list is what you're looking for.

If you don't want a flattened list, you can split using a regex first and then iterate over the results, checking the original input to see which delimiter was used:

items = re.split(r'\||<>', input) offset = 0 for i in items: delimiter = '|' if input[offset+len(i)] == '|' else '<>' offset += len(i) + len(delimiter) # Do something with i, depending on whether | or <> was the delimiter 

At last, if you don't want the substrings created at all (using only the start and end indices to save space, for instance), re.finditer might do the job. Iterate over the delimiters, and do something to the text between them depending on which delimiter (| or <>) was found. It's a more complex operation, since you'll have to handle many corner cases, but might be worth it depending on your needs.

Update: for your particular case, where the input format is uniform, obmarg's solutions is the best one. If you must, post-process the result to have a nested list:

split_result = re.split( "\||<>", input ) result = [split_result[0], split_result[1], [i for i in split_result[2:] if i]] 

(that last list comprehension is to ensure you'll get [] instead of [''] if there are no items after the last |)

Update 2: After reading the updated question, I finally understood what you're trying to achieve. Here's the full example, using the framework suggested earlier:

items = re.split(r'\||<>', input) # Split input in items offset = 0 result = [] # The result: strings for regular itens, lists for <> separated ones acc = None for i in items: delimiter = '|' if offset+len(i) < len(input) and input[offset+len(i)] == '|' else '<>' offset += len(i) + len(delimiter) if delimiter == '<>': # Will always put the item in a list if acc is None: acc = [i] # Create one if doesn't exist result.append(acc) else: acc.append(i) else: if acc is not None: # If there was a list, put the last item in it acc.append(i) else: result.append(i) # Add the regular items acc = None # Clear the list, since what will come next is a regular item or a new list print result 

Tested it with your example, the result was:

['a', 'b', 'c', 'de', ['f', 'ge', 'aha'], 'b', 'c', 'de', ['f', 'ge', 'aha'], 'b', 'c', 'de', ['f', 'ge', 'aha'], 'b', 'c', 'de', ['f', 'ge', 'aha'], 'b', 'c','de', ['f', 'ge', 'aha'], 'b', 'c', 'de', ['f', 'ge', 'aha'], 'b', 'c', 'de', ['f', 'ge', 'aha'], 'b', 'c', 'de', ['f', 'ge', 'aha'], 'b', 'c', 'de', ['f', 'ge', 'aha'], 'b', 'c', 'de', ['f', 'ge', 'aha'], 'b', 'c', 'de', ['f', 'ge', 'ah']] 

6 Comments

Thanks for your input. I am currently trying to understand what your forBlock does. As I understand it, it calculates the offset at which the |-seperated block ends and the <>-seperated block begins, right?
Not really... at each iteration, offset is where the next item is. offset+len(i) is where the next delimiter (or the end of string) is. I used it so you can quickly find the delimiters without having to scan the input string again. However, if your strings will have a regular format, you don't really need it, see my updated answer.
Thanks for the update. I guess there is no similarily simple way to get the same result with a variable number of items seperated by | before the <> begin, correct? Not to mention a format like item 1 | item 2 <> Item 3 | Item 4, which could of course be avoided by changing the formatting function on the client side.
See my updated answer. The suggested solution with a loop was addressing exactly that, the need to have some nesting depending on the delimiter type.
FYI: It looks like your new algorithm has some performance issues (See OP for benchmark).
|
2

If you know that <> is not going to appear elsewhere in the string then you could replace '<>' with '|' followed by a single split:

>>> input = "Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5" >>> input.replace("<>", "|").split("|") ['Item 1 ', ' Item 2 ', ' Item 3 ', ' Item 4 ', ' Item 5'] 

This will almost certainly be faster than doing multiple splits. It may or may not be faster than using re.split - timeit is your friend.

On my system with the sample string you supplied, my version is more than three times faster than re.split:

>>> timeit input.replace("<>", "|").split("|") 1000000 loops, best of 3: 980 ns per loop >>> import re >>> timeit re.split(r"\||<>", input) 100000 loops, best of 3: 3.07 us per loop 

(N.B.: This is using IPython, which has timeit as a built-in command).

1 Comment

Thanks for the idea. But this will create a flat list with no way to find out which one of the resulting substrings was seperated by <> instead of |, which is a requirement for my code, as I edited into the OP. Still, it's a good idea if you don't require nested lists.
1

You can make use of replace. First replace <> with |, and then split by |.

def replace_way(input): return input.replace('<>','|').split('|') 

Time performance:

import timeit import re def splitit(input): res0 = input.split("|") res = [] for element in res0: t = element.split("<>") if t != [element]: res0.remove(element) res.append(t) return (res0, res) def regexit(input): return re.split( "\||<>", input ) def replace_way(input): return input.replace('<>','|').split('|') print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit() print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit() print "replace:",timeit.Timer("replace_way('a|b|c|de|f<>ge<>ah')","from __main__ import replace_way").timeit() print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit() print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit() print "replace:",timeit.Timer("replace_way('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import replace_way").timeit() 

Results on my machine:

split: 11.8682055461 regex: 12.7430856814 replace: 2.54299265006 split: 79.2124379066 regex: 68.6917008003 replace: 10.944842347 

1 Comment

Thank you for your suggestion. However, it has the problem that it does not return nested lists for the "inner" items (the ones seperated by <> instead of |).

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.