The fastest way possible to split a long string

Question

I have a very long string in Python:

x = "12;14;14;14;18;12;17;19" # I only show a small part of it : there are 10 millions of ;

The goal is to transform it into:

y = array([12, 14, 14, 14, 18, 12, 17, 19], dtype=int)

One way to do this is to use array(x.split(";")) or numpy.fromtostring.

But both are extremely slow.

Is there quicker way to do it in python?

Thank you very much and have a nice day.

Does this answer your question? Most efficient way to split strings in Python — Cow
– Cow, Commented Dec 21, 2022 at 8:52
Hello, thank you for the answer. Unfortunately it performs the same way :( — AdrienC
– AdrienC, Commented Dec 21, 2022 at 8:54
Strings are slow, especially unicode ones. CPython is slow, especially loops. Numpy is not really design to (efficiently) deal with strings. Unfortunately Numba is very slow for strings/bytes so far (open issue that is not planed to be solved any time soon). The only possible solutions I see are cheats/tricks, or using C/Cython. I do not think Numpy can do this faster than fromstring. Numba could certainly with some tricks. Are the number always made of 2 digits? Are they always small? — Jérôme Richard
– Jérôme Richard, Commented Dec 21, 2022 at 11:16

Jérôme Richard · Accepted Answer · 2022-12-21 18:34:12Z

String parsing is often slow. Unicode decoding often make things slower (especially when there are non-ASCII character) unless it is carefully optimized (hard). CPython is slow, especially loops. Numpy is not really design to (efficiently) deal with strings. I do not think Numpy can do this faster than fromstring yet. The only solutions I can come up with are using Numba, Cython or even basic C extensions. The simplest solution is to use Numba, the fastest is to use Cython/C-extensions.

Unfortunately Numba is very slow for strings/bytes so far (this is an open issue that is not planed to be solved any time soon). Some tricks are needed so that Numba can compute this efficiently: the string needs to be converted to a Numpy array. This means it must be first encoded to a byte-array first to avoid any variable-sized encoding (like UTF-8). np.frombuffer seems the fastest solution to convert the buffer to a Numpy array. Since the input is a read-only array (unusual, but efficient), the Numba signature is not very easy to read.

Here is the final solution:

import numpy as np import numba as nb @nb.njit(nb.int32[::1](nb.types.Array(nb.uint8, 1, 'C', readonly=True,))) def compute(arr): sep = ord(';') base = ord('0') minus = ord('-') count = 1 for c in arr: count += c == sep res = np.empty(count, np.int32) val = 0 positive = True cur = 0 for c in arr: if c != sep and c != minus: val = (val * 10) + c - base elif c == minus: positive = False else: res[cur] = val if positive else -val cur += 1 val = 0 positive = True if cur < count: res[cur] = val if positive else -val return res x = ';'.join(np.random.randint(0, 200, 10_000_000).astype(str)) result = compute(np.frombuffer(x.encode('ascii'), np.uint8))

Note that the Numba solution performs no checks for sake of performance. It also assume the numbers are positive ones. Thus, you must ensure the input is valid. Alternatively, you can perform additional checks in it (at the expense of a slower code).

Here are performance results on my machine with a i5-9600KF processor (with Numpy 1.22.4 on Windows):

np.fromstring(x, dtype=np.int32, sep=';'): 8927 ms np.array(re.split(";", x), dtype=np.int32): 1419 ms np.array(x.split(";"), dtype=np.int32): 1196 ms Numba implementation: 78 ms Numba implementation (without negative numbers): 67 ms

This solution is 114 times faster than np.fromstring and 15 times faster than the fastest solution (based on split). Note that removing the support for negative numbers makes the Numba function 18% faster. Also, note that 10~12% of the time is spent in encode. The rest of the time comes from the main loop in the Numba function. More specifically, the conditionals in the loop are the main source of the slowdown because they can hardly predicted by the processor and they prevent the use of fast (SIMD) instructions. This is often why string parsing is slow.

A possible improvement is to use a branchless implementation operating on chunks. Another possible improvement is to compute the chunks using multiple threads. However, both optimizations are tricky to do and they both make the resulting code significantly harder to read (and so to maintain).

The problem with unicode strings is that the variable size code makes automatic SIMD computation nearly impossible for compilers. Since intrinsics are not available in Numba (and are not portable anyway) and there is no SIMD interface for that, the assembly code will be a scalar one. Scalar codes are slow because they are bound to >=1 cycle per character (in fact generally several cycle per character are needed). Native unbounded C-strings have the similar issue though it is possible to compute them quite efficiently using chunks since the memory is pretty slow too (see "Memory Wall").
I agree with you that dealing with unicode specifics is what make unicode string slow and if there are only ASCII characters then it is not so slow, but still, in CPython, the string has to be copied unless one deal with low level C extension (or maybe Cython). This copy is a bit expensive but not critical here (same for the decoding). AFAIK, CPython do the decoding pretty efficiently thanks to a library doing the operation using manual SIMD intrinsics.
90% of the time is spent on the computation of the character array which is not really a string. What makes the computation not very fast in Numba is basically the use of conditionals (in a loop with dependency chains) and this is more generally due to the parsing. Conditionals are slow when they cannot be predicted and this is often the case for string parsing. Moreover, dependency chains and conditional prevent any automatic vectorization of the loop. In fact, this is hard to do efficiently manually because of the variable size of the integers.
Unicode string parsing can indeed have close performance for UTF-8 ASCII-based strings but not the same due to the ASCII check. UCS-4 ASCII strings are generally significantly slower because they are 4x bigger in memory and in SIMD registers (compared to a basic byte-string. This is what Numpy uses internally.
Thus, maybe, I should have said that "string parsing is slow". Theoretically, string parsing (including unicode decoding) can often be made fast with SIMD intrinsics but most people do not use them for strings (nor portable library), and compilers are generally far from being smart enough to do this job efficiently. In the end, this is why I said "strings are slow". This is not very precise nor very exact but I think it is important to summary this complex subject to a simple idea (simplistic?).

Collectives™ on Stack Overflow

The fastest way possible to split a long string

1 Answer 1

6 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Linked

Related