String parsing is often slow. Unicode decoding often make things slower (especially when there are non-ASCII character) unless it is carefully optimized (hard). CPython is slow, especially loops. Numpy is not really design to (efficiently) deal with strings. I do not think Numpy can do this faster than fromstring yet. The only solutions I can come up with are using Numba, Cython or even basic C extensions. The simplest solution is to use Numba, the fastest is to use Cython/C-extensions.
Unfortunately Numba is very slow for strings/bytes so far (this is an open issue that is not planed to be solved any time soon). Some tricks are needed so that Numba can compute this efficiently: the string needs to be converted to a Numpy array. This means it must be first encoded to a byte-array first to avoid any variable-sized encoding (like UTF-8). np.frombuffer seems the fastest solution to convert the buffer to a Numpy array. Since the input is a read-only array (unusual, but efficient), the Numba signature is not very easy to read.
Here is the final solution:
import numpy as np import numba as nb @nb.njit(nb.int32[::1](nb.types.Array(nb.uint8, 1, 'C', readonly=True,))) def compute(arr): sep = ord(';') base = ord('0') minus = ord('-') count = 1 for c in arr: count += c == sep res = np.empty(count, np.int32) val = 0 positive = True cur = 0 for c in arr: if c != sep and c != minus: val = (val * 10) + c - base elif c == minus: positive = False else: res[cur] = val if positive else -val cur += 1 val = 0 positive = True if cur < count: res[cur] = val if positive else -val return res x = ';'.join(np.random.randint(0, 200, 10_000_000).astype(str)) result = compute(np.frombuffer(x.encode('ascii'), np.uint8))
Note that the Numba solution performs no checks for sake of performance. It also assume the numbers are positive ones. Thus, you must ensure the input is valid. Alternatively, you can perform additional checks in it (at the expense of a slower code).
Here are performance results on my machine with a i5-9600KF processor (with Numpy 1.22.4 on Windows):
np.fromstring(x, dtype=np.int32, sep=';'): 8927 ms np.array(re.split(";", x), dtype=np.int32): 1419 ms np.array(x.split(";"), dtype=np.int32): 1196 ms Numba implementation: 78 ms Numba implementation (without negative numbers): 67 ms
This solution is 114 times faster than np.fromstring and 15 times faster than the fastest solution (based on split). Note that removing the support for negative numbers makes the Numba function 18% faster. Also, note that 10~12% of the time is spent in encode. The rest of the time comes from the main loop in the Numba function. More specifically, the conditionals in the loop are the main source of the slowdown because they can hardly predicted by the processor and they prevent the use of fast (SIMD) instructions. This is often why string parsing is slow.
A possible improvement is to use a branchless implementation operating on chunks. Another possible improvement is to compute the chunks using multiple threads. However, both optimizations are tricky to do and they both make the resulting code significantly harder to read (and so to maintain).
np.fromstring(x, dtype=np.int, sep=';')fromstring. Numba could certainly with some tricks. Are the number always made of 2 digits? Are they always small?