-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
Description
Hi,
I'm working with two colleagues (@FrancescElies & @javiBi4) on seeing if we can add groupby to bcolz, what we did first was to see if we can make Pandas' excellent klib implementation more library independent. See https://github.com/CarstVaartjes/khash
A while back khash (https://github.com/attractivechaos/klib) was updated to 0.2.8:
2013-05-02 (0.2.8):
* Use quadratic probing. When the capacity is power of 2, stepping function i*(i+1)/2 guarantees to traverse each bucket. It is better than double hashing on cache performance and is more robust than linear probing. In theory, double hashing should be more robust than quadratic probing. However, my implementation is probably not for large hash tables, because the second hash function is closely tied to the first hash function, which reduce the effectiveness of double hashing. Reference: http://research.cs.vt.edu/AVresearch/hashing/quadratic.php We updated the klib to it (which was something of a headache to be honest), but the quadratic probing seems to be quite faster which might make it interesting for Pandas to retrofit this into the general Pandas master. See the example:
import numpy as np import random import pandas as pd from khash.hashtable import Int64HashTable, StringHashTable def new_klib_int(input_array): ht = Int64HashTable(len(input_array)) return ht.get_labels_groupby(input_array) def new_klib_str(input_array): ht = StringHashTable(len(input_array)) return ht.factorize(input_array) def new_klib_float(input_array): ht = Float64HashTable(len(input_array)) return ht.factorize(input_array) a = np.random.randint(100, size=10000000) b = np.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2') b = pd.Series(b).values c = np.random.random_sample(10000000) * 10000 %timeit pd.factorize(a) %timeit new_klib_int(a) %timeit pd.factorize(b) %timeit new_klib_str(b) %timeit pd.factorize(c) %timeit new_klib_float(c) In [20]: %timeit pd.factorize(a) 10 loops, best of 3: 122 ms per loop In [21]: %timeit new_klib_int(a) 10 loops, best of 3: 101 ms per loop In [22]: %timeit pd.factorize(b) 1 loops, best of 3: 496 ms per loop In [23]: %timeit new_klib_str(b) 10 loops, best of 3: 165 ms per loop In [36]: %timeit pd.factorize(c) 1 loops, best of 3: 1.61 s per loop In [37]: %timeit new_klib_float(c) 1 loops, best of 3: 1.44 s per loop So a 20%ish improvement for int64 and a 65% increase for strings in this example. Just thought to give you a heads up about this, as you might have missed 0.2.8 (and it's a bit of a pain to adjust everything so this might help to make Pandas even faster)
Edit: added float64 example too (10-15% improvement)