Skip to content

Klib upgrade (factorizing performance increase) #8524

@CarstVaartjes

Description

@CarstVaartjes

Hi,

I'm working with two colleagues (@FrancescElies & @javiBi4) on seeing if we can add groupby to bcolz, what we did first was to see if we can make Pandas' excellent klib implementation more library independent. See https://github.com/CarstVaartjes/khash
A while back khash (https://github.com/attractivechaos/klib) was updated to 0.2.8:

2013-05-02 (0.2.8):

* Use quadratic probing. When the capacity is power of 2, stepping function i*(i+1)/2 guarantees to traverse each bucket. It is better than double hashing on cache performance and is more robust than linear probing. In theory, double hashing should be more robust than quadratic probing. However, my implementation is probably not for large hash tables, because the second hash function is closely tied to the first hash function, which reduce the effectiveness of double hashing. Reference: http://research.cs.vt.edu/AVresearch/hashing/quadratic.php 

We updated the klib to it (which was something of a headache to be honest), but the quadratic probing seems to be quite faster which might make it interesting for Pandas to retrofit this into the general Pandas master. See the example:

import numpy as np import random import pandas as pd from khash.hashtable import Int64HashTable, StringHashTable def new_klib_int(input_array): ht = Int64HashTable(len(input_array)) return ht.get_labels_groupby(input_array) def new_klib_str(input_array): ht = StringHashTable(len(input_array)) return ht.factorize(input_array) def new_klib_float(input_array): ht = Float64HashTable(len(input_array)) return ht.factorize(input_array) a = np.random.randint(100, size=10000000) b = np.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2') b = pd.Series(b).values c = np.random.random_sample(10000000) * 10000 %timeit pd.factorize(a) %timeit new_klib_int(a) %timeit pd.factorize(b) %timeit new_klib_str(b) %timeit pd.factorize(c) %timeit new_klib_float(c) In [20]: %timeit pd.factorize(a) 10 loops, best of 3: 122 ms per loop In [21]: %timeit new_klib_int(a) 10 loops, best of 3: 101 ms per loop In [22]: %timeit pd.factorize(b) 1 loops, best of 3: 496 ms per loop In [23]: %timeit new_klib_str(b) 10 loops, best of 3: 165 ms per loop In [36]: %timeit pd.factorize(c) 1 loops, best of 3: 1.61 s per loop In [37]: %timeit new_klib_float(c) 1 loops, best of 3: 1.44 s per loop 

So a 20%ish improvement for int64 and a 65% increase for strings in this example. Just thought to give you a heads up about this, as you might have missed 0.2.8 (and it's a bit of a pain to adjust everything so this might help to make Pandas even faster)

Edit: added float64 example too (10-15% improvement)

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions