I have a function that for every row gets all the previous rows based on the values of three columns of the current row. I use two ways for getting the rows I need:
import pandas as pd df = pd.read_csv("data.csv") # Way 1 rows = df[(df["colA"] == 1.2) & (df["colB"] == 5) & (df["colC"] == 2.5)] # Way 2 cols = ["colA", "colB", "colC"] group_by_cols = df.groupby(cols) rows = group_by_cols.get_group((1.2, 5, 2.5)) Using %timeit in a IPython Notebook:
# Way 1 100 loops, best of 3: 16.6 ms per loop # Way 2 100 loops, best of 3: 3.42 ms per loop I am trying to find a way to improve the time it takes. I have read about using Cython to enhance the performance, but I have never used it.
The values in the columns I use are floats, if that helps.
Update:
In the comments it was mentioned using HDF over csv.
I am not familiar with it, so I would like to ask if I created a hdf file with a table called "data" containing all my data and tables containing the rows that match each combination of the parameters I want and then calling the table needed for each row, would that be faster than the way 2 ?
I tried using hdf with pandas but there is unicode text in my data, so that's a problem.
query) method proposed by @Chrisb below. Given the timing of Way 2 above vs. that provided by @Chrisb below, I assume it was based on a very small data set.groupbyand selection afterwards is simply a binary search which is much much faster than exhaustive search for large dataset. So if your dataset is really large, perhaps store them first in on-disk database/HDF rather than csv file and sort them there, and then query.