2

I have a function that for every row gets all the previous rows based on the values of three columns of the current row. I use two ways for getting the rows I need:

import pandas as pd df = pd.read_csv("data.csv") # Way 1 rows = df[(df["colA"] == 1.2) & (df["colB"] == 5) & (df["colC"] == 2.5)] # Way 2 cols = ["colA", "colB", "colC"] group_by_cols = df.groupby(cols) rows = group_by_cols.get_group((1.2, 5, 2.5)) 

Using %timeit in a IPython Notebook:

# Way 1 100 loops, best of 3: 16.6 ms per loop # Way 2 100 loops, best of 3: 3.42 ms per loop 

I am trying to find a way to improve the time it takes. I have read about using Cython to enhance the performance, but I have never used it.

The values in the columns I use are floats, if that helps.

Update:

In the comments it was mentioned using HDF over csv.

I am not familiar with it, so I would like to ask if I created a hdf file with a table called "data" containing all my data and tables containing the rows that match each combination of the parameters I want and then calling the table needed for each row, would that be faster than the way 2 ?

I tried using hdf with pandas but there is unicode text in my data, so that's a problem.

5
  • I am not sure to get your question... You mean what is the fastest way ? If so, I am pretty sure that numpy/cython will do! P.S. I thought the first way was the best, timeit result amaze me :O Commented Jul 27, 2015 at 0:17
  • @Challensois: You are right, I am looking for the fastest way. Since I am not used to Cython, can you estimate how much the the performance will be increased ? Commented Jul 27, 2015 at 0:54
  • On the larger dataset of 250k rows, Way 2 was 3x faster than Way 1 on my computer, and Way 1 was 3x slower than the (query) method proposed by @Chrisb below. Given the timing of Way 2 above vs. that provided by @Chrisb below, I assume it was based on a very small data set. Commented Jul 27, 2015 at 4:23
  • @evil_inside I think the reason why Way 2 is faster than Way 1 is because the dataset is firstly sorted in groupby and selection afterwards is simply a binary search which is much much faster than exhaustive search for large dataset. So if your dataset is really large, perhaps store them first in on-disk database/HDF rather than csv file and sort them there, and then query. Commented Jul 27, 2015 at 6:42
  • @Jianxun Li: I think you are right about why way 2 is faster. How much faster would it be if I used the way you mention ? I also tried setting the indexes as the columns I want and then get the slice, but it was much slower, about 170 ms. Commented Jul 27, 2015 at 9:07

1 Answer 1

2

Both of those methods are already pretty optimized, I'd be surprised if you picked up much going to cython.

But, there is a .query method, that should help performance, assuming your frame is somewhat large. See the docs for more, or below for an example.

df = pd.DataFrame({'A':[1.0, 1.2, 1.5] * 250000, 'B':[1.0, 5.0, 1.5] * 250000, 'C':[1.0, 2.5, 99.0] * 250000}) In [5]: %timeit rows = df[(df["A"] == 1.2) & (df["B"] == 5) & (df["C"] == 2.5)] 10 loops, best of 3: 33.4 ms per loop In [6]: %%timeit ...: cols = ["A", "B", "C"] ...: group_by_cols = df.groupby(cols) ...: rows = group_by_cols.get_group((1.2, 5, 2.5)) ...: 10 loops, best of 3: 140 ms per loop In [8]: %timeit rows = df.query('A == 1.2 and B == 5 and C == 2.5') 100 loops, best of 3: 14.8 ms per loop 
Sign up to request clarification or add additional context in comments.

1 Comment

I think the same, that there is not going to be much improvement.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.