How do I select rows based on columns values with Pandas?

Question

I have a function that for every row gets all the previous rows based on the values of three columns of the current row. I use two ways for getting the rows I need:

import pandas as pd df = pd.read_csv("data.csv") # Way 1 rows = df[(df["colA"] == 1.2) & (df["colB"] == 5) & (df["colC"] == 2.5)] # Way 2 cols = ["colA", "colB", "colC"] group_by_cols = df.groupby(cols) rows = group_by_cols.get_group((1.2, 5, 2.5))

Using %timeit in a IPython Notebook:

# Way 1 100 loops, best of 3: 16.6 ms per loop # Way 2 100 loops, best of 3: 3.42 ms per loop

I am trying to find a way to improve the time it takes. I have read about using Cython to enhance the performance, but I have never used it.

The values in the columns I use are floats, if that helps.

Update:

In the comments it was mentioned using HDF over csv.

I am not familiar with it, so I would like to ask if I created a hdf file with a table called "data" containing all my data and tables containing the rows that match each combination of the parameters I want and then calling the table needed for each row, would that be faster than the way 2 ?

I tried using hdf with pandas but there is unicode text in my data, so that's a problem.

I am not sure to get your question... You mean what is the fastest way ? If so, I am pretty sure that numpy/cython will do! P.S. I thought the first way was the best, timeit result amaze me :O — Challensois
– Challensois, Commented Jul 27, 2015 at 0:17
@Challensois: You are right, I am looking for the fastest way. Since I am not used to Cython, can you estimate how much the the performance will be increased ? — IordanouGiannis
– IordanouGiannis, Commented Jul 27, 2015 at 0:54
On the larger dataset of 250k rows, Way 2 was 3x faster than Way 1 on my computer, and Way 1 was 3x slower than the (query) method proposed by @Chrisb below. Given the timing of Way 2 above vs. that provided by @Chrisb below, I assume it was based on a very small data set. — Alexander
– Alexander, Commented Jul 27, 2015 at 4:23
@evil_inside I think the reason why Way 2 is faster than Way 1 is because the dataset is firstly sorted in groupby and selection afterwards is simply a binary search which is much much faster than exhaustive search for large dataset. So if your dataset is really large, perhaps store them first in on-disk database/HDF rather than csv file and sort them there, and then query. — Jianxun Li
– Jianxun Li, Commented Jul 27, 2015 at 6:42
@Jianxun Li: I think you are right about why way 2 is faster. How much faster would it be if I used the way you mention ? I also tried setting the indexes as the columns I want and then get the slice, but it was much slower, about 170 ms. — IordanouGiannis
– IordanouGiannis, Commented Jul 27, 2015 at 9:07

chrisb · Accepted Answer · 2015-07-27 01:08:13Z

Both of those methods are already pretty optimized, I'd be surprised if you picked up much going to cython.

But, there is a .query method, that should help performance, assuming your frame is somewhat large. See the docs for more, or below for an example.

df = pd.DataFrame({'A':[1.0, 1.2, 1.5] * 250000, 'B':[1.0, 5.0, 1.5] * 250000, 'C':[1.0, 2.5, 99.0] * 250000}) In [5]: %timeit rows = df[(df["A"] == 1.2) & (df["B"] == 5) & (df["C"] == 2.5)] 10 loops, best of 3: 33.4 ms per loop In [6]: %%timeit ...: cols = ["A", "B", "C"] ...: group_by_cols = df.groupby(cols) ...: rows = group_by_cols.get_group((1.2, 5, 2.5)) ...: 10 loops, best of 3: 140 ms per loop In [8]: %timeit rows = df.query('A == 1.2 and B == 5 and C == 2.5') 100 loops, best of 3: 14.8 ms per loop

I think the same, that there is not going to be much improvement.

Collectives™ on Stack Overflow

How do I select rows based on columns values with Pandas?

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related