Select rows from a DataFrame based on values in a MULTIPLE columns in pandas

Question

So this question concerns how to select a subset of rows in a data frame based on values in an array (or a single column). It is not enough for me to solve my problem.

I have many different tables in multiple directories. I have a dictionary with relations between tables (e.g keys for join). For each table T1, I lookup other tables (T2, T3...) that share same column names (keys) and I want to filter those tables (T2, T3...) to include rows that have matching key values in a set of columns with T1. Key set may vary! T1 may connect to T2 on one column (key) while T1 may connect with T2 on 5 keys! I do not know this beforehand.

So for example I have t1, t2, t3 and pks=["id"] (t1-->t2), fks=["id", "index", "zip"] (t1-->t3)

t1 id|index|zip|v 10|10000|200|20 t2 id|v 10|30 20|50 30|70 t3 id|index|zip|v 00|10000|200|10 10|10000|200|20 10|10000|300|30 10|10000|200|10

the output of t2 and t3 would be

t2 id|v 10|30

and t3

id|index|zip|v 10|10000|200|20 10|10000|200|10

Looking at the previous answer I would probably need to do smth like

filtered_t2 = t2.loc[t2[pks].isin(t1[fks])]

But i get the following error

ValueError: Cannot index with multidimensional key

Probably in this way I cannot handle compound key, but it also fails if I just provide one key -- 'id'! So maybe it cannot accept an array as values ...

How do I handle it when pks and fks are arrays of variable sizes?

Would this be a correct approach:

 filter = None for p, f in zip(pks, fks): if filter is None: filter = t2[p].isin(t1[f]) else: filter &= t2[p].isin(t1[f]) filtered_ft = t2.loc[filter]

Thanks!

BENY · Accepted Answer · 2019-08-09 00:46:59Z

2

Let us try merge here

t2.merge(t1,how='inner',on=['id']) t3.merge(t1,how='inner',on=['id','index','zip'])

Do another way

t2[t2[pks].apply(tuple,1).isin(t1[pks].apply(tuple,1))]

edited Aug 9, 2019 at 0:46

answered Aug 9, 2019 at 0:42

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

YohanRoth Over a year ago

but merge brings two tables together.. while I just want to filter out second table. I probably can drop cols of t1, but then renaming might be a pain... Can you check my suggested approach in the edited version? Would smth like this be correct?

YohanRoth Over a year ago

@it's a bit hard to understand the logic flow. can you briefly describe what it is doing

BENY Over a year ago

@YohanRoth convert the column you need to tuple (each row), which can allow us using isin

BENY Over a year ago

@YohanRoth first select all column you need to check , then we zip each row value to tuple, after that we can using isin to check whether it exit in another data frame or not

YohanRoth Over a year ago

I guess I am not super clear why we need to convert it to a tuple... do you think my solution in the post is also fine (except it's longer and uglier)? what do you think would be faster

Collectives™ on Stack Overflow

Select rows from a DataFrame based on values in a MULTIPLE columns in pandas

1 Answer 1

5 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Linked

Related