0

So this question concerns how to select a subset of rows in a data frame based on values in an array (or a single column). It is not enough for me to solve my problem.

I have many different tables in multiple directories. I have a dictionary with relations between tables (e.g keys for join). For each table T1, I lookup other tables (T2, T3...) that share same column names (keys) and I want to filter those tables (T2, T3...) to include rows that have matching key values in a set of columns with T1. Key set may vary! T1 may connect to T2 on one column (key) while T1 may connect with T2 on 5 keys! I do not know this beforehand.

So for example I have t1, t2, t3 and pks=["id"] (t1-->t2), fks=["id", "index", "zip"] (t1-->t3)

t1 id|index|zip|v 10|10000|200|20 t2 id|v 10|30 20|50 30|70 t3 id|index|zip|v 00|10000|200|10 10|10000|200|20 10|10000|300|30 10|10000|200|10 

the output of t2 and t3 would be

t2 id|v 10|30 

and t3

id|index|zip|v 10|10000|200|20 10|10000|200|10 

Looking at the previous answer I would probably need to do smth like

filtered_t2 = t2.loc[t2[pks].isin(t1[fks])] 

But i get the following error

ValueError: Cannot index with multidimensional key 

Probably in this way I cannot handle compound key, but it also fails if I just provide one key -- 'id'! So maybe it cannot accept an array as values ...

How do I handle it when pks and fks are arrays of variable sizes?

Would this be a correct approach:

 filter = None for p, f in zip(pks, fks): if filter is None: filter = t2[p].isin(t1[f]) else: filter &= t2[p].isin(t1[f]) filtered_ft = t2.loc[filter] 

Thanks!

1 Answer 1

2

Let us try merge here

t2.merge(t1,how='inner',on=['id']) t3.merge(t1,how='inner',on=['id','index','zip']) 

Do another way

t2[t2[pks].apply(tuple,1).isin(t1[pks].apply(tuple,1))] 
Sign up to request clarification or add additional context in comments.

5 Comments

but merge brings two tables together.. while I just want to filter out second table. I probably can drop cols of t1, but then renaming might be a pain... Can you check my suggested approach in the edited version? Would smth like this be correct?
@it's a bit hard to understand the logic flow. can you briefly describe what it is doing
@YohanRoth convert the column you need to tuple (each row), which can allow us using isin
@YohanRoth first select all column you need to check , then we zip each row value to tuple, after that we can using isin to check whether it exit in another data frame or not
I guess I am not super clear why we need to convert it to a tuple... do you think my solution in the post is also fine (except it's longer and uglier)? what do you think would be faster

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.