So this question concerns how to select a subset of rows in a data frame based on values in an array (or a single column). It is not enough for me to solve my problem.
I have many different tables in multiple directories. I have a dictionary with relations between tables (e.g keys for join). For each table T1, I lookup other tables (T2, T3...) that share same column names (keys) and I want to filter those tables (T2, T3...) to include rows that have matching key values in a set of columns with T1. Key set may vary! T1 may connect to T2 on one column (key) while T1 may connect with T2 on 5 keys! I do not know this beforehand.
So for example I have t1, t2, t3 and pks=["id"] (t1-->t2), fks=["id", "index", "zip"] (t1-->t3)
t1 id|index|zip|v 10|10000|200|20 t2 id|v 10|30 20|50 30|70 t3 id|index|zip|v 00|10000|200|10 10|10000|200|20 10|10000|300|30 10|10000|200|10 the output of t2 and t3 would be
t2 id|v 10|30 and t3
id|index|zip|v 10|10000|200|20 10|10000|200|10 Looking at the previous answer I would probably need to do smth like
filtered_t2 = t2.loc[t2[pks].isin(t1[fks])] But i get the following error
ValueError: Cannot index with multidimensional key Probably in this way I cannot handle compound key, but it also fails if I just provide one key -- 'id'! So maybe it cannot accept an array as values ...
How do I handle it when pks and fks are arrays of variable sizes?
Would this be a correct approach:
filter = None for p, f in zip(pks, fks): if filter is None: filter = t2[p].isin(t1[f]) else: filter &= t2[p].isin(t1[f]) filtered_ft = t2.loc[filter] Thanks!