Finding common intersection between various values of a columns in a dataframe where values are array/list

Question

I have a dataframe whose first 5 rows looks like this.

 userID CategoryID sectorID agunii2035 [16, 17, 3, 12, 1] [2, 33, 29, 18, 23] agunii3007 [2, 4, 6, 3, 16] [4, 15, 29, 10, 18] agunii2006 [8, 16, 2, 5, 12] [38, 18, 7, 36, 33] agunii2003 [6, 4, 2, 5, 17] [37, 12, 3, 32, 34] agunii3000 [12, 11, 7, 3, 1] [38, 1, 13, 25, 3]

Now for any userID (let say "userID" = 'agunii2035') , I want to get the "userID"s whose "CategoryID" or "SectorID" have at least one common intersection value (For example, since agunni2035 and aguni3007 have at least one common "CategoryID" i.e '16' or have one common "sectorID" i.e. '29', we will consider the "userID " 'agunii3007')

The output can be a dataframe that looks like this

 userID user_with_common_cat/sectorID agunii2035 {aguni3007, agunni2006, agunii2003, agunii300} aguni3007 {agunni2035,agunni2006,agunii2003} and so on

or this can also be

 userID user_with_common_cat/sectorID agunii2035 [aguni3007, agunni2006, agunii2003, agunii300] aguni3007 [agunni2035,agunni2006,agunii2003} and so on

Any help on this please?

Edit

What I have done so far:

userID= 'agunii2035' common_users = [] for user in uniqueUsers: common = list(set(df_interest.loc[df_interest['userID'] == 'agashi2035', 'categoryID'].iloc[0]).intersection(df_interest.loc[df_interest['userID'] == user, 'categoryID'].iloc[0])) #intersect = len(common) > 0 if (len(common) > 0): common_users.append(user)

I want to do this for sectors as well and make the intersection for either sector or category and append to the common_user list if length of any intersection is 1.

Also, I want to do this for all the users.

I have just done a static one for a single user. userID= 'agunii2035' common_users = [] for user in uniqueUsers: common = list(set(df_interest.loc[df_interest['userID'] == 'agashi2035', 'categoryID'].iloc[0]).intersection(df_interest.loc[df_interest['userID'] == user, 'categoryID'].iloc[0])) #intersect = len(common) > 0 if (len(common) > 0): common_users.append(user) I want to add the sectors part as well and do this for all users. — d_b
– d_b, Commented Oct 16, 2020 at 7:19

Thomas · Accepted Answer · 2020-10-16 07:36:22Z

I usually don't really like to manipulate dataframe where a "cell" contains a list and not s single element (float, str, etc.).

In the following I will manipulate a python dict instead of a dataframe.

Data

You can transform a pandas dataframe into dict with the to_dict method doc.

Here are the data in dictionnary:

d = { "agunii2035": { "category_id": [16, 17, 3, 12, 1], "sector_id": [2, 33, 29, 18, 23], }, "agunii3007": { "category_id": [2, 4, 6, 3, 16], "sector_id": [4, 15, 29, 10, 18], }, "agunii2006": { "category_id": [8, 16, 2, 5, 12], "sector_id": [38, 18, 7, 36, 33], }, "agunii2003": { "category_id": [6, 4, 2, 5, 17], "sector_id": [37, 12, 3, 32, 34], }, "agunii3000": { "category_id": [12, 11, 7, 3, 1], "sector_id": [38, 1, 13, 25, 3], }, }

Solution 1: Iterate over the dictionary with for loops

Here we can have two for loops to check all elements. The only thing to know is how to intersect two list in python with set.

results = {} for user_a, category_sector_a in d.items(): results[user_a] = [] for user_b, category_sector_b in d.items(): if user_a != user_b: # we use "set" to have common elements between the two lists intersection_category = set(category_sector_a["category_id"]) & set( category_sector_a["category_id"] ) intersection_sector = set(category_sector_a["sector_id"]) & set( category_sector_a["sector_id"] ) if (len(intersection_category)) > 0 or (len(intersection_category) > 0): results[user_a].append(user_b)

Solution 2: itertools

Here, we use itertools to generate all combinations of keys in the original data. It will allow us to avoid the two for loops.

import itertools results = {} for user_a, user_b in itertools.combinations(d.keys(), 2): # we use "set" to have common elements between the two lists intersection_category = set(d[user_a]["category_id"]) & set( d[user_b]["category_id"] ) intersection_sector = set(d[user_a]["sector_id"]) & set(d[user_b]["sector_id"]) if (len(intersection_category)) > 0 or (len(intersection_category) > 0): if user_a in results: results[user_a].append(user_b) else: results[user_a] = [user_b]

It is almost the same thing as previously. Except, at the end we have to create the key in the results dictionary if the key doesn't exist.

Solution 3: itertools and list comprehension

Here, we also use the itertools but in a list comprehension. We use list comprehension to output only users pair respecting the condition (the if part).

import operator import itertools results = [ (user_a, user_b) for user_a, user_b in itertools.combinations(d.keys(), 2) if (len(set(d[user_a]["category_id"]) & set(d[user_b]["category_id"]))) > 0 or (len(set(d[user_a]["sector_id"]) & set(d[user_b]["sector_id"])) > 0) ] results = { k: list(list(zip(*g))[1]) for k, g in itertools.groupby(results, operator.itemgetter(0)) }

Note at the end the part where we need to groupy because the output of the list comprehension is a list of users tuple (pairs). The solution to groupy list of tuples in python comes from this solution on SO.

Collectives™ on Stack Overflow

Finding common intersection between various values of a columns in a dataframe where values are array/list

1 Answer 1

Data

Solution 1: Iterate over the dictionary with for loops

Solution 2: itertools

Solution 3: itertools and list comprehension

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Data

Solution 1: Iterate over the dictionary with for loops

Solution 2: itertools

Solution 3: itertools and list comprehension

1 Comment

Linked

Related