Calculate distance among LDA distributions between two rows in Pandas data frame

Question

I have a dataframe with LDA topic distribution outputs along with other demographic information as below:

single_df = pd.DataFrame([{"department": 'marketing', 'LDA_1': 0.252, 'LDA_2':0.002, 'LDA_3':0.50}, {"department": 'engineering', 'LDA_1': 0.478, 'LDA_2':0.152, 'LDA_3':0.492}, {"department": 'cooperate', 'LDA_1': 0.52, 'LDA_2':0.780, 'LDA_3':0.50}, {"department": "marketing", 'LDA_1': 0.352, 'LDA_2':0.052, 'LDA_3':0.20}])

I would like to get to the below final dataframe. How do I write a function to calculate Jenson-Shannon distance between two rows (column name containing "LDA_") that returns below data frame?

i j same_department distance_LDA 0 1 0 0.23 0 2 0 0.43 0 3 1 0.26 1 2 0 0.24 1 3 0 0.11 2 3 0 0.29

I've written code to calculate JS distance between individual pairs as below. How do I turn it into a function?

array=single_df.filter(regex='LDA').to_numpy() distance.jensenshannon(array[0],array[1])

Then to calculate whether two people share the department, I have the code below:

def same_department(i,j): if i['department'] == j['department']: return 1 else: return 0

Henry Ecker · Accepted Answer · 2021-05-04 00:11:34Z

Let's try generating all possible row combinations, merging to make a DataFrame where comparisons can happen in the same row. Then applying row-wise the jensenshannon function based on column suffixes:

from itertools import combinations from scipy.spatial.distance import jensenshannon import pandas as pd single_df = pd.DataFrame([{"department": 'marketing', 'LDA_1': 0.252, 'LDA_2': 0.002, 'LDA_3': 0.50}, {"department": 'engineering', 'LDA_1': 0.478, 'LDA_2': 0.152, 'LDA_3': 0.492}, {"department": 'cooperate', 'LDA_1': 0.52, 'LDA_2': 0.780, 'LDA_3': 0.50}, {"department": "marketing", 'LDA_1': 0.352, 'LDA_2': 0.052, 'LDA_3': 0.20}]) # Merge the 3 LDA Columns Into A Single Column Containing a List single_df['LDA'] = single_df.filter(regex='^LDA_.*').agg(list, axis=1) # Get Rid Of The Original LDA_X columns single_df = single_df.filter(regex='^(?!LDA_.*)') # Get All Row Combinations a, b = map(list, zip(*combinations(single_df.index, 2))) # Merge Together df = single_df.loc[a].reset_index().merge( single_df.loc[b].reset_index(), left_index=True, right_index=True, ) # Apply jensonshannon to LDA_x and LDA_y Lists df['distance_LDA'] = df.apply( lambda x: jensenshannon(x['LDA_x'], x['LDA_y']), axis=1) # Get If In Same Department df['same_department'] = df['department_x'].eq(df['department_y']).astype(int) # Rename and Filter Columns df = df \ .rename(columns={'index_x': 'i', 'index_y': 'j'})[['i', 'j', 'same_department', 'distance_LDA']] # For Display print(df.to_string(index=False))

Output:

i j same_department distance_LDA 0 1 0 0.235849 0 2 0 0.429508 0 3 1 0.264777 1 2 0 0.238155 1 3 0 0.112456 2 3 0 0.299704

Thanks! Was wondering if there is a faster way to calculate J-S distance other than using the "apply()" function? My actual data frame has over 1M rows. Any suggestion would be appreciated!
There may be some optimizations that can be made, but to significantly increase performance some major refactoring would need to be made. There may be some good ideas in Performance of Pandas apply vs np.vectorize to create new column from existing columns or you might consider multiprocessing.

Collectives™ on Stack Overflow

Calculate distance among LDA distributions between two rows in Pandas data frame

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related