1

I have a dataframe with LDA topic distribution outputs along with other demographic information as below:

single_df = pd.DataFrame([{"department": 'marketing', 'LDA_1': 0.252, 'LDA_2':0.002, 'LDA_3':0.50}, {"department": 'engineering', 'LDA_1': 0.478, 'LDA_2':0.152, 'LDA_3':0.492}, {"department": 'cooperate', 'LDA_1': 0.52, 'LDA_2':0.780, 'LDA_3':0.50}, {"department": "marketing", 'LDA_1': 0.352, 'LDA_2':0.052, 'LDA_3':0.20}]) 

enter image description here

I would like to get to the below final dataframe. How do I write a function to calculate Jenson-Shannon distance between two rows (column name containing "LDA_") that returns below data frame?

i j same_department distance_LDA 0 1 0 0.23 0 2 0 0.43 0 3 1 0.26 1 2 0 0.24 1 3 0 0.11 2 3 0 0.29 

I've written code to calculate JS distance between individual pairs as below. How do I turn it into a function?

array=single_df.filter(regex='LDA').to_numpy() distance.jensenshannon(array[0],array[1]) 

Then to calculate whether two people share the department, I have the code below:

def same_department(i,j): if i['department'] == j['department']: return 1 else: return 0 

1 Answer 1

1

Let's try generating all possible row combinations, merging to make a DataFrame where comparisons can happen in the same row. Then applying row-wise the jensenshannon function based on column suffixes:

from itertools import combinations from scipy.spatial.distance import jensenshannon import pandas as pd single_df = pd.DataFrame([{"department": 'marketing', 'LDA_1': 0.252, 'LDA_2': 0.002, 'LDA_3': 0.50}, {"department": 'engineering', 'LDA_1': 0.478, 'LDA_2': 0.152, 'LDA_3': 0.492}, {"department": 'cooperate', 'LDA_1': 0.52, 'LDA_2': 0.780, 'LDA_3': 0.50}, {"department": "marketing", 'LDA_1': 0.352, 'LDA_2': 0.052, 'LDA_3': 0.20}]) # Merge the 3 LDA Columns Into A Single Column Containing a List single_df['LDA'] = single_df.filter(regex='^LDA_.*').agg(list, axis=1) # Get Rid Of The Original LDA_X columns single_df = single_df.filter(regex='^(?!LDA_.*)') # Get All Row Combinations a, b = map(list, zip(*combinations(single_df.index, 2))) # Merge Together df = single_df.loc[a].reset_index().merge( single_df.loc[b].reset_index(), left_index=True, right_index=True, ) # Apply jensonshannon to LDA_x and LDA_y Lists df['distance_LDA'] = df.apply( lambda x: jensenshannon(x['LDA_x'], x['LDA_y']), axis=1) # Get If In Same Department df['same_department'] = df['department_x'].eq(df['department_y']).astype(int) # Rename and Filter Columns df = df \ .rename(columns={'index_x': 'i', 'index_y': 'j'})[['i', 'j', 'same_department', 'distance_LDA']] # For Display print(df.to_string(index=False)) 

Output:

i j same_department distance_LDA 0 1 0 0.235849 0 2 0 0.429508 0 3 1 0.264777 1 2 0 0.238155 1 3 0 0.112456 2 3 0 0.299704 
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! Was wondering if there is a faster way to calculate J-S distance other than using the "apply()" function? My actual data frame has over 1M rows. Any suggestion would be appreciated!
There may be some optimizations that can be made, but to significantly increase performance some major refactoring would need to be made. There may be some good ideas in Performance of Pandas apply vs np.vectorize to create new column from existing columns or you might consider multiprocessing.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.