I have a function which takes in data for a particular year and returns a dataframe.
For example:
df
year fruit license grade 1946 apple XYZ 1 1946 orange XYZ 1 1946 apple PQR 3 1946 orange PQR 1 1946 grape XYZ 2 1946 grape PQR 1 .. 2014 grape LMN 1 Note: 1) a specific license value will exist only for a particular year and only once for a particular fruit (eg. XYZ only for 1946 and only once for apple, orange and grape). 2) Grade values are categorical.
I realize the below function isn't very efficient to achieve its intended goals, but this is what I am currently working with.
def func(df, year): #1. Filter out only the data for the year needed df_year=df[df['year']==year] ''' 2. Transform DataFrame to the form: XYZ PQR .. LMN apple 1 3 1 orange 1 1 3 grape 2 1 1 Note that 'LMN' is just used for representation purposes. It won't logically appear here because it can only appear for the year 2014. ''' df_year = df_year.pivot(index='fruit',columns='license',values='grade') #3. Remove all fruits that have ANY NaN values df_year=df_year.dropna(axis=1, how="any") #4. Some additional filtering #5. Function to calculate similarity between fruits def similarity_score(fruit1, fruit2): agreements=np.sum( ( (fruit1 == 1) & (fruit2 == 1) ) | \ ( (fruit1 == 3) & (fruit2 == 3) )) disagreements=np.sum( ( (fruit1 == 1) & (fruit2 == 3) ) |\ ( (fruit1 == 3) & (fruit2 == 1) )) return (( (agreements-disagreements) /float(len(fruit1)) ) +1)/2) #6. Create Network dataframe network_df=pd.DataFrame(columns=['Source','Target','Weight']) for i,c in enumerate(combinations(df_year,2)): c1=df[[c[0]]].values.tolist() c2=df[[c[1]]].values.tolist() c1=[item for sublist in c1 for item in sublist] c2=[item for sublist in c2 for item in sublist] network_df.loc[i] = [c[0],c[1],similarity_score(c1,c2)] return network_df Running the above gives:
df_1946=func(df,1946) df_1946.head() Source Target Weight Apple Orange 0.6 Apple Grape 0.3 Orange Grape 0.7 I want to flatten the above to a single row:
(Apple,Orange) (Apple,Grape) (Orange,Grape) 1946 0.6 0.3 0.7 Note the above will not have 3 columns, but in fact around 5000 columns.
Eventually, I want to stack the transformed dataframe rows to get something like:
df_all_years
(Apple,Orange) (Apple,Grape) (Orange,Grape) 1946 0.6 0.3 0.7 1947 0.7 0.25 0.8 .. 2015 0.75 0.3 0.65 What is the best way to do this?



(Apple,Orange)- is it a string or a tuple?