0

I have a function which takes in data for a particular year and returns a dataframe.

For example:

df

year fruit license grade 1946 apple XYZ 1 1946 orange XYZ 1 1946 apple PQR 3 1946 orange PQR 1 1946 grape XYZ 2 1946 grape PQR 1 .. 2014 grape LMN 1 

Note: 1) a specific license value will exist only for a particular year and only once for a particular fruit (eg. XYZ only for 1946 and only once for apple, orange and grape). 2) Grade values are categorical.

I realize the below function isn't very efficient to achieve its intended goals, but this is what I am currently working with.

def func(df, year): #1. Filter out only the data for the year needed df_year=df[df['year']==year] ''' 2. Transform DataFrame to the form: XYZ PQR .. LMN apple 1 3 1 orange 1 1 3 grape 2 1 1 Note that 'LMN' is just used for representation purposes. It won't logically appear here because it can only appear for the year 2014. ''' df_year = df_year.pivot(index='fruit',columns='license',values='grade') #3. Remove all fruits that have ANY NaN values df_year=df_year.dropna(axis=1, how="any") #4. Some additional filtering #5. Function to calculate similarity between fruits def similarity_score(fruit1, fruit2): agreements=np.sum( ( (fruit1 == 1) & (fruit2 == 1) ) | \ ( (fruit1 == 3) & (fruit2 == 3) )) disagreements=np.sum( ( (fruit1 == 1) & (fruit2 == 3) ) |\ ( (fruit1 == 3) & (fruit2 == 1) )) return (( (agreements-disagreements) /float(len(fruit1)) ) +1)/2) #6. Create Network dataframe network_df=pd.DataFrame(columns=['Source','Target','Weight']) for i,c in enumerate(combinations(df_year,2)): c1=df[[c[0]]].values.tolist() c2=df[[c[1]]].values.tolist() c1=[item for sublist in c1 for item in sublist] c2=[item for sublist in c2 for item in sublist] network_df.loc[i] = [c[0],c[1],similarity_score(c1,c2)] return network_df 

Running the above gives:

df_1946=func(df,1946) df_1946.head() Source Target Weight Apple Orange 0.6 Apple Grape 0.3 Orange Grape 0.7 

I want to flatten the above to a single row:

 (Apple,Orange) (Apple,Grape) (Orange,Grape) 1946 0.6 0.3 0.7 

Note the above will not have 3 columns, but in fact around 5000 columns.

Eventually, I want to stack the transformed dataframe rows to get something like:

df_all_years

 (Apple,Orange) (Apple,Grape) (Orange,Grape) 1946 0.6 0.3 0.7 1947 0.7 0.25 0.8 .. 2015 0.75 0.3 0.65 

What is the best way to do this?

2
  • (Apple,Orange) - is it a string or a tuple? Commented Aug 19, 2017 at 19:34
  • Tuple. You can use anything you like, as long as there as is a way to tell what combination a particular cell represents. Commented Aug 19, 2017 at 19:36

2 Answers 2

2

I would rearrange the computation a bit differently. Instead of looping over the years:

for year in range(1946, 2015): partial_result = func(df, year) 

and then concatenating the partial results, you can get better performance by doing as much work as possible on the whole DataFrame, df, before calling df.groupby(...). Also, if you can express the computation in terms of builtin aggregators such as sum and count, the computation can be done more quickly than if you use custom functions with groupby/apply.

import itertools as IT import numpy as np import pandas as pd np.random.seed(2017) def make_df(): N = 10000 df = pd.DataFrame({'fruit': np.random.choice(['Apple', 'Orange', 'Grape'], size=N), 'grade': np.random.choice([1,2,3], p=[0.7,0.1,0.2], size=N), 'year': np.random.choice(range(1946,1950), size=N)}) df['manufacturer'] = (df['year'].astype(str) + '-' + df.groupby(['year', 'fruit'])['fruit'].cumcount().astype(str)) df = df.sort_values(by=['year']) return df def similarity_score(df): """ Compute the score between each pair of columns in df """ agreements = {} disagreements = {} for col in IT.combinations(df,2): fruit1 = df[col[0]].values fruit2 = df[col[1]].values agreements[col] = ( ( (fruit1 == 1) & (fruit2 == 1) ) | ( (fruit1 == 3) & (fruit2 == 3) )) disagreements[col] = ( ( (fruit1 == 1) & (fruit2 == 3) ) | ( (fruit1 == 3) & (fruit2 == 1) )) agreements = pd.DataFrame(agreements, index=df.index) disagreements = pd.DataFrame(disagreements, index=df.index) numerator = agreements.astype(int)-disagreements.astype(int) grouped = numerator.groupby(level='year') total = grouped.sum() count = grouped.count() score = ((total/count) + 1)/2 return score df = make_df() df2 = df.set_index(['year','fruit','manufacturer'])['grade'].unstack(['fruit']) df2 = df2.dropna(axis=0, how="any") print(similarity_score(df2)) 

yields

 Grape Orange Apple Apple Grape year 1946 0.629111 0.650426 0.641900 1947 0.644388 0.639344 0.633039 1948 0.613117 0.630566 0.616727 1949 0.634176 0.635379 0.637786 
Sign up to request clarification or add additional context in comments.

1 Comment

I have edited the question and defined both df and func so that you can get a better idea of what is going on. Happy to provide more information.
1

Here's one way of doing a pandas routine to pivot the table in the way you refer to; while it handles ~5,000 columns--as resulting combinatorially from two initially separate classes--quickly enough (bottleneck step took about 20 s on my quad-core MacBook), for much larger scaling there are definitely faster strategies. The data in this example is pretty sparse (5K columns, with 5K random samples from 70 rows of years [1947-2016]) so execution time might be some seconds longer with a fuller dataframe.

from itertools import chain import pandas as pd import numpy as np import random # using python3 .choices() import re # Make bivariate data w/ 5000 total combinations (1000x5 categories) # Also choose 5,000 randomly; some combinations may have >1 values or NaN random_sample_data = np.array( [random.choices(['Apple', 'Orange', 'Lemon', 'Lime'] + ['of Fruit' + str(i) for i in range(1000)], k=5000), random.choices(['Grapes', 'Are Purple', 'And Make Wine', 'From the Yeast', 'That Love Sugar'], k=5000), [random.random() for _ in range(5000)]] ).T df = pd.DataFrame(random_sample_data, columns=[ "Source", "Target", "Weight"]) df['Year'] = random.choices(range(1947, 2017), k=df.shape[0]) # Three views of resulting df in jupyter notebook: df df[df.Year == 1947] df.groupby(["Source", "Target"]).count().unstack() 

enter image description here

To flatten the grouped-by-year data, since groupby requires a function to be applied, you can use a temporary df intermediary to:

  1. push all data.groupby("Year") into individual rows but with separate dataframes per the two columns "Target" + "Source" (to later expand by) plus "Weight".
  2. Use zip and pd.core.reshape.util.cartesian_product to create an empty properly shaped pivot df which will be the final table, arising from temp_df.

e.g.,

df_temp = df.groupby("Year").apply( lambda s: pd.DataFrame([(s.Target, s.Source, s.Weight)], columns=["Target", "Source", "Weight"]) ).sort_index() df_temp.index = df_temp.index.droplevel(1) # reduce MultiIndex to 1-d # Predetermine all possible pairwise column category combinations product_ts = [*zip(*(pd.core.reshape.util.cartesian_product( [df.Target.unique(), df.Source.unique()]) ))] ts_combinations = [str(x + ' ' + y) for (x, y) in product_ts] ts_combinations 

enter image description here

Finally, use simple for-for nested iteration (again, not the fastest, though pd.DataFrame.iterrows might help speed things up, as shown). Because of the random sampling with replacement I had to handle multiple values, so you probably would want to remove the conditional below the second for loop, which is the step where the three separate dataframes are, for each year, accordingly zipped into a single row of all cells via the pivoted ("Weight") x ("Target"-"Source") relation.

df_pivot = pd.DataFrame(np.zeros((70, 5000)), columns=ts_combinations) df_pivot.index = df_temp.index for year, values in df_temp.iterrows(): for (target, source, weight) in zip(*values): bivar_pair = str(target + ' ' + source) curr_weight = df_pivot.loc[year, bivar_pair] if curr_weight == 0.0: df_pivot.loc[year, bivar_pair] = [weight] # append additional values if encountered elif type(curr_weight) == list: df_pivot.loc[year, bivar_pair] = str(curr_weight + [weight]) 

enter image description here

# Spotcheck: # Verifies matching data in pivoted table vs. original for Target+Source # combination "And Make Wine of Fruit614" across all 70 years 1947-2016 df df_pivot['And Make Wine of Fruit614'] df[(df.Year == 1947) & (df.Target == 'And Make Wine') & (df.Source == 'of Fruit614')] 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.