I'm dealing with ranked ordered list data at massive scale. I need to compare how individuals rank institutions/programs across periods. I need help figuring out which is the most efficient way to deal with this.
- A ranked ordered list (ROL): a report by individual in which they rank programs in institutions from most preferred to least preferred (0 being the most preferred).
- Operations: I need to run multiple operations between ROLs. Such as if the order changes, are new institutions or programs are added, and a lot more that I'm not detailing here.
I started using dictionaries because I'm familiar with them, but for a subsample my code is taking 28 hours to run. I need to speed this up a lot. I'm particularly looking for advice in which is the most efficient way to work with this type of data.
Below there is a fake data set on which I'm running the code.
import pandas as pd import numpy as np # generate fake data frame df = pd.DataFrame([[1, 1, 0, 100, 101], [1, 2, 0, 100, 101], [1, 2, 1, 100, 102], [2, 1, 0, 100, 101], [2, 2, 0, 100, 101], [2, 2, 1, 200, 202], [3, 1, 0, 100, 101], [3, 1, 1, 200, 201], [3, 2, 0, 100, 101], [3, 2, 1, 200, 201], [4, 1, 0, 100, 101], [4, 1, 1, 200, 201], [4, 2, 0, 200, 201], [4, 2, 1, 100, 101] ], columns=['id_individual', 'period', 'rank', 'id_institution', 'id_program']) df['change_app'] = False df['change_order'] = False df['add_newinst'] = False df['add_newprog'] = False for indiv in df['id_individual'].unique(): # recover rank of each individual for each period r_pre = df.loc[(df['id_individual'] == indiv) & (df['period'] == 1)] r_post = df.loc[(df['id_individual'] == indiv) & (df['period'] == 2)] # generate empty dict to store ranks rank_pre = {} rank_post = {} # extract institution and program and assign to dictionary for i in range(0, len(r_pre)): rank_pre[i] = r_pre['id_institution'].loc[r_pre['rank'] == i].values[0], r_pre['id_program'].loc[r_pre['rank'] == i].values[0] for i in range(0, len(r_post)): rank_post[i] = r_post['id_institution'].loc[r_post['rank'] == i].values[0], r_post['id_program'].loc[r_post['rank'] == i].values[0] # if dictionaries are different, then compute some cases if rank_pre != rank_post: # Replace change app to true df['change_app'].loc[(df['id_individual'] == indiv)] = True # check if it was a reorder df['change_order'].loc[(df['id_individual'] == indiv)] = (set(rank_pre.values()) == set(rank_post.values())) & (len(rank_pre) == len(rank_post)) # get the set of values in the first position of the tuple programs_pre = set(rank_pre.values()) programs_post = set(rank_post.values()) inst_pre = set([x[0] for x in rank_pre.values()]) inst_post = set([x[0] for x in rank_post.values()]) # Added institution: if set of inst_post has an element that is not in inst_pre df['add_newinst'].loc[(df['id_individual'] == indiv)] = len(inst_post - inst_pre) > 0 # Added program: if set of programs_post has an element that is not in programs_pre df['add_newprog'].loc[(df['id_individual'] == indiv)] = len(programs_post - programs_pre) > 0 df.head(14) Expected Output:
id_individual period rank id_institution id_program change_app change_order add_newinst add_newprog 0 1 1 0 100 101 True False False True 1 1 2 0 100 101 True False False True 2 1 2 1 100 102 True False False True 3 2 1 0 100 101 True False True True 4 2 2 0 100 101 True False True True 5 2 2 1 200 202 True False True True 6 3 1 0 100 101 False False False False 7 3 1 1 200 201 False False False False 8 3 2 0 100 101 False False False False 9 3 2 1 200 201 False False False False 10 4 1 0 100 101 True True False False 11 4 1 1 200 201 True True False False 12 4 2 0 200 201 True True False False 13 4 2 1 100 101 True True False False - I tried: performing operations over ranked ordered lists from individuals using pandas/dictionaries.
- I expected: low computing time.
- For 500.000 individuals, comparing ranked ordered lists is taking around 20 hours