1

I'm dealing with ranked ordered list data at massive scale. I need to compare how individuals rank institutions/programs across periods. I need help figuring out which is the most efficient way to deal with this.

  • A ranked ordered list (ROL): a report by individual in which they rank programs in institutions from most preferred to least preferred (0 being the most preferred).
  • Operations: I need to run multiple operations between ROLs. Such as if the order changes, are new institutions or programs are added, and a lot more that I'm not detailing here.

I started using dictionaries because I'm familiar with them, but for a subsample my code is taking 28 hours to run. I need to speed this up a lot. I'm particularly looking for advice in which is the most efficient way to work with this type of data.

Below there is a fake data set on which I'm running the code.

import pandas as pd import numpy as np # generate fake data frame df = pd.DataFrame([[1, 1, 0, 100, 101], [1, 2, 0, 100, 101], [1, 2, 1, 100, 102], [2, 1, 0, 100, 101], [2, 2, 0, 100, 101], [2, 2, 1, 200, 202], [3, 1, 0, 100, 101], [3, 1, 1, 200, 201], [3, 2, 0, 100, 101], [3, 2, 1, 200, 201], [4, 1, 0, 100, 101], [4, 1, 1, 200, 201], [4, 2, 0, 200, 201], [4, 2, 1, 100, 101] ], columns=['id_individual', 'period', 'rank', 'id_institution', 'id_program']) df['change_app'] = False df['change_order'] = False df['add_newinst'] = False df['add_newprog'] = False for indiv in df['id_individual'].unique(): # recover rank of each individual for each period r_pre = df.loc[(df['id_individual'] == indiv) & (df['period'] == 1)] r_post = df.loc[(df['id_individual'] == indiv) & (df['period'] == 2)] # generate empty dict to store ranks rank_pre = {} rank_post = {} # extract institution and program and assign to dictionary for i in range(0, len(r_pre)): rank_pre[i] = r_pre['id_institution'].loc[r_pre['rank'] == i].values[0], r_pre['id_program'].loc[r_pre['rank'] == i].values[0] for i in range(0, len(r_post)): rank_post[i] = r_post['id_institution'].loc[r_post['rank'] == i].values[0], r_post['id_program'].loc[r_post['rank'] == i].values[0] # if dictionaries are different, then compute some cases if rank_pre != rank_post: # Replace change app to true df['change_app'].loc[(df['id_individual'] == indiv)] = True # check if it was a reorder df['change_order'].loc[(df['id_individual'] == indiv)] = (set(rank_pre.values()) == set(rank_post.values())) & (len(rank_pre) == len(rank_post)) # get the set of values in the first position of the tuple programs_pre = set(rank_pre.values()) programs_post = set(rank_post.values()) inst_pre = set([x[0] for x in rank_pre.values()]) inst_post = set([x[0] for x in rank_post.values()]) # Added institution: if set of inst_post has an element that is not in inst_pre df['add_newinst'].loc[(df['id_individual'] == indiv)] = len(inst_post - inst_pre) > 0 # Added program: if set of programs_post has an element that is not in programs_pre df['add_newprog'].loc[(df['id_individual'] == indiv)] = len(programs_post - programs_pre) > 0 df.head(14) 

Expected Output:

 id_individual period rank id_institution id_program change_app change_order add_newinst add_newprog 0 1 1 0 100 101 True False False True 1 1 2 0 100 101 True False False True 2 1 2 1 100 102 True False False True 3 2 1 0 100 101 True False True True 4 2 2 0 100 101 True False True True 5 2 2 1 200 202 True False True True 6 3 1 0 100 101 False False False False 7 3 1 1 200 201 False False False False 8 3 2 0 100 101 False False False False 9 3 2 1 200 201 False False False False 10 4 1 0 100 101 True True False False 11 4 1 1 200 201 True True False False 12 4 2 0 200 201 True True False False 13 4 2 1 100 101 True True False False 
  • I tried: performing operations over ranked ordered lists from individuals using pandas/dictionaries.
  • I expected: low computing time.
  • For 500.000 individuals, comparing ranked ordered lists is taking around 20 hours

1 Answer 1

2

Making some pivot tables that we can perform some vectorized functions on should perform far faster than any manual loop...

# Test Data df = pd.DataFrame([[1, 1, 0, 100, 101], [1, 2, 0, 100, 101], [1, 2, 1, 100, 102], [2, 1, 0, 100, 101], [2, 2, 0, 100, 101], [2, 2, 1, 200, 202], [3, 1, 0, 100, 101], [3, 1, 1, 200, 201], [3, 2, 0, 100, 101], [3, 2, 1, 200, 201], [4, 1, 0, 100, 101], [4, 1, 1, 200, 201], [4, 2, 0, 200, 201], [4, 2, 1, 100, 101] ], columns=['id_individual', 'period', 'rank', 'id_institution', 'id_program']) # Pivot Config, we'll use this twice~ config = { 'index': 'id_individual', 'columns': 'period', 'values': ['id_institution', 'id_program'] } # First Pivot Table will be to look at unique values by period: unique = df.pivot_table(**config, aggfunc=set) # To look for changes, it'll help if there aren't missing values # So, let's do a reindex trick to extract those missing values: names = ['id_individual', 'period', 'rank'] df = (df.set_index(names) .reindex(pd.MultiIndex.from_product(df[names].apply(set), names=names)) .reset_index()) # Now, we can make the Pivot Table of changes: changes = df.pivot_table(**config, aggfunc=tuple) # You're looking for a transform, so we'll set the index we're using # to facilitate this. We'll also drop the NaN values we created, # and convert back to integers. df = df.set_index('id_individual').dropna().astype(int) # Let's take the cross section of each period; looking at changes. p1_c, p2_c = [changes.xs(x, level='period', axis=1) for x in (1,2)] # It was changed if either column had any change: was_changed = p1_c.ne(p2_c).any(1) df['change_app'] = was_changed # Let's take the cross section of each period; looking at unique values. p1_u, p2_u = [unique.xs(x, level='period', axis=1) for x in (1,2)] # First, are they the same? same_vals = p1_u.eq(p2_u).all(1) # If they're the same, and were changed, it was just an order change: df['change_order'] = was_changed & same_vals # We take advantage of set logic here, new things have been added # if the first is a proper subset (<) of the second: df['add_newinst'] = p1_u.id_institution.lt(p2_u.id_institution) df['add_newprog'] = p1_u.id_program.lt(p2_u.id_program) # Reset the index back to where we started: df = df.reset_index() print(df) 

Output:

 id_individual period rank id_institution id_program change_app change_order add_newinst add_newprog 0 1 1 0 100 101 True False False True 1 1 2 0 100 101 True False False True 2 1 2 1 100 102 True False False True 3 2 1 0 100 101 True False True True 4 2 2 0 100 101 True False True True 5 2 2 1 200 202 True False True True 6 3 1 0 100 101 False False False False 7 3 1 1 200 201 False False False False 8 3 2 0 100 101 False False False False 9 3 2 1 200 201 False False False False 10 4 1 0 100 101 True True False False 11 4 1 1 200 201 True True False False 12 4 2 0 200 201 True True False False 13 4 2 1 100 101 True True False False 

~Large Frame Test - 200k IDs~

import pandas as pd import numpy as np from time import time df = pd.DataFrame([[1, 1, 0, 100, 101], [1, 2, 0, 100, 101], [1, 2, 1, 100, 102], [2, 1, 0, 100, 101], [2, 2, 0, 100, 101], [2, 2, 1, 200, 202], [3, 1, 0, 100, 101], [3, 1, 1, 200, 201], [3, 2, 0, 100, 101], [3, 2, 1, 200, 201], [4, 1, 0, 100, 101], [4, 1, 1, 200, 201], [4, 2, 0, 200, 201], [4, 2, 1, 100, 101] ], columns=['id_individual', 'period', 'rank', 'id_institution', 'id_program']) # 7 million rows, 200k individuals: df = pd.concat([df.assign(id_individual=df.id_individual.add(4*x)) for x in range(50000)], ignore_index=True) start = time() config = { 'index': 'id_individual', 'columns': 'period', 'values': ['id_institution', 'id_program'] } unique = df.pivot_table(**config, aggfunc=set) names = ['id_individual', 'period', 'rank'] df = (df.set_index(names) .reindex(pd.MultiIndex.from_product(df[names].apply(set), names=names)) .reset_index()) changes = df.pivot_table(**config, aggfunc=tuple) df = df.set_index('id_individual').dropna().astype(int) p1_c, p2_c = [changes.xs(x, level='period', axis=1) for x in (1,2)] was_changed = p1_c.ne(p2_c).any(1) df['change_app'] = was_changed p1_u, p2_u = [unique.xs(x, level='period', axis=1) for x in (1,2)] same_vals = p1_u.eq(p2_u).all(1) df['change_order'] = was_changed & same_vals df['add_newinst'] = p1_u.id_institution.lt(p2_u.id_institution) df['add_newprog'] = p1_u.id_program.lt(p2_u.id_program) df = df.reset_index() print('Time Taken:', round(time()-start, 2), 'seconds') # Output: Time Taken: 12.87 seconds 
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you very much! Incredible speed boost. I would still like to know which is the "best theoretical way" to deal with this type of data. But pivot tables are for sure faster than dictionaries by orders of magnitude. Thanks! :)
The shortest answer to "best theoretical way", while still using pandas, is anything vectorized. You may get marginalized improvements with other vectorized approaches, but nothing that relies primarily on a standard for-loop will ever come close.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.