2
\$\begingroup\$

I have a code that grabs quickly the twitter followers for 12 different users. After appending to a Pandas DataFrame, the data is compared to the same file pulled the day before.

It allows me to see which users have gained, lost, and returned followers.

The code works fine. However, the for loop to compare changes between days is slow. Any ideas how to help me out with the for loop section?

TRANSFORMATION FUNCTIONS

# Transformation Functions # Gained Users def gained_users(account, new, old): gained_followers = [] old_user_ids = old.Follower_ID[old.Handles == account].unique().tolist() new_user_ids = new.Follower_ID[new.Handles == account].unique().tolist() gained = list(set(new_user_ids) - set(old_user_ids)) gained_followers.extend(map(str, gained)) return gained_followers # Lost Users def lost_users(account, new, old): old_user_ids = old.Follower_ID[old.Handles == account].unique().tolist() new_user_ids = new.Follower_ID[new.Handles == account].unique().tolist() lost = list(set(old_user_ids) - set(new_user_ids)) return lost # Returned Users def returned_users(account, new, old): returned_followers = [] new_user_ids = new.Follower_ID[new.Handles == account].unique().tolist() returned_user_ids = old.Follower_ID[(old.Handles == account) & (old.End_Date.notnull() == True)].unique().tolist() returned = list(set(returned_user_ids).intersection(new_user_ids)) returned_followers.extend(map(str, returned)) return returned_followers 

FOR LOOP SECTION

# Add Returned Users for username in lookup_users: returned_ids = returned_users(username, new_followers_df, historical_followers_df) if returned_ids: historical_followers_df.loc[(historical_followers_df["Handles"] == username) & (historical_followers_df["Follower_ID"] == ids), "Start_Date"] = today historical_followers_df.loc[(historical_followers_df["Handles"] == username) & (historical_followers_df["Follower_ID"] == ids), "Returned_After_Days"] = pd.to_datetime(historical_followers_df.Start_Date) - pd.to_datetime(historical_followers_df.End_Date) historical_followers_df.loc[(historical_followers_df["Handles"] == username) & (historical_followers_df["Follower_ID"] == ids), "End_Date"] = np.NaN # Add Lost Users for username in lookup_users: lost_ids = lost_users(username, new_followers_df, historical_followers_df) if lost_ids: for ids in lost_ids: historical_followers_df.loc[(historical_followers_df["Handles"] == username) & (historical_followers_df["Follower_ID"] == ids), "End_Date"] = today # Add Gained Users for username in lookup_users: new_ids = gained_users(username, new_followers_df, historical_followers_df) if new_ids: gained_users_list = pd.DataFrame({ "Handles": username, "Follower_ID": new_ids, "Start_Date": today}) historical_followers_df = gained_users_list.append(historical_followers_df, ignore_index=True) 

EDIT:

Hi @Graipher thanks for your help. The reasoning makes sense and the structure is much neater! Hopefully if you don't mind, can you please answer these three questions:

1. Could you explain the old_account = old.Handles == account passage, as I have never seen it before!

2. If I run the code as it is it says that old_followers = user_followers & (historical_followers_df["Follower_ID"] == ids) is this because ids in the filter is not specified?

If so, Is it correct to say that:

a. I have to amend the categorize_users function above. Adding: old = map(str, old_user_ids) and return old, gained, lost, returned b. In the for loop where I call a function I add a new variable ids so that it looks like: ids, new_ids, lost_ids, returned_ids = categorize_users(username, new_followers_df, historical_followers_df)

3. Finally, I think that there's a problem with new_ids as it says that TypeError: object of type 'map' has no len()

a. Is this a code problem? I can't really figure that part out.

\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

You should try to avoid calculating the same things over and over. The first example are your transformation functions. While it is nice that you have separated the concerns, it probably costs you quite a bit here. It might be better to make this one function:

def categorize_users(new, old): """ Return the gained, lost and returned followers from the two `pandas.DataFrame`s `old` and `new`. """ old_user_ids = set(old.Follower_ID) new_user_ids = set(new.Follower_ID) returned_user_ids = set(old.Follower_ID[old.End_Date.notnull()]) gained = map(str, new_user_ids - old_user_ids) lost = map(str, old_user_ids - new_user_ids) returned = map(str, returned_user_ids.intersection(new_user_ids)) return gained, lost, returned 

Note that I re-used everything that was used more than once. I also skipped the unique part, because the set already does it, but you might want to add it back in to see if it is any faster.

Now, to your actual for loop:

for username in lookup_users: new_user_followers = new_followers_df["Handles"] == username old_user_followers = historical_followers_df["Handles"] == username new_ids, lost_ids, returned_ids = categorize_users( username, new_followers_df[new_user_followers], historical_followers_df[old_user_followers]) # Add Returned Users if returned_ids: for ids in returned_ids: old_followers = user_followers & (historical_followers_df["Follower_ID"] == ids) historical_followers_df.loc[old_followers, "Start_Date"] = today historical_followers_df.loc[old_followers,"Returned_After_Days"] = pd.to_datetime(historical_followers_df.Start_Date) - pd.to_datetime(historical_followers_df.End_Date) historical_followers_df.loc[old_followers, "End_Date"] = np.NaN # Add Lost Users if lost_ids: for ids in lost_ids: old_followers = user_followers & (historical_followers_df["Follower_ID"] == ids) historical_followers_df.loc[old_followers, "End_Date"] = today # Add New Users if new_ids: gained_users_list = pd.DataFrame({ "Handles": username, "Follower_ID": new_ids, "Start_Date": today}) historical_followers_df = gained_users_list.append( historical_followers_df, ignore_index=True) 

I made your three for loops into one, because now we can actually get all the three different classes in one go. I also put the selection of the relevant old followers which need to be modified into its own variable as well, so it is not computed four times per loop.

I also moved the filtering of the dataframes to one user already in the loop, so that categorize_users becomes more general.

\$\endgroup\$
2
  • \$\begingroup\$ Hi Graipher, thank you for reply! I edited my question above. Hopefully you can help me out with that part! \$\endgroup\$ Commented Jun 29, 2017 at 16:07
  • \$\begingroup\$ It seems that the set function is not doing it's job, is it due to the length of twitter's ids? I f so is there a workaround you think? \$\endgroup\$ Commented Jun 29, 2017 at 16:47

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.