0

I have the following code:

new_df = pd.DataFrame(columns=df.columns) for i in list: temp = df[df["customer id"]==i] new_df = new_df.append(temp) 

where list is a list of customer id's for the customers that meet a criteria chosen before. I use the temp dataframe because there are multiple rows for the same customer.

I consider that I know how to code, but I have never learnt how to code for big data efficiency. In this case, the df has around 3 million rows and list contains around 100,000 items. This code ran for more than 24h and it was still not done, so I need to ask, am I doing something terribly wrong? Is there a way to make this code more efficient?

2
  • yes, new_df = new_df.append(temp) is very inefficient. It makes your algorithm quadratic time, pandas.Dataframe.append always creates whole-new dataframe. The most efficient way would probably be to make 'customer id' column an index and simply select with your list Commented Jul 11, 2020 at 23:40
  • I simulated with 3 million record and 100000 customer ids. It takes only a few seconds with isin. Commented Jul 12, 2020 at 0:23

2 Answers 2

1

list is a type in Python. You should avoid naming your variables with built-in types or functions. I simulated the problem with 3 million rows and a list of customer id of size 100000. It took only a few seconds using isin.

new_df = df[ df['customer id'].isin(customer_list) ] 
Sign up to request clarification or add additional context in comments.

1 Comment

You're absolutely right, this worked perfectly in just seconds. I didn't realize isin could be used this way, so thank you! The huge difference in time between the two codes still baffles me but I get it. Thanks again
1

You can try this code below, which should make things faster.

new_df = df.loc[df['customer id'].isin(list)] 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.