I have the following code:
new_df = pd.DataFrame(columns=df.columns) for i in list: temp = df[df["customer id"]==i] new_df = new_df.append(temp) where list is a list of customer id's for the customers that meet a criteria chosen before. I use the temp dataframe because there are multiple rows for the same customer.
I consider that I know how to code, but I have never learnt how to code for big data efficiency. In this case, the df has around 3 million rows and list contains around 100,000 items. This code ran for more than 24h and it was still not done, so I need to ask, am I doing something terribly wrong? Is there a way to make this code more efficient?
new_df = new_df.append(temp)is very inefficient. It makes your algorithm quadratic time,pandas.Dataframe.appendalways creates whole-new dataframe. The most efficient way would probably be to make'customer id'column an index and simply select with your list