Make piece of code efficient for big data

Question

I have the following code:

new_df = pd.DataFrame(columns=df.columns) for i in list: temp = df[df["customer id"]==i] new_df = new_df.append(temp)

where list is a list of customer id's for the customers that meet a criteria chosen before. I use the temp dataframe because there are multiple rows for the same customer.

I consider that I know how to code, but I have never learnt how to code for big data efficiency. In this case, the df has around 3 million rows and list contains around 100,000 items. This code ran for more than 24h and it was still not done, so I need to ask, am I doing something terribly wrong? Is there a way to make this code more efficient?

yes, new_df = new_df.append(temp) is very inefficient. It makes your algorithm quadratic time, pandas.Dataframe.append always creates whole-new dataframe. The most efficient way would probably be to make 'customer id' column an index and simply select with your list — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Jul 11, 2020 at 23:40
I simulated with 3 million record and 100000 customer ids. It takes only a few seconds with isin. — Pramote Kuacharoen
– Pramote Kuacharoen, Commented Jul 12, 2020 at 0:23

Pramote Kuacharoen · Accepted Answer · 2020-07-12 01:44:48Z

1

list is a type in Python. You should avoid naming your variables with built-in types or functions. I simulated the problem with 3 million rows and a list of customer id of size 100000. It took only a few seconds using isin.

new_df = df[ df['customer id'].isin(customer_list) ]

edited Jul 12, 2020 at 1:44

answered Jul 11, 2020 at 23:46

Pramote Kuacharoen

1,5511 gold badge7 silver badges6 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user12195705 Over a year ago

You're absolutely right, this worked perfectly in just seconds. I didn't realize isin could be used this way, so thank you! The huge difference in time between the two codes still baffles me but I get it. Thanks again

rhug123 · Accepted Answer · 2020-07-11 23:46:19Z

You can try this code below, which should make things faster.

new_df = df.loc[df['customer id'].isin(list)]

Collectives™ on Stack Overflow

Make piece of code efficient for big data

2 Answers 2

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Linked

Related