2

I have a non-optimal solution to a problem and I'm searching for a better one.

My data looks like this:

df = pd.DataFrame(columns=['id', 'score', 'duration', 'user'], data=[[1, 800, 60, 'abc'], [1, 900, 60, 'zxc'], [2, 800, 250, 'abc'], [2, 5000, 250, 'bvc'], [3, 6000, 250, 'zxc'], [3, 8000, 250, 'klp'], [4, 1400, 500,'kod'], [4, 8000, 500, 'bvc']])``` 

As you can see instances are pairs of identical ids with the same duration and different scores. My goal is to remove all id pairs that have a duration of less than 120 or where at least one user has a score less than 1500.

So far my solution is like this:

# remove instances with duration > 120 (duration is the same for every instance of the same id) df= df[df['duration'] > 120] # groupby id and get the min value of score test= df.groupby('id')['score'].min().reset_index() # then I can get a list of the id's where at least one user has a score below 1500 and remove both instances with the same id for x in list(test[test['score'] < 1500]['id']): df.drop(df.loc[df['id']==x].index, inplace=True) 

However, the last bit is not very efficient and quite slow. I have around 700k instances in df and was wondering what is the most efficient way to remove all instances with id equal to the ones found in list(test[test['score'] < 1500]['id']). Also a note, for simplicity i used an integer for id in this example but my id's are objects that have this kind of format 4240c195g794530fj4e10z53.

However, you're welcome to show me a better initial approach to this problem. Thanks!

0

2 Answers 2

2

You can first create the condition , then groupby on the boolean column based on the id column and then transform with all to retain groups that satisfies the condition for all the rows in the group.

#retain duration greater than or equal to (ge) 120 and id that has score ge 1500 cond = df['duration'].ge(120) & df['score'].ge(1500) out = df[cond.groupby(df['id']).transform('all')] 

Or chaining them up in 1 line:

out = df[(df['duration'].ge(120) & df['score'].ge(1500)) .groupby(df['id']).transform('all')] 

 id score duration user 4 3 6000 250 zxc 5 3 8000 250 klp 
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks - that's a neat solution, haven't used .transform('all') before. However, I noticed an inconsistency in my df. For some reason some id's are single (so 1 instance per id instead of 2) and I want to get rid of them. If it's no bother, could you advise me of an appropriate way to do that?
@idontknowmuch try this cond = df['id'].duplicated(keep=False) & df['duration'].ge(120) & df['score'].ge(1500) and then out = df[cond.groupby(df['id']).transform('all')] ?
1

Making a loop to process pandas dataframe or numpy is almost always a bad idea with regards to performance. You need to use pandas or numpy methods, except the apply method which is not so performant.

I am adding anky's response and add two other slightly less performant solutions.

 def with_isin(df): df= df[df['duration'] > 120] test= df.groupby('id')['score'].min()<1500 return df.isin(test[test].index) def with_join(df): df= df[df['duration'] > 120] test= df.groupby('id')['score'].min()<1500 return df[df.join(test,rsuffix='_test', on='id')['score_test']] def anky(df): return df[(df['duration'].ge(120) & df['score'].ge(1500)) .groupby(df['id']).transform('all')] %timeit with_isin(df) #>>> 1.22 ms ± 18.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit with_join(df) #>>> 2.23 ms ± 48.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit anky(df) #>>> 1.15 ms ± 42.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.