I have a non-optimal solution to a problem and I'm searching for a better one.
My data looks like this:
df = pd.DataFrame(columns=['id', 'score', 'duration', 'user'], data=[[1, 800, 60, 'abc'], [1, 900, 60, 'zxc'], [2, 800, 250, 'abc'], [2, 5000, 250, 'bvc'], [3, 6000, 250, 'zxc'], [3, 8000, 250, 'klp'], [4, 1400, 500,'kod'], [4, 8000, 500, 'bvc']])``` As you can see instances are pairs of identical ids with the same duration and different scores. My goal is to remove all id pairs that have a duration of less than 120 or where at least one user has a score less than 1500.
So far my solution is like this:
# remove instances with duration > 120 (duration is the same for every instance of the same id) df= df[df['duration'] > 120] # groupby id and get the min value of score test= df.groupby('id')['score'].min().reset_index() # then I can get a list of the id's where at least one user has a score below 1500 and remove both instances with the same id for x in list(test[test['score'] < 1500]['id']): df.drop(df.loc[df['id']==x].index, inplace=True) However, the last bit is not very efficient and quite slow. I have around 700k instances in df and was wondering what is the most efficient way to remove all instances with id equal to the ones found in list(test[test['score'] < 1500]['id']). Also a note, for simplicity i used an integer for id in this example but my id's are objects that have this kind of format 4240c195g794530fj4e10z53.
However, you're welcome to show me a better initial approach to this problem. Thanks!