2

there! I have the following situation and any help would be very appreciated.

Let's say I have the following dataframe, containing 2 columns and 90 thousand rows (made this shorter so it can be easily reproduced):

 PRODUCT ID PROBLEM 0 1 OIL LEAK 1 2 FLAT TIRE 2 3 OIL LEAK 3 4 ENGINE ISSUES 4 5 ENGINE ISSUES 5 6 OIL LEAK 6 7 OIL LEAK 7 8 FLAT TIRE 8 9 FLAT TIRE 9 90000 OIL LEAK 

I need to drop SOME rows (but not all) based on values from column 'PROBLEM'. Imagine the value 'OIL LEAK' appears in my dataframe 11 thousand times, but I want to keep only 50 entries of this value in my dataframe and delete all the other rows this value appears. For me, it's not important the index of the row that is being droppeg as long as I have 50 registers of this value remaining in my dataframe.

Is there a way to perform it? Thanks in advance!

1
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. Commented Jul 2, 2022 at 7:27

2 Answers 2

3

You can save 50 oil leaks and concat them after removing for instance?

leaks = df[df['PROBLEM'] == 'OIL LEAK'].head(50) df = df[df['PROBLEM'] != 'OIL LEAK'].concat(leaks) 
Sign up to request clarification or add additional context in comments.

1 Comment

Precisely what I needed. Thanks for the help! Kind regards!
1

In general we can use grouping with cumulative count like this:

df[df.groupby('PROBLEM').cumcount() < 50] 

In order to apply this logic only to some values in the PROBLEM column:

counted = df.groupby('PROBLEM').cumcount() max_count = 50 problems_to_cut = ['OIL LEAK'] selected = df[~((counted >= max_count) & (df.PROBLEM.isin(problems_to_cut)))] 

1 Comment

Hey there! Thanks for the help! Worked just fine! Another great solution! Best regards

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.