0

I have a pd.DataFrame which has students' exam performance metrics in each row. Each student has a unique ID, and each student has a unique row for the questions they solved on the exam. For example, student with ID "a1a1" has attempted two questions whereas student with ID "w2e3" has attempted only one question. (sample df)

enter image description here

I want to find the students who have attempted to solve less than 3 questions and remove the rows associated with them from the data-frame. How can I do this with pd.DataFrame methods?

1
  • problemID is just a unique identification number for each problem. Each row represents a question solved by the student, so if a question solved three questions, he/she has three rows(entries) in this data frame Commented Nov 30, 2020 at 16:33

1 Answer 1

1

Use value_counts() on studentID

import pandas as pd df = pd.DataFrame({'studentID':['a','a','a','b','b','b', 'c'], 'problemID':[1,2,3,1,2,3,1]}) print(df) tmp = df['studentID'].value_counts() tmp = tmp[tmp >= 3] new_df = df[df['studentID'].isin(tmp.index)] print(new_df) 

Output:

 studentID problemID 0 a 1 1 a 2 2 a 3 3 b 1 4 b 2 5 b 3 6 c 1 studentID problemID 0 a 1 1 a 2 2 a 3 3 b 1 4 b 2 5 b 3 
Sign up to request clarification or add additional context in comments.

5 Comments

this just returns to a list, how can I remove those entries from my original data frame
AttributeError: 'builtin_function_or_method' object has no attribute 'index'
Sorry, now it should work, i was a bit hasty
I just don't understand why .index is needed on the last line
Becase the tmp Series has name of students as index and number of solved problems as values.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.